C++ Audio Effect Acceleration - Verifying Object-Oriented Programming and SIMD with Assembly

Comparison of conditions that slow down voice DSP with heavy use of virtual functions, using sample processing/block processing/handwritten AVX2. We investigate the reasons for the reversed results based on MSVC reports and assembly code.

C++

越井琢巳 (Koshii Takumi)

2025.12.14

This page has been translated by machine translation. View original

 IntroductionWhen implementing audio effects in C++, prioritizing extensibility with a design using virtual can sometimes result in processing that is slower than expected, depending on how it's written. In this article, we'll compare the same processing in three ways: "per-sample", "block-based (with auto-vectorization)", and "block-based (manual AVX2)", and then examine why the measurement results were opposite to expectations using MSVC's auto-vectorization reports and assembly output (.cod). We'll clarify what to question, how to verify, and which optimizations actually work when trying to speed up audio processing while maintaining design integrity.
 BackgroundIn professional audio processing, applying multiple effects in a chain is common, and both of these requirements are needed:
Effects should be easy to add or replace (maintainability, extensibility)
Processing must be fast enough for real-time use (latency, CPU usage)
In this context, object-oriented programming (OOP) and SIMD optimization can work well together in some cases, but may not align depending on the implementation. This article aims to identify where this phenomenon occurs by analyzing measurement logs and generated code.
 What is SIMDSIMD is a mechanism that executes the same operation on multiple data points simultaneously, providing optimization that is particularly effective for operations like multiplication and min/max on continuous data such as float arrays (previous article on this topic). In C++, the compiler can automatically vectorize loops, or developers can explicitly write instructions using AVX2 and similar technologies.
 Target AudienceThose who have written simple DSP (gain/clip, etc.) in C++
Those who understand sample-by-sample vs. block processing concepts but aren't confident about optimization verification
Those who believe "hand-written SIMD (AVX2) should be faster" but have been disappointed by actual measurements
 ReferencesFA, /Fa (Listing file)
loop pragma (no_vector / ivdep, etc.)
Vectorizer and parallelizer messages (C5001/C5002 and reason codes)
Avoiding AVX-SSE Transition Penalties (Intel documentation)
 Conditions for OOP and SIMD to Work Together The Problem with virtual in Inner LoopsSIMD excels at repeating the same calculation on consecutive array elements. However, when virtual calls like fx->processSample(x) are made for each sample, indirect calls are likely to appear inside loops, making it difficult for the compiler to treat loops in a straightforward manner.
If we process in blocks (e.g., 512 samples at a time) instead of sample-by-sample, the virtual calls occur only once per block, and the inner loop that traverses the array can be a simple for loop. This form is more likely to benefit from the compiler's auto-vectorization, serving as a starting point for combining OOP and SIMD.
 Test EnvironmentCPU: Intel(R) Core(TM) i7-11700F @ 2.50GHz
Compiler: MSVC v143 (VS 2022 C++ x64/x86 build tools)
Language mode: C++ 17
Build configuration: ReleaseAsm x64!In this test, we disable /GL and output listing (.cod) using /FAcs to check the generated code. Since this setting has a different purpose than the normal Release configuration, we created a separate "ReleaseAsm" configuration copied from Release for testing purposes.

Compilation settings were adjusted to enable optimization while allowing verification of the generated code.
Optimization: /O2, /Ob2, /Oi, /Ot

!When /GL (Whole Program Optimization) is enabled, the final code generation occurs at link time, so the listing (.cod) for each translation unit may not match the final form. We disabled /GL in this article as we want to rely on .cod.

Instruction set: /arch:AVX2
Output: /FAcs (listing with source-equivalent information and machine code)

!/FAcs outputs a listing, but depending on the setting (presence of c), the extension becomes .cod rather than .asm. This article relies on .cod.

Since our target is sample code, the absolute values will vary depending on the CPU generation and detailed options. Here, we're only concerned with confirming the relative trends between patterns.
 Experiment ConditionsSample rate: 48kHz
Block size: 512
Total samples: 10,240,000 (512 × 20,000)
Number of effects: 8 (applying gain and clip alternately)
Measurement: Multiple runs under identical conditions, taking the fastest value
Noise reduction: Raising thread priority and pinning to CPU core
Additionally, we output .cod using MSVC's /FAcs for verification purposes discussed later.
 Test 1: Unexpected Results from A/B/C MeasurementsWe tested the following three patterns and measured their execution speed.
 A: Calling virtual for Each SampleA configuration where sample-based processSample() is called 8 times per sample. While extensible as OOP, the virtual call in the inner loop makes it difficult for the compiler to apply SIMD optimization.
sample-virtual code snippetstruct ISampleFx {
  virtual ~ISampleFx() = default;
  virtual float processSample(float x) noexcept = 0;
};

static BenchResult bench_sample_virtual(
  const float* in, float* out, size_t nSamples, int chainLen, int repeats)
{
  auto chain = make_sample_chain(chainLen);

  for (int r = 0; r < repeats; ++r) {
    float sum = 0.0f;
    for (size_t i = 0; i < nSamples; ++i) {
      float y = in[i];
      for (auto& fx : chain) {
        y = fx->processSample(y);   // ← virtual for each sample
      }
      out[i] = y;
      sum += y;
    }
    // timing / best-of-N omitted
  }
}
 B: virtual per Block, Simple for Inside BlocksPush virtual calls to the block boundaries, with simple for loops for array traversal inside the blocks.
block-virtual scalar code snippetstruct IBlockFx {
  virtual ~IBlockFx() = default;
  virtual void processBlock(const float* in, float* out, int n) noexcept = 0;
};

struct GainBlockScalar : IBlockFx {
  float g;
  explicit GainBlockScalar(float gain) : g(gain) {}
  void processBlock(const float* in, float* out, int n) noexcept override {
    for (int i = 0; i < n; ++i) {
      out[i] = in[i] * g;          // ← simple loop (potential for auto-vectorization)
    }
  }
};

static BenchResult bench_block_virtual(
  const float* in, float* out, size_t nSamples, int blockSize, int chainLen, /*...*/)
{
  auto chain = make_block_chain(chainLen, /*useAvx2=*/false);
  std::vector<float> scratchA(blockSize), scratchB(blockSize);

  const size_t nBlocks = nSamples / (size_t)blockSize;
  for (size_t b = 0; b < nBlocks; ++b) {
    const float* curIn = in + b * (size_t)blockSize;
    float* curOut = scratchA.data();

    for (int i = 0; i < (int)chain.size(); ++i) {
      chain[i]->processBlock(curIn, curOut, blockSize); // ← virtual per block
      curIn = curOut;
      curOut = (curOut == scratchA.data()) ? scratchB.data() : scratchA.data();
    }

    // writing to out (checksum addition omitted)
  }
}
 C: Manual AVX2 intrinsics Inside BlocksLike B, virtual calls are placed at block boundaries, but the block internals process 8 elements at a time using intrinsics. This implementation uses loadu/storeu without assuming alignment.
block-virtual AVX2 code snippetstruct GainBlockAVX2 : IBlockFx {
  float g;
  explicit GainBlockAVX2(float gain) : g(gain) {}
  void processBlock(const float* in, float* out, int n) noexcept override {
    const __m256 vg = _mm256_set1_ps(g);
    int i = 0;
    for (; i + 8 <= n; i += 8) {
      __m256 x = _mm256_loadu_ps(in + i);
      __m256 y = _mm256_mul_ps(x, vg);
      _mm256_storeu_ps(out + i, y);
    }
    for (; i < n; ++i) out[i] = in[i] * g; // tail
  }
};

struct ClipBlockAVX2 : IBlockFx {
  float lo, hi;
  ClipBlockAVX2(float l, float h) : lo(l), hi(h) {}
  void processBlock(const float* in, float* out, int n) noexcept override {
    const __m256 vlo = _mm256_set1_ps(lo);
    const __m256 vhi = _mm256_set1_ps(hi);
    int i = 0;
    for (; i + 8 <= n; i += 8) {
      __m256 x = _mm256_loadu_ps(in + i);
      x = _mm256_max_ps(x, vlo);
      x = _mm256_min_ps(x, vhi);
      _mm256_storeu_ps(out + i, x);
    }
    for (; i < n; ++i) { /* scalar tail */ }
  }
};
The results were as follows:
A) sample-virtual                 83.016 ms   123.35 MSamples/s
B) block-virtual scalar            6.675 ms  1534.15 MSamples/s
C) block-virtual AVX2              8.327 ms  1229.79 MSamples/s
There is significant improvement from A to B, clearly showing the effect of changing the virtual call placement (approximately 12.4x). However, B outperforms C, with manual AVX2 losing to B (which appears scalar in the source). At this point, the question shifts from "Is hand-written SIMD slow?" to "Was B really scalar?"
 Compiler Report InvestigationThe build log showed that the loops in B (GainBlockScalar / ClipBlockScalar) were classified as loop vectorized (C5001). This suggests that B might be effectively SIMD-optimized. Meanwhile, some parts of the AVX2 implementation were marked as not vectorized (C5002), indicating that writing manual SIMD code doesn't always align well with compiler optimizations.
However, reports are just hints, not conclusions. We need to examine the .cod to see what instructions are actually being generated.
 Assembly Investigation GainBlockScalar Is Using YMM-Width MultiplicationChecking the assembly for GainBlockScalar::processBlock, we can see that it's expanding coefficients into registers and multiplying 8 elements (YMM) at a time. Here's an excerpt from the relevant section:
; GainBlockScalar::processBlock (excerpt)
vbroadcastss ymm2, DWORD PTR [rcx+8]
...
vmulps      ymm0, ymm2, YMMWORD PTR [rdi+rdx]
vmovups     YMMWORD PTR [r8+rdx], ymm0
The presence of vmulps and YMMWORD PTR indicates that this loop is generated as 256-bit (float×8) width SIMD, at least in part. This means that while the source is scalar, the generated code is not.
 ClipBlockScalar Also Uses YMM-Width min/maxSimilarly, we can see vmaxps / vminps in ClipBlockScalar::processBlock:
; ClipBlockScalar::processBlock (excerpt)
vmaxps  ymm0, ymm3, YMMWORD PTR [r11+rax-32]
vminps  ymm0, ymm4, ymm0
vmovups YMMWORD PTR [r11+rax-32], ymm0
So far, regarding why "B is faster than C" in Test 1, it's highly likely that the assumption "B is a scalar implementation" is incorrect.
 Test 2: Using no_vec to Make a Valid Comparison B': Intentionally Disabling Auto-vectorizationIn Test 2, we add #pragma loop(no_vector) to the inner loops of B's processBlock to suppress the compiler's auto-vectorization. This makes the comparison between B and C more valid as "scalar vs. hand-written SIMD".
block-virtual scalar(no_vec) code snippetstruct GainBlockScalarNoVec : IBlockFx {
  float g;
  explicit GainBlockScalarNoVec(float gain) : g(gain) {}
  void processBlock(const float* in, float* out, int n) noexcept override {
    #pragma loop(no_vector)
    for (int i = 0; i < n; ++i) {
      out[i] = in[i] * g;
    }
  }
};

struct ClipBlockScalarNoVec : IBlockFx {
  float lo, hi;
  ClipBlockScalarNoVec(float l, float h) : lo(l), hi(h) {}
  void processBlock(const float* in, float* out, int n) noexcept override {
    #pragma loop(no_vector)
    for (int i = 0; i < n; ++i) {
      float x = in[i];
      if (x < lo) x = lo;
      if (x > hi) x = hi;
      out[i] = x;
    }
  }
};
We confirmed that pragma comments remained in the assembly at the relevant points:
; 128  : #pragma loop(no_vector)
; 129  :         for (int i = 0; i < n; ++i) {
The execution results were as follows:
A) sample-virtual                 83.016 ms   123.35 MSamples/s
B) block-virtual scalar            6.675 ms  1534.15 MSamples/s
B') block-virtual scalar(no_vec)  27.504 ms   372.31 MSamples/s
C) block-virtual AVX2              8.327 ms  1229.79 MSamples/s
The significant slowdown from B to B' strongly suggests that B was benefiting from auto-vectorization. Also, C outperforms B', establishing a valid comparison of "hand-written SIMD superiority over scalar".
 Assembly Verification: B' Uses Scalar OperationsExamining the .cod for B' with #pragma loop(no_vector), we can see that the main loop in GainBlockScalarNoVec::processBlock is generated as scalar instructions processing one element at a time. Here's an excerpt:
; GainBlockScalarNoVec::processBlock (excerpt)
; 128  : #pragma loop(no_vector)
vmovss xmm0, DWORD PTR [rdi+rdx]
vmulss xmm0, xmm0, DWORD PTR [rcx+8]
vmovss DWORD PTR [r8+rdx], xmm0
The vmulss instruction here is a scalar multiplication (1 element) using XMM registers, which is different from vmulps that calculates with YMM width (8 elements). This confirms that B' has "fallen back" to scalar not just in appearance but in the generated code as well.
Similarly, in ClipBlockScalarNoVec::processBlock, we can see scalar instructions like vmaxss / vminss:
; ClipBlockScalarNoVec::processBlock (excerpt)
; 137  : #pragma loop(no_vector)
vmovss xmm0, DWORD PTR [r11+rax-4]
vmaxss xmm1, xmm0, DWORD PTR [rcx+8]
vminss xmm2, xmm1, DWORD PTR [rcx+12]
vmovss DWORD PTR [rax-20], xmm2
From the above, the main reason B was fast in Test 1 is that even though the source was scalar, it was actually generating SIMD instructions (YMM width) through auto-vectorization. The comparison of B and C in Test 1 wasn't "scalar vs. hand-written AVX2" but actually "auto-vectorized implementation vs. hand-written AVX2 implementation".
Under these conditions, the hand-written AVX2 version can become more complex due to tail processing and alignment optimization constraints with loadu/storeu, which can result in the auto-vectorized code being better in terms of unrolling and scheduling, leading to a performance gap of around 25%. Therefore, it's inappropriate to conclude that "hand-written SIMD is meaningless" based solely on C's inferiority to B. In comparative experiments, we need to establish "what we're comparing with what" based not on implementations but on generated code before making an evaluation.
 Discussion The Starting Point for Optimization Is Not "Writing SIMD" But "Arranging Code So SIMD Works"As A → B shows, simply pushing virtual calls to block boundaries and making inner loops into simple array traversals can change performance by an order of magnitude. In practice, reorganizing the processing structure is more effective than hand-written instructions.
 Scalar Implementation Is Not Necessarily Scalar in PracticeAs shown by the results of B and B', the compiler will auto-vectorize in optimized builds. Without understanding "what the generated code actually is," conclusions can be reversed in comparative experiments. In this article, we used compiler reports as clues and verified with assembly output to determine what we were actually comparing.
 ConclusionIn this article, we analyzed the "measurement inversion" that often occurs when trying to combine OOP and SIMD, using measurement results, compiler reports, and assembly output as evidence. The conclusion is that B (block-virtual scalar) was fast because the generated code was SIMD-optimized even though the source was scalar, and when auto-vectorization was disabled in B', C (hand-written AVX2) became superior. In practice, it's safer to first "blockify and simplify loops" to maximize compiler optimization, and then consider hand-written SIMD only if necessary.

C++ Audio Effect Acceleration - Verifying Object-Oriented Programming and SIMD with Assembly

Introduction

Background

What is SIMD

Target Audience

References

Conditions for OOP and SIMD to Work Together

The Problem with `virtual` in Inner Loops

Test Environment

Experiment Conditions

Test 1: Unexpected Results from A/B/C Measurements

A: Calling `virtual` for Each Sample

B: `virtual` per Block, Simple `for` Inside Blocks

C: Manual AVX2 intrinsics Inside Blocks

Compiler Report Investigation

Assembly Investigation

GainBlockScalar Is Using YMM-Width Multiplication

ClipBlockScalar Also Uses YMM-Width min/max

Test 2: Using no_vec to Make a Valid Comparison

B': Intentionally Disabling Auto-vectorization

Assembly Verification: B' Uses Scalar Operations

Discussion

The Starting Point for Optimization Is Not "Writing SIMD" But "Arranging Code So SIMD Works"

Scalar Implementation Is Not Necessarily Scalar in Practice

Conclusion

Related articles

AWS Topics

Trending Topics

Products & Services

Features and Series