What is SIMD - Trying to Speed up Audio Buffer Gain Processing in C++

When writing audio processing in C++, one thing I'm curious about is "how much faster will it be if I use SIMD?" In this article, using a simple process of applying gain to an entire buffer as our subject, we'll compare hand-written SIMD implementation using AVX2 with a normal for loop in Visual Studio 2022 / MSVC. We'll confirm the effectiveness of SIMD with concrete numbers, including the differences between optimization options /Od and /O2.
C++
越井琢巳 (Koshii Takumi)
2025.11.29
This page has been translated by machine translation. View original
 IntroductionIn this article, we will review the basics of SIMD (Single Instruction Multiple Data), which is frequently mentioned when writing audio processing code in C++, and check how much processing speed actually changes through a simple benchmark. Using a simple process of "applying gain to an entire buffer" as our subject, we'll also observe how behavior changes with different compilation options.
 Target AudienceThose who write audio processing applications in C++
Those who have heard of SIMD but want to get a better sense of how much faster it actually makes things
 References/O options (Optimize Code) - Visual Studio / MSVC C++
/arch (x64) - Enable Extended Instruction Set
Intel® Intrinsics Guide
SIMD Introduction (C Language SIMD Introduction: Table of Contents)
Intro to practical SIMD for graphics
 What is SIMD? Why is it well-suited for audio processing?In real-time audio processing, there are many operations that repeat the same calculation for all samples in a buffer. Typical examples include:
Applying a constant gain to an audio buffer
Adding multiple tracks together for mixing
Multiplying envelope or filter coefficients to all samples
When written straightforwardly in C++, this typically becomes a for loop like the following:
// Scalar version of gain processing
void applyGainScalar(float* buffer, size_t n, float gain) {
    for (size_t i = 0; i < n; ++i) {
        buffer[i] *= gain;
    }
}
In this case, the CPU conceptually processes 1 sample per loop iteration. It reads one value from memory, multiplies it, and writes one result back - repeating this process for each sample.
SIMD (Single Instruction Multiple Data) breaks this "1 instruction for 1 data" premise and provides a mechanism to process multiple data points with a single instruction. The CPU contains vector registers with widths of 128 bits or 256 bits that can hold multiple 32-bit float values (4 or 8 values) in a single register and perform addition or multiplication on all of them at once. For example, with AVX2's 256-bit registers, you can handle 8 32-bit floats at once, meaning you can perform multiplication on 8 samples simultaneously with a single instruction.
 How to use SIMDThere are two main approaches to using SIMD in C++:
Let the compiler optimize for you

Write a straightforward for loop and enable optimization options like /O2 or /Ot. The compiler will automatically vectorize loops that meet certain conditions. Modern compilers are quite smart and will often replace your code with SSE/AVX instructions without you having to think about it.
Explicitly call vector instructions using intrinsic functions

Include headers like #include <immintrin.h> and use functions such as _mm256_set1_ps, _mm256_mul_ps, and _mm256_loadu_ps to explicitly load, multiply, and write back 8 elements at once.
In this article, as an example of the latter approach, we'll write gain processing using AVX2 as follows:
#include <immintrin.h>

// Example of applying gain to 8 samples at a time using AVX2
void applyGainAvx(float* buffer, size_t n, float gain) {
    __m256 g = _mm256_set1_ps(gain);          // Expand gain to an 8-element vector
    size_t i = 0;
    size_t simdN = n & ~size_t(7);           // Round down to a multiple of 8

    for (; i < simdN; i += 8) {
        __m256 x = _mm256_loadu_ps(&buffer[i]); // Load 8 samples
        __m256 y = _mm256_mul_ps(x, g);         // Multiply 8 samples simultaneously
        _mm256_storeu_ps(&buffer[i], y);        // Store back
    }

    // Process the remainder with scalar code
    for (; i < n; ++i) {
        buffer[i] *= gain;
    }
}
The code becomes somewhat harder to read, but the advantage is that we can explicitly state "this loop always uses AVX2". Loops that perform the same calculation on all samples in a continuous array, which is common in audio processing, are very well-suited for SIMD.
 Actual Measurements Benchmark ConditionsNow, let's use the applyGainScalar and applyGainAvx functions from earlier to measure how much the processing time actually changes.
The benchmark conditions are as follows:
Compiler/IDE: Visual Studio 2022 / MSVC
Build Configuration: Release / x64
Execution Method: "Start Without Debugging" (Ctrl + F5)
Number of Samples: 1,048,576 samples (1 << 20)
Loop Count: Each implementation repeated 200 times
Gain Value: 0.5f
Benchmark Code#include <immintrin.h>
#include <chrono>
#include <functional>
#include <iostream>
#include <vector>
#include <random>
#include <algorithm>

// Scalar version
void applyGainScalar(float* buffer, size_t n, float gain) {
    for (size_t i = 0; i < n; ++i) {
        buffer[i] *= gain;
    }
}

// SIMD version (AVX2)
void applyGainAvx(float* buffer, size_t n, float gain) {
    __m256 g = _mm256_set1_ps(gain);
    size_t i = 0;
    size_t simdN = n & ~size_t(7); // Round down to a multiple of 8

    for (; i < simdN; i += 8) {
        __m256 x = _mm256_loadu_ps(&buffer[i]);
        __m256 y = _mm256_mul_ps(x, g);
        _mm256_storeu_ps(&buffer[i], y);
    }

    for (; i < n; ++i) {
        buffer[i] *= gain;
    }
}

// Benchmark helper
double bench(const std::function<void()>& f, int iterations) {
    using clock = std::chrono::high_resolution_clock;

    // Warmup
    for (int i = 0; i < 5; ++i) {
        f();
    }

    auto start = clock::now();
    for (int i = 0; i < iterations; ++i) {
        f();
    }
    auto end = clock::now();
    std::chrono::duration<double> diff = end - start;
    return diff.count();
}

int main() {
    const size_t N = 1 << 20; // 1M samples
    const int iterations = 200;
    const float gain = 0.5f;

    std::vector<float> buffer(N);

    // Fill with random numbers
    std::mt19937 rng(12345);
    std::uniform_real_distribution<float> dist(-1.0f, 1.0f);
    for (auto& v : buffer) {
        v = dist(rng);
    }

    // Scalar benchmark (reset buffer each time)
    std::vector<float> bufScalar = buffer;
    double tScalar = bench([&] {
        std::copy(buffer.begin(), buffer.end(), bufScalar.begin());
        applyGainScalar(bufScalar.data(), bufScalar.size(), gain);
    }, iterations);

    // SIMD benchmark
    std::vector<float> bufSimd = buffer;
    double tSimd = bench([&] {
        std::copy(buffer.begin(), buffer.end(), bufSimd.begin());
        applyGainAvx(bufSimd.data(), bufSimd.size(), gain);
    }, iterations);

    std::cout << "Scalar: " << tScalar << " sec" << std::endl;
    std::cout << "SIMD  : " << tSimd   << " sec" << std::endl;
    std::cout << "Speedup: " << (tScalar / tSimd) << "x" << std::endl;
}
Here, we copy from buffer to bufScalar / bufSimd each time before running the processing. This ensures that both the scalar and SIMD versions process exactly the same input data, and also helps to somewhat avoid biasing cache states based on processing order. This isn't a rigorous benchmark, but it's sufficient to see the general trends.
We compared the following two compilation option patterns:
No code optimization: /OdAll options/permissive- /ifcOutput "TestSimd\x64\Release\" /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /Od /Ob2 /sdl /Fd"TestSimd\x64\Release\vc143.pdb" /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /arch:AVX2 /Gd /Oi /MD /FC /Fa"TestSimd\x64\Release\" /EHsc /nologo /Fo"TestSimd\x64\Release\" /Ot /Fp"TestSimd\x64\Release\TestSimd.pch" /diagnostics:column

Maximum speed priority: /O2All options/permissive- /ifcOutput "TestSimd\x64\Release\" /GS /GL /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /Ob2 /sdl /Fd"TestSimd\x64\Release\vc143.pdb" /Zc:inline /fp:precise /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /arch:AVX2 /Gd /Oi /MD /FC /Fa"TestSimd\x64\Release\" /EHsc /nologo /Fo"TestSimd\x64\Release\" /Ot /Fp"TestSimd\x64\Release\TestSimd.pch" /diagnostics:column

Both use the "Configuration: Release / x64" settings, with only the optimization level being switched for measurement.
 Measurement Results and Interpretation

Optimization option
Scalar time [s]
SIMD time [s]
Speedup


/Od
0.295703
0.109595
2.70x

/O2
0.077822
0.067121
1.16x

With the /Od option that disables code optimization, the scalar implementation truly processes "one sample at a time in sequence," while only the SIMD implementation processes 8 samples at a time using AVX2. As a result, we observed an approximately 2.7x speed difference, which matches our expectations.
On the other hand, with /O2 enabling optimization, the scalar implementation's loop also becomes subject to the compiler's automatic vectorization. Since we specified /arch:AVX2, the compiler can freely use AVX2 instructions and will automatically replace simple loops with vector instructions like vmulps. As a result, the machine code for both scalar and handwritten SIMD implementations becomes quite similar, and the difference between them shrinks to about 1.16x. The cost of std::copy (memory copy) performed in the benchmark affects both implementations equally, so the relative difference in gain computation itself becomes smaller within the overall processing time.
From these two patterns, we can understand the following:
SIMD itself can be expected to provide a 2-3x improvement even without optimization
In Release builds, the compiler performs aggressive vectorization, so the additional effect of handwritten SIMD may not be as dramatic
In practical projects, a reasonable approach is to "first write straightforward C++ code, rely on optimization and auto-vectorization, and only consider handwritten SIMD for hot spots that are still performance bottlenecks"
 ConclusionIn this article, we've reviewed the basic concepts of SIMD and confirmed how much speed actually changes through a simple benchmark of gain processing. While we saw a 2-3x difference without optimization, in Release builds the compiler automatically vectorizes code, which can reduce the additional benefit of handwritten SIMD to around 10% in some cases.
In practice, it seems appropriate to first write straightforward loops, profile them, and only carefully tune with SIMD for parts that are genuinely bottlenecks.