Verifying the impact of C++ inline specification on performance in MSVC

Verifying the impact of C++ inline specification on performance in MSVC

Adding the inline keyword doesn't automatically make code faster. Measurements with MSVC and assembly analysis confirm that whether inlining occurs depends greatly on the function's characteristics, translation unit boundaries, and optimization options such as LTO.
2026.02.25

This page has been translated by machine translation. View original

Introduction

"Adding inline makes function calls disappear and code run faster." When learning C++, many people might initially understand it this way. However, the reality is not that simple.

In this article, we'll use MSVC to empirically examine how the presence or absence of the inline keyword affects execution speed. Does simply adding the inline keyword make code faster, or are other factors more significant? Through this verification, we'll clarify the relationship between the inline keyword and inline expansion.

What is an inline function?

In C++, you can add the inline keyword to function declarations.

inline int add(int a, int b) { return a + b; }

The specified meaning in the specification is the relaxation of the ODR (One Definition Rule). Normally, only one definition of the same function is allowed throughout the program. With inline, you can place the same definition in multiple translation units without a linker error. This is a necessary mechanism when writing function definitions in header files.

In the C++ standard, inline is also described as a hint for inline expansion, but it has no binding force and compilers can ignore it. At least from the specification standpoint, the only guarantee is the relaxation of ODR, not a promise of speed improvement.

What is inline expansion?

Inline expansion is an optimization technique where the compiler replaces a function call with the body of the function.

// Calling code (conceptual image)
result = add(x, y);  →  result = x + y;  // call instruction disappears

This eliminates call overhead, leading to faster execution for small functions. Additionally, at the expansion site, there are benefits of additional optimizations like constant folding and loop optimization becoming more effective.

Adding the inline keyword doesn't guarantee that the compiler will perform inline expansion. Conversely, even without it, the compiler may automatically expand the function. Compilers automatically determine whether to expand based on function size and call frequency, estimating costs and benefits. MSVC also has an extension called __forceinline that strongly requests expansion from the compiler. Even so, it's not absolutely guaranteed.

Test Environment

Item Value
OS Windows 11
CPU Intel Core i7-11700F
Compiler MSVC v143 (Visual Studio 2022)
C++ Standard C++14

Target Audience

  • Those who have heard of C++ inline but don't know the mechanism well
  • Those interested in how compilers optimize function calls

References

Verification Method

In this article, we'll empirically verify how compilers actually behave.

  • Test 1: Presence or absence of inline specification
    We'll prepare 3 patterns for the same function: unmodified (plain) / inline / __forceinline, and compare execution times. This directly tests whether adding inline actually makes code faster.
  • Test 2: Function characteristics
    We'll prepare 3 types of functions: minimal function (add), function with branching (clamp), and function with a loop (heavy), to check how function size and structure affect inline expansion decisions.
  • Test 3: Call location
    We'll compare execution times when function definitions and calls are in the same .cpp file (same-TU) versus different .cpp files (cross-TU).
  • Test 4: Compilation conditions
    We'll build the same code with 3 configurations: Debug (no optimization), Release_NoLTO (with optimization), and Release_LTO (with optimization + link-time optimization) to check the impact of optimization options.

Test Functions

We've prepared 3 types of functions. Each is defined with 3 modifiers: unmodified (plain) / inline / __forceinline in the same translation unit (same-TU). We add static to prevent link conflicts with identically named functions for cross-TU.

// --- add: minimal function (1 instruction) ---
static int add_plain_same_tu(int a, int b) {
    return a + b;
}
static inline int add_inlined_same_tu(int a, int b) {
    return a + b;
}
static __forceinline int add_forced_same_tu(int a, int b) {
    return a + b;
}

// --- clamp: function with branching ---
static int clamp_plain_same_tu(int value, int lo, int hi) {
    if (value < lo) return lo;
    if (value > hi) return hi;
    return value;
}
static inline int clamp_inlined_same_tu(int value, int lo, int hi) {
    if (value < lo) return lo;
    if (value > hi) return hi;
    return value;
}
static __forceinline int clamp_forced_same_tu(int value, int lo, int hi) {
    if (value < lo) return lo;
    if (value > hi) return hi;
    return value;
}

// --- heavy: function with loop ---
static int heavy_plain_same_tu(int n) {
    int sum = 0;
    for (int i = 1; i <= n; ++i) { sum += i * i; }
    return sum;
}
static inline int heavy_inlined_same_tu(int n) {
    int sum = 0;
    for (int i = 1; i <= n; ++i) { sum += i * i; }
    return sum;
}
static __forceinline int heavy_forced_same_tu(int n) {
    int sum = 0;
    for (int i = 1; i <= n; ++i) { sum += i * i; }
    return sum;
}

For the different translation unit (cross-TU), we define unmodified functions in a separate file.

// cross_tu_functions.h
#pragma once
int add_plain_cross_tu(int a, int b);
int clamp_plain_cross_tu(int value, int lo, int hi);
int heavy_plain_cross_tu(int n);
// cross_tu_functions.cpp
#include "cross_tu_functions.h"

int add_plain_cross_tu(int a, int b) {
    return a + b;
}
int clamp_plain_cross_tu(int value, int lo, int hi) {
    if (value < lo) return lo;
    if (value > hi) return hi;
    return value;
}
int heavy_plain_cross_tu(int n) {
    int sum = 0;
    for (int i = 1; i <= n; ++i) { sum += i * i; }
    return sum;
}

Measurement Method

We used the RUN_BENCH macro for measurements. To avoid calls through function pointers, we directly expanded the call expressions as macro arguments. After a 1 million warmup, we called add and clamp 100 million times, and heavy 10 million times, recording the elapsed time. Addition to a volatile variable prevents Dead Code Elimination. Note that the measurements in this article are from a single trial and include the effects of optimization suppression via volatile. Since our goal is to understand trends, differences of a few milliseconds are treated as margin of error.

bench.h (measurement macro and utilities)
#pragma once

#define _CRT_SECURE_NO_WARNINGS

#include <chrono>
#include <cstdio>
#include <string>
#include <vector>

struct BenchResult {
    std::string config;
    std::string function;
    std::string inline_type;
    std::string call_type;
    int iterations;
    double time_ms;
};

// RUN_BENCH: direct call macro (avoids function pointer to preserve inlining)
// - Warmup: 1,000,000 iterations
// - Measurement: `iters` iterations
// - volatile sink prevents dead-code elimination
#define RUN_BENCH(results, cfg, func, itype, ctype, iters, call_expr) \
    do {                                                              \
        volatile int sink = 0;                                        \
        for (int i = 0; i < 1000000; ++i) {                           \
            sink += (call_expr);                                      \
        }                                                             \
        auto start = std::chrono::high_resolution_clock::now();       \
        sink = 0;                                                     \
        for (int i = 0; i < (iters); ++i) {                           \
            sink += (call_expr);                                      \
        }                                                             \
        auto end = std::chrono::high_resolution_clock::now();         \
        double ms = std::chrono::duration<double, std::milli>(        \
            end - start).count();                                     \
        BenchResult r;                                                \
        r.config = (cfg);                                             \
        r.function = (func);                                          \
        r.inline_type = (itype);                                      \
        r.call_type = (ctype);                                        \
        r.iterations = (iters);                                       \
        r.time_ms = ms;                                               \
        (results).push_back(r);                                       \
    } while (0)

inline void write_csv(const std::vector<BenchResult>& results,
                      const std::string& filename) {
    FILE* fp = fopen(filename.c_str(), "w");
    if (!fp) return;
    fprintf(fp, "config,function,inline_type,call_type,iterations,time_ms\n");
    for (const auto& r : results) {
        fprintf(fp, "%s,%s,%s,%s,%d,%.3f\n",
                r.config.c_str(), r.function.c_str(),
                r.inline_type.c_str(), r.call_type.c_str(),
                r.iterations, r.time_ms);
    }
    fclose(fp);
}

inline void print_results(const std::vector<BenchResult>& results) {
    printf("%-14s %-24s %-14s %-10s %12s %10s\n",
           "config", "function", "inline_type", "call_type",
           "iterations", "time_ms");
    printf("----------------------------------------------------"
           "-------------------------------------------\n");
    for (const auto& r : results) {
        printf("%-14s %-24s %-14s %-10s %12d %10.3f\n",
               r.config.c_str(), r.function.c_str(),
               r.inline_type.c_str(), r.call_type.c_str(),
               r.iterations, r.time_ms);
    }
}
main() function (measurement execution part)
int main() {
    std::vector<BenchResult> results;
    const char* cfg = BENCH_CONFIG;

    const int ADD_ITERS   = 100000000; // 100 million
    const int CLAMP_ITERS = 100000000; // 100 million
    const int HEAVY_ITERS =  10000000; // 10 million

    // --- same-TU: add ---
    RUN_BENCH(results, cfg, "add_plain",  "plain",        "same_tu", ADD_ITERS,
              add_plain_same_tu(i, i + 1));
    RUN_BENCH(results, cfg, "add_inline", "inline",       "same_tu", ADD_ITERS,
              add_inlined_same_tu(i, i + 1));
    RUN_BENCH(results, cfg, "add_forced", "__forceinline", "same_tu", ADD_ITERS,
              add_forced_same_tu(i, i + 1));

    // --- same-TU: clamp ---
    RUN_BENCH(results, cfg, "clamp_plain",  "plain",        "same_tu", CLAMP_ITERS,
              clamp_plain_same_tu(i, 0, 1000));
    RUN_BENCH(results, cfg, "clamp_inline", "inline",       "same_tu", CLAMP_ITERS,
              clamp_inlined_same_tu(i, 0, 1000));
    RUN_BENCH(results, cfg, "clamp_forced", "__forceinline", "same_tu", CLAMP_ITERS,
              clamp_forced_same_tu(i, 0, 1000));

    // --- same-TU: heavy ---
    RUN_BENCH(results, cfg, "heavy_plain",  "plain",        "same_tu", HEAVY_ITERS,
              heavy_plain_same_tu(i % 50 + 1));
    RUN_BENCH(results, cfg, "heavy_inline", "inline",       "same_tu", HEAVY_ITERS,
              heavy_inlined_same_tu(i % 50 + 1));
    RUN_BENCH(results, cfg, "heavy_forced", "__forceinline", "same_tu", HEAVY_ITERS,
              heavy_forced_same_tu(i % 50 + 1));

    // --- cross-TU ---
    RUN_BENCH(results, cfg, "add_plain",   "plain", "cross_tu", ADD_ITERS,
              add_plain_cross_tu(i, i + 1));
    RUN_BENCH(results, cfg, "clamp_plain", "plain", "cross_tu", CLAMP_ITERS,
              clamp_plain_cross_tu(i, 0, 1000));
    RUN_BENCH(results, cfg, "heavy_plain", "plain", "cross_tu", HEAVY_ITERS,
              heavy_plain_cross_tu(i % 50 + 1));

    // --- Output ---
    print_results(results);

    std::string csv_name = std::string("results_") + cfg + ".csv";
    write_csv(results, csv_name);
    printf("\nResults written to %s\n", csv_name.c_str());

    return 0;
}

Build Configurations

We used three build configurations. The BENCH_CONFIG macro embeds the configuration name, reflected in the CSV filename.

Configuration Name (Article Notation) VS Configuration Name Optimization LTO
Debug Debug None (/Od) None
Release_NoLTO ReleaseNoLTO Yes (/O2 /Oi) None
Release_LTO Release Yes (/O2 /Oi) Yes (/GL + /LTCG)

Build commands:

msbuild test-inline-code-performance.sln /p:Configuration=Debug /p:Platform=x64
msbuild test-inline-code-performance.sln /p:Configuration=ReleaseNoLTO /p:Platform=x64
msbuild test-inline-code-performance.sln /p:Configuration=Release /p:Platform=x64

We used the /FA option to output assembly listings.

Verification Results

Test 1: Presence or absence of inline specification

Here are the results for same translation unit with Release_NoLTO.

Test 1: Presence or absence of inline specification

Function plain (ms) inline (ms) __forceinline (ms)
add (100 million) 110 84 102
clamp (100 million) 53 51 51
heavy (10 million) 58 56 56

For clamp and heavy, execution times are almost identical across all three modifiers. For add, there's some variation in the numbers, but checking the assembly confirmed that inline expansion occurred for all three modifiers. These differences likely stem from variations in code layout rather than whether inline expansion occurred. At least in this MSVC environment, the presence or absence of inline did not affect the compiler's optimization decisions.

Test 2: Function Characteristics

We reorganize the same data along the function axis. We checked the assembly for plain modifier, same-TU, Release_NoLTO.

Function Characteristic Inline Expansion with plain
add Minimal (1 instruction) Expanded
clamp With branching Expanded
heavy With loop Not expanded

The compiler automatically expanded the minimal add function. The clamp function with branching was also expanded. However, heavy with a loop was deemed too costly and was not expanded unless __forceinline was used. The reason heavy shows almost identical execution times across all three modifiers in Test 1 is that the calculation cost of the loop body dominates, making the presence or absence of call overhead less noticeable in the results. In this test, the function's characteristics had a much greater impact on expansion decisions than the presence of the inline keyword.

Test 3: Call Location

We compare same translation unit and different translation unit in Release_NoLTO. For cross-TU, only plain modifier was measured.

Test 3: Call Location

Function same-TU (ms) cross-TU (ms)
add (100 million) 110 114
clamp (100 million) 53 138
heavy (10 million) 58 58

Pay attention to clamp. It went from 53ms in same-TU to 138ms in cross-TU, about 2.6 times slower. When crossing translation units, the compiler cannot reference the function definition, preventing inline expansion. As a result, not only does call overhead remain, but opportunities for additional optimizations like constant folding that would happen at the expansion site are also lost. For add, the function body is so minimal that the call overhead itself is small, making the difference less noticeable. For heavy, there's no difference because it wasn't expanded to begin with.

Test 4: Compilation Conditions

We compare three build configurations in cross-TU.

Test 4: Compilation Conditions

Function Debug (ms) Release_NoLTO (ms) Release_LTO (ms)
add (100 million) 264 114 104
clamp (100 million) 321 138 53
heavy (10 million) 378 58 57

Debug is generally slower because optimizations are disabled. In Release_NoLTO, heavy shows significant improvement, but cross-TU clamp remains at 138ms. Note that in Release_LTO, clamp improves to 53ms. This is because LTO (Link-Time Optimization) enables inline expansion across translation units. The cross-TU performance degradation confirmed in Test 3 can be resolved with LTO.

Checking the Assembly

Beyond just numbers, we also verified inline expansion by examining the generated assembly.

Release_NoLTO: add (same-TU) → With expansion

$LL16@main:
    mov     ecx, DWORD PTR sink$1[rsp]
    inc     ecx
    add     ecx, edx          ; Addition is directly embedded
    mov     DWORD PTR sink$1[rsp], ecx
    add     edx, 2
    cmp     edx, 2000000
    jl      SHORT $LL16@main

There's no call instruction. The function body is directly expanded in the loop as add ecx, edx. Note that cmp edx, 2000000 is the comparison instruction for the warmup loop (1 million times), with the value changed due to the compiler's loop transformation. This doesn't affect our inline expansion verification.

Release_NoLTO: add (cross-TU) → No expansion

$LL88@main:
    lea     ebx, DWORD PTR [rax+1]
    mov     edx, ebx
    mov     ecx, eax
    call    ?add_plain_cross_tu@@YAHHH@Z  ; Function call remains
    ...
    jl      SHORT $LL88@main

Because the translation units differ, the call instruction remains, causing a function call each time. In Debug builds, call instructions are output even in the same translation unit because optimizations are disabled. We also confirmed that in Release_LTO cross-TU, the call instructions for add and clamp disappear, generating code equivalent to same-TU.

What We Learned

Through this verification in our MSVC environment, we confirmed the following:

  1. The presence or absence of the inline keyword alone does not change the result of inline expansion
  2. Whether inline expansion occurs depends greatly on the function's size and structure
  3. Crossing translation units loses opportunities for inline expansion
  4. Enabling LTO allows expansion across translation unit boundaries

However, this doesn't mean inline has no practical value. Adding inline allows you to place function definitions in header files, enabling the compiler to reference the function body from the calling translation unit. As confirmed in Test 3, this is important for creating opportunities for inline expansion.

Conclusion

Simply adding inline doesn't automatically make code faster. Inline expansion is automatically determined by the compiler based on function characteristics and costs. Function design, translation unit configuration, and LTO utilization affect speed more than the presence of the keyword. When performance is a concern, it's best to start by measuring with a profiler.

Share this article

FacebookHatena blogX

Related articles