Is it faster to cache non-static member references to local variables in C++? Verified with MSVC

Is it faster to cache non-static member references to local variables in C++? Verified with MSVC

I verified whether repeatedly reading this->member or obj.member in C++ loops is truly slower than extracting to local variables by examining MSVC benchmarks and observing assembly code.
2026.03.14

This page has been translated by machine translation. View original

Introduction

When writing loop processing in C++, it is sometimes said that "it's better to extract a non-static member variable to a local variable first rather than directly referencing it within the loop every time."

Example:
Code considered slow

for (int i = 0; i < count; ++i) {
    sum += this->member_;
}

Modified code

const int value = this->member_;
for (int i = 0; i < count; ++i) {
    sum += value;
}

However, modern compilers have become quite intelligent. In MSVC, this caution is not always valid. In this article, we will benchmark in MSVC, examine the generated assembly, and verify the following questions:

  • Is there a runtime difference between direct reference and local caching?
  • Does the trend change between Debug and Release?
  • Under what conditions are differences more likely to appear?

What is MSVC

MSVC stands for Microsoft Visual C++ Compiler. It is a C++ compiler included in Visual Studio and is widely used in Windows environments. This article does not deal with general C++ principles but confirms how MSVC actually optimizes references to non-static member variables.

Test Environment

  • OS: Windows
  • Compiler: MSVC 19.44.35223
  • Architecture: x64
  • Build Configurations: Debug / Release / AsmRelease
  • Measurement method: std::chrono::steady_clock

Target Audience

  • Those interested in C++ performance optimization
  • Those who want to know about MSVC's optimization behavior
  • Those who want to verify if traditional cautions still apply today

References

Verification Approach

In this verification, I avoided drawing conclusions based solely on benchmark numbers. Even if there is a numerical difference, it's unclear whether it's due to the member reference difference or other code shape differences.

I will proceed with a strategy combining execution time measurement and assembly code examination. First, I prepared Build Configurations: Debug, Release, and AsmRelease. AsmRelease was prepared to examine the assembly while maintaining Release-equivalent optimization.

I started with baseline experiments measuring simple cases, and then conducted additional experiments to examine conditions where differences might be more apparent. Beyond comparing execution times, I also checked the AsmRelease assembly output to see if loads remained within the loop each time, or if values were extracted before the loop and kept in registers.

In the baseline experiments, I compared the following 4 patterns:

  • Simple loop
  • Somewhat complex loop
  • With noinline function calls
  • Alias-like case

Additionally, I checked these cases in follow-up experiments:

  • Cases where the compiler tends to be conservative by passing this to external functions
  • Comparison between this->member and obj.member
  • Comparison between inline getter and noinline getter

I measured 7 times with 10,000,000 iterations and observed the minimum values. In this case, I'm showing the minimum value from 7 measurements to observe trends less affected by noise. This is not a strict statistical comparison but an indicator to check consistency with assembly differences.

Excerpt of the code used in the experiments

Below is an extract of the essential parts from the whole verification code. Some function names and surrounding definitions are omitted for brevity in explanation.

// 1. Simple loop
__declspec(noinline) [[nodiscard]] std::uint64_t direct_simple(std::uint64_t iterations) const {
    std::uint64_t acc = 0;
    for (std::uint64_t i = 0; i < iterations; ++i) {
        acc += static_cast<std::uint64_t>(data_[static_cast<std::size_t>(i % data_.size())]) + member_;
    }
    return acc;
}

// 2. Somewhat complex loop
__declspec(noinline) [[nodiscard]] std::uint64_t direct_complex(std::uint64_t iterations) const {
    std::uint64_t acc = 0;
    for (std::uint64_t i = 0; i < iterations; ++i) {
        const auto value = static_cast<std::uint64_t>(data_[static_cast<std::size_t>(i % data_.size())]);
        if ((i & 1U) == 0) {
            acc += (value * 3U) + member_;
        } else {
            acc ^= (value + member_) << (i & 7U);
        }
        acc += (value ^ member_) & 0x1FU;
    }
    return acc;
}

// 3. With noinline function calls
__declspec(noinline) [[nodiscard]] std::uint64_t direct_with_call(std::uint64_t iterations) const {
    std::uint64_t acc = 0;
    for (std::uint64_t i = 0; i < iterations; ++i) {
        const auto value = static_cast<std::uint64_t>(data_[static_cast<std::size_t>(i % data_.size())]);
        acc += add_with_barrier(value, member_);
    }
    return acc;
}

// 4. Alias-like case
__declspec(noinline) [[nodiscard]] std::uint64_t direct_alias_like(std::uint64_t iterations) const {
    std::uint64_t acc = 0;
    const auto* self = this;
    for (std::uint64_t i = 0; i < iterations; ++i) {
        const auto value = static_cast<std::uint64_t>(data_[static_cast<std::size_t>(i % data_.size())]);
        touch_pointer(self);
        acc += self->member_ + value;
    }
    return acc;
}

// 5. Case where this escapes outside
__declspec(noinline) [[nodiscard]] std::uint64_t direct_this_escape(std::uint64_t iterations) {
    std::uint64_t acc = 0;
    for (std::uint64_t i = 0; i < iterations; ++i) {
        const auto value = static_cast<std::uint64_t>(data_[static_cast<std::size_t>(i % data_.size())]);
        observe_nonconst(this);
        acc += member_ + value;
    }
    return acc;
}

// 6. Comparison between this->member and obj.member: this->member
__declspec(noinline) [[nodiscard]] std::uint64_t direct_simple(std::uint64_t iterations) const {
    std::uint64_t acc = 0;
    for (std::uint64_t i = 0; i < iterations; ++i) {
        acc += static_cast<std::uint64_t>(data_[static_cast<std::size_t>(i % data_.size())]) + member_;
    }
    return acc;
}

// 7. Comparison between this->member and obj.member: obj.member
__declspec(noinline) std::uint64_t direct_ref_simple(const BenchTarget& target, std::uint64_t iterations) {
    std::uint64_t acc = 0;
    for (std::uint64_t i = 0; i < iterations; ++i) {
        acc += static_cast<std::uint64_t>(target.data_[static_cast<std::size_t>(i % target.data_.size())]) + target.member_;
    }
    return acc;
}

// 8. Comparison between inline getter and noinline getter: inline getter
__declspec(noinline) [[nodiscard]] std::uint64_t inline_getter_loop(std::uint64_t iterations) const {
    std::uint64_t acc = 0;
    for (std::uint64_t i = 0; i < iterations; ++i) {
        acc += static_cast<std::uint64_t>(data_[static_cast<std::size_t>(i % data_.size())]) + inline_getter();
    }
    return acc;
}

// 9. Comparison between inline getter and noinline getter: noinline getter
__declspec(noinline) [[nodiscard]] std::uint64_t noinline_getter_loop(std::uint64_t iterations) const {
    std::uint64_t acc = 0;
    for (std::uint64_t i = 0; i < iterations; ++i) {
        acc += static_cast<std::uint64_t>(data_[static_cast<std::size_t>(i % data_.size())]) + noinline_getter();
    }
    return acc;
}

Baseline Experiments

First, let's look at the most basic comparison. I'll observe if direct references to this->member alone are really slower.

Release Results

Pattern Direct Reference Local Cache Difference
Simple loop 23.145 ms 22.616 ms 0.529 ms
Somewhat complex loop 24.163 ms 22.753 ms 1.410 ms
With noinline function calls 23.061 ms 22.623 ms 0.438 ms
Alias-like case 22.510 ms 22.759 ms -0.249 ms

In Release, the differences across all four patterns were quite small and did not show a consistent advantage for either approach. The simple loop, noinline function calls, and alias-like cases were nearly equivalent. Only the somewhat complex loop showed a slight advantage for local caching, but even this was not a dramatic difference.

Debug Results

Pattern Direct Reference Local Cache Difference
Simple loop 61.166 ms 59.519 ms 1.647 ms
Somewhat complex loop 68.370 ms 66.793 ms 1.577 ms
With noinline function calls 78.261 ms 77.087 ms 1.174 ms
Alias-like case 78.001 ms 77.391 ms 0.610 ms

In Debug, local caching was slightly faster across all four patterns. However, the differences were only a few percent, so even here, direct reference was not extremely slow. The overall assessment from the baseline experiments is that differences almost disappear in Release and become slightly more visible in Debug.

Assembly Observation

In the baseline experiments, I checked the AsmRelease output in addition to the numbers. Here, I'll look at what happened in each of the four patterns.

Simple Loop

In the simple loop, both direct reference and local cache loaded the member only once before the loop, and then register values were used afterward.

; Direct reference
mov r11d, DWORD PTR [rcx]
...
add rax, r11
; Local cache
mov edi, DWORD PTR [rcx]
...
add rax, rdi

Even though the code is written differently for direct reference and local cache, the assembly is almost identical. At least in this case, it cannot simply be said that "it's slow because this->member is read within the loop."

Somewhat Complex Loop

Even in the somewhat complex loop, the member was loaded before the loop. Despite branches and additional operations, MSVC was able to keep the values in this form.

; Direct reference
mov r10d, DWORD PTR [rcx]
...
lea eax, [r10+rax*3]
...
mov ecx, r10d
; Local cache
mov esi, DWORD PTR [rcx]
...
lea eax, [rsi+rax*3]
...
mov ecx, esi

In this pattern as well, direct reference did not result in re-reading memory every time. It shows that "just because it's a somewhat complex loop doesn't immediately put you at a disadvantage."

With noinline Function Calls

Even in the case with noinline function calls, the member was extracted before the loop. It doesn't reload the member with each call.

; Direct reference
mov edi, DWORD PTR [rcx]
...
mov edx, edi
call add_with_barrier
; Local cache
mov esi, DWORD PTR [rcx]
...
mov edx, esi
call add_with_barrier

This level of noinline calls did not result in disadvantageous assembly for direct reference. Here too, the difference between direct reference and local cache is quite small.

Alias-like Case

The alias-like case showed clear differences in assembly. In the method that directly references the member after touch_pointer(self), in-loop reloading remained.

; Direct reference
call touch_pointer
mov ecx, DWORD PTR [r11]
...
add rax, rcx
; Local cache
mov edi, DWORD PTR [rcx]
...
call touch_pointer
lea rcx, [rdi+r8]

Only in this case was there a clear assembly difference between direct reference and local cache. However, even so, the measured difference was limited. It shows that clear assembly differences and large execution time differences are separate matters.

Looking across the four baseline patterns, the assembly for direct reference and local cache was almost identical in three patterns, with clear differences only in the alias-like case. At least with this version of MSVC, when discussing differences, it's easy to misjudge unless we look at "which cases retain in-loop reloading."

Additional Experiments

Looking at just the baseline experiments, while there seemed to be a trend that local caching has some effect in Debug, it was not yet fully clear. So in additional experiments, I intentionally created conditions where differences might appear.

Release Results

Pattern Classification Direct Reference Local Cache Difference
Case where the compiler tends to be conservative by passing this to external functions - 22.934 ms 22.680 ms 0.254 ms
Comparison between this->member and obj.member this->member 23.774 ms 22.847 ms 0.927 ms
Comparison between this->member and obj.member obj.member 23.073 ms 22.499 ms 0.574 ms
Comparison between inline getter and noinline getter inline getter 22.461 ms 22.590 ms -0.129 ms
Comparison between inline getter and noinline getter noinline getter 22.159 ms 22.286 ms -0.127 ms

Overall, differences remained quite small in the additional experiments as well. While there were code generation differences in the case of passing this to external functions, the measured differences were limited. Comparisons between this->member and obj.member as well as getter systems did not show large differences in this configuration.

Debug Results

Pattern Classification Direct Reference Local Cache Difference
Case where the compiler tends to be conservative by passing this to external functions - 81.551 ms 80.794 ms 0.757 ms
Comparison between this->member and obj.member this->member 60.696 ms 62.371 ms -1.675 ms
Comparison between this->member and obj.member obj.member 60.916 ms 59.882 ms 1.034 ms
Comparison between inline getter and noinline getter inline getter 82.071 ms 60.007 ms 22.064 ms
Comparison between inline getter and noinline getter noinline getter 84.161 ms 60.834 ms 23.327 ms

In Debug, the differences in getter systems stood out. Both inline getter and noinline getter were significantly faster with local caching, showing the direct cost of calling getters within the loop every time. However, the differences seen in getter systems should be interpreted as the difference of whether getter calls remain within the loop, rather than the difference of non-static member references themselves.

Assembly Observation

In additional experiments as well, I checked the AsmRelease output beyond just numerical differences.

Case Where this Escapes Outside

In this case, only direct reference retained in-loop reloading. Since it crosses observe_nonconst(this), the compiler likely finds it difficult to safely retain the member.

; Direct reference
call observe_nonconst
mov ecx, DWORD PTR [r11]
...
add rax, rcx

On the other hand, local cache extracted the value before the loop.

; Local cache
mov edi, DWORD PTR [rcx]
...
call observe_nonconst
lea rcx, [rdi+r8]

While there are clear assembly differences between direct reference and local cache, the measured differences remained small, showing that assembly differences don't directly translate to large execution time differences.

Comparison between this->member and obj.member

In the obj.member comparison, even with direct reference, the member was loaded before the loop. This is the same direction of optimization as baseline's this->member.

; Direct reference
mov r11d, DWORD PTR [rcx]
...
add rax, r11

Similarly with local cache, there was no in-loop reloading.

; Local cache
mov edi, DWORD PTR [rcx]
...
add rax, rdi

From these results, it's clear that in this simple case, the difference between this->member and obj.member itself is not dominant. The issue seems to be not whether it's this or obj, but whether the compiler can consider that value as loop-invariant.

inline getter

Inline getter also collapsed into almost the same code as direct reference. Getter calls do not remain within the loop.

; Direct reference
mov r11d, DWORD PTR [rcx]
...
add rax, r11

With local cache as well, it's loaded before the loop.

; Local cache
mov edi, DWORD PTR [rcx]
...
add rax, rdi

Under these conditions, inline getter was treated almost the same as simple member references.

noinline getter

The most clear-cut case was noinline getter. In noinline_getter_loop, a call noinline_getter remained within the loop every time.

; Direct reference
...
call noinline_getter
mov ecx, eax
add rax, rcx

The noinline_getter itself is simple, actually just a function that reads a member.

; Direct reference
mov eax, DWORD PTR [rcx]
ret 0

On the other hand, with local cache, the getter was called only once before the loop, and its return value was reused.

; Local cache
call noinline_getter
mov edi, eax
...
add rax, rdi

While there was a very clear assembly difference, the measured difference remained small in Release, showing that execution time is not determined solely by assembly differences.

Findings from Verification

First, it's difficult to generalize the old caution in Release. At least in this version of MSVC, simple this->member, obj.member, and inline getters are optimized and often become almost the same code as local cache.

On the other hand, differences appeared in Debug, in cases where this escapes outside, and in noinline getters where the compiler finds it difficult to retain values. The essence seems to be not that it's slow because it's a member variable, but whether the compiler can treat that value as loop-invariant. In practical terms, it's not necessary to sacrifice readability by always extracting to local variables in regular practice.

Conclusion

In this verification, we found that in modern MSVC, it cannot be said that "directly referencing non-static member variables within loops is always slow." In Release, it's sufficiently optimized in many cases, while differences remain in Debug, this escape, and noinline getter-like forms. In the end, the criterion to base judgments on should not be whether it's a member variable, but whether it's a code shape that allows the compiler to retain that value.

Share this article

FacebookHatena blogX