
Is it faster to cache non-static member references to local variables in C++? Verified with MSVC
This page has been translated by machine translation. View original
Introduction
When writing loop processing in C++, it is sometimes said that "it's better to extract a non-static member variable to a local variable first rather than directly referencing it within the loop every time."
Example:
Code considered slow
for (int i = 0; i < count; ++i) {
sum += this->member_;
}
Modified code
const int value = this->member_;
for (int i = 0; i < count; ++i) {
sum += value;
}
However, modern compilers have become quite intelligent. In MSVC, this caution is not always valid. In this article, we will benchmark in MSVC, examine the generated assembly, and verify the following questions:
- Is there a runtime difference between direct reference and local caching?
- Does the trend change between Debug and Release?
- Under what conditions are differences more likely to appear?
What is MSVC
MSVC stands for Microsoft Visual C++ Compiler. It is a C++ compiler included in Visual Studio and is widely used in Windows environments. This article does not deal with general C++ principles but confirms how MSVC actually optimizes references to non-static member variables.
Test Environment
- OS: Windows
- Compiler: MSVC 19.44.35223
- Architecture: x64
- Build Configurations: Debug / Release / AsmRelease
- Measurement method:
std::chrono::steady_clock
Target Audience
- Those interested in C++ performance optimization
- Those who want to know about MSVC's optimization behavior
- Those who want to verify if traditional cautions still apply today
References
- MSVC Compiler Options
- Compiler options listed by category
- /FA, /Fa (Listing file)
- /Ob (Inline Function Expansion)
Verification Approach
In this verification, I avoided drawing conclusions based solely on benchmark numbers. Even if there is a numerical difference, it's unclear whether it's due to the member reference difference or other code shape differences.
I will proceed with a strategy combining execution time measurement and assembly code examination. First, I prepared Build Configurations: Debug, Release, and AsmRelease. AsmRelease was prepared to examine the assembly while maintaining Release-equivalent optimization.
I started with baseline experiments measuring simple cases, and then conducted additional experiments to examine conditions where differences might be more apparent. Beyond comparing execution times, I also checked the AsmRelease assembly output to see if loads remained within the loop each time, or if values were extracted before the loop and kept in registers.
In the baseline experiments, I compared the following 4 patterns:
- Simple loop
- Somewhat complex loop
- With noinline function calls
- Alias-like case
Additionally, I checked these cases in follow-up experiments:
- Cases where the compiler tends to be conservative by passing
thisto external functions - Comparison between
this->memberandobj.member - Comparison between inline getter and noinline getter
I measured 7 times with 10,000,000 iterations and observed the minimum values. In this case, I'm showing the minimum value from 7 measurements to observe trends less affected by noise. This is not a strict statistical comparison but an indicator to check consistency with assembly differences.
Excerpt of the code used in the experiments
Below is an extract of the essential parts from the whole verification code. Some function names and surrounding definitions are omitted for brevity in explanation.
// 1. Simple loop
__declspec(noinline) [[nodiscard]] std::uint64_t direct_simple(std::uint64_t iterations) const {
std::uint64_t acc = 0;
for (std::uint64_t i = 0; i < iterations; ++i) {
acc += static_cast<std::uint64_t>(data_[static_cast<std::size_t>(i % data_.size())]) + member_;
}
return acc;
}
// 2. Somewhat complex loop
__declspec(noinline) [[nodiscard]] std::uint64_t direct_complex(std::uint64_t iterations) const {
std::uint64_t acc = 0;
for (std::uint64_t i = 0; i < iterations; ++i) {
const auto value = static_cast<std::uint64_t>(data_[static_cast<std::size_t>(i % data_.size())]);
if ((i & 1U) == 0) {
acc += (value * 3U) + member_;
} else {
acc ^= (value + member_) << (i & 7U);
}
acc += (value ^ member_) & 0x1FU;
}
return acc;
}
// 3. With noinline function calls
__declspec(noinline) [[nodiscard]] std::uint64_t direct_with_call(std::uint64_t iterations) const {
std::uint64_t acc = 0;
for (std::uint64_t i = 0; i < iterations; ++i) {
const auto value = static_cast<std::uint64_t>(data_[static_cast<std::size_t>(i % data_.size())]);
acc += add_with_barrier(value, member_);
}
return acc;
}
// 4. Alias-like case
__declspec(noinline) [[nodiscard]] std::uint64_t direct_alias_like(std::uint64_t iterations) const {
std::uint64_t acc = 0;
const auto* self = this;
for (std::uint64_t i = 0; i < iterations; ++i) {
const auto value = static_cast<std::uint64_t>(data_[static_cast<std::size_t>(i % data_.size())]);
touch_pointer(self);
acc += self->member_ + value;
}
return acc;
}
// 5. Case where this escapes outside
__declspec(noinline) [[nodiscard]] std::uint64_t direct_this_escape(std::uint64_t iterations) {
std::uint64_t acc = 0;
for (std::uint64_t i = 0; i < iterations; ++i) {
const auto value = static_cast<std::uint64_t>(data_[static_cast<std::size_t>(i % data_.size())]);
observe_nonconst(this);
acc += member_ + value;
}
return acc;
}
// 6. Comparison between this->member and obj.member: this->member
__declspec(noinline) [[nodiscard]] std::uint64_t direct_simple(std::uint64_t iterations) const {
std::uint64_t acc = 0;
for (std::uint64_t i = 0; i < iterations; ++i) {
acc += static_cast<std::uint64_t>(data_[static_cast<std::size_t>(i % data_.size())]) + member_;
}
return acc;
}
// 7. Comparison between this->member and obj.member: obj.member
__declspec(noinline) std::uint64_t direct_ref_simple(const BenchTarget& target, std::uint64_t iterations) {
std::uint64_t acc = 0;
for (std::uint64_t i = 0; i < iterations; ++i) {
acc += static_cast<std::uint64_t>(target.data_[static_cast<std::size_t>(i % target.data_.size())]) + target.member_;
}
return acc;
}
// 8. Comparison between inline getter and noinline getter: inline getter
__declspec(noinline) [[nodiscard]] std::uint64_t inline_getter_loop(std::uint64_t iterations) const {
std::uint64_t acc = 0;
for (std::uint64_t i = 0; i < iterations; ++i) {
acc += static_cast<std::uint64_t>(data_[static_cast<std::size_t>(i % data_.size())]) + inline_getter();
}
return acc;
}
// 9. Comparison between inline getter and noinline getter: noinline getter
__declspec(noinline) [[nodiscard]] std::uint64_t noinline_getter_loop(std::uint64_t iterations) const {
std::uint64_t acc = 0;
for (std::uint64_t i = 0; i < iterations; ++i) {
acc += static_cast<std::uint64_t>(data_[static_cast<std::size_t>(i % data_.size())]) + noinline_getter();
}
return acc;
}
Baseline Experiments
First, let's look at the most basic comparison. I'll observe if direct references to this->member alone are really slower.
Release Results
| Pattern | Direct Reference | Local Cache | Difference |
|---|---|---|---|
| Simple loop | 23.145 ms | 22.616 ms | 0.529 ms |
| Somewhat complex loop | 24.163 ms | 22.753 ms | 1.410 ms |
| With noinline function calls | 23.061 ms | 22.623 ms | 0.438 ms |
| Alias-like case | 22.510 ms | 22.759 ms | -0.249 ms |
In Release, the differences across all four patterns were quite small and did not show a consistent advantage for either approach. The simple loop, noinline function calls, and alias-like cases were nearly equivalent. Only the somewhat complex loop showed a slight advantage for local caching, but even this was not a dramatic difference.
Debug Results
| Pattern | Direct Reference | Local Cache | Difference |
|---|---|---|---|
| Simple loop | 61.166 ms | 59.519 ms | 1.647 ms |
| Somewhat complex loop | 68.370 ms | 66.793 ms | 1.577 ms |
| With noinline function calls | 78.261 ms | 77.087 ms | 1.174 ms |
| Alias-like case | 78.001 ms | 77.391 ms | 0.610 ms |
In Debug, local caching was slightly faster across all four patterns. However, the differences were only a few percent, so even here, direct reference was not extremely slow. The overall assessment from the baseline experiments is that differences almost disappear in Release and become slightly more visible in Debug.
Assembly Observation
In the baseline experiments, I checked the AsmRelease output in addition to the numbers. Here, I'll look at what happened in each of the four patterns.
Simple Loop
In the simple loop, both direct reference and local cache loaded the member only once before the loop, and then register values were used afterward.
; Direct reference
mov r11d, DWORD PTR [rcx]
...
add rax, r11
; Local cache
mov edi, DWORD PTR [rcx]
...
add rax, rdi
Even though the code is written differently for direct reference and local cache, the assembly is almost identical. At least in this case, it cannot simply be said that "it's slow because this->member is read within the loop."
Somewhat Complex Loop
Even in the somewhat complex loop, the member was loaded before the loop. Despite branches and additional operations, MSVC was able to keep the values in this form.
; Direct reference
mov r10d, DWORD PTR [rcx]
...
lea eax, [r10+rax*3]
...
mov ecx, r10d
; Local cache
mov esi, DWORD PTR [rcx]
...
lea eax, [rsi+rax*3]
...
mov ecx, esi
In this pattern as well, direct reference did not result in re-reading memory every time. It shows that "just because it's a somewhat complex loop doesn't immediately put you at a disadvantage."
With noinline Function Calls
Even in the case with noinline function calls, the member was extracted before the loop. It doesn't reload the member with each call.
; Direct reference
mov edi, DWORD PTR [rcx]
...
mov edx, edi
call add_with_barrier
; Local cache
mov esi, DWORD PTR [rcx]
...
mov edx, esi
call add_with_barrier
This level of noinline calls did not result in disadvantageous assembly for direct reference. Here too, the difference between direct reference and local cache is quite small.
Alias-like Case
The alias-like case showed clear differences in assembly. In the method that directly references the member after touch_pointer(self), in-loop reloading remained.
; Direct reference
call touch_pointer
mov ecx, DWORD PTR [r11]
...
add rax, rcx
; Local cache
mov edi, DWORD PTR [rcx]
...
call touch_pointer
lea rcx, [rdi+r8]
Only in this case was there a clear assembly difference between direct reference and local cache. However, even so, the measured difference was limited. It shows that clear assembly differences and large execution time differences are separate matters.
Looking across the four baseline patterns, the assembly for direct reference and local cache was almost identical in three patterns, with clear differences only in the alias-like case. At least with this version of MSVC, when discussing differences, it's easy to misjudge unless we look at "which cases retain in-loop reloading."
Additional Experiments
Looking at just the baseline experiments, while there seemed to be a trend that local caching has some effect in Debug, it was not yet fully clear. So in additional experiments, I intentionally created conditions where differences might appear.
Release Results
| Pattern | Classification | Direct Reference | Local Cache | Difference |
|---|---|---|---|---|
Case where the compiler tends to be conservative by passing this to external functions |
- | 22.934 ms | 22.680 ms | 0.254 ms |
Comparison between this->member and obj.member |
this->member |
23.774 ms | 22.847 ms | 0.927 ms |
Comparison between this->member and obj.member |
obj.member |
23.073 ms | 22.499 ms | 0.574 ms |
| Comparison between inline getter and noinline getter | inline getter | 22.461 ms | 22.590 ms | -0.129 ms |
| Comparison between inline getter and noinline getter | noinline getter | 22.159 ms | 22.286 ms | -0.127 ms |
Overall, differences remained quite small in the additional experiments as well. While there were code generation differences in the case of passing this to external functions, the measured differences were limited. Comparisons between this->member and obj.member as well as getter systems did not show large differences in this configuration.
Debug Results
| Pattern | Classification | Direct Reference | Local Cache | Difference |
|---|---|---|---|---|
Case where the compiler tends to be conservative by passing this to external functions |
- | 81.551 ms | 80.794 ms | 0.757 ms |
Comparison between this->member and obj.member |
this->member |
60.696 ms | 62.371 ms | -1.675 ms |
Comparison between this->member and obj.member |
obj.member |
60.916 ms | 59.882 ms | 1.034 ms |
| Comparison between inline getter and noinline getter | inline getter | 82.071 ms | 60.007 ms | 22.064 ms |
| Comparison between inline getter and noinline getter | noinline getter | 84.161 ms | 60.834 ms | 23.327 ms |
In Debug, the differences in getter systems stood out. Both inline getter and noinline getter were significantly faster with local caching, showing the direct cost of calling getters within the loop every time. However, the differences seen in getter systems should be interpreted as the difference of whether getter calls remain within the loop, rather than the difference of non-static member references themselves.
Assembly Observation
In additional experiments as well, I checked the AsmRelease output beyond just numerical differences.
Case Where this Escapes Outside
In this case, only direct reference retained in-loop reloading. Since it crosses observe_nonconst(this), the compiler likely finds it difficult to safely retain the member.
; Direct reference
call observe_nonconst
mov ecx, DWORD PTR [r11]
...
add rax, rcx
On the other hand, local cache extracted the value before the loop.
; Local cache
mov edi, DWORD PTR [rcx]
...
call observe_nonconst
lea rcx, [rdi+r8]
While there are clear assembly differences between direct reference and local cache, the measured differences remained small, showing that assembly differences don't directly translate to large execution time differences.
Comparison between this->member and obj.member
In the obj.member comparison, even with direct reference, the member was loaded before the loop. This is the same direction of optimization as baseline's this->member.
; Direct reference
mov r11d, DWORD PTR [rcx]
...
add rax, r11
Similarly with local cache, there was no in-loop reloading.
; Local cache
mov edi, DWORD PTR [rcx]
...
add rax, rdi
From these results, it's clear that in this simple case, the difference between this->member and obj.member itself is not dominant. The issue seems to be not whether it's this or obj, but whether the compiler can consider that value as loop-invariant.
inline getter
Inline getter also collapsed into almost the same code as direct reference. Getter calls do not remain within the loop.
; Direct reference
mov r11d, DWORD PTR [rcx]
...
add rax, r11
With local cache as well, it's loaded before the loop.
; Local cache
mov edi, DWORD PTR [rcx]
...
add rax, rdi
Under these conditions, inline getter was treated almost the same as simple member references.
noinline getter
The most clear-cut case was noinline getter. In noinline_getter_loop, a call noinline_getter remained within the loop every time.
; Direct reference
...
call noinline_getter
mov ecx, eax
add rax, rcx
The noinline_getter itself is simple, actually just a function that reads a member.
; Direct reference
mov eax, DWORD PTR [rcx]
ret 0
On the other hand, with local cache, the getter was called only once before the loop, and its return value was reused.
; Local cache
call noinline_getter
mov edi, eax
...
add rax, rdi
While there was a very clear assembly difference, the measured difference remained small in Release, showing that execution time is not determined solely by assembly differences.
Findings from Verification
First, it's difficult to generalize the old caution in Release. At least in this version of MSVC, simple this->member, obj.member, and inline getters are optimized and often become almost the same code as local cache.
On the other hand, differences appeared in Debug, in cases where this escapes outside, and in noinline getters where the compiler finds it difficult to retain values. The essence seems to be not that it's slow because it's a member variable, but whether the compiler can treat that value as loop-invariant. In practical terms, it's not necessary to sacrifice readability by always extracting to local variables in regular practice.
Conclusion
In this verification, we found that in modern MSVC, it cannot be said that "directly referencing non-static member variables within loops is always slow." In Release, it's sufficiently optimized in many cases, while differences remain in Debug, this escape, and noinline getter-like forms. In the end, the criterion to base judgments on should not be whether it's a member variable, but whether it's a code shape that allows the compiler to retain that value.