optimization of access to members in c++
I'm running into an inconsistent optimization behav开发者_运维技巧ior with different compilers for the following code:
class tester
{
public:
tester(int* arr_, int sz_)
: arr(arr_), sz(sz_)
{}
int doadd()
{
sm = 0;
for (int n = 0; n < 1000; ++n)
{
for (int i = 0; i < sz; ++i)
{
sm += arr[i];
}
}
return sm;
}
protected:
int* arr;
int sz;
int sm;
};
The doadd
function simulates some intensive access to members (ignore the overflows in addition for this question). Compared with similar code implemented as a function:
int arradd(int* arr, int sz)
{
int sm = 0;
for (int n = 0; n < 1000; ++n)
{
for (int i = 0; i < sz; ++i)
{
sm += arr[i];
}
}
return sm;
}
The doadd
method runs about 1.5 times slower than the arradd
function when compiled in Release mode with Visual C++ 2008. When I modify the doadd
method to be as follows (aliasing all members with locals):
int doadd()
{
int mysm = 0;
int* myarr = arr;
int mysz = sz;
for (int n = 0; n < 1000; ++n)
{
for (int i = 0; i < mysz; ++i)
{
mysm += myarr[i];
}
}
sm = mysm;
return sm;
}
Runtimes become roughly the same. Am I right in concluding that this is a missing optimization by the Visual C++ compiler? g++
seems to do it better and run both the member function and the normal function at the same speed when compiling with -O2
or -O3
.
The benchmarking is done by invoking the doadd
member and arradd
function on some sufficiently large array (a few millions of integers in size).
EDIT: Some fine-grained testing shows that the main culprit is the sm
member. Replacing all others by local versions still makes the runtime long, but once I replace sm
by mysm
the runtime becomes equal to the function version.
Resolution
Dissapointed with the answers (sorry guys), I shaked off my laziness and dove into the disassembly listings for this code. My answer below summarizes the findings. In short: it has nothing to do with aliasing, it has all to do with loop unrolling, and with some strange heuristics MSVC applies when deciding which loop to unroll.
It may be an aliasing issue - the compiler can not know that the instance variable sm
will never be pointed at by arr
, so it has to treat sm
as if it were effectively volatile, and save it on every iteration. You could make sm
a different type to test this hypothesis. Or just use a temporary local sum (which will get cached in a register) and assign it to sm
at the end.
MSVC is correct, in that it is the only one that, given the code we've seen, is guaranteed to work correctly. GCC employs optimizations that are probably safe in this specific instance, but that can only be verified by seeing more of the program.
Because sm
is not a local variable, MSVC apparently assumes that it might alias arr
.
That's a fairly reasonable assumption: because arr
is protected, a derived class might set it to point to sm
, so arr
could alias sm
.
GCC sees that it doesn't actually alias arr
, and so it doesn't write sm
back to memory until after the loop, which is much faster.
It's certainly possible to instantiate the class so that arr
points to sm
, which MSVC would handle, but GCC wouldn't.
Assuming that sz > 1
, GCCs optimization is permissible in general.
Because the function loops over arr
, treating it as an array of sz
elements, calling the function with sz > 1
would yield undefined behavior whether or not arr
aliases sm
, and so GCC could safely assume that they don't alias. But if sz == 1
, or if the compiler can't be sure what sz
's value might be, then it runs the risk that sz
might be 1, and so arr
and sm
could alias perfectly legally, and GCC's code would break.
So most likely, GCC simply gets away with it by inlining the whole thing, and seeing that in this case, they don't alias.
I disassembled the code with MSVC to better understand what's going on. Turns out aliasing wasn't a problem at all, and neither was some kind of paranoid thread safety.
Here is the interesting part of the arradd
function disassambled:
for (int n = 0; n < 10; ++n)
{
for (int i = 0; i < sz; ++i)
013C101C mov ecx,ebp
013C101E mov ebx,29B9270h
{
sm += arr[i];
013C1023 add eax,dword ptr [ecx-8]
013C1026 add edx,dword ptr [ecx-4]
013C1029 add esi,dword ptr [ecx]
013C102B add edi,dword ptr [ecx+4]
013C102E add ecx,10h
013C1031 sub ebx,1
013C1034 jne arradd+23h (13C1023h)
013C1036 add edi,esi
013C1038 add edi,edx
013C103A add eax,edi
013C103C sub dword ptr [esp+10h],1
013C1041 jne arradd+16h (13C1016h)
013C1043 pop edi
013C1044 pop esi
013C1045 pop ebp
013C1046 pop ebx
ecx
points to the array, and we can see that the internal loop is unrolled x4 here - note the four consecutive add
instructions from following addresses, and ecx
being advanced by 16 bytes (4 words) at a time inside the loop.
For the unoptimized version of the member function, doadd
:
int tester::doadd()
{
sm = 0;
for (int n = 0; n < 10; ++n)
{
for (int i = 0; i < sz; ++i)
{
sm += arr[i];
}
}
return sm;
}
The disassembly is (it's harder to find since the compiler inlined it into main
):
int tr_result = tr.doadd();
013C114A xor edi,edi
013C114C lea ecx,[edi+0Ah]
013C114F nop
013C1150 xor eax,eax
013C1152 add edi,dword ptr [esi+eax*4]
013C1155 inc eax
013C1156 cmp eax,0A6E49C0h
013C115B jl main+102h (13C1152h)
013C115D sub ecx,1
013C1160 jne main+100h (13C1150h)
Note 2 things:
- The sum is stored in a register -
edi
. Hence, there's not aliasing "care" taken here. The value ofsm
isn't re-read all the time.edi
isinitialized just once and then used as a temporary. You don't see its return since the compiler optimized it and usededi
directly as the return value of the inlined code. - The loop is not unrolled. Why? No good reason.
Finally, here's an "optimized" version of the member function, with mysm
keeping the sum local manually:
int tester::doadd_opt()
{
sm = 0;
int mysm = 0;
for (int n = 0; n < 10; ++n)
{
for (int i = 0; i < sz; ++i)
{
mysm += arr[i];
}
}
sm = mysm;
return sm;
}
The (again, inlined) disassembly is:
int tr_result_opt = tr_opt.doadd_opt();
013C11F6 xor edi,edi
013C11F8 lea ebp,[edi+0Ah]
013C11FB jmp main+1B0h (13C1200h)
013C11FD lea ecx,[ecx]
013C1200 xor ecx,ecx
013C1202 xor edx,edx
013C1204 xor eax,eax
013C1206 add ecx,dword ptr [esi+eax*4]
013C1209 add edx,dword ptr [esi+eax*4+4]
013C120D add eax,2
013C1210 cmp eax,0A6E49BFh
013C1215 jl main+1B6h (13C1206h)
013C1217 cmp eax,0A6E49C0h
013C121C jge main+1D1h (13C1221h)
013C121E add edi,dword ptr [esi+eax*4]
013C1221 add ecx,edx
013C1223 add edi,ecx
013C1225 sub ebp,1
013C1228 jne main+1B0h (13C1200h)
The loop here is unrolled, but just x2.
This explains my speed-difference observations quite well. For a 175e6 array, the function runs ~1.2 secs, the unoptimized member ~1.5 secs, and the optimized member ~1.3 secs. (Note that this may differ for you, on another machine I got closer runtimes for all 3 versions).
What about gcc? When compiled with it, all 3 versions ran at ~1.5 secs. Suspecting the lack of unrolling I looked at gcc
's disassembly and indeed: gcc doesn't unroll any of the versions.
As Paul wrote it is probably because sm member is really updated every time in the "real" memory , meanwhile local summary in the function can be accumulated in register variable (after compiler optimization).
You can get similar issues when passing in pointer arguments. If you like getting your hands dirty, you may find the restrict
keyword useful in future.
http://developers.sun.com/solaris/articles/cc_restrict.html
This isn't really the same code at all. If you put the sm, arr and sz variables inside the class instead of making theme local, the compiler can't (easily) guess that some other class won't inherit from test
class and want access to these members, doing something like `arr=&sm; doadd();. Henceforth, access to these variables can't be optimized away as they can when they are local to function.
In the end the reason is basically the one Paul pointed out, sm is updated in real memory when using a class member, can be stored in a register when in a function. Memory reads from add shouldn't change resulting time much, as memomry must be read anyway to get the value.
In this case if test is not exported to another module and not aliased even indirectly to something exported, and if there is no aliasing like above. The compiler could optimize intermediate writes to sm... some compilers like gcc seems to optimize aggressively enough to detect above cases (would it also if test class is exported). But these are really hard guesses. There is still much simpler optimizations that are not yet performed by compilers (like inlining through different compilation units).
The key is probably that doadd
is written like this if you make the member accesses explicit with this
:
int doadd()
{
this->sm = 0;
for (int n = 0; n < 1000; ++n)
{
for (int i = 0; i < this->sz; ++i)
{
this->sm += this->arr[i];
}
}
return this->sm;
}
Therein lies the problem: all class members are accessed via the this
pointer, whereas arradd
has all variables on the stack. To speed it up, you have found that by moving all members on to the stack as local variables, the speed then matches arradd
. So this indicates the this
indirection is responsible for the performance loss.
Why might that be? As I understand it this
is usually stored in a register so I don't think it's ultimately any slower than just accessing the stack (which is an offset in to the stack pointer as well). As other answers point out, it's probably the aliasing problem that generates less optimal code: the compiler can't tell if any of the memory addresses overlap. Updating sm
could also in theory change the content of arr
, so it decides to write out the value of sm
to main memory every time rather than tracking it in a register. When variables are on the stack, the compiler can assume they're all at different memory addresses. The compiler doesn't see the program as clearly as you do: it can tell what's on the stack (because you declared it like that), but everything else is just arbitrary memory addresses that could be anything, anywhere, overlapping any other pointer.
I'm not surprised the optimisation in your question (using local variables) isn't made - not only would the compiler have to prove the memory of arr
does not overlap anything pointed to by this
, but also that not updating the member variables until the end of the function is equivalent to the un-optimised version updating throughout the function. That can be a lot trickier to determine than you imagine, especially if you take concurrency in to account.
精彩评论