STL find performs better than hand-crafted loop
I have some question. Given the following C++ code fragment:
#include <boost/progress.hpp>
#include <vector>
#include <algorithm>
#include <numeric>
#include <iostream>
struct incrementor
{
incrementor() : curr_() {}
unsigned int operator()()
{ return curr_++; }
private:
unsigned int curr_;
};
template<class Vec>
char const* value_found(Vec const& v, typename Vec::const_iterator i)
{
return i==v.end() ? "no" : "yes";
}
template<class Vec>
typename Vec::const_iterator find1(Vec const& v, typename Vec::value_type val)
{
return find(v.begin(), v.end(), val);
}
template<class Vec>
typename Vec::const_iterator find2(Vec const& v, typename Vec::value_type val)
{
for(typename Vec::const_iterator i=v.begin(), end=v.end(); i<end; ++i)
if(*i==val) return i;
return v.end();
}
int main()
{
using namespace std;
typedef vector<unsigned int>::const_iterator iter;
vector<unsigned int> vec;
vec.reserve(10000000);
boost::progress_timer pt;
generate_n(back_inserter(vec), vec.capacity(), incrementor());
//added this line, to avoid any doubts, that compiler is able to
// guess the data is sorted
random_shuffle(vec.begin(), vec.end());
cout << "value generation required: " << pt.elapsed() << endl;
double d;
pt.restart();
iter found=find1(vec, vec.capacity());
d=pt.elapsed();
cout << "first search required: " << d << endl;
cout << "first search found value: " << value_found(vec, found)<< endl;
pt.restart();
found=find2(vec, vec.capacity());
d=pt.elapsed();
cout << "second search required: " << d << endl;
cout << "second search found value: " << value_found(vec, found)<< endl;
return 0;
}
On my machine (Intel i7, Windows Vista) STL find (call via find1) runs about 10 times faster than the hand-crafted loop (call via find2). I first thought that Visual C++ performs some kind of vectorization (may be I am mistaken here), but as far as I can see assembly does not look the way it uses vectorization. Why is STL loop faster? Hand-crafted loop is identical to the loop from the STL-find body.
I was asked to post program's output. Without shuffle:
value generation required: 0.078
first search required: 0.008
first search found value: no
second search required: 0.098
second search found value: no
With shuffle (caching effects):
开发者_运维技巧value generation required: 1.454
first search required: 0.009
first search found value: no
second search required: 0.044
second search found value: no
Many thanks,
dusha.
P.S. I return the iterator and write out the result (found or not), because I would like to prevent compiler optimization, that it thinks the loop is not required at all. The searched value is obviously not in the vector.
P.P.S. I was asked to post assembly generated for the find functions. Here it is:
found=find1(vec, vec.capacity());
001811D0 lea eax,[esp+5Ch]
001811D4 call std::vector<unsigned int,std::allocator<unsigned int> >::capacity (1814D0h)
001811D9 mov esi,dword ptr [esp+60h]
001811DD mov ecx,dword ptr [esp+64h]
001811E1 cmp esi,ecx
001811E3 je wmain+180h (1811F0h)
001811E5 cmp dword ptr [esi],eax
001811E7 je wmain+180h (1811F0h)
001811E9 add esi,4
001811EC cmp esi,ecx
001811EE jne wmain+175h (1811E5h)
found=find2(vec, vec.capacity());
001812AE lea eax,[esp+5Ch]
001812B2 call std::vector<unsigned int,std::allocator<unsigned int> >::capacity (1814D0h)
001812B7 mov ecx,dword ptr [esp+60h]
001812BB mov edx,dword ptr [esp+64h]
001812BF cmp ecx,edx
001812C1 je wmain+262h (1812D2h)
001812C3 cmp dword ptr [ecx],eax
001812C5 je wmain+34Fh (1813BFh)
001812CB add ecx,4
001812CE cmp ecx,edx
001812D0 jne wmain+253h (1812C3h)
find2 uses ecx-register instead of esi. What is the difference between these two registers? Can it be that esi assumes the pointer to be properly aligned and therefore brings additional performance?
Read some assembly reference ecx is just a counter, whereas esi is the memory source. So I think STL algorithm knows that Random Access Iterator is properly aligned and therefore uses memory pointers. Where in the non-STL version there is no speculation how the alignment is. Am I right?
Visual C++'s find
algorithm uses unchecked iterators, whereas your hand written loop is using checked iterators.
My other guess is that you're calling
I am an idiot.std::vector<t>::end()
on every iteration of your loop in find2
, while std::find
results in only one call to the begin and end accessors.
Make sure that you compile your code in release mode with checked iterators turned off
Set _SECURE_SCL=0 in your preprocessor definitions.
Also, boost::progress_timer has a resolution of milliseconds I believe (it is based on std::clock), this makes it very unreliable for accurate measurements of short durations. You need to make the code that you measure significantly slower so that you get rid of other factors (such as your process being on hold and so on). You should measure using the high performance counters as suggested by DeadMG.
Your measurement methodology is broken. Measuring the execution speed of code is very difficult to do correctly, because the total elapsed time depends on factors which may not be explicitly controlled by the code you've written.
Some things to check (you may consider some of these obvious):
- What optimisation settings are you using? You are testing a release build, right?
- You said you've checked the assembly code generated by the STL version and it isn't using vectorisation. But maybe it's using some other common optimisation method such as loop unrolling?
- Why are you using
i < end
rather thani != end
in your loop? (I actually doubt this makes any difference, but who knows?)
(My original answer was completely stupid -- not sure why it got a vote -- I'm leaving it here since some of the comments relate to it)
In this case, I suspect you're just seeing the effects of the memory hierarchy. When you call find1(), the CPU has to read all the data from RAM. That data will then be stored in the CPU's cache, which is much (easily 10 to 100 times) faster than accessing RAM. When you call find2(), the CPU can read the whole array from cache memory, and so find2() takes less time to execute.
To get some more evidence, try swapping the code so that you measure find2() first and then find1(). If your results get reversed, it's likely that you're seeing the effect of the cache. If they don't, then it's something else.
Edit After some further thought (actually, just after some thought), I think my original suspicion must be incorrect (the size of the array you're searching through makes it unlikely that the entire array has been put into the cache). There may still be cache effects, but they are probably somewhat more subtle. Still, try reversing the measurements anyway, it'd be interesting to see what effect it has.
find doesn't take a value_type, it takes a const value_type&. Now, I would say that for unsigned int, this should make no difference. However, it's perfectly possible that your optimizer just didn't notice this, and failed to correctly optimize your loop body.
Edit: What I would suggest is that you're kind of lying to the compiler with your use of the for loop. You could rewrite it as
typename Vec::iterator i, end;
i = vec.begin();
end = vec.end();
while(i != end && *i != val)
i++;
return i;
Of course, the guy who wrote std::find knows exactly how smart the optimizer is, and what exactly it can and can't cope with.
Edit: I ran your test on my machine. It's an i7 930, no overclock, on Visual Studio 2010. I replaced the boost::progress_timer with the High Performance Counter.
__int64 frequency, begin, end;
QueryPerformanceCounter(frequency);
double d;
QueryPerformanceCounter(begin);
iter found=find1(vec, vec.capacity());
QueryPerformanceCounter(end);
d = ((end - begin) / (double)frequency) * 1000000;
cout << "first search required: " << d << endl;
cout << "first search found value: " << value_found(vec, found)<< endl;
QueryPerformanceCounter(begin);
found=find2(vec, vec.capacity());
QueryPerformanceCounter(end);
d = ((end - begin) / (double)frequency) * 1000000;
cout << "second search required: " << d << endl;
cout << "second search found value: " << value_found(vec, found)<< endl;
Says that both of them took 0.24 (roughly) nanoseconds to operate- that is, there is no difference. My suggestion is that your optimizer is just immature and that your version of std::find is written exactly to present the correct optimizations, whereas your find just doesn't tick the correct optimization boxes.
Edit: Your timing data is clearly busted. My i7 is running in 0.23 nanoseconds- that is, 0.00000023 seconds, whereas yours wants 0.008 seconds. Unless my i7 is about 40,000 times faster than yours, there's no way. There is no way that an i7 will take so long to traverse only ten million items. Of course, I'm actually running 64bit Windows 7, although didn't compile in 64bit mode.
Going to post the disassembler now.
find1:
00F810D3 mov esi,dword ptr [esp+34h]
00F810D7 mov eax,dword ptr [esp+3Ch]
00F810DB mov ecx,dword ptr [esp+38h]
00F810DF sub eax,esi
00F810E1 sar eax,2
00F810E4 cmp esi,ecx
00F810E6 je main+0B3h (0F810F3h)
00F810E8 cmp dword ptr [esi],eax
00F810EA je main+0B3h (0F810F3h)
00F810EC add esi,4
00F810EF cmp esi,ecx
00F810F1 jne main+0A8h (0F810E8h)
find2:
00F8119A mov ecx,dword ptr [esp+34h]
00F8119E mov eax,dword ptr [esp+3Ch]
00F811A2 mov edx,dword ptr [esp+38h]
00F811A6 sub eax,ecx
00F811A8 sar eax,2
00F811AB cmp ecx,edx
00F811AD jae main+17Fh (0F811BFh)
00F811AF nop
00F811B0 cmp dword ptr [ecx],eax
00F811B2 je main+254h (0F81294h)
00F811B8 add ecx,4
00F811BB cmp ecx,edx
00F811BD jb main+170h (0F811B0h)
00F811BF mov esi,edx
You can see that find2 is slightly different to find1. I checked by replacing the call to find2 with another find1 call which does produce identical disassembly. Curious that they produce different assembly.
I don't use Visual C++ myself, but with GCC I also got the result that find2
is a little slower. However, I was able to make find2
just slightly faster than find1
by manually unrolling the loop:
template<class Vec>
typename Vec::const_iterator find2(Vec const& v, typename Vec::value_type val)
{
for(typename Vec::const_iterator i=v.begin(), end=v.end(); i != end; ) {
if (i[0]==val) return i;
if (i[1]==val) return i + 1;
i += 2;
}
return v.end();
}
My guess to why std::find
is faster is that the compiler has all the information to figure out that the vector size is a multiple of 2, and that it is possible to do this unrolling.
Another guess is that it is just a tradeoff between space/size - that the compiler skips this optimization in the general case.
A lot of C/C++ users complain that once they write an specialisation of a function, the non specialised version out performs it!
The reason is simply once you are writing optimisation passes on your compiler back end, you will think of ways to improve the std::find
code generation and hence it out performs your implementation.
Also node that std::find
for VC++ at least, has different versions, which will invoke different function and search algorithms for different kinds of iterators.
So all an all I think the compiler seems to understand your data is sorted and hence perform better search.
精彩评论