Unexpected results with OpenMP on i7 and Xeon
parallelizing two nested for-loops, I have run into behavior I cannot explain. I have tried three different kinds of parallelization using OpenMP on an i7 860 and and a xeon E5540 and I expected that the code behaves more or less the same on both platforms, meaning that one of the platforms should be faster for the all three different cases I tested. But that is not the case:
- For case 1, the Xeon is faster by ~10%,
- for case 2, the i7 is faster by factor 2 and
- for case 3, the Xeon is again fas开发者_C百科ter by the factor 1,5
Do you have an idea what could cause this?
Please state when you need more information or clarification!
To further clarify, my question is meant more general. If I run the same code on an i7 and on a xeon system, shouldn't the use of OpenMP result in comparable (proportional) results?
pseudo code:
for 1:4
for 1:1000
vector_multiplication
end
end
The cases:
case 1: no pramga omp no parallelzation
case 2: pragma omp for the first for-loop
case 3: pragma omp for the second for-loop
Results
Here are the actual numbers from the time
command:
case 1
Time Xeon i7
real 11m14.120s 12m53.679s
user 11m14.030s 12m46.220s
sys 0m0.080s 0m0.176s
case 2
Time Xeon i7
real 8m57.144s 4m37.859s
user 71m10.530s 29m07.797s
sys 0m0.300s 0m00.128s
case 3
Time Xeon i7
real 2m00.234s 3m35.866s
user 11m52.870s 22m10.799s
sys 0m00.170s 0m00.136s
[Update]
Thanks for all the hints. I am still researching what the reason could be.
There's been good answers here about possible variations with effects of compilation, etc, which are quite correct; but there are other reasons to expect differences. A fairly straightforward (eg, low arithmetic intensity) computation like this tends to be very sensitive to memory bandwidth; and the amount of memory bandwidth available per thread will depend on how many threads you run. Is memory set up the same way on both systems?
It looks like the i7 860 has a higher clock speed, but the E5540 has higher total memory bandwidth. Since case 2 can only make use of 4 threads, and case 3 can make use of more, it's not at all crazy to think that in the 4-thread case the clock speed wins but in the 8-thread case the increased memory contention (8 threads trying to pull in/push out values) tips the hand to the higher-bandwdith Xeon.
Making this potentially more complicated is the fact that it looks like you're running 8 threads -- are these dual-socket systems or are you using hyperthreading? This makes the situation much more complicated, since hyperthreading actually helps hide some of the memory contention by switching in threads when another thread is stuck waiting for memory.
If you want to try to see if finite memory bandwidth is playing a role here, you can artificially add more computation to the problem (eg, multiply exp(sin(a)) by cos(b)*cos(b) or something) to ensure the problem is compute-bound, eliminating one variable as you try to get to the bottom of things. Compiling the code on each system with optimizations for that particular machine (with -march or -xHost or what have you) eliminates another variable. If hyperthreading is on, turning it off (or just setting OMP_NUM_THREADS to the number of physical cores) gets rid of another variable. Once you understand what's going on in this simplified case, relaxing the restrictions above one-by-one should help you understand what's going on a little better.
Things that can influence the efficiency of openMP are e.g the computation of the iteration bounds. Ensure that these are computed before hand and that the iteration variables are as local as possible. If you have C99 do something like
#pragma omp parallel for
for (size_t i = start, is = stop; i < is; ++i) ...
to ensure that the expressions start
and stop
are evaluated at the beginning. (or use _Pragma
, see my other answer.)
Then the only way to see if this really was successful is to look at the assembler code (usually option -S
).
If it isn't look into the other compiler options. gcc
has the option -march=native
that compiles optimally for the actual architecture. Other platforms might have similar options. Start by searching for the best architecture options for case 1.
This is more of a comment than an answer.
It's not entirely clear what you have measured. For greater certainty I would want to:
Insert timing statements at points in my code to report execution time. That way I know what I am measuring, with more certainty than the Linux time command gives me.
Use 2 different compilers, to ensure that I am measuring something about OpenMP rather than aspects of one implementation of it.
All that aside, I tend to agree that your initial results bear further investigation.
One of the things I suggest you try is collapsing your two loops and letting OpenMP schedule 4000 loops rather than 4 or 1000.
精彩评论