开发者

Unexpected results with OpenMP on i7 and Xeon

parallelizing two nested for-loops, I have run into behavior I cannot explain. I have tried three different kinds of parallelization using OpenMP on an i7 860 and and a xeon E5540 and I expected that the code behaves more or less the same on both platforms, meaning that one of the platforms should be faster for the all three different cases I tested. But that is not the case:

  • For case 1, the Xeon is faster by ~10%,
  • for case 2, the i7 is faster by factor 2 and
  • for case 3, the Xeon is again fas开发者_C百科ter by the factor 1,5

Do you have an idea what could cause this?

Please state when you need more information or clarification!

To further clarify, my question is meant more general. If I run the same code on an i7 and on a xeon system, shouldn't the use of OpenMP result in comparable (proportional) results?

pseudo code:

for 1:4
    for 1:1000
        vector_multiplication
    end
end

The cases:

case 1: no pramga omp no parallelzation

case 2: pragma omp for the first for-loop

case 3: pragma omp for the second for-loop

Results

Here are the actual numbers from the time command:

case 1

Time   Xeon        i7
real   11m14.120s  12m53.679s
user   11m14.030s  12m46.220s
sys      0m0.080s    0m0.176s

case 2

Time   Xeon        i7
real    8m57.144s   4m37.859s
user   71m10.530s  29m07.797s
sys      0m0.300s   0m00.128s

case 3

Time   Xeon        i7
real    2m00.234s   3m35.866s
user   11m52.870s  22m10.799s
sys     0m00.170s   0m00.136s

[Update]

Thanks for all the hints. I am still researching what the reason could be.


There's been good answers here about possible variations with effects of compilation, etc, which are quite correct; but there are other reasons to expect differences. A fairly straightforward (eg, low arithmetic intensity) computation like this tends to be very sensitive to memory bandwidth; and the amount of memory bandwidth available per thread will depend on how many threads you run. Is memory set up the same way on both systems?

It looks like the i7 860 has a higher clock speed, but the E5540 has higher total memory bandwidth. Since case 2 can only make use of 4 threads, and case 3 can make use of more, it's not at all crazy to think that in the 4-thread case the clock speed wins but in the 8-thread case the increased memory contention (8 threads trying to pull in/push out values) tips the hand to the higher-bandwdith Xeon.

Making this potentially more complicated is the fact that it looks like you're running 8 threads -- are these dual-socket systems or are you using hyperthreading? This makes the situation much more complicated, since hyperthreading actually helps hide some of the memory contention by switching in threads when another thread is stuck waiting for memory.

If you want to try to see if finite memory bandwidth is playing a role here, you can artificially add more computation to the problem (eg, multiply exp(sin(a)) by cos(b)*cos(b) or something) to ensure the problem is compute-bound, eliminating one variable as you try to get to the bottom of things. Compiling the code on each system with optimizations for that particular machine (with -march or -xHost or what have you) eliminates another variable. If hyperthreading is on, turning it off (or just setting OMP_NUM_THREADS to the number of physical cores) gets rid of another variable. Once you understand what's going on in this simplified case, relaxing the restrictions above one-by-one should help you understand what's going on a little better.


Things that can influence the efficiency of openMP are e.g the computation of the iteration bounds. Ensure that these are computed before hand and that the iteration variables are as local as possible. If you have C99 do something like

#pragma omp parallel for
for (size_t i = start, is = stop; i < is; ++i) ...

to ensure that the expressions start and stop are evaluated at the beginning. (or use _Pragma, see my other answer.)

Then the only way to see if this really was successful is to look at the assembler code (usually option -S).

If it isn't look into the other compiler options. gcc has the option -march=native that compiles optimally for the actual architecture. Other platforms might have similar options. Start by searching for the best architecture options for case 1.


This is more of a comment than an answer.

It's not entirely clear what you have measured. For greater certainty I would want to:

  1. Insert timing statements at points in my code to report execution time. That way I know what I am measuring, with more certainty than the Linux time command gives me.

  2. Use 2 different compilers, to ensure that I am measuring something about OpenMP rather than aspects of one implementation of it.

All that aside, I tend to agree that your initial results bear further investigation.

One of the things I suggest you try is collapsing your two loops and letting OpenMP schedule 4000 loops rather than 4 or 1000.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜