Parallel Sum for Vectors

2023-03-10 17:35 问答作者：

Could someone please provide some suggestions on how I can decrease the following for loop's runtime through multithreading? Suppose I also have two vectors called 'a' and 'b'.

for (int j = 0; j < 8000; j++){
    // Perform an operation and store in the vector 'a'
    // Add 'a' to 'b' coefficient wise
}

开发者_如何学Python

This for loop is executed many times in my program. The two operations in the for loop above are already optimized, but they only run on one core. However, I have 16 cores available and would like to make use of them.

I've tried modifying the loop as follows. Instead of having the vector 'a', I have 16 vectors, and suppose that the i-th one is called a[i]. My for loop now looks like

for (int j = 0; j < 500; j++){
    for (int i = 0; i < 16; i++){
        // Perform an operation and store in the vector 'a[i]'
    }
    for (int i = 0; i < 16; i++){
        // Add 'a[i]' to 'b' coefficient wise
    }

}

I use the OpenMp on each of the for loops inside by adding '#pragma omp parallel for' before each of the inner loops. All of my processors are in use but my runtime only increases significantly. Does anyone have any suggestions on how I can decrease the runtime of this loop? Thank You in Advance.

omp creates threads for your program whereever you insert pragma tag, so it's createing threads for inner tags but the problem is 16 threads are created, each one does 1 operation and then all of them are destroyed using your method. creating and destroying threads take a lot of time so the method you used increases the overal time of your process although it uses all 16 cores. you didn't have to create inner fors just put #pragma omp parallel for tag before your 8000 loop it's up to omp to seperate values between treads so what you did to create the second loop, is omp's job. that way omp create threads only once and then process 500 numbers useing that each thread and end all of them after that (using 499 less thread creation and destruction)

Actually, I am going to put these comments in an answer.

Forking threads for trivial operations just adds overhead.

First, make sure your compiler is using vector instructions to implement your loop. (If it does not know how to do this, you might have to code with vector instructions yourself; try searching for "SSE instrinsics". But for this sort of simple addition of vectors, automatic vectorization ought to be possible.)

Assuming your compiler is a reasonably modern GCC, invoke it with:

gcc -O3 -march=native ...

Add -ftree-vectorizer-verbose=2 to find out whether or not it auto-vectorized your loop and why.

If you are already using vector instructions, then it is possible you are saturating your memory bandwidth. Modern CPU cores are pretty fast... If so, you need to restructure at a higher level to get more operations inside each iteration of the loop, finding ways to perform lots of operations on blocks that fit inside the L1 cache.

Does anyone have any suggestions on how I can decrease the runtime of this loop?

for (int j = 0; j < 500; j++){  // outer loop
  for (int i = 0; i < 16; i++){  // inner loop

Always try to make outer loop iterations lesser than inner loop. This will save you from inner loop initializations that many times. In above code inner loop i = 0; is initialized 500 times. Now,

for (int i = 0; j < 16; i++){  // outer loop
  for (int j = 0; j < 500; j++){  // inner loop

Now, inner loop j = 0; is initialized only 16 times ! Give a try by modifying your code accordingly, if it makes any impact.

继续阅读：multithreading openmp parallel-processing

Parallel Sum for Vectors

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？