parallelization of nested loop in threading building blocks

2023-02-21 05:52 问答作者：

I am new to threading building blocks and trying to encode the FFT algorithm in TBB to get some hands on experience. In case of this algorithm i can only parallelize the innermost loop. but by doing so the performance has degraded up to unacceptable degree (more than thousand times). I have tried for array size up to 2^20 but same result. My code is given below

for(stage1=0;stage1 < lim;stage1 ++)
{
    butterfly_index1 = butterfly_index2;
    butterfly_index2 = butterfly_index2 + butterfly_index2; 
    k = -1*2*PI/butterfly_index2;
    k_next = 0.0;
    for(stage2 = 0 ; stage2 < butterfly_index1;stage2++)
    {
        sine=sin(k_next);
        cosine = cos(k_next);
        k_next = k_next + k;
        FFT. sine = 开发者_如何学JAVA&sine;
        FFT.cosine = &cosine;
        FFT.n2 = &butterfly_index2;
        FFT.loop_init = &stage2;
        FFT.n1 = &butterfly_index1;
        parallel_for(blocked_range<int>(
                        stage2,SIZE,SIZE/4),FFT,simple_partitioner()); 
    }
}

and body for parallel_loop is

void operator()(const blocked_range<int> &r)const
{
    for(int k = r.begin(); k != r.end(); k++)
    {   
        if(k != *loop_init)
        {
            if((k - (*loop_init))% (* n2)!= 0)
                continue;
        }           
        temp_real = (*cosine) * X[k + *n1].real - (*sine) * X[k + *n1].img;
        temp_img = (*sine)* X[k + *n1].real + (*cosine) * X[k + *n1].img;
        X[k + *n1].real = X[k].real - temp_real;
        X[k + *n1].img = X[k].img - temp_img;
        X[k].real = X[k].real + temp_real;
        X[k].img = X[k].img + temp_img; 
    }
}

If i replace it with normal loop then things are correct.

For very short workloads, the huge slowdown can happen due to the overhead on thread creation. For 2^20 elements in the array, I do not believe such a huge performance drop happens.

Another significant source of performance degradation is compiler's inability to optimize (in particular, vectorize) the code after it has been TBBfied. Find out whether your compiler may produce a vectorization report, and look for differences between serial and TBB versions.

A possible source of slowdown might be the copy constructor of parallel_for's body class, because the bodies are copied a lot. But the given code does not look suspicious in this regard: seems the body contains a few pointers. Anyway, look whether it could be a problem.

Another usual source of significant overhead is too fine-grained parallelism, i.e. a lot of tasks each containing very little work. But here this is not the case either, because grainsize SIZE/4 in the 3rd parameter to blocked_range specifies that body's operator()() will be invoked at most 4 times per the algorithm.

I would recommend to not specify simple_partitioner and grainsize for initial experiments, and instead let TBB distribute the work dynamically. You can tune it later if necessary.

the problem with my code was the decrease in workload with increase in n2 variable. so as the out loop proceeds the workload for parallel_for become half and after just a few iteration it becomes too small to give performance boost by using TBB. So solution is to parallelize the inner loop only for iterations when the workload for inner loop is sufficient ans serialize the inner loop for rest of iterations. secondly the "for" loop header that contains the condition check (k != r.end()) also degrades the performance. solution is to replace the r.end() with a locally defined variable that is initialized to r.end()

Thanks to intel software forum for their assistance in solving this problem

继续阅读：algorithm fft loops nested tbb

parallelization of nested loop in threading building blocks

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？