parallelization of nested loop in threading building blocks
I am new to threading building blocks and trying to encode the FFT algorithm in TBB to get some hands on experience. In case of this algorithm i can only parallelize the innermost loop. but by doing so the performance has degraded up to unacceptable degree (more than thousand times). I have tried for array size up to 2^20 but same result. My code is given below
for(stage1=0;stage1 < lim;stage1 ++)
{
butterfly_index1 = butterfly_index2;
butterfly_index2 = butterfly_index2 + butterfly_index2;
k = -1*2*PI/butterfly_index2;
k_next = 0.0;
for(stage2 = 0 ; stage2 < butterfly_index1;stage2++)
{
sine=sin(k_next);
cosine = cos(k_next);
k_next = k_next + k;
FFT. sine = 开发者_如何学JAVA&sine;
FFT.cosine = &cosine;
FFT.n2 = &butterfly_index2;
FFT.loop_init = &stage2;
FFT.n1 = &butterfly_index1;
parallel_for(blocked_range<int>(
stage2,SIZE,SIZE/4),FFT,simple_partitioner());
}
}
and body for parallel_loop is
void operator()(const blocked_range<int> &r)const
{
for(int k = r.begin(); k != r.end(); k++)
{
if(k != *loop_init)
{
if((k - (*loop_init))% (* n2)!= 0)
continue;
}
temp_real = (*cosine) * X[k + *n1].real - (*sine) * X[k + *n1].img;
temp_img = (*sine)* X[k + *n1].real + (*cosine) * X[k + *n1].img;
X[k + *n1].real = X[k].real - temp_real;
X[k + *n1].img = X[k].img - temp_img;
X[k].real = X[k].real + temp_real;
X[k].img = X[k].img + temp_img;
}
}
If i replace it with normal loop then things are correct.
For very short workloads, the huge slowdown can happen due to the overhead on thread creation. For 2^20 elements in the array, I do not believe such a huge performance drop happens.
Another significant source of performance degradation is compiler's inability to optimize (in particular, vectorize) the code after it has been TBBfied. Find out whether your compiler may produce a vectorization report, and look for differences between serial and TBB versions.
A possible source of slowdown might be the copy constructor of parallel_for's body class, because the bodies are copied a lot. But the given code does not look suspicious in this regard: seems the body contains a few pointers. Anyway, look whether it could be a problem.
Another usual source of significant overhead is too fine-grained parallelism, i.e. a lot of tasks each containing very little work. But here this is not the case either, because grainsize SIZE/4 in the 3rd parameter to blocked_range specifies that body's operator()() will be invoked at most 4 times per the algorithm.
I would recommend to not specify simple_partitioner and grainsize for initial experiments, and instead let TBB distribute the work dynamically. You can tune it later if necessary.
the problem with my code was the decrease in workload with increase in n2 variable. so as the out loop proceeds the workload for parallel_for become half and after just a few iteration it becomes too small to give performance boost by using TBB. So solution is to parallelize the inner loop only for iterations when the workload for inner loop is sufficient ans serialize the inner loop for rest of iterations. secondly the "for" loop header that contains the condition check (k != r.end()) also degrades the performance. solution is to replace the r.end() with a locally defined variable that is initialized to r.end()
Thanks to intel software forum for their assistance in solving this problem
精彩评论