CUDA fast math operations
This is my code,
__device__ void calculateDT(float *devD, int *devImg, int cntVoxelLi, int *neighVoxels)
{
float minV = devD[cntVoxelLi];
int cv = devImg[cntVoxelLi];
float v = 0,cuVal = 0;
int c1=0,d1=0,r1=0;
GetInd2Sub(cntVoxelLi, r1,c1,d1);
for(int ind=0;ind<开发者_C百科9;ind++)
{
v = pow(float(cv - devImg[neighVoxels[ind]]),2);
cuVal = devD[neighVoxels[ind]] + (1-exp(-v/100));
minV = min(minV, cuVal);
}
devD[cntVoxelLi] = minV;
}
When I run the entire program it takes about 15seconds. But when I remove the
exp(-v/100)
it takes only 7 seconds. It seems to be this exp operation takes much time. I tries with expf function as well. How can I improve the performance?
The performance difference you ar seeing is mostly the result of compiler optimization. When you remove the exp
expression, the variable v
becomes unused, and the compiler will remove the calculation of v
because it is effectively dead code. So the large drop in execution time is due to the elimination of all of the floating point computation fom the kernel loop, not from the removal of the exp
function alone.
As for performance optimizations, the obvious one is to eliminate the use of pow
for computing a simple square (the compiler might be doing this itself), and tidy up all of the floating point expressions to eliminate a number of implicit integer-floating point conversions (hint: 0 is an integer, 0. is double precision and 0.f is single precision).
It is difficult to comment on memory transaction performance in the kernel from the code you have posted. The CUDA 4 visual profiler has some useful diagnostics which show whether a piece of code is memory or arithmetic limited. You might find it useful to profile the code and see what it reports.
精彩评论