开发者

Replacing a c for loop with cuda

What is the best way to do this in CUDA?

...
for(int i=0;i<size;++i)                                                                             
  for(int j=i+1;j<size ;++j)                                                                           
    temp_norm+=exp((train[i]-train[j])/tau);   

Would this be equivalent?

...
int i = threadIdx.x + blockIdx.x * blockDim.x;
int j = threadIdx.y + blockIdx.y * blockDim.y;

if (i>=size || j>=size) return;

if(j>i)
  temp_norm+=exp((train[i]-train[j])/tau开发者_JAVA技巧);

Any help would be much appreciated!


How best to implement really depends on how big size is. But assuming it is quite large, e.g. 1000 or more...

To do it the way you suggest, you would need to use atomicAdd(), which can be expensive if too many threads atomically add to the same address. A better way is probably to use a parallel reduction.

Check out the "reduction" sample in the NVIDIA CUDA SDK.

YMMV with the following since it is untested, and I don't know your data size, but something like this should work. Use the "reduction6" kernel from that example, but add your computation to the first while loop. Replace the initialization of i and gridSize with

unsigned int i = blockIdx.x*blockSize + threadIdx.x;
unsigned int gridSize = blockSize * gridDim.x;

Replace the while (i < n) loop with

while (i < size)
{
  for (unsigned int j = i+1; j<size; ++j)
      mySum += exp((train[j]-train[i])/tau);   
  i += gridSize;
}

(Note, floating point arithmetic is non-associative, so the different order of operations in a parallel implementation may give you a slightly different answer than the sequential implementation. It may even give you a slightly more accurate answer due to the balanced tree reduction, depending on your input data.)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜