开发者

Strange CUDA behavior in vector multiplication program

I'm having some trouble with a very basic CUDA program. I have a program that multiplies two vectors on the Host and on the Device and then compare开发者_开发百科s them. This works without a problem. What's wrong is that I'm trying to test different number of threads and blocks for learning purposes. I have the following kernel:

__global__ void multiplyVectorsCUDA(float *a,float *b, float *c, int N){
    int idx = threadIdx.x;
    if (idx<N) 
        c[idx] = a[idx]*b[idx];
}

which I call like:

multiplyVectorsCUDA <<<nBlocks, nThreads>>> (vector_a_d,vector_b_d,vector_c_d,N);

For the moment I've fixed nBLocks to 1 so I only vary the vector size N and the number of threads nThreads. From what I understand, there will be a thread for each multiplication so N and nThreads should be equal.

The problem is the following

  1. I first call the kernel with N=16 and nThreads<16 which doesn't work. (This is ok)
  2. Then I call it with N=16 and nThreads=16 which works fine. (Again works as expected)
  3. But when I call it with N=16 and nThreads<16 it still works!

I don't understand why the last step doesn't fail like the first one. It only fails again if I restart my PC.

Has anyone run into something like this before or can explain this behavior?


Wait, so are you calling all three in a row? I don't know the rest of your code, but are you sure you're clearing out the graphics memory you alloced between each run? If not, that could explain why it doesn't work the first time but does the third time when you're passing the same values, and why it only works again after rebooting (rebooting clears all the memory alloced).


Don't know if its ok to answer my own question but I realized I had a bug in my code when comparing the host and device vectors (that part of the code wasn't posted). Sorry for the inconvenience. Could someone please close this post since it won't let me delete it?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜