CUDA kernels consistently returning bad results

2023-02-11 00:16 问答作者：

I am a CUDA beginner who has successfully compiled and run several code samples using CUDA libraries such as CUFFT and CUBLAS. Lately, however, I have been trying to generate my own simple kernels and am repeatedly receiving nonsense values back after calling my kernels. That is--when I pass a parameter into a kernel, set its value in the kernel, then try to copy the results back to the host and read the values later, they are bogus. I have tried many different simple tutorial kernels that seem to work for most people online, but I always get nonsensical values. For example...

#define SIZE 10

    //  Kernel definition, see also section 4.2.3 of Nvidia Cuda Programming Guide                                      
    __global__  void vecAdd(float* A, float* B, float* C) {

      // threadIdx.x is a built-in variable  provided by CUDA at runtime                                                
      int i = thread开发者_运维技巧Idx.x;
      A[i]=0;
      B[i]=i;
      C[i] = A[i] + B[i];

    }

    int main {

      int N=SIZE;
      float A[SIZE], B[SIZE], C[SIZE];
      float *devPtrA;
      float *devPtrB;
      float *devPtrC;
      int memsize= SIZE * sizeof(float);

      cudaMalloc((void**)&devPtrA, memsize);
      cudaMalloc((void**)&devPtrB, memsize);
      cudaMalloc((void**)&devPtrC, memsize);
      cudaMemcpy(devPtrA, A, memsize,  cudaMemcpyHostToDevice);
      cudaMemcpy(devPtrB, B, memsize,  cudaMemcpyHostToDevice);
      // __global__ functions are called:  Func<<< Dg, Db, Ns >>>(parameter);                                          
      vecAdd<<<1, N>>>(devPtrA,  devPtrB, devPtrC);
      cudaMemcpy(C, devPtrC, memsize,  cudaMemcpyDeviceToHost);

      for (int i=0; i<SIZE; i++)
        printf("C[%d]=%f\n",i,C[i]);

      cudaFree(devPtrA);
      cudaFree(devPtrA);
      cudaFree(devPtrA);

}

This is a fairly straightforward problem; the results should be:

C[0]=0.000000 
C[1]=1.000000 
C[2]=2.000000 
C[3]=3.000000 
C[4]=4.000000 
C[5]=5.000000 
C[6]=6.000000 
C[7]=7.000000 
C[8]=8.000000 
C[9]=9.000000

However, my awesome results are always random and generally look more like:

C[0]=nan
C[1]=-32813464158208.000000
C[2]=nan
C[3]=-27667211200843743232.000000
C[4]=34559834084263395806523272811251761152.000000
C[5]=9214363188332593152.000000
C[6]=nan
C[7]=-10371202300694685655937271382147072.000000
C[8]=121653576586393934243511643668480.000000
C[9]=-30648783863808.000000

So basically, when I pass parameters into a CUDA kernel with the intention of storing results within them to be copied back to the host, I tend to get out junk.

This one really has me stumped. Any help would be greatly appreciated.

Thanks.

You should always check for errors returned by API calls. C developers are completely accustomed to checking for NULL from malloc() for example, since not checking for NULL frequently results in a null pointer dereference later on (bad things ensue). C++ developers often rely on exceptions, but many APIs are C-style (including the CUDA calls you are using and many other libraries) so you should be aware of when to check for errors.

Ideally you would check for errors on every API call. Personally I wouldn't use the SAFE_CALL macros from CUTIL, instead I would check the error, handle it properly and throw an exception (C++) or at least clean up properly. That way, when you grow the experiment into a bigger app you have already thought about error handling.

At the very least, you should check for an error at the end:

cudaError_t cudaResult;
cudaResult = cudaGetLastError();
if (cudaResult != cudaSuccess)
{
    // Do whatever you want here
    // I normally create a std::string msg with a description of where I am
    // and append cudaGetErrorString(cudaResult)
}

I ran your code and did not get any errors. I would try to see if the sdk samples still run? Also, GPUOcelot provides emulation support if you ever need it (seems a bit overkill in this case).

my output: cuda2:~/tests$ ./test C[0]=0.000000 C[1]=1.000000 C[2]=2.000000 C[3]=3.000000 C[4]=4.000000 C[5]=5.000000 C[6]=6.000000 C[7]=7.000000 C[8]=8.000000 C[9]=9.000000

I get the same behavior on a Linux host with your code when I don't have the kernel module loaded? Are you certain you have the driver loaded? You can check that you have a CUDA capable device by running the deviceQuery sample executable that comes with the SDK.

As an update, if you have the module loaded (verified with lsmod). You may need to ensure /dev/nvidia* device nodes exist. There is a sample script in the getting started guide to get you going (page 6, http://developer.download.nvidia.com/compute/cuda/3_2_prod/docs/Getting_Started_Linux.pdf).

Further update, If you were handling the errors as Tom suggested you'd catch this error. If you are wanting a quick and dirty approach, that'll tell you where you ran into the error, you can look at the code for the CUDA by Example book (http://developer.nvidia.com/object/cuda-by-example.html). The code provides a HANDLE_ERROR that will terminate your program when it runs into an error, and provide a message on stdout. This isn't the best approach for production code, but is quick and dirty.

I finally figured this out. I am running on a 64-bit Mac Pro and had been passing -Xcompiler "arch x86_64" as an argument to nvcc. A moderator on the NVidia forums pointed out that, on a Mac, I should pass "-m64" to nvcc instead. I must have missed this in the documentation. Passing -m64 fixed my output and kernels seem to be launching successfully now. Thanks to all for your answers.

继续阅读：gpu

CUDA kernels consistently returning bad results

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？