CUDA kernels consistently returning bad results
I am a CUDA beginner who has successfully compiled and run several code samples using CUDA libraries such as CUFFT and CUBLAS. Lately, however, I have been trying to generate my own simple kernels and am repeatedly receiving nonsense values back after calling my kernels. That is--when I pass a parameter into a kernel, set its value in the kernel, then try to copy the results back to the host and read the values later, they are bogus. I have tried many different simple tutorial kernels that seem to work for most people online, but I always get nonsensical values. For example...
#define SIZE 10
// Kernel definition, see also section 4.2.3 of Nvidia Cuda Programming Guide
__global__ void vecAdd(float* A, float* B, float* C) {
// threadIdx.x is a built-in variable provided by CUDA at runtime
int i = thread开发者_运维技巧Idx.x;
A[i]=0;
B[i]=i;
C[i] = A[i] + B[i];
}
int main {
int N=SIZE;
float A[SIZE], B[SIZE], C[SIZE];
float *devPtrA;
float *devPtrB;
float *devPtrC;
int memsize= SIZE * sizeof(float);
cudaMalloc((void**)&devPtrA, memsize);
cudaMalloc((void**)&devPtrB, memsize);
cudaMalloc((void**)&devPtrC, memsize);
cudaMemcpy(devPtrA, A, memsize, cudaMemcpyHostToDevice);
cudaMemcpy(devPtrB, B, memsize, cudaMemcpyHostToDevice);
// __global__ functions are called: Func<<< Dg, Db, Ns >>>(parameter);
vecAdd<<<1, N>>>(devPtrA, devPtrB, devPtrC);
cudaMemcpy(C, devPtrC, memsize, cudaMemcpyDeviceToHost);
for (int i=0; i<SIZE; i++)
printf("C[%d]=%f\n",i,C[i]);
cudaFree(devPtrA);
cudaFree(devPtrA);
cudaFree(devPtrA);
}
This is a fairly straightforward problem; the results should be:
C[0]=0.000000
C[1]=1.000000
C[2]=2.000000
C[3]=3.000000
C[4]=4.000000
C[5]=5.000000
C[6]=6.000000
C[7]=7.000000
C[8]=8.000000
C[9]=9.000000
However, my awesome results are always random and generally look more like:
C[0]=nan
C[1]=-32813464158208.000000
C[2]=nan
C[3]=-27667211200843743232.000000
C[4]=34559834084263395806523272811251761152.000000
C[5]=9214363188332593152.000000
C[6]=nan
C[7]=-10371202300694685655937271382147072.000000
C[8]=121653576586393934243511643668480.000000
C[9]=-30648783863808.000000
So basically, when I pass parameters into a CUDA kernel with the intention of storing results within them to be copied back to the host, I tend to get out junk.
This one really has me stumped. Any help would be greatly appreciated.
Thanks.
You should always check for errors returned by API calls. C developers are completely accustomed to checking for NULL from malloc() for example, since not checking for NULL frequently results in a null pointer dereference later on (bad things ensue). C++ developers often rely on exceptions, but many APIs are C-style (including the CUDA calls you are using and many other libraries) so you should be aware of when to check for errors.
Ideally you would check for errors on every API call. Personally I wouldn't use the SAFE_CALL macros from CUTIL, instead I would check the error, handle it properly and throw an exception (C++) or at least clean up properly. That way, when you grow the experiment into a bigger app you have already thought about error handling.
At the very least, you should check for an error at the end:
cudaError_t cudaResult;
cudaResult = cudaGetLastError();
if (cudaResult != cudaSuccess)
{
// Do whatever you want here
// I normally create a std::string msg with a description of where I am
// and append cudaGetErrorString(cudaResult)
}
I ran your code and did not get any errors. I would try to see if the sdk samples still run? Also, GPUOcelot provides emulation support if you ever need it (seems a bit overkill in this case).
my output: cuda2:~/tests$ ./test C[0]=0.000000 C[1]=1.000000 C[2]=2.000000 C[3]=3.000000 C[4]=4.000000 C[5]=5.000000 C[6]=6.000000 C[7]=7.000000 C[8]=8.000000 C[9]=9.000000
I get the same behavior on a Linux host with your code when I don't have the kernel module loaded? Are you certain you have the driver loaded? You can check that you have a CUDA capable device by running the deviceQuery sample executable that comes with the SDK.
As an update, if you have the module loaded (verified with lsmod). You may need to ensure /dev/nvidia* device nodes exist. There is a sample script in the getting started guide to get you going (page 6, http://developer.download.nvidia.com/compute/cuda/3_2_prod/docs/Getting_Started_Linux.pdf).
Further update, If you were handling the errors as Tom suggested you'd catch this error. If you are wanting a quick and dirty approach, that'll tell you where you ran into the error, you can look at the code for the CUDA by Example book (http://developer.nvidia.com/object/cuda-by-example.html). The code provides a HANDLE_ERROR that will terminate your program when it runs into an error, and provide a message on stdout. This isn't the best approach for production code, but is quick and dirty.
I finally figured this out. I am running on a 64-bit Mac Pro and had been passing -Xcompiler "arch x86_64" as an argument to nvcc. A moderator on the NVidia forums pointed out that, on a Mac, I should pass "-m64" to nvcc instead. I must have missed this in the documentation. Passing -m64 fixed my output and kernels seem to be launching successfully now. Thanks to all for your answers.
精彩评论