CUDA kernel function taking longer than equivalent host function
I'm following along with http://code.google.com/p/stanford-cs193g-sp2010/ and the video lectures posted online, doing one of the problem sets posted (the first one) I've encountered something slightly counterintuitive at least with respect to the way the question is asked. The question asks me to derive a timing model for execution time on the cpu and gpu assuming linear scaling based on timings from a sample application run on my own machine.
-Plug the timing numbers printed by the code on the computer you're working on into that equation and report what the break even point (when the cpu version is as fast as the gpu version) will be.
The issue I'm having is that my kernel is taking a lot longer than the host version of the equivalent function (I'll post both below), such that there is no break even point. The numbers I'm getting are as follows.
done with copy to gpu kernel
copy to gpu took 26.30630 ms
done with gpu shift cypher kernel
gpu shift cypher took 7.33203 ms
done with copy from gpu kernel
copy from gpu took 28.54141 ms
host shift cypher took 0.00186 ms
Worked! CUDA and reference output match.
Do you think there is something wrong with the way I'm doing things? Here is the kernel and host functions.
// This kernel implements a per element shift
__global__ void shift_c开发者_运维问答ypher(unsigned int *input_array, unsigned int *output_array,
unsigned int shift_amount, unsigned int alphabet_max, unsigned int array_length)
{
int gid = blockIdx.x * blockDim.x + threadIdx.x;
output_array[gid] = (input_array[gid] + shift_amount)%(alphabet_max+1);
}
void host_shift_cypher(unsigned int *input_array, unsigned int *output_array, unsigned int shift_amount, unsigned int alphabet_max, unsigned int array_length)
{
for(unsigned int i=0;i<array_length;i++)
{
int element = input_array[i];
int shifted = element + shift_amount;
if(shifted > alphabet_max)
{
shifted = shifted % (alphabet_max + 1);
}
output_array[i] = shifted;
}
}
The sample application runs with 16MB of integer elements, with a block size of 512. Here is the full source for the file in question http://pastebin.com/htYH0bA2
host shift cypher took 0.00186 ms
This looks very odd. Whatever you do with 16MB on a CPU, it should take more than a fraction of a millisecond.
By looking at the pastebin code it seems you time everything with CUDA events. Although I have not used them, my guess is that you measure actual time of GPU Kernels executing with this. Which, in the case of just calling host code will be next to nothing. Is this really how they measure host code executing in the Stanford course?
You could prove me wrong by just checking this result with any kind of C timer.
This was a problem with the timer as w.m pointed out. The issue was, I believe, that the event recording functions in the timer handed off control to the cpu based host function before recording the event. It's sort of confusing because you would think that recording the event would occur within the time the host code is executing but it seems that it was doing something more like recording the start and stop events simultaneously, both after the host code finished executing. Adding a cudaThreadSynchronize();
to the start timer seems to fix the problem (ensuring the event gets recorded before continuing with the host code. This could be a windows only discrepancy or based on my CUDA version or hardware etc I'm not sure. In any case my new, much more normal results are as follows.
done with copy to gpu kernel
copy to gpu took 25.75683 ms
done with gpu shift cypher kernel
gpu shift cypher took 7.45667 ms
done with copy from gpu kernel
copy from gpu took 29.10922 ms
host shift cypher took 153.98772 ms
1 second sleep took 999.85291 ms
Worked! CUDA and reference output match.
精彩评论