CUDA kernel function taking longer than equivalent host function

2023-03-31 18:58 问答作者：

I'm following along with http://code.google.com/p/stanford-cs193g-sp2010/ and the video lectures posted online, doing one of the problem sets posted (the first one) I've encountered something slightly counterintuitive at least with respect to the way the question is asked. The question asks me to derive a timing model for execution time on the cpu and gpu assuming linear scaling based on timings from a sample application run on my own machine.

-Plug the timing numbers printed by the code on the computer you're working on into that equation and report what the break even point (when the cpu version is as fast as the gpu version) will be.

The issue I'm having is that my kernel is taking a lot longer than the host version of the equivalent function (I'll post both below), such that there is no break even point. The numbers I'm getting are as follows.

done with copy to gpu kernel
copy to gpu took 26.30630 ms
done with gpu shift cypher kernel
gpu shift cypher took 7.33203 ms
done with copy from gpu kernel
copy from gpu took 28.54141 ms
host shift cypher took 0.00186 ms
Worked! CUDA and reference output match.

Do you think there is something wrong with the way I'm doing things? Here is the kernel and host functions.

// This kernel implements a per element shift
__global__ void shift_c开发者_运维问答ypher(unsigned int *input_array, unsigned int *output_array, 
    unsigned int shift_amount, unsigned int alphabet_max, unsigned int array_length)
{
    int gid = blockIdx.x * blockDim.x + threadIdx.x;
    output_array[gid] = (input_array[gid] + shift_amount)%(alphabet_max+1);
}

void host_shift_cypher(unsigned int *input_array, unsigned int *output_array, unsigned int shift_amount, unsigned int alphabet_max, unsigned int array_length)
{
  for(unsigned int i=0;i<array_length;i++)
  {
    int element = input_array[i];
    int shifted = element + shift_amount;
    if(shifted > alphabet_max)
    {
      shifted = shifted % (alphabet_max + 1);
    }
    output_array[i] = shifted;
  }
}

The sample application runs with 16MB of integer elements, with a block size of 512. Here is the full source for the file in question http://pastebin.com/htYH0bA2

host shift cypher took 0.00186 ms

This looks very odd. Whatever you do with 16MB on a CPU, it should take more than a fraction of a millisecond.

By looking at the pastebin code it seems you time everything with CUDA events. Although I have not used them, my guess is that you measure actual time of GPU Kernels executing with this. Which, in the case of just calling host code will be next to nothing. Is this really how they measure host code executing in the Stanford course?

You could prove me wrong by just checking this result with any kind of C timer.

This was a problem with the timer as w.m pointed out. The issue was, I believe, that the event recording functions in the timer handed off control to the cpu based host function before recording the event. It's sort of confusing because you would think that recording the event would occur within the time the host code is executing but it seems that it was doing something more like recording the start and stop events simultaneously, both after the host code finished executing. Adding a cudaThreadSynchronize(); to the start timer seems to fix the problem (ensuring the event gets recorded before continuing with the host code. This could be a windows only discrepancy or based on my CUDA version or hardware etc I'm not sure. In any case my new, much more normal results are as follows.

done with copy to gpu kernel
copy to gpu took 25.75683 ms
done with gpu shift cypher kernel
gpu shift cypher took 7.45667 ms
done with copy from gpu kernel
copy from gpu took 29.10922 ms
host shift cypher took 153.98772 ms
1 second sleep took 999.85291 ms
Worked! CUDA and reference output match.

继续阅读：c gpgpu

CUDA kernel function taking longer than equivalent host function

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？