开发者

Is cudamalloc slower than cudamemcpy?

i am working on a code which needs to be time efficient and thus using Cufft for this purpose but when i try to compute fft of a very large data in parallel it is slower than cpu fftw and the reason i find after finding the time for every line of code using high precision timing code is that cudamalloc taking around 0.983 sec while the time for rest of the lines of code is around 0.00xx sec which is expected ....

i have gone through some of the related posts but according to them

the main delay w开发者_StackOverflow中文版ith GPUs is due to memory transfer not memory allocation

And also in one of the posts it was written that

The very first call to any of the cuda library functions launches an initialisation subroutine

what is the actual reason of this delay ...or is it not normal to have such delay in the execution of code???

Thanks in Advance


Is it possible that the large delay you are seeing (nearly 1s) is due to driver initialisation? It seems rather long for a cudaMalloc. Also check your driver is up-to-date.

The delay for the first kernel launch can be due to a number of factors:

  1. Driver initialisation
  2. PTX compilation
  3. Context creation

The first of these is only applicable if you are running on a Linux system without X. In that case the driver is only loaded when required and unloaded afterwards. Running nvidia-smi -pm 1 as root will run the driver in persistent mode to avoid such delays, check out man nvidia-smi for details and remember to add this to an init script since it won't persist across a reboot.

The second delay is in compiling the PTX for the specific device architecture in your system. This is easily avoided by embedding the binary for your device architecture (or architectures if you want to support multiple archs without compiling PTX) into the executable. See the CUDA C Programming Guide (available on NVIDIA website) for more information, section 3.1.1.2 talks about JIT compilation.

The third point, context creation, is unavoidable but NVIDIA have gone to great effort to reduce the cost. Context creation involves copying the executable code to the device, copying any data objects, setting up the memory system etc.


It is understandable. The nvcc embeds ptx code into the application binary which has to compiled to native gpu binary using a JIT compiler. This accounts for the start up delay. AFAIK malloc is not slower than memcpy .

It is also true that cudaRegisterFatBinary and cudaRegisterFunction are inserted by nvcc into your code to register your kernel and its entry point with the runtime. I guess this is the initialization you are talking about.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜