Is cudamalloc slower than cudamemcpy?

2023-03-20 07:48 问答作者：

i am working on a code which needs to be time efficient and thus using Cufft for this purpose but when i try to compute fft of a very large data in parallel it is slower than cpu fftw and the reason i find after finding the time for every line of code using high precision timing code is that cudamalloc taking around 0.983 sec while the time for rest of the lines of code is around 0.00xx sec which is expected ....

i have gone through some of the related posts but according to them

the main delay w开发者_StackOverflow中文版ith GPUs is due to memory transfer not memory allocation

And also in one of the posts it was written that

The very first call to any of the cuda library functions launches an initialisation subroutine

what is the actual reason of this delay ...or is it not normal to have such delay in the execution of code???

Thanks in Advance

Is it possible that the large delay you are seeing (nearly 1s) is due to driver initialisation? It seems rather long for a cudaMalloc. Also check your driver is up-to-date.

The delay for the first kernel launch can be due to a number of factors:

Driver initialisation
PTX compilation
Context creation

The first of these is only applicable if you are running on a Linux system without X. In that case the driver is only loaded when required and unloaded afterwards. Running nvidia-smi -pm 1 as root will run the driver in persistent mode to avoid such delays, check out man nvidia-smi for details and remember to add this to an init script since it won't persist across a reboot.

The second delay is in compiling the PTX for the specific device architecture in your system. This is easily avoided by embedding the binary for your device architecture (or architectures if you want to support multiple archs without compiling PTX) into the executable. See the CUDA C Programming Guide (available on NVIDIA website) for more information, section 3.1.1.2 talks about JIT compilation.

The third point, context creation, is unavoidable but NVIDIA have gone to great effort to reduce the cost. Context creation involves copying the executable code to the device, copying any data objects, setting up the memory system etc.

It is understandable. The nvcc embeds ptx code into the application binary which has to compiled to native gpu binary using a JIT compiler. This accounts for the start up delay. AFAIK malloc is not slower than memcpy .

It is also true that cudaRegisterFatBinary and cudaRegisterFunction are inserted by nvcc into your code to register your kernel and its entry point with the runtime. I guess this is the initialization you are talking about.

继续阅读：gpgpu

Is cudamalloc slower than cudamemcpy?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？