Copying an integer from GPU to CPU

2023-02-17 08:15 问答作者：

I need to copy a single boolean or an integer value from the device to the host after every kernel call (I am calling the same kernel in a for loop). That is, after every kernel call, I need to send an integer or a boolean back to the host. What is the best way to do this?

Should I write the value directly to RAM? Or should I use cudaMemcpy()? Or is there any other way to do t开发者_JS百科his? Would copying just 1 integer after every kernel launch slow down my program?

Let me first answer your last question:

Would copying just 1 integer after every kernel launch slow down my program?

A bit - yes. Issuing the command, waiting for GPU to respond, etc, etc... The amount of data (1 int vs 100 ints) probably doesn't really matter in this case. However, you can still achieve speeds of thousands memory transfers per second. Most likely, your kernel will be slower than this single memory transfer (otherwise, it would be probably better to do the whole task on a CPU)

what is the best way to do this?

Well, I would suggest simply trying it yourself. As you said: you can either use mapped-pinned memory and have your kernel store the value directly to RAM, or use cudaMemcpy. The first one might be better if your kernels still have some work to do after sending the integer back. In that case, the latency of sending it to host could be hidden by the execution of the kernel.

If you use the first method, you will have to call cudaThreadsynchronize() to make sure the kernel ended its execution. Kernel calls are asynchronous.

You can use cudaMemcpyAsync which is also asynchronous, but GPU cannot have kernel running and having cudaMemcpyAsync executed in parallel, unless you use streams.

I never actually tried that, but if your program won't crash if the loop executes too many times, you might try to ignore synchronisation and let it iterate until the special value is seen in RAM. In that solution, the memory transfer might be completely hidden and you would pay an overhead only at the end. You will need however to somehow prevent the loop from iterating too many times, CUDA events may be helpful.

Why not use pinned memory? If your system supports it -- see CUDA C Programming Guide's section on pinned memory.

Copying data to and from the GPU will be much slower than accessing the data from the CPU. If you are not running a significant number of threads for this value then this will result in very slow performance, don't do it.

What you are describing sounds like a serial algorithm, your algorithm needs to be parallelised in order to make it worth doing using CUDA. If you can't rewrite your algorithm to become a single write of multiple data to the GPU, multiple threads, single write of multiple data back to CPU; then your algorithm should be done on CPU.

If you need the value computed in the previous kernel call to launch the next one then is serialized and your choice is to cudaMemcpy(dst,src, size =1, ...);

If all the kernel launch parameters do not depend on the previous launch then you can store all the result of each kernel invocation in GPU memory and then download all the results at once.

Copying an integer from GPU to CPU

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？