Dividing up CUDA cudaMemcpy into chunks

2023-03-22 15:26 问答作者：

A co-worker and I were brainstorming on how to mitigate the memory transfer time between host and device and it came up that perhaps arranging things to one mega-transfer (i.e. one single call) might help. This led me to create a test case where I took timings of transferring few large data chunks vs. many small data data chunks. I got some very interesting/strange results, and was wondering if anyone here had an explanation?

I won't put my whole code up here since it's quite long, but I tested the chunking in two different ways:

Explicitly writing out all cudaMemcpy's, e.g.:

cudaEventRecord(start, 0);
cudaMemcpy(aD, a, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 1*nBytes/10, a + 1*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 2*nBytes/10, a + 2*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 3*nBytes/10, a + 3*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 4*nBytes/10, a + 4*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 5*nBytes/10, a + 5*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 6*nBytes/10, a + 6*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 7*nBytes/10, a + 7*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 8*nBytes/10, a + 8*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaMemcpy(aD + 9*nBytes/10, a + 9*nBytes/10, nBytes/10, cudaMemcpyHostToDevice);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
Putting the cudaMemcpy's into a for loop:

cudaEventRecord(start, 0);
for(int i = 0; i < nChunks; i++)
{
cudaMemcpy(aD + i*nBytes/nChunks, a + i*nBytes/nChunks, nBytes/nChunks, cudaMemcpyHostToDevice);
}
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);

To note, I also did a "warm-up" transfer at the start of ea开发者_如何学Cch test just in case, though I don't think it was needed (the context was created by a cudaMalloc call).

I tested this on total transfer sizes ranging from 1 MB to 1 GB, where each test case transferred the same amount of information regardless of how it was chunked up. A sample of my output is this:

single large transfer = 0.451616 ms
10 explicit transfers = 0.198016 ms
100 explicit transfers = 0.691712 ms
10 looped transfers = 0.174848 ms
100 looped transfers = 0.683744 ms
1000 looped transfers = 6.145792 ms
10000 looped transfers = 104.981247 ms
100000 looped transfers = 13097.441406 ms

What's interesting here and what I don't get is that, across the board, the 10 transfers were ALWAYS faster by a significant amount than any of the others, even the single large transfer! And that result stayed consistent no matter how large or small the data set was (i.e. 10x100MB vs 1x1GB or 10x1MB vs 1x10MB still results in the 10x being faster). If anyone has any insight on why this is or what I may be doing wrong to get these weird numbers, I would be very interested to hear what you have to say.

Thanks!

P.S. I know that cudaMemcpy carries with it an implicit synchronization and so I could have used a CPU timer and that cudaEventSynchronize is redundant, but I figured it was better to be on the safe side

UPDATE: I wrote a function to try and take advantage of this apparent rip in the performance space-time continuum. When I use that function, though, which is written EXACLTY as in my test cases, the effect goes away and I see what I expect (a single cudaMemcpy is fastest). Perhaps this is all more akin to quantum physics than relativity wherein the act of observing changes the behavior...

cudaMemcpy() is synchronous - CUDA waits until the memcpy is done before returning to your app.

If you call cudaMemcpyAsync(), the driver will return control to your app before the GPU necessarily has performed the memcpy.

It's critical that you call cudaMemcpyAsync() instead of cudaMemcpy(). Not because you want to overlap the transfers with GPU processing, but because that is the only way you will get CPU/GPU concurrency.

On a cg1.4xlarge instance in Amazon EC2, it takes ~4 microseconds for the driver to request a mempy of the GPU; so CPU/GPU concurrency is a good way to hide driver overhead.

I don't have a ready explanation for the disparity you are seeing at 10 - the main knee I'd expect to see is where the memcpy crosses over 64K in size. The driver inlines memcpy's smaller than 64K into the same buffer used to submit commands.

Use the cudaThreadSynchronize() before and after each cuda call to get the real memory transfer time, cudaMemcpy() is synchronous but not with the CPU execution, it depends on the function called.

Cuda function calls are synchronous with other cuda function calls like other memory transfers or kernel execution, this is managed in a different CUDA thread invisible to the CUDA developer. cudaMemcpyAsync() is asynchronous with other CUDA calls, that is why it needs that the GPU memory segments copied do not overlap with other concurrent memory transfers.

Are you sure that in this case cudaMemcpy(), which is synchronous in the CUDA execution thread, is being synchronous also with the CPU thread? Well depending of the cuda function it can be or not, but if you use the cudaThreadSynchronize function when measuring times it will be synchronous with the CPU for sure, and the real times of each step will appear.

Perhaps it is some peculiarity in how CUDA measures time. You are measuring times which are less than 1 ms, which is very small. Did you try to time it with CPU based timer and compare results?

继续阅读：memcpy transfer

Dividing up CUDA cudaMemcpy into chunks

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？