CUDA Efficient memory access

2023-03-11 02:45 问答作者：

I want to store an image into device and I want to process it. I am using the following to copy the image to memory.

int *image = new int[W*H];
//init image h开发者_开发百科ere
int  *devImage;
int sizei = W*H*sizeof(int);
cudaMalloc((void**)&devImage, sizei);
cudaMemcpy(devImage, image, sizei, cudaMemcpyHostToDevice);
//call device function here.

I have two device functions. In the first function I am accessing the image from left to right and in the second function I am accessing it from top to bottom. I found that the top to bottom access takes very less time compare to left to right. This is because of the time needed for accessing the memory. How can I efficiently access the memory in CUDA?

This sounds like it may be an issue with coalesced memory access. You should try to have consecutive threads access consecutive elements from memory.

For example, assume you're using 10 threads (numbered 0-9) and you're operating on a 10x10 element data set. It is easy to picture the data laid out in a grid like it is below, however, in memory, the way you declared it in your code, it is laid out in a linear manner, as a 100-element 1D array.

 0,  1,  2,  3...   9,
10, 11, 12, 13...  19,
20, 21, 22, 23...  29,
30, 31, 32, 33...  39,
 .   .              .
 .        .         .
 .             .    .
90, 91, 92, 93...  99

It sounds like your first implementation going "from top to bottom" is performing coalesced reads -- the ten threads operate on elements 0, 1, 2, 3... 9, then 10, 11, 12, 13... 19, etc. These reads are coalesced because the ten threads read ten elements that are adjacent in the 1D linear memory layout.

It sounds like your second implementation going "from left to right" may be accessing your array in an uncoalesced manner -- the ten threads operate on elements 0, 10, 20, 30... 90, then 1, 11, 21, 31... 91, etc. In this case, reads are uncoalesced because the ten consecutive threads are reading memory locations that are actually far apart. Remember, in a 1D linear memory layout, elements 12 and 22 are ten memory addresses away from one another!

The Best Practices Guide discusses the importance of coalesced access in section 3.2.1, and there's a pretty good description of coalesced accesses in this post.

Random access - Use texture memory or surface memory..

CUDA Efficient memory access

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？