CUDA Efficient memory access
I want to store an image into device and I want to process it. I am using the following to copy the image to memory.
int *image = new int[W*H];
//init image h开发者_开发百科ere
int *devImage;
int sizei = W*H*sizeof(int);
cudaMalloc((void**)&devImage, sizei);
cudaMemcpy(devImage, image, sizei, cudaMemcpyHostToDevice);
//call device function here.
I have two device functions. In the first function I am accessing the image from left to right and in the second function I am accessing it from top to bottom. I found that the top to bottom access takes very less time compare to left to right. This is because of the time needed for accessing the memory. How can I efficiently access the memory in CUDA?
This sounds like it may be an issue with coalesced memory access. You should try to have consecutive threads access consecutive elements from memory.
For example, assume you're using 10 threads (numbered 0-9) and you're operating on a 10x10 element data set. It is easy to picture the data laid out in a grid like it is below, however, in memory, the way you declared it in your code, it is laid out in a linear manner, as a 100-element 1D array.
0, 1, 2, 3... 9,
10, 11, 12, 13... 19,
20, 21, 22, 23... 29,
30, 31, 32, 33... 39,
. . .
. . .
. . .
90, 91, 92, 93... 99
It sounds like your first implementation going "from top to bottom" is performing coalesced reads -- the ten threads operate on elements 0, 1, 2, 3... 9, then 10, 11, 12, 13... 19, etc. These reads are coalesced because the ten threads read ten elements that are adjacent in the 1D linear memory layout.
It sounds like your second implementation going "from left to right" may be accessing your array in an uncoalesced manner -- the ten threads operate on elements 0, 10, 20, 30... 90, then 1, 11, 21, 31... 91, etc. In this case, reads are uncoalesced because the ten consecutive threads are reading memory locations that are actually far apart. Remember, in a 1D linear memory layout, elements 12 and 22 are ten memory addresses away from one another!
The Best Practices Guide discusses the importance of coalesced access in section 3.2.1, and there's a pretty good description of coalesced accesses in this post.
Random access - Use texture memory or surface memory..
精彩评论