CUDA version slower than CPU version?
I am writing a image subsampler in CUDA and use the threads to perform the averaging operation.However if I do this without calling the kernel it runs much faster compared to when I actually call the CUDA kernel.Image size right now is 1280x1024. Does the kernel call usually take substantial time or is there something wrong with my implementation?
P.S I tried calling just the kernel(with code removed) and it is pretty much same time as kernel with code.Also my code w/o kernel call runs approx 350 ms whereas with Kernel call runs close to 1000ms.
__global__ void subsampler(int *r_d,int *g开发者_C百科_d,int *b_d, int height,int width,int *f_r,int*f_g,int*f_b){
int id=blockIdx.x * blockDim.x*blockDim.y+ threadIdx.y*blockDim.x+threadIdx.x+blockIdx.y*gridDim.x*blockDim.x*blockDim.y;
if (id<height*width/4){
f_r[id]=(r_d[4*id]+r_d[4*id+1]+r_d[4*id+2]+r_d[4*id+3])/4;
f_g[id]=(g_d[4*id]+g_d[4*id+1]+g_d[4*id+2]+g_d[4*id+3])/4;
f_b[id]=(b_d[4*id]+b_d[4*id+1]+b_d[4*id+2]+b_d[4*id+3])/4;
}
}
I define blockSizeX and blockSizeY to be 1 and 1 ( i tried making them 4,16) but somehow this is the fastest
dim3 blockSize(blocksizeX,blocksizeY);
int new_width=img_width/2;
int new_height=img_height/2;
int n_blocks_x=new_width/blocksizeX+(new_width/blocksizeY == 0 ?0:1);
int n_blocks_y=new_height/blocksizeX+(new_height/blocksizeY == 0 ?0:1);
dim3 gridSize(n_blocks_x,n_blocks_y);
and then I call the kernel with gridSize,BlockSize.
It may be that the kernel is not implemented very well, or it may be that the overhead of moving your data to and from the GPU card is swamping any computational benefit. Try benchmarking the kernel in isolation (without CPU <-> GPU memory transfers) to see how much of your total time is taken by the kernel and how much by memory transfers. You can then decide based on these measurements whether you need to do more work on the kernel.
While I'm not sure what hardware you're running this one, you should be able to make this kernel perform closer to 1000 fps, rather than 1000ms/frame :)
Suggestion 1: If this processing has any interaction with visualization, through OpenGL/DirectX or similar, just do this as a shader -- all the details of grid/block size, memory layout, etc., is handled for you. If you really need to implement this yourself in CUDA, then keep reading:
First, I assume you're subsampling your 1280x1024 image by a factor of 2 in each direction, yielding a 640x512 image. Each pixel in the resulting image is the average of four pixels in the source image. The images have three channels, RGB.
Question 1: Do you really want 32 bits per channel or did you want RGB888 (8 bits per channel)? RGB888 is fairly common -- I will assume this is what you meant.
Question 2: Is your data actually planar, or are you extracting it from an interleaved format? RGB888 is an interleaved format, where pixels appear in memory as RGBRGBRGB. I would write your kernel to process the image in its native format. I will assume your data is actually planar, so you have three planes, R8, G8, and B8.
The first thing to do is consider memory layout. You will want one thread for every pixel in the destination image. Given that the memory access pattern for subsampling is not coalesced, you will want to read the pixel data into shared memory. Consider a block size of 32x8 threads. This allows each block to read in 40*8*4 pixels, or 3072 bytes at 3bpp. You will actually read in slightly more than that, to keep the loads coalesced, for a total of 4096 bytes per block. This now gives you:
dim3 block(32, 8);
dim3 grid(1280 / 2 / 32, 1024 / 2 / 8); // 20x64 blocks of 256 threads
Now comes the fun part: doing the shared memory. Your kernel could look like this:
__global__ void subsample(uchar* r, uchar* g, uchar* b, // in
uchar* ro, uchar* go, uchar* bo) // out
{
/* Global offset into output pixel arrays */
int gid = blockIdx.y * gridDim.x * blockDim.x + blockIdx.x * blockDim.x;
/* Global offset into input pixel arrays */
int gidin = gid * 2;
__shared__ uchar* rc[1024];
__shared__ uchar* gc[1024];
__shared__ uchar* bc[1024];
/* Read r, g, and b, into shmem cache */
((int*)rc)[threadIdx.x] = ((int*)r)[gidin + threadIdx.x];
((int*)gc)[threadIdx.x] = ((int*)g)[gidin + threadIdx.x];
((int*)bc)[threadIdx.x] = ((int*)b)[gidin + threadIdx.x];
__syncthreads();
/* Shared memory for output */
__shared__ uchar* roc[256];
__shared__ uchar* goc[256];
__shared__ uchar* boc[256];
/* Do the subsampling, one pixel per thread. Store into the output shared memory */
...
__syncthreads();
/* Finally, write the result to global memory with coalesced stores */
if (threadIdx.x < 64) {
((int*)ro)[gid + threadIdx.x] = ((int*)roc)[threadIdx.x];
} else if (threadIdx.x < 128) {
((int*)go)[gid + threadIdx.x-64] = ((int*)goc)[threadIdx.x-64];
} else if (threadIdx.x < 192) {
((int*)bo)[gid + threadIdx.x-128] = ((int*)boc)[threadIdx.x-128];
}
}
Whew! Lot of stuff there, sorry for the code dump. Some principles to keep in mind:
1) Memory is fast when you use coalesced loads/stores. That means for each thread in a warp of 32, each accesses 32 bytes. If the 32byte index matches the thread index in the warp, then all 32 accesses get put into one 128 transaction. This is how you get the 100GB/s bandwidth of the GPU.
2) The pattern of memory access when doing subsampling is not coalesced, since it relies on 2D spatial locality, which the raw memory does not have. (Could use texture memory for this as well...) By storing the input in shared memory, then processing, you minimize the impact of upon your compute performance.
I hope this helps -- I can reply with more detail on some parts if you'd like.
精彩评论