Increasing block size decreases performance
P.S. I have edited this post with my actual code .I am downsampling an image so every 4 neighoring pixels (so for eg 1,2 from row 1 and 1,2 from row 2) give an output pixel.I get a effective bw of .5GB/s compared to theoretical maximum of 86.4 GB/s.The time I use is the difference in calling the kernel with instructions and calling an empty kernel. It looks pretty bad to me right now but I cant figure out what am I doing wrong.
__global__ void streamkernel(int *r_d,int *g_d,int *b_d,int height ,int width,int *f_r,int *f_g,int *f_b){
int id=blockIdx.x * blockDim.x*blockDim.y+ threadIdx.y*blockDim.x+threadIdx.x+blockIdx.y*gridDim.x*blockDim.x*blockD开发者_如何学编程im.y;
int number=2*(id%(width/2))+(id/(width/2))*width*2;
if (id<height*width/4)
{
f_r[id]=(r_d[number]+r_d[number+1];+r_d[number+width];+r_d[number+width+1];)/4;
f_g[id]=(g_d[number]+g_d[number+1]+g_d[number+width]+g_d[number+width+1])/4;
f_b[id]=(g_d[number]+g_d[number+1]+g_d[number+width]+g_d[number+width+1];)/4;
}
}
Try looking up the matrix multiplication example in CUDA SDK examples for how to use shared memory.
The problem with your current kernel is that it's doing 4 global memory reads and 1 global memory write for each 3 additions and 1 division. Each global memory access costs roughly 400 cycles. This means you're spending the vast majority of time doing memory access (what GPUs are bad at) rather than compute (what GPUs are good at).
Shared memory in effect allows you to cache this so that amortized, you get roughly 1 read and 1 write at each pixel for 3 additions and 1 division. That is still not doing so great on the CGMA ratio (compute to global memory access ratio, the holy grail of GPU computing).
Overall, I think for a simple kernel like this, a CPU implementation is likely going to be faster given the overhead of transferring data across the PCI-E bus.
You're forgetting the fact that one multiprocessor can execute up to 8 blocks simultaneously and the maximum performance is reached exactly then. However there are many factors that limit the number of blocks that can exist in parallel (incomplete list):
- Maximum amount of shared memory per multiprocessor limits the number of blocks if #blocks * shared memory per block would be > total shared memory.
- Maximum number of threads per multiprocessor limits the number of blocks if #blocks * #threads / block would be > max total #threads.
- ...
You should try to find a kernel execution configuration that causes exactly 8 blocks to be run on one multiprocessor. This will almost always yield the highest performance even if the occupancy is =/= 1.0! From this point on you can try to iteratively make changes that reduce the number of executed blocks per MP, but therefore increase the occupancy of your kernel and see if the performance increases.
The nvidia occupancy calculator(excel sheet) will be of great help.
精彩评论