Increasing block size decreases performance

2023-02-11 20:09 问答作者：

In my cuda code if I increase the blocksizeX ,blocksizeY it actually is taking more time .[Therefore I run it at 1x1]Also a chunk of my execution time ( for eg 7 out of 9 s ) is taken by just the call to the kernel .Infact I am quite amazed that even if I comment out the entire kernel the time is almost same.Any suggestions where and how to optimize?

P.S. I have edited this post with my actual code .I am downsampling an image so every 4 neighoring pixels (so for eg 1,2 from row 1 and 1,2 from row 2) give an output pixel.I get a effective bw of .5GB/s compared to theoretical maximum of 86.4 GB/s.The time I use is the difference in calling the kernel with instructions and calling an empty kernel. It looks pretty bad to me right now but I cant figure out what am I doing wrong.

 __global__ void streamkernel(int *r_d,int *g_d,int *b_d,int height ,int width,int *f_r,int *f_g,int *f_b){


    int id=blockIdx.x * blockDim.x*blockDim.y+ threadIdx.y*blockDim.x+threadIdx.x+blockIdx.y*gridDim.x*blockDim.x*blockD开发者_如何学编程im.y;
    int number=2*(id%(width/2))+(id/(width/2))*width*2;

     if (id<height*width/4)
    {

        f_r[id]=(r_d[number]+r_d[number+1];+r_d[number+width];+r_d[number+width+1];)/4;                              
        f_g[id]=(g_d[number]+g_d[number+1]+g_d[number+width]+g_d[number+width+1])/4;             
        f_b[id]=(g_d[number]+g_d[number+1]+g_d[number+width]+g_d[number+width+1];)/4;  
    }


  }

Try looking up the matrix multiplication example in CUDA SDK examples for how to use shared memory.

The problem with your current kernel is that it's doing 4 global memory reads and 1 global memory write for each 3 additions and 1 division. Each global memory access costs roughly 400 cycles. This means you're spending the vast majority of time doing memory access (what GPUs are bad at) rather than compute (what GPUs are good at).

Shared memory in effect allows you to cache this so that amortized, you get roughly 1 read and 1 write at each pixel for 3 additions and 1 division. That is still not doing so great on the CGMA ratio (compute to global memory access ratio, the holy grail of GPU computing).

Overall, I think for a simple kernel like this, a CPU implementation is likely going to be faster given the overhead of transferring data across the PCI-E bus.

You're forgetting the fact that one multiprocessor can execute up to 8 blocks simultaneously and the maximum performance is reached exactly then. However there are many factors that limit the number of blocks that can exist in parallel (incomplete list):

Maximum amount of shared memory per multiprocessor limits the number of blocks if #blocks * shared memory per block would be > total shared memory.
Maximum number of threads per multiprocessor limits the number of blocks if #blocks * #threads / block would be > max total #threads.
...

You should try to find a kernel execution configuration that causes exactly 8 blocks to be run on one multiprocessor. This will almost always yield the highest performance even if the occupancy is =/= 1.0! From this point on you can try to iteratively make changes that reduce the number of executed blocks per MP, but therefore increase the occupancy of your kernel and see if the performance increases.

The nvidia occupancy calculator(excel sheet) will be of great help.

Increasing block size decreases performance

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？