cuda register pressure

2023-01-25 05:33 问答作者：

I have a kernel does a linear least square fit. It turns out threads are using too many registers, therefore, the occupancy is low. Here is the kernel,

__global__
void strainAxialKernel(
    float* d_dis,
    float* d_str
){
    int i = threadIdx.x;
    float a = 0;
    float c = 0;
    float e = 0;
    float f = 0;
    int shift = (int)((float)(i*NEIGHBOURS)/(float)WINDOW_PER_LINE);
    int j;
    __shared__ float dis[WINDOW_PER_LINE];
    __shared__ float str[WINDOW_PER_LINE];

    // fetch data from global memory
    dis[i] = d_dis[blockIdx.x*WINDOW_PER_LINE+i];
    __syncthreads();

    // least square fit
    for (j=-shift; j<NEIGHBOURS-shift; j++)                                     
    {                                                                           
        a += j;                                                                 
        c += j*j;                                                               
        e += dis[i+j];                                                          
        f += (float(j))*dis[i+j];                                               
    }                                                                       
    str[i] = AMP*(a*e-NEIGHBOURS*f)/(a*a-NEIGHBOURS*c)/(float)BLOCK_SPACING;    

    // compensate attenuation
    if (COMPEN_EXP>0 && COMPEN_BASE>0)                                          
    {                                                                           
        str[i]                                                                  
        = (fl开发者_运维百科oat)(str[i]*pow((float)i/(float)COMPEN_BASE+1.0f,COMPEN_EXP));     
    }   

    // write back to global memory
    if (!SIGN_PRESERVE && str[i]<0)                                             
    {                                                                           
        d_str[blockIdx.x*WINDOW_PER_LINE+i] = -str[i];                          
    }                                                                           
    else                                                                        
    {                                                                           
        d_str[blockIdx.x*WINDOW_PER_LINE+i] = str[i];                           
    }
}

I have 32x404 blocks with 96 threads in each block. On GTS 250, the SM shall be able to handle 8 blocks. Yet, visual profiler shows I have 11 registers per thread, as a result, occupancy is 0.625 (5 blocks per SM). BTW, the shared memory used by each block is 792 B, so the register is the problem. The performance is not end of the world. I am just curious if there is anyway I can get around this. Thanks.

There is always a trade-off between the fast but limited registers/shared memory and the slow but large global memory. There's no way to "get around" that trade-off. If you use reduce register usage by using global memory, you should get higher occupancy but slower memory access.

That said, here are some ideas to use fewer registers:

Can shift be precomputed and stored in constant memory? Then each thread just needs to look up shift[i].
Do a and c have to be floats?
Or, can a and c be removed from the loop and computed once? And thus removed completely?

a is computed as a simple arithmetic sequence, so reduce it... (something like this)

a = ((NEIGHBORS-shift) - (-shift) + 1) * ((NEIGHBORS-shift) + (-shift)) / 2

a = (NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2

so instead, do something like the following (you can probably reduce these expressions further):

str[i] = AMP*((NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2*e-NEIGHBOURS*f)
str[i] /= ((NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2*(NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2-NEIGHBOURS*c)
str[i] /= (float)BLOCK_SPACING;

Occupancy is NOT a problem.

The SM in GTS 250 (compute capability 1.1) may be able to hold 8 blocks (8x96 threads) simultaneously in its registers, but it only has 8 execution units, meaning that only 8 out of 8x96 (or, in your case, 5x96) threads would be advancing at any given moment of time. There's very little value in trying to squeeze more blocks onto the overloaded SM.

In fact, you could try to play with -maxrregcount option to INCREASE the number of registers, that could have a positive effect on performance.

You can use launch bounds to instruct the compiler to generate a register mapping for a maximum number of threads and a minimum number of blocks per multiprocessor. This can reduce register counts so that you can achieve the desired occupancy.

For your case, Nvidia's occupancy calculator shows a theoretical peak occupancy of 63%, which seems to be what you're achieving. This is due to your register count, as you mention, but it is also due to the number of threads per block. Increasing the number of threads per block to 128 and decreasing the register count to 10 yields 100% theoretical peak occupancy.

To control the launch bounds for your kernel:

__global__ void
__launch_bounds__(128, 6)
MyKernel(...)
{
    ...
}

Then just launch with a block size of 128 threads and enjoy your occupancy. The compiler should generate your kernel such that it uses 10 or less registers.

cuda register pressure

更多精彩内容

精彩评论

最新问答

第一次出国飞行流程+注意事项？

再生油（关于再生油的介绍）？

东莞科技进修学院（关于东莞科技进修学院的介绍）？

均为镇政府人员平均年龄不超30？

手机msn在哪里下载（其实很简单）？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？