开发者

cuda register pressure

I have a kernel does a linear least square fit. It turns out threads are using too many registers, therefore, the occupancy is low. Here is the kernel,

__global__
void strainAxialKernel(
    float* d_dis,
    float* d_str
){
    int i = threadIdx.x;
    float a = 0;
    float c = 0;
    float e = 0;
    float f = 0;
    int shift = (int)((float)(i*NEIGHBOURS)/(float)WINDOW_PER_LINE);
    int j;
    __shared__ float dis[WINDOW_PER_LINE];
    __shared__ float str[WINDOW_PER_LINE];

    // fetch data from global memory
    dis[i] = d_dis[blockIdx.x*WINDOW_PER_LINE+i];
    __syncthreads();

    // least square fit
    for (j=-shift; j<NEIGHBOURS-shift; j++)                                     
    {                                                                           
        a += j;                                                                 
        c += j*j;                                                               
        e += dis[i+j];                                                          
        f += (float(j))*dis[i+j];                                               
    }                                                                       
    str[i] = AMP*(a*e-NEIGHBOURS*f)/(a*a-NEIGHBOURS*c)/(float)BLOCK_SPACING;    

    // compensate attenuation
    if (COMPEN_EXP>0 && COMPEN_BASE>0)                                          
    {                                                                           
        str[i]                                                                  
        = (fl开发者_运维百科oat)(str[i]*pow((float)i/(float)COMPEN_BASE+1.0f,COMPEN_EXP));     
    }   

    // write back to global memory
    if (!SIGN_PRESERVE && str[i]<0)                                             
    {                                                                           
        d_str[blockIdx.x*WINDOW_PER_LINE+i] = -str[i];                          
    }                                                                           
    else                                                                        
    {                                                                           
        d_str[blockIdx.x*WINDOW_PER_LINE+i] = str[i];                           
    }
}

I have 32x404 blocks with 96 threads in each block. On GTS 250, the SM shall be able to handle 8 blocks. Yet, visual profiler shows I have 11 registers per thread, as a result, occupancy is 0.625 (5 blocks per SM). BTW, the shared memory used by each block is 792 B, so the register is the problem. The performance is not end of the world. I am just curious if there is anyway I can get around this. Thanks.


There is always a trade-off between the fast but limited registers/shared memory and the slow but large global memory. There's no way to "get around" that trade-off. If you use reduce register usage by using global memory, you should get higher occupancy but slower memory access.

That said, here are some ideas to use fewer registers:

  1. Can shift be precomputed and stored in constant memory? Then each thread just needs to look up shift[i].
  2. Do a and c have to be floats?
  3. Or, can a and c be removed from the loop and computed once? And thus removed completely?

a is computed as a simple arithmetic sequence, so reduce it... (something like this)

a = ((NEIGHBORS-shift) - (-shift) + 1) * ((NEIGHBORS-shift) + (-shift)) / 2

or

a = (NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2

so instead, do something like the following (you can probably reduce these expressions further):

str[i] = AMP*((NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2*e-NEIGHBOURS*f)
str[i] /= ((NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2*(NEIGHBORS + 1) * ((NEIGHBORS - 2*shift)) / 2-NEIGHBOURS*c)
str[i] /= (float)BLOCK_SPACING;


Occupancy is NOT a problem.

The SM in GTS 250 (compute capability 1.1) may be able to hold 8 blocks (8x96 threads) simultaneously in its registers, but it only has 8 execution units, meaning that only 8 out of 8x96 (or, in your case, 5x96) threads would be advancing at any given moment of time. There's very little value in trying to squeeze more blocks onto the overloaded SM.

In fact, you could try to play with -maxrregcount option to INCREASE the number of registers, that could have a positive effect on performance.


You can use launch bounds to instruct the compiler to generate a register mapping for a maximum number of threads and a minimum number of blocks per multiprocessor. This can reduce register counts so that you can achieve the desired occupancy.

For your case, Nvidia's occupancy calculator shows a theoretical peak occupancy of 63%, which seems to be what you're achieving. This is due to your register count, as you mention, but it is also due to the number of threads per block. Increasing the number of threads per block to 128 and decreasing the register count to 10 yields 100% theoretical peak occupancy.

To control the launch bounds for your kernel:

__global__ void
__launch_bounds__(128, 6)
MyKernel(...)
{
    ...
}

Then just launch with a block size of 128 threads and enjoy your occupancy. The compiler should generate your kernel such that it uses 10 or less registers.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜