CUDA: Accessing arbitrary long matrices in both dimensions

2023-03-07 05:01 问答作者：

Hey there, currently I'm using threads indexed only in one dimension to access all elements of a matrix like that:

// Thread-ID
int idx = blockIdx.x * blockDim.x + threadIdx.x;
// Offset:
int offset = gridDim.x * blockDim.x;

while(idx < MATRIX_ROWS * MATRIX_COLS)
{   
    row = idx 开发者_运维百科% MATRIX_ROWS;
    col = idx / MATRIX_ROWS;    

    matrix[ row ][ col ] = ...;
    idx += offset;
}

Now I wondered how to access arbitrary long matrices with two dimensional indexing. I would like that one block always access the single elements of one row. Something like that (x-index is refering to the cols and the y-index to the rows of the matrix):

// Thread-IDs
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;

// Offset:
int offset = gridDim.x * blockDim.x;

while(idx < MATRIX_COLS)
{   
    matrix[ idy ][ idx ] = ...;
    idx += offset;
}

Now let's assume the matrix has more rows than I started blocks when calling the kernel: When starting N blocks, the first N rows of the matrix are handled right, but what about the other rows? How would you do that? Thanks!

EDIT: I came up with an idea but I don't know if that's somehow 'ugly' coding!?

// Thread-IDs
int idx0 = blockIdx.x * blockDim.x + threadIdx.x;
int idx = idx0;
int idy = blockIdx.y * blockDim.y + threadIdx.y;

// Offset:
int offsetx = gridDim.x * blockDim.x;
int offsety = gridDim.y * blockDim.y;

while(idx < MATRIX_COLS && idy < MATRIX_ROWS)
{   
    matrix[ idy ][ idx ] = ...;

    idx += offsetx;
    if(idx > MATRIX_COLS)
    {
        // Jump to nex row and start from 'beginning' concerning columns
        idy += offsety;
        idx = idx0;
    }
}

Perhaps something like this is what you are looking for?

// Thread-IDs
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idy = blockIdx.y * blockDim.y + threadIdx.y;

// Offset:
int offsetx = gridDim.x * blockDim.x;
int offsety = gridDim.y * blockDim.y;

for(row = idy; row < MATRIX_ROWS; i+=offsety) {
    float * pos = matrix + row;

#pragma unroll
    for(col = idx; col < MATRIX_COLS; col+=offsetx) {
        pos[col] = .....;
    }
}

If MATRIX_COLS is a preprocessor define or constant, the compiler might be able to unroll the inner loop and give a bit more performance.

EDIT: The first version of the code was written with column-major ordering stuck in the back of my head, so ignore the comment that was here. Access should be coalesced doing it this way.

继续阅读：matrix rows

CUDA: Accessing arbitrary long matrices in both dimensions

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？