Non Square Matrix Multiplication in CUDA

2023-02-20 02:53 问答作者：

The code I use for matrix multiplications in CUDA lets me multiply both square and non square matrices, how开发者_运维问答ever, both Width and Height MUST be multiples of blocksize.

So, for example, I can multiply [3][6] * [6][3] (using blocksize=3), but I can't multiply [3][2]*[2][3].

Does anyone knows a way to do that? This is my kernel:

#include <stdio.h>

#include <limits.h>

#include <stdlib.h>
#define blocksize 3
#define HM (1*blocksize) 
#define WM (2*blocksize) 
#define WN (1*blocksize)
#define HN WM 
#define WP WN   
#define HP HM  
#define PTH WM
#define PTW HM

__global__ void nonsquare(float*M, float*N, float*P, int uWM,int uWN)

{
__shared__ float MS[blocksize][blocksize];
__shared__ float NS[blocksize][blocksize];


int tx=threadIdx.x, ty=threadIdx.y, bx=blockIdx.x, by=blockIdx.y;
int rowM=ty+by*blocksize;
int colN=tx+bx*blocksize;
float Pvalue=0;


for(int m=0; m< uWM/blocksize;++m){
    MS[ty][tx]=M[rowM*uWM+(m*blocksize+tx)];
    NS[ty][tx]=M[colN + uWN*(m*blocksize+ty)];
    __syncthreads();

    for(int k=0;k<blocksize;k++)
        Pvalue+=MS[ty][k]*NS[k][tx];
    __syncthreads();
    P[rowM*WP+colN]=Pvalue;
     }
    }

Thanks in advance!

I think the easiest thing to do would be to just pad the blocks on the end with zeros:

for(int m=0; m< uWM/blocksize;++m){
    colM = m*blocksize+tx;
    rowN = m*blocksize+ty;
    if (rowM > uWN || rowN > uWM || colM > uWM || colN > uWN) {
        MS[ty][tx]=0.;
        NS[ty][tx]=0.;
    } else {
        MS[ty][tx]=M[rowM*uWM+colM];
        NS[ty][tx]=N[colN + uWN*rowN];
    }

plus or minus. (That NS line should reference N, not M, right?)

But, since I seem to be the only one here advocating using existing tuned libraries when possible -- why not use CUBLAS or MAGMA instead of rolling your own? They're fast, and tested by hundreds of users.

The underlying performance requirement here is that either the first or second dimension of the shared memory "tile" be a round multiple of 16 - historically that is what is necessary to achieve optimal global memory bandwidth (ie. half warp coalesced transactions). Whether it should be the first or second dimension of the tile is dictated by whether the matrices are stored in column or row major order. There is nothing to say that the shared memory tile need be square, only that the leading dimension of the storage (LDA in BLAS notation) be round multiples of 16.

You could easily template the kernel with the tile dimensions as template arguments and instantiate several versions, depending on matrix dimensions. For a given architecture, there is probably an optimal tile dimension which balances occupancy and instruction level parallelism. The "clever" way to solve this is probably to decompose the matrix multiplication into two operations - the first doing the bulk of the work at the optimal tile size, and the second at a different size for the remaining columns. If the result is going straight back to host memory after the product is completed, the second operation might best be done on the host using an optimised BLAS, overlapped with the GPU kernel. This is the approach that many of the routines in the UTK Magma library use.

继续阅读：matrix-multiplication

Non Square Matrix Multiplication in CUDA

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？