开发者

PyCUDA: C/C++ includes?

Something that isn't really mentioned anywhere (at least that I can see) is what library functions are exposed to inline CUDA kernels.

Specifically I'm doing small / stupid matrix multiplications that don't deserve to be individually offloaded to the GPU but am offloading a larger section of the algorithm which includes开发者_如何学Python this multiplication. Noone ever liked using their own linalg functions since someone has always done it better.

TLDR What libraries can I play with while in inline kernels under PyCUDA?


I don't know of any, and I always thought it would be useful to have.

For the size of problems that I usually work with (small matrices and tensors that arise in the finite element method), I just wrote C++ templates to do the operations. Templating the functions allows the compiler to know the trip counts at compile time, and it can unroll loops and keep results or intermediate results in register, which tends to be very efficient for kernel throughput. So the matrix-matrix product gets declared as

template < typename Real, unsigned int l, unsigned int m, unsigned int n >
__device__ __host__ 
void matmul(const Real *a,
            const Real *b,
                  Real *c)
{
    for(int i=0; i<l; i++) {
        for(int j=0; j<n; j++) {
            Real dotprod = Real(0);
               for(int k=0; k<m; k++) {
                   dotprod += a[idx2c(i,k,l)] * b[idx2c(k,j,m)];
                }
                c[idx2c(i,j,l)] = dotprod;
           }
     }
}

For the sort of sizes that crop up in my kernels (2x2, 3x3, 4x4, 8x8, 9x9), doing the above and letting the compile work things out seems to be as good as any other approach I have tried. Because at the thread level CUDA is effectively scalar, there aren't any vector primitives or stuff like that which can be used to accelerate these sort of small operations.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜