开发者

Matrix Multiplication using CUDA + MPI

I'm doing a research about gpu in cluster environments using mpi to communicate.

I开发者_运维百科n order to compare speed up, I think in create:

A Multiplication of matrix just for GPU, ok.

Now just CPU MatrixMulti, ok.

But I can't find a nice implementation of CUDA + MPI matrix multiplication.

Anyone have some hint about where I can fin this? Or suggest one implementation.


The MTL4 Matrix Template Library can be a great starting point. Right now MTL4 has multi-core, DMM, and we are almost done with a full GPU implementation. Peter and I have been talking about distributed GPU algorithms, but since our focus is driven by PDE solvers for the moment, distributed GPU algorithms are difficult to make competitive against robust DMM.

However, I am working on a new geophysics/medical imaging solver set that is more conducive for distributed GPU computes as the data sets are more modest and the video capabilities of the GPU are beneficial.

To get started, take a look at the MTL4 tutorial


there is not much around. Your best bet is actually write a block matrix multiplication over MPI had have each node do the block multiplication locally on GPU.


The Combinatorial BLAS is a templated C++ MPI code that has a sparse matrix-matrix multiply operation. It uses a sqrt(p)-by-sqrt(p) processor grid and the SUMMA algorithm for matrix multiplication. One of the template arguments is a "sequential" component which is the matrix local to one process. You may be able to use it directly with a finnagled template argument that's your CUDA structure, but at least it can serve as a reference for your own code.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜