How to implement Radix sort on multi-GPU?

2023-01-25 15:50 问答作者：

How to implement Radix sort on multi-GPU – same way as on single GPU i.e. by splitting the data then building histograms on开发者_如何学运维 separate GPUs and then use merge data back (like bunch of cards)?

That method would work, but I don't think it would be the fastest approach. Specifically, merging histograms for every K bits (K=4 is currently best) would require the keys to be exchanged between GPUs 32/K = 8 times to sort 32-bit integers. Since the memory bandwidth between GPUs (~5GB/s) is much lower than the memory bandwidth on a GPU (~150GB/s) this will kill performance.

A better strategy would be to split the data into multiple parts, sort each part in parallel on a different GPU, and then merge the parts once at the end. This approach requires only one inter-GPU transfer (vs. 8 above) so it will be considerably faster.

Unfortunately this question is not adequately posed. It depends on element size, where the elements begin life in memory, and where you want the sorted elements to end up residing.

Sometimes it's possible to compress the sorted list by storing elements in groups sharing the same common prefix, or you can unique elements on the fly, storing each element once in the sorted list with an associated count. For example, you might sort a huge list of 32-bit integers into 64K distinct lists of 16-bit values, cutting your memory requirement in half.

The general principle is that you want to make the fewest number of passes over the data as possible and that your throughput will almost always correspond to bandwidth constraints associated with your storage policy.

If your data set exceeds the size of fast memory, you probably want to finish with a merge pass rather than continue to radix sort, as another person has already answered.

I'm just getting into GPU architecture and I don't understand the K=4 comment above. I've never seen an architecture yet where such a small K would prove optimal.

I suspect merging histograms is also the wrong approach. I'd probably let the elements fragment in memory rather than merge histograms. Is it that hard to manage meso-scale scatter/gather lists in the GPU fabric? I sure hope not.

Finally, it's hard to conceive of a reason why you would want to involve multiple GPUs for this task. Say your card has 2GB of memory and 60GB/s write bandwidth (that's what my mid-range card is showing). A three pass radix sort (11-bit histograms) requires 6GB of write bandwidth (likely your rate limiting factor), or about 100ms to sort a 2GB list of 32-bit integers. Great, they're sorted, now what? If you need to ship them anywhere else without some kind of preprocessing or compression, the sorting time will be small fish.

In any case, just compiled my first example programs today. There's still a lot to learn. My target application is permutation intensive, which is closely related to sorting. I'm sure I'll weigh in on this subject again in future.

继续阅读：concurrency gpu opencl parallel-processing

How to implement Radix sort on multi-GPU?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？