Can you predict the runtime of a CUDA kernel?

2023-04-06 09:41 问答作者：

To what degree can one predict / calculate the performanc开发者_开发百科e of a CUDA kernel?

Having worked a bit with CUDA, this seems non trivial.

But a colleage of mine, who is not working on CUDA, told me, that it cant be hard if you have the memory bandwidth, the number of processors and their speed?

What he said seems not to be consistent with what I read. This is what I could imagine could work. What do you think?

 Memory processed
------------------ = runtime for memory bound kernels ?
 Memory bandwidth

   Flops
------------ = runtime for computation bound kernels?
 Max GFlops

Such calculation will barely give good prediction. There are many factors that hurt the performance. And those factors interact with each other in a extremely complicated way. So your calculation will give the upper bound of the performance, which is far away from the actual performance (in most cases).

For example, for memory bound kernels, those with a lot cache misses will be different with those with hits. Or those with divergences, those with barriers...

I suggest you to read this paper, which might give you more ideas on the problem: "An Analytical Model for a GPU Architecture with Memory-level and Thread-level Parallelism Awareness".

Hope it helps.

I think you can predict a best-case with a bit of work. Like you said, with instruction counts, memory bandwidth, input size, etc.

However, predicting the actual or worst-case is much trickier.

First off, there are factors like memory access patterns. Eg: with older CUDA capable cards, you had to pay attention to distribute your global memory accesses so that they wouldn't all contend for a single memory bank. (The newer CUDA cards use a hash between logical and physical addresses to resolve this).

Secondly, there are non-deterministic factors like: how busy is the PCI bus? How busy is the host kernel? Etc.

I suspect the easiest way to get close to actual run-times is basically to run the kernel on subsets of the input and see how long it actually takes.

继续阅读：gpgpu

Can you predict the runtime of a CUDA kernel?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？