Putting a for loop in a CUDA Kernel

2023-03-27 02:19 问答作者：

Is a bad idea to put a for loop in a kerne开发者_如何学运维l?

or is it a common thing to do?

It's common to put loops into kernels. It doesn't mean it's always a good idea, but it doesn't mean that it's not, either.

The general problem of deciding exactly how to effectively distribute your tasks and data and exploit related parallelisms is a very hard and unsolved problem, especially when it comes to CUDA. Active research is being carried out to determine efficiently (i.e., without blindly exploring the parameter space) how to achieve best results for given kernels.

Sometimes, it can make a lot of sense to put loops into kernels. For instance, iterative computations on many elements of a large, regular data structure exhibiting strong data independence are ideally suited to kernels containing loops. Other times, you may decide to have each thread process many data points, if e.g. you'd not have enough shared memory to allocate one thread per task (this isn't uncommon when a large number of threads share a large amount of data, and by increasing the amount of work done per thread, you can fit all the threads' shared data in shared memory).

Your best bet is to make an educated guess, test, profile, and revise as you need. There's a lot of room to play around with optimizations... launch parameters, global vs. constant vs. shared memory, keeping the number of registers cool, ensuring coalescing and avoiding memory bank conflicts, etc. If you're interested in performance, you should check out the "CUDA C Best Practices" and "CUDA Occupancy Calculator" available from NVIDIA on the CUDA 4.0 documentation page (if you haven't already).

It's generally okay if you're careful about your memory access patterns. If the for loop will access memory at random leading to many uncoalesced memory reads it could be very slow.

In fact, I once had a piece of code run slower with CUDA because I naively stuck a for loop in the kernel. However, once I had thought about memory access, by for example loading a chunk at a time into shared so each thread block could do a part of the for loop at the same time from shared, it was much quicker.

A basic pattern for processing large data is using a tile approach where the input data is split up and each thread works on its tile of the data, where a loop is definetly needed.
Example 1: if input data is 2D Matrix known that its number of rows exceed its number of columns I would access the row using the unique grid block index and access the column using the tiled thread index approach using a loop over the tile size.
Example 2: If your threads need to process a single value which is needed for further calculations. (Vector normalization as example) you need a tile approach, since only within blocks threads can efficiently be synchronized.

As long as it's not at the top level, you should probably be OK. Doing so at the top level would negate all the advantages of CUDA.

As Dan points out, memory accesses become an issue. One way around this is to load the referenced memory either into shared memory or texture memory if it doesn't fit in shared. The reason is that uncoalesced global memory accesses are very slow (~400 clock cycles rather ~40 of the shared memory).

继续阅读：c

Putting a for loop in a CUDA Kernel

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？