Putting a for loop in a CUDA Kernel
Is a bad idea to put a for loop in a kerne开发者_如何学运维l?
or is it a common thing to do?It's common to put loops into kernels. It doesn't mean it's always a good idea, but it doesn't mean that it's not, either.
The general problem of deciding exactly how to effectively distribute your tasks and data and exploit related parallelisms is a very hard and unsolved problem, especially when it comes to CUDA. Active research is being carried out to determine efficiently (i.e., without blindly exploring the parameter space) how to achieve best results for given kernels.
Sometimes, it can make a lot of sense to put loops into kernels. For instance, iterative computations on many elements of a large, regular data structure exhibiting strong data independence are ideally suited to kernels containing loops. Other times, you may decide to have each thread process many data points, if e.g. you'd not have enough shared memory to allocate one thread per task (this isn't uncommon when a large number of threads share a large amount of data, and by increasing the amount of work done per thread, you can fit all the threads' shared data in shared memory).
Your best bet is to make an educated guess, test, profile, and revise as you need. There's a lot of room to play around with optimizations... launch parameters, global vs. constant vs. shared memory, keeping the number of registers cool, ensuring coalescing and avoiding memory bank conflicts, etc. If you're interested in performance, you should check out the "CUDA C Best Practices" and "CUDA Occupancy Calculator" available from NVIDIA on the CUDA 4.0 documentation page (if you haven't already).
It's generally okay if you're careful about your memory access patterns. If the for loop will access memory at random leading to many uncoalesced memory reads it could be very slow.
In fact, I once had a piece of code run slower with CUDA because I naively stuck a for loop in the kernel. However, once I had thought about memory access, by for example loading a chunk at a time into shared so each thread block could do a part of the for loop at the same time from shared, it was much quicker.
- A basic pattern for processing large data is using a tile approach where the input data is split up and each thread works on its tile of the data, where a loop is definetly needed.
Example 1: if input data is 2D Matrix known that its number of rows exceed its number of columns I would access the row using the unique grid block index and access the column using the tiled thread index approach using a loop over the tile size.
Example 2: If your threads need to process a single value which is needed for further calculations. (Vector normalization as example) you need a tile approach, since only within blocks threads can efficiently be synchronized.
As long as it's not at the top level, you should probably be OK. Doing so at the top level would negate all the advantages of CUDA.
As Dan points out, memory accesses become an issue. One way around this is to load the referenced memory either into shared memory or texture memory if it doesn't fit in shared. The reason is that uncoalesced global memory accesses are very slow (~400 clock cycles rather ~40 of the shared memory).
精彩评论