Large constant array in global memory
Is it possible to increase performance by running on a GPU for the algorithm with the following properties:
- There are hundreds and even thousands of independent threads, which do not require any synchronization during calculations
- Each thread has a relatively small (less than 200Kb) local memory region containing thread-specific data. Read/Write
- Each thread accesses a large memory block (hundreds of megabytes and even gigabytes). This memory is read-only
- For each access to the global memory there will be at least two accesses to the local memory
- There will be a lot of branches in the algorithm
Unfortunately the algorithm is rather complic开发者_C百科ated to be show here.
My instinct is to use texture memory aggressively. The caching benefits will beat uncoalesced global memory reads by a mile.
The writes you may need to add some padding etc. to avoid bank conflicts.
The reliance on hundreds of meg or gigs of data is somewhat concerning. Can you carve it up somehow? Hope you have a big beefy Tesla/Quadro w/ oodles of RAM.
That said, the name of game for CUDA optimization is always to experiment, profile/measure, rinse and repeat.
Before I start, please remember that there are two layers of parallelism in CUDA: blocks and threads.
There are hundreds and even thousands of independent threads, which do not require any synchronization during calculations
Since you can launch as many as 65535 blocks per dimension, you can treat each block in cuda to be equivalent to a "thread" of yours.
Each thread has a relatively small (less than 200Kb) local memory region containing thread-specific data. Read/Write
Unfortunately most cards have a shared memory limit of 16k per block. So if you can figure out how to handle with this lower limit, great. If not, you will need to use global memory accesses..
Each thread accesses a large memory block (hundreds of megabytes and even gigabytes). This memory is read-only
You can not bind such large arrays to textures or constant memory. So in a given block, try to make the threads read contiguous chunks of data for the best performance.
For each access to the global memory there will be at least two accesses to the local memory There will be a lot of branches in the algorithm
Since you are essentially replacing a single thread in your original implementation with a block in cuda, you may want to revise the code a little bit to try and implement a parallel version of the "per thread code" too.
This may not be clear at first glance, but think it through a little. Any algorithm that has hundreds / thousands of independent parts with no synchronization needed is great for a parallel implementation, even with cuda.
精彩评论