Large constant array in global memory

2023-03-27 18:40 问答作者：

Is it possible to increase performance by running on a GPU for the algorithm with the following properties:

There are hundreds and even thousands of independent threads, which do not require any synchronization during calculations
Each thread has a relatively small (less than 200Kb) local memory region containing thread-specific data. Read/Write
Each thread accesses a large memory block (hundreds of megabytes and even gigabytes). This memory is read-only
For each access to the global memory there will be at least two accesses to the local memory
There will be a lot of branches in the algorithm

Unfortunately the algorithm is rather complic开发者_C百科ated to be show here.

My instinct is to use texture memory aggressively. The caching benefits will beat uncoalesced global memory reads by a mile.

The writes you may need to add some padding etc. to avoid bank conflicts.

The reliance on hundreds of meg or gigs of data is somewhat concerning. Can you carve it up somehow? Hope you have a big beefy Tesla/Quadro w/ oodles of RAM.

That said, the name of game for CUDA optimization is always to experiment, profile/measure, rinse and repeat.

Before I start, please remember that there are two layers of parallelism in CUDA: blocks and threads.

There are hundreds and even thousands of independent threads, which do not require any synchronization during calculations

Since you can launch as many as 65535 blocks per dimension, you can treat each block in cuda to be equivalent to a "thread" of yours.

Each thread has a relatively small (less than 200Kb) local memory region containing thread-specific data. Read/Write

Unfortunately most cards have a shared memory limit of 16k per block. So if you can figure out how to handle with this lower limit, great. If not, you will need to use global memory accesses..

Each thread accesses a large memory block (hundreds of megabytes and even gigabytes). This memory is read-only

You can not bind such large arrays to textures or constant memory. So in a given block, try to make the threads read contiguous chunks of data for the best performance.

For each access to the global memory there will be at least two accesses to the local memory There will be a lot of branches in the algorithm

Since you are essentially replacing a single thread in your original implementation with a block in cuda, you may want to revise the code a little bit to try and implement a parallel version of the "per thread code" too.

This may not be clear at first glance, but think it through a little. Any algorithm that has hundreds / thousands of independent parts with no synchronization needed is great for a parallel implementation, even with cuda.

继续阅读：optimization

Large constant array in global memory

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

Easiest way to get words of one line from istream into a vector?

性激素六项检查的最佳时间是多久？多少钱？？

抽烟只抽炫赫门？

Infinite gtk warnings when I right click on the icon