Are atomic operations on global memory in CUDA performed in parallel across a warp?

2023-01-05 18:49 问答作者：

I need to do an atomic FP add operation on global memory on a CC 2.0 device. If the glo开发者_开发问答bal data referenced in a warp fit into an aligned 128-byte sector, will these operations be done in parallel or will they be executed one at a time?

My guess would be that they are parallel, but I am not sure of this

Regards Gautham Ganapathy

When programming you can think of atomic operations as conceptually parallel (while still satisfying the requirements of atomicity).

When optimizing it helps to be aware of serialization that might be occuring. What actually happens depends on the hardware you are running on. Performance depends on the location and number of atomic memory units, as well as the pattern of memory accesses being performed in parallel.

For example, if the locations that are addressed in parallel map to completely different atomic units, they will occur in parallel. If many addresses in parallel map to the same atomic unit, they must be serialized.

Atomic operation performance has improved consistently from sm_11 (Compute capability 1.1, where it first appeared), to sm_2x (Fermi devices), to sm_3x (Kepler devices). Kepler improved worst-case atomic memory operation performance (where many atomic operations access the same memory address) by up to 10X, and best case performance (where many atomic operations access very different memory addresses) by up to 2X. Atomic performance on Kepler is high enough that you may consider using atomics where previously you might have employed explicit parallel reduction code. See this presentation for more details.

Note: this discussion applies to global memory atomics. Shared memory atomics are a different beast, and in general result in serialization and are therefore do not have very high performance.

Atomic operations are slower than normal operations, because they really can't happen in parallel.

What will probably happen is that each add will be done one at a time, but execution won't progress past the add until all the threads have completed it, it will look parallel from the code's perspective.

I'm not sure if the access will be coalesced or not, but the speed penalty from the atomic operations will probably outweigh the memory access speed benefit.

To rephrase what has already been said: ATOMIC operations will be performed in sequence, but since all other operations will be halted at the moment, they will APPEAR to have been performed at the same time (in parallel). One important thing to note is that although atomic operations are sequencial, their ORDER cannot be controlled.

继续阅读：nvidia

Are atomic operations on global memory in CUDA performed in parallel across a warp?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？