开发者

Are atomic operations on global memory in CUDA performed in parallel across a warp?

I need to do an atomic FP add operation on global memory on a CC 2.0 device. If the glo开发者_开发问答bal data referenced in a warp fit into an aligned 128-byte sector, will these operations be done in parallel or will they be executed one at a time?

My guess would be that they are parallel, but I am not sure of this

Regards Gautham Ganapathy


When programming you can think of atomic operations as conceptually parallel (while still satisfying the requirements of atomicity).

When optimizing it helps to be aware of serialization that might be occuring. What actually happens depends on the hardware you are running on. Performance depends on the location and number of atomic memory units, as well as the pattern of memory accesses being performed in parallel.

For example, if the locations that are addressed in parallel map to completely different atomic units, they will occur in parallel. If many addresses in parallel map to the same atomic unit, they must be serialized.

Atomic operation performance has improved consistently from sm_11 (Compute capability 1.1, where it first appeared), to sm_2x (Fermi devices), to sm_3x (Kepler devices). Kepler improved worst-case atomic memory operation performance (where many atomic operations access the same memory address) by up to 10X, and best case performance (where many atomic operations access very different memory addresses) by up to 2X. Atomic performance on Kepler is high enough that you may consider using atomics where previously you might have employed explicit parallel reduction code. See this presentation for more details.

Note: this discussion applies to global memory atomics. Shared memory atomics are a different beast, and in general result in serialization and are therefore do not have very high performance.


Atomic operations are slower than normal operations, because they really can't happen in parallel.

What will probably happen is that each add will be done one at a time, but execution won't progress past the add until all the threads have completed it, it will look parallel from the code's perspective.

I'm not sure if the access will be coalesced or not, but the speed penalty from the atomic operations will probably outweigh the memory access speed benefit.


To rephrase what has already been said: ATOMIC operations will be performed in sequence, but since all other operations will be halted at the moment, they will APPEAR to have been performed at the same time (in parallel). One important thing to note is that although atomic operations are sequencial, their ORDER cannot be controlled.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜