Synchronizations in GPUs
I have some question about how GPUs perform synchronizations. As I know, when a warp encounters a barrier (assuming it is in OpenCL), and it knows that the other warps of the same group haven't been there yet. So it has to wait. But what exactly does that warp do during the waiting time? Is it still an active warp? Or will it do some kind of null operations?
As I notice, when we have a synchronization in the kernel, the number of instructions increases. I wonder what is the source of this increment. Is the synchronization broken down into that m开发者_开发问答any smaller GPU instructions? Or because the idle warps perform some extra instructions?
And finally, I strongly wonder if the cost added by a synchronization, compared to one without synch, (let's say barrier(CLK_LOCAL_MEM_FENCE)) is affected by the number of warp in a workgroup (or threadblock)? Thanks
An active warp is one that is resident on the SM, i.e. all the resources (registers etc.) have been allocated and the warp is available for executing providing it is schedulable. If a warp reaches a barrier before other warps in the same threadblock/work-group it will still be active (it is still resident on the SM and all its registers are still valid), but it won't execute any instructions since it is not ready to be scheduled.
Inserting a barrier not only stalls execution but also acts as a barrier for the compiler: the compiler is not allowed to perform most optimisations across the barrier since this may invalidate the purpose of the barrier. This is the most likely reason you are seeing more instructions - without the barrier the compiler is able to perform more optimisations.
The cost of a barrier is very dependent on what your code is doing, but each barrier introduces a bubble where all warps have to (effectively) become idle before they all start work again, so if you have a very large threadblock/work-group then of course there is potentially a bigger bubble than with a small block. The impact of the bubble depends on your code - if your code is very memory bound then the barrier will expose the memory latencies where before they may have been hidden, but if more balanced then it may have a less noticeable effect.
This means that in a very memory-bound kernel you may be better off launching a larger number of smaller blocks so that other blocks can be executing when one block is bubbling on a barrier. You would need to ensure that your occupancy increases as a result, and if you are sharing data between threads using the block-shared-memory then there is a trade-off to be had.
精彩评论