Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation) [closed]

2022-12-22 15:23 问答作者：

Closed. This question needs to be more focused. It is not currently accepting answers.

开发者_如何学JAVA

Want to improve this question? Update the question so it focuses on one problem only by editing this post.

Closed 4 years ago.

Improve this question

How are threads organized to be executed by a GPU?

Hardware

If a GPU device has, for example, 4 multiprocessing units, and they can run 768 threads each: then at a given moment no more than 4*768 threads will be really running in parallel (if you planned more threads, they will be waiting their turn).

Software

threads are organized in blocks. A block is executed by a multiprocessing unit. The threads of a block can be indentified (indexed) using 1Dimension(x), 2Dimensions (x,y) or 3Dim indexes (x,y,z) but in any case xyz <= 768 for our example (other restrictions apply to x,y,z, see the guide and your device capability).

Obviously, if you need more than those 4*768 threads you need more than 4 blocks. Blocks may be also indexed 1D, 2D or 3D. There is a queue of blocks waiting to enter the GPU (because, in our example, the GPU has 4 multiprocessors and only 4 blocks are being executed simultaneously).

Now a simple case: processing a 512x512 image

Suppose we want one thread to process one pixel (i,j).

We can use blocks of 64 threads each. Then we need 512*512/64 = 4096 blocks (so to have 512x512 threads = 4096*64)

It's common to organize (to make indexing the image easier) the threads in 2D blocks having blockDim = 8 x 8 (the 64 threads per block). I prefer to call it threadsPerBlock.

dim3 threadsPerBlock(8, 8);  // 64 threads

and 2D gridDim = 64 x 64 blocks (the 4096 blocks needed). I prefer to call it numBlocks.

dim3 numBlocks(imageWidth/threadsPerBlock.x,  /* for instance 512/8 = 64*/
              imageHeight/threadsPerBlock.y);

The kernel is launched like this:

myKernel <<<numBlocks,threadsPerBlock>>>( /* params for the kernel function */ );

Finally: there will be something like "a queue of 4096 blocks", where a block is waiting to be assigned one of the multiprocessors of the GPU to get its 64 threads executed.

In the kernel the pixel (i,j) to be processed by a thread is calculated this way:

uint i = (blockIdx.x * blockDim.x) + threadIdx.x;
uint j = (blockIdx.y * blockDim.y) + threadIdx.y;

Suppose a 9800GT GPU:

it has 14 multiprocessors (SM)
each SM has 8 thread-processors (AKA stream-processors, SP or cores)
allows up to 512 threads per block
warpsize is 32 (which means each of the 14x8=112 thread-processors can schedule up to 32 threads)

https://www.tutorialspoint.com/cuda/cuda_threads.htm

A block cannot have more active threads than 512 therefore __syncthreads can only synchronize limited number of threads. i.e. If you execute the following with 600 threads:

func1();
__syncthreads();
func2();
__syncthreads();

then the kernel must run twice and the order of execution will be:

func1 is executed for the first 512 threads
func2 is executed for the first 512 threads
func1 is executed for the remaining threads
func2 is executed for the remaining threads

Note:

The main point is __syncthreads is a block-wide operation and it does not synchronize all threads.

I'm not sure about the exact number of threads that __syncthreads can synchronize, since you can create a block with more than 512 threads and let the warp handle the scheduling. To my understanding it's more accurate to say: func1 is executed at least for the first 512 threads.

Before I edited this answer (back in 2010) I measured 14x8x32 threads were synchronized using __syncthreads.

I would greatly appreciate if someone test this again for a more accurate piece of information.

继续阅读：nvidia

Understanding CUDA grid dimensions, block dimensions and threads organization (simple explanation) [closed]

Hardware

Software

Now a simple case: processing a 512x512 image

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Hardware

Software

Now a simple case: processing a 512x512 image

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？