OpenCL - How to I query for a device's SIMD width?

2023-03-29 05:38 问答作者：

In CUDA, there is a concept of a warp, which is defined as 开发者_JAVA技巧the maximum number of threads that can execute the same instruction simultaneously within a single processing element. For NVIDIA, this warp size is 32 for all of their cards currently on the market.

In ATI cards, there is a similar concept, but the terminology in this context is wavefront. After some hunting around, I found out that the ATI card I have has a wavefront size of 64.

My question is, what can I do to query for this SIMD width at runtime for OpenCL?

I found the answer I was looking for. It turns out that you don't query the device for this information, you query the kernel object (in OpenCL). My source is:

http://www.hpc.lsu.edu/training/tutorials/sc10/tutorials/SC10Tutorials/docs/M13/M13.pdf

(Page 108)

which says:

The most efficient work group sizes are likely to be multiples of the native hardware execution width

wavefront size in AMD speak/warp size in Nvidia speak

Query device for CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

So, in short, the answer appears to be to call the clGetKernelWorkGroupInfo() method with a param name of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE. See this link for more information on this method:

http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clGetKernelWorkGroupInfo.html

On AMD, you can query CL_DEVICE_WAVEFRONT_WIDTH_AMD. That's different from CL_DEVICE_SIMD_WIDTH_AMD, which returns the number of threads it executes in each clock cycle. The latter may be smaller than the wavefront size, in which case it takes multiple clock cycles to execute one instruction for all the threads in a wavefront.

On NVIDIA, you can query the warp size width using clGetDeviceInfo with CL_DEVICE_WARP_SIZE_NV (although this is always 32 for current GPUs), however, this is an extension, as OpenCL defines nothing like warps or wavefronts. I don't know about any AMD extension that would allow to query for the wavefront size.

For AMD: clGetDeviceInfo(..., CL_DEVICE_WAVEFRONT_WIDTH_AMD, ...) (if cl_amd_device_attribute_query extension supported)

For Nvidia: clGetDeviceInfo(..., CL_DEVICE_WARP_SIZE_NV, ...) (if cl_nv_device_attribute_query extension supported)

But there is no uniform way. The way suggested by Jonathan DeCarlo doesn't work, I was using it for GPUs if these two extensions does not supported - for example Intel iGPU, but recently I faced wrong results on Intel HD 4600:

Intel HD 4600 says CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE=32 while in fact Intel GPUs seems to have wavefront equal to 16, so I faced incorrect results, everything works fine if barriers were used for wavefront=16.

P.S. I have not enough reputation to comment Jonathan DeCarlo answer about this, will be glad if somebody will add comment.

The closest to actual SIMD width is the result of get_max_sub_group_size() kernel runtime function from cl_khr_subgroups extension. It returns min(SIMD-width, work-group-size).

Worth attention is also function get_sub_group_size() which returns the size of the current sub-group, which is never bigger than SIMD width: for example if SIMD width is 32 and group size is 40, then get_sub_group_size for threads 0..31 will return 32 and for threads 32..39, it will return 8.

foot-note: to use this extension add #pragma OPENCL EXTENSION cl_khr_subgroups : enable at the top of your openCL kernel code.

UPDATE:
it seems that there's also corresponding host level function clGetKernelSubGroupInfo, but jocl that I use does not have a binding for it, so I cannot verify if it works.
Currently, if I need to check SIMD width at the host level, I run a helper kernel which calls get_max_sub_group_size() and stores it into its result buffer:

// run it with max work-group size
__kernel void getSimdWidth(__global uint *simdWidth) {
    if (get_local_id(0) == 0) simdWidth[0] = get_max_sub_group_size();
}

You can use the clGetDeviceInfo to get maximum number of workitems you can have in your local workset for each dimension. This is most likely multiple of your wavefront size.

See: http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clGetDeviceInfo.html

For CUDA (using NVIDIA), please take a look at B.4.5 Cuda programming guide from NVIDIA. There is a variable for containing this information. You can query this variable at runtime. For AMD , I'm not sure if there is such a variable.

继续阅读：gpgpu gpu opencl

OpenCL - How to I query for a device's SIMD width?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？