OpenCL - How to I query for a device's SIMD width?
In CUDA, there is a concept of a warp, which is defined as 开发者_JAVA技巧the maximum number of threads that can execute the same instruction simultaneously within a single processing element. For NVIDIA, this warp size is 32 for all of their cards currently on the market.
In ATI cards, there is a similar concept, but the terminology in this context is wavefront. After some hunting around, I found out that the ATI card I have has a wavefront size of 64.
My question is, what can I do to query for this SIMD width at runtime for OpenCL?
I found the answer I was looking for. It turns out that you don't query the device for this information, you query the kernel object (in OpenCL). My source is:
http://www.hpc.lsu.edu/training/tutorials/sc10/tutorials/SC10Tutorials/docs/M13/M13.pdf
(Page 108)
which says:
The most efficient work group sizes are likely to be multiples of the native hardware execution width
- wavefront size in AMD speak/warp size in Nvidia speak
- Query device for CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE
So, in short, the answer appears to be to call the clGetKernelWorkGroupInfo() method with a param name of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE. See this link for more information on this method:
http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clGetKernelWorkGroupInfo.html
On AMD, you can query CL_DEVICE_WAVEFRONT_WIDTH_AMD. That's different from CL_DEVICE_SIMD_WIDTH_AMD, which returns the number of threads it executes in each clock cycle. The latter may be smaller than the wavefront size, in which case it takes multiple clock cycles to execute one instruction for all the threads in a wavefront.
On NVIDIA, you can query the warp size width using clGetDeviceInfo with CL_DEVICE_WARP_SIZE_NV (although this is always 32 for current GPUs), however, this is an extension, as OpenCL defines nothing like warps or wavefronts. I don't know about any AMD extension that would allow to query for the wavefront size.
For AMD: clGetDeviceInfo(..., CL_DEVICE_WAVEFRONT_WIDTH_AMD, ...) (if cl_amd_device_attribute_query extension supported)
For Nvidia: clGetDeviceInfo(..., CL_DEVICE_WARP_SIZE_NV, ...) (if cl_nv_device_attribute_query extension supported)
But there is no uniform way. The way suggested by Jonathan DeCarlo doesn't work, I was using it for GPUs if these two extensions does not supported - for example Intel iGPU, but recently I faced wrong results on Intel HD 4600:
Intel HD 4600 says CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE=32 while in fact Intel GPUs seems to have wavefront equal to 16, so I faced incorrect results, everything works fine if barriers were used for wavefront=16.
P.S. I have not enough reputation to comment Jonathan DeCarlo answer about this, will be glad if somebody will add comment.
The closest to actual SIMD width is the result of get_max_sub_group_size()
kernel runtime function from cl_khr_subgroups extension. It returns min(SIMD-width, work-group-size)
.
Worth attention is also function get_sub_group_size()
which returns the size of the current sub-group, which is never bigger than SIMD width: for example if SIMD width is 32 and group size is 40, then get_sub_group_size
for threads 0..31 will return 32 and for threads 32..39, it will return 8.
foot-note: to use this extension add #pragma OPENCL EXTENSION cl_khr_subgroups : enable
at the top of your openCL kernel code.
UPDATE:
it seems that there's also corresponding host level function clGetKernelSubGroupInfo, but jocl that I use does not have a binding for it, so I cannot verify if it works.
Currently, if I need to check SIMD width at the host level, I run a helper kernel which calls get_max_sub_group_size()
and stores it into its result buffer:
// run it with max work-group size
__kernel void getSimdWidth(__global uint *simdWidth) {
if (get_local_id(0) == 0) simdWidth[0] = get_max_sub_group_size();
}
You can use the clGetDeviceInfo to get maximum number of workitems you can have in your local workset for each dimension. This is most likely multiple of your wavefront size.
See: http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clGetDeviceInfo.html
For CUDA (using NVIDIA), please take a look at B.4.5 Cuda programming guide from NVIDIA. There is a variable for containing this information. You can query this variable at runtime. For AMD , I'm not sure if there is such a variable.
精彩评论