Strange behaviour using local memory in OpenCL

2022-12-18 08:53 问答作者：

I'm currently working on a project suing OpenCL on a NVIDIA Tesla C1060 (driver version 195.17). However I'm getting some strange behaviour I can't really explain. Here is the code which puzzles me (reduced for clarity and testing purpose):

kernel void TestKernel(global const int* groupOffsets, global       float* result,     
                       local        int* tmpData,             const int    itemcount)
{
   unsigned int groupid    = get_group_id(0);
   unsigned int globalsize = get_global_size(0);
   unsigned int groupcount = get_num_groups(0);

   for(unsigned int id = get_global_id(0); id < itemcount; id += globalsize, groupid += groupcount)
   {
      barrier(CLK_LOCAL_MEM_FENCE);
      if(get_local_id(0) == 0)
         tmpData[0] = groupOffsets[groupid]; 
      barrier(CLK_LOCAL_MEM_FENCE);
      int offset = tmpData[0];
      result[id]   = (float) offset;
   }
}

This code should load the offset for each workgroup into local memory and then read it back and write it into the corresponding outputvector entry. For most workitems this is working, but for each workgroup the workitems with local ids 1 to 31 read an incorrect value. My output vector (for workgroupsize=128) is as following:

index       0: 0
index   1- 31: 470400
index  32-127: 0
index     128: 640
index 129-159: 471040
index 160-255: 640
index     256: 1280
index 257-287: 471680
index 288-511: 1280
...

the output i expected would be

index   0-127: 0
index 128-255: 640
index 256-511: 1280
...

Strange thing is: the problem only occurs when I use less then itemcount workitems 开发者_Python百科(so it works as expected when globalsize>=itemcount, meaning that every workitem processes only one entry). So I'm guessing it has something to do with the loop. Does anyone know what I'm doing wrong and how to fix it?

Update: I found out that it seems to work if I change

if(get_local_id(0) == 0)
     tmpData[0] = groupOffsets[groupid];

if(get_local_id(0) < 32)
     tmpData[0] = groupOffsets[groupid];

Which astonishes me even more, so while it might fix the problem, I'm don't feel comfortable fixing it this way (as in it might break some other time). Besides I would rather avoid losing performance when running on Geforce 8xxx class hardware due to additional (uncoalesced for that hardware as far as I understand) memory accesses. So the question still remains.

Firstly, and importantly, you need to be careful that itemcount is a multiple of the local work size to avoid divergence when executing the barrier.

All work-items in a work-group executing the kernel on a processor must execute this function before any are allowed to continue execution beyond the barrier. This function must be encountered by all work-items in a work-group executing the kernel.

You could implement this as follows:

unsigned int itemcountrounded = get_local_size(0) * ((itemcount + get_local_size(0) - 1) / get_local_size(0));
for(unsigned int id = get_global_id(0); id < itemcountrounded; id += globalsize, groupid += groupcount)
{
    // ...
    if (id < itemcount)
        result[id]   = (float) offset;
}

You said the code was reduced for simplicity, what happens if you run what you posted? Just wondering whether you need to put the barrier on global memory as well.

继续阅读：opencl

Strange behaviour using local memory in OpenCL

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？