Strange behaviour using local memory in OpenCL
I'm currently working on a project suing OpenCL on a NVIDIA Tesla C1060 (driver version 195.17). However I'm getting some strange behaviour I can't really explain. Here is the code which puzzles me (reduced for clarity and testing purpose):
kernel void TestKernel(global const int* groupOffsets, global float* result,
local int* tmpData, const int itemcount)
{
unsigned int groupid = get_group_id(0);
unsigned int globalsize = get_global_size(0);
unsigned int groupcount = get_num_groups(0);
for(unsigned int id = get_global_id(0); id < itemcount; id += globalsize, groupid += groupcount)
{
barrier(CLK_LOCAL_MEM_FENCE);
if(get_local_id(0) == 0)
tmpData[0] = groupOffsets[groupid];
barrier(CLK_LOCAL_MEM_FENCE);
int offset = tmpData[0];
result[id] = (float) offset;
}
}
This code should load the offset for each workgroup into local memory and then read it back and write it into the corresponding outputvector entry. For most workitems this is working, but for each workgroup the workitems with local ids 1 to 31 read an incorrect value. My output vector (for workgroupsize=128) is as following:
index 0: 0
index 1- 31: 470400
index 32-127: 0
index 128: 640
index 129-159: 471040
index 160-255: 640
index 256: 1280
index 257-287: 471680
index 288-511: 1280
...
the output i expected would be
index 0-127: 0
index 128-255: 640
index 256-511: 1280
...
Strange thing is: the problem only occurs when I use less then itemcount workitems 开发者_Python百科(so it works as expected when globalsize>=itemcount, meaning that every workitem processes only one entry). So I'm guessing it has something to do with the loop. Does anyone know what I'm doing wrong and how to fix it?
Update: I found out that it seems to work if I change
if(get_local_id(0) == 0)
tmpData[0] = groupOffsets[groupid];
to
if(get_local_id(0) < 32)
tmpData[0] = groupOffsets[groupid];
Which astonishes me even more, so while it might fix the problem, I'm don't feel comfortable fixing it this way (as in it might break some other time). Besides I would rather avoid losing performance when running on Geforce 8xxx class hardware due to additional (uncoalesced for that hardware as far as I understand) memory accesses. So the question still remains.
Firstly, and importantly, you need to be careful that itemcount
is a multiple of the local work size to avoid divergence when executing the barrier.
All work-items in a work-group executing the kernel on a processor must execute this function before any are allowed to continue execution beyond the barrier. This function must be encountered by all work-items in a work-group executing the kernel.
You could implement this as follows:
unsigned int itemcountrounded = get_local_size(0) * ((itemcount + get_local_size(0) - 1) / get_local_size(0));
for(unsigned int id = get_global_id(0); id < itemcountrounded; id += globalsize, groupid += groupcount)
{
// ...
if (id < itemcount)
result[id] = (float) offset;
}
You said the code was reduced for simplicity, what happens if you run what you posted? Just wondering whether you need to put the barrier on global memory as well.
精彩评论