Performing a scan in OpenCL
I've been trying to get a simple scan to work for quite some time now. For small problems, the output is correct, however for large output, I get the correct results only sometimes. I've checked Apple's OpenCL example and I am basically doing the same thing (except for the bank conflicts, which I'm ignoring atm). So here's the code for the first phase:
__kernel void
scan_init(__global int * input,
__global int * sums)
{
int gid = get_global_id(0);
int lid = get_local_id(0);
int chunk_size = get_local_size(0)*2;
int chunk = gid/chunk_size;
int offset = chunk*chunk_size;
reduction(input, offset);
// store sums
if(lid==0)
{
sums[chunk] = input[(chunk+1)*chunk_size-1];
}
downsweep(input, offset);
}
And the reduction function itself:
void reduction(__global int * input,
int offset)
{
int stride = 1;
int grp_size = get_local_size(0);
int lid = get_local_id(0);
for(int d = grp_size; d > 0; d>>=1)
{
barrier(CLK_GLOBAL_MEM_FENCE);
if(lid < d)
{
int ai = stride*(2*lid+1)-1+offset;
int bi = stride*(2*lid+2)-1+offset;
input[bi] += input[ai];
}
stride *= 2;
}
}
In the second phase, partials sums are used to build the sum for each element:
void downsweep(__global int * input,
const unsigned int offset)
{
int grp_size = get_local_size(0);
int lid = get_local_id(0);
int stride = grp_size*2;
for(int d = 1; d <= grp_size; d *=2)
{
barrier(CLK_GLOBAL_MEM_FENCE);
stride >>=1;
if(lid+1 < d)
{
int src = 2*(lid + 1)*stride-1+offset;
int dest = src + stride;
input[dest]+=input[src];
}
}
}
The input is padded to a size that is a multiple of the local work size. Each w开发者_Go百科ork group can scan a chunk of twice it size. I save the sum of each chunk in the sums array, which I use to check the result. Following is the output for input size 4000 of an array of 1's:
Chunk size: 1024
Chunks: 4
Scan global size: 4096
Local work size: 512
Sum size: 4
0:1024 1:1120 2:2904 3:928
However, expected result would be
0:1024 1:1024 2:1024 3:928
If I run the code again, I get:
0:1056 1:5376 2:1024 3:928
0:1024 1:1088 2:1280 3:992
0:5944 1:11156 2:3662 3:1900
0:7872 1:1056 2:2111 3:1248
The call to the kernel is the following:
clEnqueueNDRangeKernel(cl_ctx->queue, scan_init, 1, NULL, &scan_global_size, &local_work_size, 0, NULL, NULL);
Where global size is 4096 and local size is 512. If I limit the local work group size to 64, the output looks as follows:
0:128 1:128 2:128 3:288 4:128 5:128 6:192 7:192
8:192 9:254 10:128 11:256 12:128 13:360 14:128 15:128
16:128 17:128 18:128 19:288 20:128 21:128 22:128 23:128
24:192 25:128 26:128 27:192 28:128 29:128 30:128 31:32
And if I change the input size to 512 and any chunks size, everything works great!
Finally, when using input size 513 and a group size of 256 (that is, I have two chunks, each having 512 elements, with the second one having only the first element set to 1), the result of the first phase is:
0:1 1:2 2:1 3:6 4:1 5:2 6:1 7:14
8:1 9:2 10:1 11:6 12:1 13:2 14:1 15:28
16:1 17:2 18:1 19:6 20:1 21:2 22:1 23:14
24:1 25:2 26:1 27:6 28:1 29:2 30:1 31:56
32:1 33:2 34:1 35:6 36:1 37:2 38:1 39:14
40:1 41:2 42:1 43:6 44:1 45:2 46:1 47:28
48:1 49:2 50:1 51:6 52:1 53:2 54:1 55:14
56:1 57:2 58:1 59:6 60:1 61:2 62:1 63:148
Where it should be:
0:1 1:2 2:1 3:4 4:1 5:2 6:1 7:8
8:1 9:2 10:1 11:4 12:1 13:2 14:1 15:16
16:1 17:2 18:1 19:4 20:1 21:2 22:1 23:8
24:1 25:2 26:1 27:4 28:1 29:2 30:1 31:32
32:1 33:2 34:1 35:4 36:1 37:2 38:1 39:8
40:1 41:2 42:1 43:4 44:1 45:2 46:1 47:16
48:1 49:2 50:1 51:4 52:1 53:2 54:1 55:8
56:1 57:2 58:1 59:4 60:1 61:2 62:1 63:64
My guess is, it's a problem with accessing same data concurrently by different threads, however, this shouldn't be the case, since every work group is processing a different chunk of the input data. Any help on this matter will be greatly appreciated!!
I suspect the problem has to do with barrier() not being an inter-workgroup synchronisation. Each workgroup will have its own barrier, and you have no guarantees about the ordering of workgroups themselves. When you changed the input set size to 512, you may get all your workgroups to run on the same multiprocessor, and therefore incidentally synchronized.
Your chunk variable is get_group_id(0)/2, which means you've got two entire workgroups assigned to the same chunk. You probably want that the other way around. If they happen to run in lockstep, they'll simply overwrite each other's work because their load-store dependencies will match up. Otherwise, they may or may not interfere, always in the direction of summing values multiple times.
A hint in this matter is in your question itself: "Each work group can scan a chunk of twice it size." That should mean a total work size of half the array size is sufficient.
The loop in downsweep() also has an oddity.The first iteration does nothing; lid+1>=1, and d starts as 1. That could be an insignificant superfluous iteration, but it's an off by one in the planning.
精彩评论