[OpenCL]nearest neighbour using euclidean distance
I'm using OpenCL to find the nearest neighbour between two set of 3D points.
Nearest Neighbour: For each point(x,y,z) in the DataSet I have to find the nearest one in the model. Squared distance = (Ax-Bx)^2 + (Ay-By)^2 + (Az-Bz)^2
Here what I've done so far:
struct point {
int x;
int y;
int z;
};
__kernel void
nearest_neighbour(__global struct point *model,
__global struct point *dataset,
_开发者_运维百科_global int *nearest,
const unsigned int model_size)
{
int g_dataset_id = get_global_id(0);
int dmin = -1;
int d, dx, dy, dz;
for (int i=0; i<model_size; ++i) {
dx = model[i].x - dataset[g_dataset_id].x;
dx = dx * dx;
dy = model[i].y - dataset[g_dataset_id].y;
dy = dy * dy;
dz = model[i].z - dataset[g_dataset_id].z;
dz = dz * dz;
d = dx + dy + dz;
if(dmin == -1 || d < dmin)
{
nearest[g_dataset_id] = i;
dmin = d;
}
}
}
The code seems to work, but I'm sure that it can be optimized. I would like to know how can I take advantage of the local memory to make it better.
Thanks
P.S. I know that there are other (better) methods to find nearest neighbour, like kd-tree, but for now I would like to do the easy one.
The compiler is probably hoisting these loop-invariants for you, but to be sure it gets done, try this code which assigns them to temporaries named datum_x
and so on. Also, initializing dmin to MAX_INT
allows you to avoid the superfluous comparison with -1
. Another approach is to unroll the first loop iteration (with i=0
) in order to initialize dmin.
int dmin = MAX_INT;
int d, dx, dy, dz;
int datum_x, datum_y, datum_z;
datum_x = dataset[g_model_id].x;
datum_y = dataset[g_model_id].y;
datum_z = dataset[g_model_id].z;
for (int i=0; i<size_dataset; ++i) {
dx = model[i].x - datum_x;
dx = dx * dx;
dy = model[i].y - datum_y;
dy = dy * dy;
dz = model[i].z - datum_z;
dz = dz * dz;
d = dx + dy + dz;
if(d < dmin)
{
nearest[g_dataset_id] = i;
dmin = d;
}
}
Maybe a quick pre-filter can speed things up. Instead of calculating the squared distance immediately, you can first check if distance in all three coordinates are closer than dmin. So, you can replace your inner loop with
{
dx = model[i].x - datum_x;
if (abs(dx) >= dmin) continue;
dy = model[i].y - datum_y;
if (abs(dy) >= dmin) continue;
dz = model[i].z - datum_z;
if (abs(dz) >= dmin) continue;
dx = dx * dx;
dy = dy * dy;
dz = dz * dz;
d = dx + dy + dz;
if(d < dmin)
{
nearest[g_dataset_id] = i;
dmin = d;
}
}
I am not sure if the extra calls to the abs()
and the if
s per point will filter out enough data points, but it is a simple enough change to try out, I think.
The first thing that occurred to me is the suggestion that Heath made. Each work item is accessing the memory item model[i]
simultaneously. Depending on how good the compiler is at optimizing, it might be better to have each work item access a different element from the array. One way of staggering it is:
int datum_x, datum_y, datum_z;
datum_x = dataset[g_model_id].x;
datum_y = dataset[g_model_id].y;
datum_z = dataset[g_model_id].z;
for (int i=0; i<size_dataset; ++i) {
j = (i + g_model_id) % size_dataset; // i --> j by cyclic permutation
dx = model[j].x - datum_x;
dx = dx * x;
dy = model[j].y - datum_y;
dy = dy * dy;
/* and so on... */
}
However, it may well be that the access to model[i]
in your code is handled as a "broadcast," in which case my code will run slower.
I am pretty sure that a lot of time is spent writing the current min into nearest[g_dataset_id]
. Access to global memory is often very slow, so you're better off storing the current min in a register like you do with dmin = d
.
Just like this:
...
int dmin = MAX_INT;
int imin = 0;
...
for (...)
{
...
if(d < dmin)
{
imin = i;
dmin = d;
}
}
nearest[g_dataset_id] = imin; //write to global memory only once
Heath's suggestion can be applied to the output index too: maintain a variable nearest_id
, and write it only once after the loop.
Instead of 3 component struct, I would experiment with int4 vectors, and use vector operations.
After Eric Bainville suggestion I've tried to get rid of the point struct. As suggested I've used float4, here what I've done:
__kernel void
nearest_neighbour(__global float4 *model,
__global float4 *dataset,
__global unsigned int *nearest,
const unsigned int model_size)
{
int g_dataset_id = get_global_id(0);
float dmin = MAXFLOAT;
float d;
/* Ottimizzato per memoria locale */
float4 local_xyz = dataset[g_dataset_id];
float4 d_xyz;
int imin;
for (int i=0; i<model_size; ++i) {
d_xyz = model[i] - local_xyz;
d_xyz *= d_xyz;
d = d_xyz.x + d_xyz.y + d_xyz.z;
if(d < dmin)
{
imin = i;
dmin = d;
}
}
nearest[g_dataset_id] = imin; // Write only once in global memory
}
The problem is that this version run a bit slower than the one based on the point struct. Probably because in the struct one I've used the pre-filter:
dx = model[i].x - local_x;
dx = dx * dx;
if (dx >= dmin) continue;
dy = model[i].y - local_y;
dy = dy * dy;
if (dy >= dmin) continue;
dz = model[i].z - local_z;
dz = dz * dz;
if (dz >= dmin) continue;
d = dx + dy + dz;
I can't use that pre-filter width float4 version. In your opinion are there other optimization I can do on the float4 version?
Thank you all for your valuable suggestions
To your specific question of "I would like to know how can I take advantage of the local memory to make it better."
Using the GPU's local memory can be tricky. You need to spend some quality time with the SDK's code samples and programming guide before tackling it.
Basically, you use local memory to cache some block of the global data -- in your case the model[] array -- so that you can read it from the there faster than reading it from global. If you did want to try it, it would go something like this pseudocode:
For each block of the model array {
1) Read data from __global and write it to __local
2) Barrier
3) For each model datum in the __local cache,
Read it and process it.
4) Barrier
}
Step 3 is basically the loop you have now, except that it would only be processing a chunk of the model data instead of the whole thing.
Steps 2 and 4 are absolutely essential whenever you use local memory. You have to synchronize all of the theads in your workgroup. The barrier forces all of the work items to complete the code before the barrier before any of them is allowed to proceed to execute code after the barrier. This prevents work items from reading data out of local memory before it gets written there by the other threads. I don't recall the syntax of the barrier instructions, but they're in the OpenCL docs.
Step 1 you would have each work item read a different datum from global and write it to local cache.
Something like this (caution this is oversimplified and untested!):
__local float4 modelcache[CACHESIZE];
int me = get_local_id(0);
for (int j = 0; j < model_size; j += CACHESIZE) {
modelcache[me] = dataset[j+me];
barrier(CLK_LOCAL_MEM_FENCE);
for (int i=0; i < CACHESIZE; ++i) {
d_xyz = modelcache[i] - local_xyz;
... etc.
}
barrier(CLK_LOCAL_MEM_FENCE);
}
The design question then is: How big should the local cache be? What's the work group size?
The local data store is shared between work items in a work group. If your ND array of work items executes a number of work groups in parallel, each work group has it's own copy of the modelcache.
If you make the local data arrays too small, you get very little or no benefit from using them. If you make them too big, then the GPU can't execute as many work groups in parallel, and you might actually run considerably slower.
Finally, I have to say that this particular algorithm isn't likely to benefit much from a local memory cache. In your program, all of the work items are reading the same model[i] locations at the same time, and most GPUs have hardware that is specifically optimized to do that fast.
精彩评论