How to structure this OpenCL brute-force code
I'm just starting to play about with OpenCL, and I'm stuck with how to structure the program in a reasonably efficient manner (mainly avoiding lots of transferring of data to/from the GPU or wherever the work is being done)
What I'm trying to do is, given:
v = r*i + b*j + g*k
..I know v
for various values of r
, g
and b
, but i
, j
and k
are unknown. I want to calculate reasonable values for i
/j
/k
via brute force
In other words, I have a bunch of "raw" RGB pixel values, and I have a desaturated version of these colours. I do not know the weightings (i/j/k) used calculate the desaturated values.
My initial plan was to:
load the data into a CL buffer (so the input r/g/b values, and the output)
have a kernel which takes the three possible matrix values, and the various pixel-data buffers.
It then performs
v = r*i + b*j + g*k
, and subtracts the value ofv
to the known value, and stores this in a "score" bufferAnother kernel calculates the RMS error for that value (if the difference is zero for all input values, the values for i/j/k are "correct")
I have this working (written using Python and PyCL, the code is here), but I'm wondering how I can parallelise this chunk of work more (by try multiple i/j/k values at once)
I issue is, I have the 4 read-only buffers (3 for the input values, 1 for the expected values), but I need a separate "score" buffer for every combination of i/j/k
Another issue is the RMS calculation is the slowest part, since it's effectively single-threaded (total up all the values in "score" and sqrt() the total)
Basically, I'm wondering if there's a sensible way to structure such a program.
It seems like a task well-suited to OpenCL - hopefully the description of my goal wasn't too convoluted! As mentioned, my current code is here, and in case it is clearer, this is the Python version of what I'm trying to do:
import sys
import math
import random
def make_test_data(w = 128, h = 128):
in_r, in_g, in_b = [], [], []
print "Make raw data"
for x in range(w):
for y in range(h):
in_r.append(random.random())
in_g.append(random.random())
in_b.append(random.random())
# the unknown values
mtx = [random.random(), random.random(), random.random()]
print "Secret numbers were: %s" % mtx
out_r = [(r*mtx[0] + g*mtx[1] + b*mtx[2]) for (r, g, b) in zip(in_r, in_g, in_b)]
return {'in_r': in_r, 'in_g': in_g, 'in_b': in_b,
'expected_r': out_r}
def score_matrix(ir, ig, ib, expected_r, mtx):
ms = 0
for i in range(len(ir)):
val = ir[i] * mtx[0] + ig[i] * mtx[1] + ib[i] * mtx[2]
ms += abs(val - expected_r[i]) ** 2
rms = math.sqrt(ms / float(len(ir)))
return rms
# Make random test data
test_data = make_test_data(16, 16)
lowest_rms = sys.maxint
closest = []
divisions = 10
for possible_r in range(divisions):
for possible_g in range(divisions):
for possible_b in range(divisions):
开发者_StackOverflow pr, pg, pb = [x / float(divisions-1) for x in (possible_r, possible_g, possible_b)]
rms = score_matrix(
test_data['in_r'], test_data['in_g'], test_data['in_b'],
test_data['expected_r'],
mtx = [pr, pg, pb])
if rms < lowest_rms:
closest = [pr, pg, pb]
lowest_rms = rms
print closest
Are i,j,k sets independent? I assumed that yes. Few things hurts your performance:
- running too many small kernels
- using global memory for communication between score_matrix and rm_to_rms
You could rewrite both kernels into one with following changes:
- make that one OpenCL work-group would work on different i,j,k - you can pre-generate this on CPU
in order to do 1 you need to process multiple elements of array with one thread you can do it like this:
int i = get_thread_id(0); float my_sum = 0; for (; i < array_size; i += get_local_size(0)){ float val = in_r[i] * mtx_r + in_g[i] * mtx_g + in_b[i] * mtx_b; my_sum += pow(fabs(expect_r[i] - val), 2); }
after this you write my_sum for each thread into local memory and sum it up with reduce (O(log(n)) algorithm).
save result into global memory
alternatively if you need to compute i,j,k sequentially you can look up barrier and memory fence functions in OpenCL specification so you can use these instead of running two kernels, just remember to sum up everything in first step, write into global synchronize all threads, and then sum up again
There are two potential issues:
- Kernel launch overhead may be large if the work required to process each of your images is small. This is what you would address by combining the evaluation of multiple
i,j,k
values in a single kernel. - Serialization of the sum calculation for the RMSE. This is likely the larger issue, currently.
To address (2), notice that summation can be evaluated in parallel, but it is not as trivial as mapping a function separately over every pixel in your input. That is because summation requires communicating values between neighboring elements, rather than treating all elements independently. This pattern is commonly called a reduction.
PyOpenCL includes high-level support for common reductions. What you want here is a sum reduction: pyopencl.array.sum(array)
.
Looking further into how this is implemented in raw OpenCL, Apple's OpenCL docs include an example of parallel reduction for sum. The pieces most relevant to what you want to do are the kernel and the main
and create_reduction_pass_counts
functions of the host C program which runs the reduction.
精彩评论