How can I speed-up this loop (in C)?
I'm trying to parallelize a convolution function in C. Here's the original function which convolves two arrays of 64-bit floats:
void convolve(const Float64 *in1,
UInt32 in1Len,
const Float64 *in2,
UInt32 in2Len,
Float64 *results)
{
UInt32 i, j;
for (i = 0; i < in1Len; i++) {
for (j = 0; j < in2Len; j++) {
results[i+j] += in1[i] * in2[j];
}
}
}
In order to allow for concurrency (without semaphores), I created a function that computes the result for a particular position in the results
array:
void convolveHelper(const Float64 *in1,
UInt32 in1Len,
const Float64 *in2,
UInt32 in2Len,
Float64 *result,
UInt32 outPosition)
{
UInt32 i, j;
for (i = 0; i < in1Len; i++) {
if (i > outPosition)
开发者_如何学运维 break;
j = outPosition - i;
if (j >= in2Len)
continue;
*result += in1[i] * in2[j];
}
}
The problem is, using convolveHelper
slows down the code about 3.5 times (when running on a single thread).
Any ideas on how I can speed-up convolveHelper
, while maintaining thread safety?
Convolutions in the time domain become multiplications in the Fourier domain. I suggest you grab a fast FFT library (like FFTW) and use that. You'll go from O(n^2) to O(n log n).
Algorithmic optimizations nearly always beat micro-optimizations.
The most obvious thing that could help would be to pre-compute the starting and ending indices of the loop, and remove the extra tests on i
and j
(and their associated jumps). This:
for (i = 0; i < in1Len; i++) {
if (i > outPosition)
break;
j = outPosition - i;
if (j >= in2Len)
continue;
*result += in1[i] * in2[j];
}
could be rewritten as:
UInt32 start_i = (in2Len < outPosition) ? outPosition - in2Len + 1 : 0;
UInt32 end_i = (in1Len < outPosition) ? in1Len : outPosition + 1;
for (i = start_i; i < end_i; i++) {
j = outPosition - i;
*result += in1[i] * in2[j];
}
This way, the condition j >= in2Len
is never true, and the loop test is essentially the combination of the tests i < in1Len
and i < outPosition
.
In theory you also could get rid of the assignment to j
and turn i++
into ++i
, but the compiler is probably doing those optimizations for you already.
Instead of the two
if
statements in the loop, you can calculate the correct minimum/maximum values fori
before the loop.You're calculating each result position separately. Instead, you can split the
results
array into blocks and have each thread calculate a block. The calculation for a block will look like theconvolve
function.
Unless your arrays are very big, using a thread is unlikely to actually help much, since the overhead of starting a thread will be greater than the cost of the loops. However, let's assume that your arrays are large, and threading is a net win. In that case, I'd do the following:
- Forget your current
convolveHelper
, which is too complicated and won't help much. Split the interior of the loop into a thread function. I.e. just make
for (j = 0; j < in2Len; j++) { results[i+j] += in1[i] * in2[j]; }
into its own function that takes i
as a parameter along with everything else.
- Have the body of
convolve
simply launch threads. For maximum efficiency, use a semaphore to make sure that you never create more threads than you have cores.
Answer lies in Simple Math & NOT multi-threading (UPDATED)
Here's why...
consider ab + ac
U can optimise it as a*(b+c) (one multimplication less)
In ur case there are in2Len unnecessary multiplications in the inner-loop. Which can be eliminated.
Hence, modifying the code as follows should give us the reqd convolution:
(NOTE: The following code returns circular-convolution which must be unfolded to obtain the linear-convolution result.
void convolve(const Float64 *in1,
UInt32 in1Len,
const Float64 *in2,
UInt32 in2Len,
Float64 *results)
{
UInt32 i, j;
for (i = 0; i < in1Len; i++) {
for (j = 0; j < in2Len; j++) {
results[i+j] += in2[j];
}
results[i] = results[i] * in1[i];
}
}
This should give U the max performance jump more than anything else. Try it our and see!!
GoodLUCK!!
CVS @ 2600Hertz
I finally figured out how to correctly precompute the start/end indexes (a suggestion given by both Tyler McHenry and interjay):
if (in1Len > in2Len) {
if (outPosition < in2Len - 1) {
start = 0;
end = outPosition + 1;
} else if (outPosition >= in1Len) {
start = 1 + outPosition - in2Len;
end = in1Len;
} else {
start = 1 + outPosition - in2Len;
end = outPosition + 1;
}
} else {
if (outPosition < in1Len - 1) {
start = 0;
end = outPosition + 1;
} else if (outPosition >= in2Len) {
start = 1 + outPosition - in2Len;
end = in1Len;
} else {
start = 0;
end = in1Len;
}
}
for (i = start; i < end; i++) {
*result = in1[i] * in2[outPosition - i];
}
Unfortunately, precomputing the indexes produces no noticeable decrease in execution time :(
Let the convolve helper work on larger sets, calculating multiple results, using a short outer loop.
The key in parallelization is to find a good balance between the distribution of work between threads. Do no use more threads than the number of CPU cores.
Split the work evenly between all threads. With this kind of problem, the complexity of each threads work should be the same.
精彩评论