How can I speed-up this loop (in C)?

2022-12-27 19:39 问答作者：

I'm trying to parallelize a convolution function in C. Here's the original function which convolves two arrays of 64-bit floats:

void convolve(const Float64 *in1,
              UInt32 in1Len,
              const Float64 *in2,
              UInt32 in2Len,
              Float64 *results)
{
    UInt32 i, j;

    for (i = 0; i < in1Len; i++) {
        for (j = 0; j < in2Len; j++) {
            results[i+j] += in1[i] * in2[j];
        }
    }
}

In order to allow for concurrency (without semaphores), I created a function that computes the result for a particular position in the results array:

void convolveHelper(const Float64 *in1,
                    UInt32 in1Len,
                    const Float64 *in2,
                    UInt32 in2Len,
                    Float64 *result,
                    UInt32 outPosition)
{
    UInt32 i, j;

    for (i = 0; i < in1Len; i++) {
        if (i > outPosition)
      开发者_如何学运维      break;
        j = outPosition - i;
        if (j >= in2Len)
            continue;
        *result += in1[i] * in2[j];
    }
}

The problem is, using convolveHelper slows down the code about 3.5 times (when running on a single thread).

Any ideas on how I can speed-up convolveHelper, while maintaining thread safety?

Convolutions in the time domain become multiplications in the Fourier domain. I suggest you grab a fast FFT library (like FFTW) and use that. You'll go from O(n^2) to O(n log n).

Algorithmic optimizations nearly always beat micro-optimizations.

The most obvious thing that could help would be to pre-compute the starting and ending indices of the loop, and remove the extra tests on i and j (and their associated jumps). This:

for (i = 0; i < in1Len; i++) {
   if (i > outPosition)
     break;
   j = outPosition - i;
   if (j >= in2Len)
     continue;
   *result += in1[i] * in2[j];
}

could be rewritten as:

UInt32 start_i = (in2Len < outPosition) ? outPosition - in2Len + 1 : 0;
UInt32 end_i = (in1Len < outPosition) ? in1Len : outPosition + 1;

for (i = start_i; i < end_i; i++) {
   j = outPosition - i;
   *result += in1[i] * in2[j];
}

This way, the condition j >= in2Len is never true, and the loop test is essentially the combination of the tests i < in1Len and i < outPosition.

In theory you also could get rid of the assignment to j and turn i++ into ++i, but the compiler is probably doing those optimizations for you already.

Instead of the two if statements in the loop, you can calculate the correct minimum/maximum values for i before the loop.
You're calculating each result position separately. Instead, you can split the results array into blocks and have each thread calculate a block. The calculation for a block will look like the convolve function.

Unless your arrays are very big, using a thread is unlikely to actually help much, since the overhead of starting a thread will be greater than the cost of the loops. However, let's assume that your arrays are large, and threading is a net win. In that case, I'd do the following:

Forget your current convolveHelper, which is too complicated and won't help much.

Split the interior of the loop into a thread function. I.e. just make

for (j = 0; j < in2Len; j++) {
    results[i+j] += in1[i] * in2[j];
}

into its own function that takes i as a parameter along with everything else.

Have the body of convolve simply launch threads. For maximum efficiency, use a semaphore to make sure that you never create more threads than you have cores.

Answer lies in Simple Math & NOT multi-threading (UPDATED)

Here's why...

consider ab + ac

U can optimise it as a*(b+c) (one multimplication less)

In ur case there are in2Len unnecessary multiplications in the inner-loop. Which can be eliminated.

Hence, modifying the code as follows should give us the reqd convolution:

(NOTE: The following code returns circular-convolution which must be unfolded to obtain the linear-convolution result.

void convolve(const Float64 *in1,
              UInt32 in1Len,
              const Float64 *in2,
              UInt32 in2Len,
              Float64 *results)
{
    UInt32 i, j;

    for (i = 0; i < in1Len; i++) {

        for (j = 0; j < in2Len; j++) {
            results[i+j] += in2[j];
        }

        results[i] = results[i] * in1[i];

    }
}

This should give U the max performance jump more than anything else. Try it our and see!!

GoodLUCK!!

CVS @ 2600Hertz

I finally figured out how to correctly precompute the start/end indexes (a suggestion given by both Tyler McHenry and interjay):

if (in1Len > in2Len) {
    if (outPosition < in2Len - 1) {
        start = 0;
        end = outPosition + 1;
    } else if (outPosition >= in1Len) {
        start = 1 + outPosition - in2Len;
        end = in1Len;
    } else {
        start = 1 + outPosition - in2Len;
        end = outPosition + 1;
    }
} else {
    if (outPosition < in1Len - 1) {
        start = 0;
        end = outPosition + 1;
    } else if (outPosition >= in2Len) {
        start = 1 + outPosition - in2Len;
        end = in1Len;
    } else {
        start = 0;
        end = in1Len;
    }
}

for (i = start; i < end; i++) {
    *result = in1[i] * in2[outPosition - i];
}

Unfortunately, precomputing the indexes produces no noticeable decrease in execution time :(

Let the convolve helper work on larger sets, calculating multiple results, using a short outer loop.

The key in parallelization is to find a good balance between the distribution of work between threads. Do no use more threads than the number of CPU cores.

Split the work evenly between all threads. With this kind of problem, the complexity of each threads work should be the same.

继续阅读：c concurrency loops optimization performance

How can I speed-up this loop (in C)?

Answer lies in Simple Math & NOT multi-threading (UPDATED)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Answer lies in Simple Math & NOT multi-threading (UPDATED)

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？