C Loop Optimizations with Conditionals on Looping Variable
Apologies if this is asked in the archives. I found some similar questions, but none that seemed exactly what I wanted.
The distilled version of the problem I'm working on is as follows. I have a series of calculations to perform that will store values in 4 (very large) arrays: A,B,C and D. These calculations are interdependent, for example calculating b[i] could require using a[i-1]. I am capable of expressing everything in a single loop, but that results in edge cases where for c开发者_Python百科ertain values of i, only some of the calculations should be performed. For example:
for(i=0;i<end;i++)
{
if(i == 0)
//calculate A[i+1] and B[i+1]
else if (i == end-1)
//calculate C[i-1] and D[i-1]
else
//calculate A[i+1], B[i+1], C[i-1], D[i-1]
}
For issues of performance, I would like to avoid having conditionals within my loop. Evaluating a conditional would be cheap compared to the calculations, but possibly not negligible. My question is if a compiler might reliably expand that to
//calculate A[1] and B[1]
for(i=1;i<end-1;i++)
{
//calculate A[i+1], B[i+1], C[i-1], D[i-1]
}
//calculate C[end-2] and D[end-2]
I gather from the archives that the compiler would break apart my loop if the conditional expressions were constant, but here they depend on i, which in principle could be changed by some of my calculations. Will it detect that I'm not tampering with the iteration variable and thus break it apart in the sensible way?
Extra information, in case you decide to answer the question by suggesting a better way to do things:
Originally the code was written with 4 loops, to calculate elements for each of the arrays. This was the most intuitive way to write the code, but it was inefficient. Since calculating elements in one array depended on elements in the other arrays, this meant I had to read in all 4 arrays from memory during each of the 4 loops. Since these arrays do not fit in cache, this is not optimal and I needed code that would loop through my arrays only once.
I'm also aware that I can break my loop apart by hand, and indeed that is how things are currently done. However these calculations involve nontrivial formulas (and I can't afford the performance hit of calling a function during every iteration of this loop), so breaking apart the code caused code duplication that is not only very hard to read, but almost unmaintainable the next time my formulas get tweaked (which they will...)
Thanks in advance!
To answer your question in a broader sense: when optimization is critical, the profiler is your friend. Developers are notoriously poor at guessing where in our code the processor spends most of its time. A profiler will show you exactly where the "expensive" operations are, so you can concentrate on fixing the areas that will give you the most significant improvements.
I'm curious about your statement that you "can't afford the performance hit of calling a function during every iteration of this loop...." How do you know that? Many modern processors are optimized for function calls, particularly if you can pass a pointer (to a struct
?) instead of many individual arguments. If your calculations are indeed "nontrivial" then the overhead of a function call may be negligible.
Other things to think about:
As an experiment, re-index your calculations so they operate exactly on
i
itself, rather thani-1
ori+1
, as much as possible. So, for example, useA[i]
,B[i]
,C[i-2]
, andD[i-2]
. I'd be surprised if that made a significant improvement using an optimizing compiler, but you never know....Precompute anything you can.
Try to break your calculations into smaller components that are either constant or common, as James Greenhalgh suggested, so they might be reused.
Can you rewrite your equations more efficiently? Analysis of the math may lead you to a shortcut: perhaps you can rewrite some (or all) of the iterations in closed form.
Can you replace your equations altogether with something simpler? For example, suppose you need to sort a set of locations by their distance from your house. The distance calculation requires subtraction, squaring, addition, and a square root. In general the square root is by far the most expensive operation. But if you need only the relative distances, you can skip the square root altogether: sorting by the square of the distance generates the same sort order!
If inlining isn't possible, can you define your functions (or their components) as efficient macros, so at least you can avoid repeating the code? As you mentioned, clipboard inheritance is the mortal enemy of maintainability.
If nothing else, going through this exercise will teach you oodles about the way your compiler and the C language work. Good luck!
In addition to profiling, I'd suggest looking over the code the compiler is actually emitting (cc -S *.c
for many compilers). This should tell you how (or if) the loop is being unrolled, as well as show which loop invariants are being moved. Don't forget to specify the same optimization settings as you do for your regular compilations.
My question is if a compiler might reliably expand that to...
The answer you are looking for is no. Compiler optimisation is greatly overrated, especially if you aren't using MSVC and targeting Wintel.
As already suggested inspect the assembler output for yourself to check. Personally I would play it safe and just write the code so the compiler doesn't have to do the work for you - and in fact I think it is more readable if you need to do something special for first and last iterations to refactor loops like this anyway... instead of needing to think about the conditionals and what they mean, the order of execution follows the order the code is written, which is much more natural (and like written language).
Now, you seem to suggest this was unreadable - the correct solution imo is not to stick all the code in the loop, but instead to move the unreadable blocks of code into well named functions with strong inlining hints (__forceinline etc.), then the iteration can look like this:
prepareForIteration( A, B );
for( int i = 1; i < ( end - 1 ); ++i )
{
iterationStep( i, A, B, C, D );
}
finaliseIteration( end, C, D );
Of course, since you know what the code actually does, I'm sure you can find better names...
The best thing you can do is what you already did: Loop through all four arrays at the same time rather than through each one separately. Cache-friendly access patterns are by far the most important micro-optimization when dealing with large arrays.
For the example you give, I would try the following if you are using gcc or clang-llvm:
for(i=0;i<end;i++)
{
if(__builtin_expect((i == 0), 0))
//calculate A[i+1] and B[i+1]
else if (__builtin_expect((i == end-1), 0))
//calculate C[i-1] and D[i-1]
else
//calculate A[i+1], B[i+1], C[i-1], D[i-1]
}
This is a "hint" to the compiler -- which it will pass along to the CPU -- to help with branch prediction. On a modern CPU, a mispredicted branch can cost hundreds of cycles. (On the other hand, on a modern CPU, the logic for predicting each branch based on how often it was taken in the past is pretty sophisticated. So your mileage may vary.) A correctly predicted branch costs 1 cycle or even less, so this is the next thing I would worry about.
Note that classical "optimization" techniques like loop unrolling are almost certainly worthless and probably even counter-productive.
Finally, if your algorithm can be vectorized, then that is your next step. The advantage of moving conditionals out of the loop entirely is that it may make it easier for the compiler to vectorize it automatically. But if the loop is non-trivial, you may have to write the vectorized code by hand; try a search for "SSE intrinsics".
As others have suggested, start by using a profiler and the -S
option (or equivalent) to your assembler. Good luck.
精彩评论