Improving the performance of C code
What is the most unorthodox way to improve the performance of C code? This is no-holds-barred! Everything goes including changing loop structures to gotos, hardcoding anything and everything, using case statements in bizarre ways etc. Don't worry at all about maintainability, readability etc.
p.s. This is practical... and I开发者_如何学编程'm well aware of how to improve the performance of code in reasonable ways (improve algorithms, profile before you optimize etc)
In my experience the most unorthodox way of optimizing C code is to profile the application, identify slowly performing structures and or db hits and then design reasonable solutions around them using Big O analysis.
Duff's Device is the canonical example. It's so weird that Tom Duff admitted, "This code forms some sort of argument in [the debate about fall-through in case statements], but I'm not sure whether it's for or against".
Abusing the constant 0x5f3759df to compute inverse square roots quickly has to rank pretty high...
Use inline assembly?
Seriously though, if by merely changing the C code you can improve performance, chances are you can do it cleanly.
A few exceptions:
1) If you are relying on alignment semantics for pointers of different types often you can perform block operations on pointers that technically expose your application to a bounds overrun condition, but in practice does not because of your system's alignment characteristics. So a memory copy can be performed, by aligning initial char's then the inner block can be done using a long* pointer.
2) It may be possible to copy stack frames in clever ways if you know the memory order in which your compiler assigns local variables. This may allow you to implement co-routines which the language doesn't otherwise support. Coroutines are often a simpler and faster way of implementing some kinds of loop control.
3) Unions are always a bit "hacky" however you use them. Its a way of implementing polymorphism with fairly loose type checking.
4) Use of the C preprocessor as a way of auto-generating code is usually very difficult to debug and read. As such people tend to avoid this.
Profile your code, find the slow spots, and use inline assembly to optimize them.
1) Loop unrolling. You save a jump, comparison, and increment every iteration if you don't actually loop.
2) Avoid double-indirection. It's usually faster to perform arithmetic that retrieval, so a[y*height + x] is usually faster than a[y][x]. Plus a one-dimensional array of size MxN saves M (or N) words worth of pointers compared to a rectangular matrix of dimensions MxN.
3) Use ridiculous assembly optimizations whenever possible. For instance, on the x86 architecture, you can use the BSWAP instruction to swap bytes in one operation instead of the normal temp=a; a=b; b=temp;
pattern.
And of course, don't forget:
4) Don't do bounds checking or error handling.
That having been said, I'd avoid all of these except (2) in practice.
You're looking for an unorthodox, no holds-barred, yet general-purpose solution to optimizing C?
Rewrite it in assembly language.
For point 3 above within Dathan's answer, another way of swapping, you can swap variables in an unconventional manner using xor.
int = 3, y = 4; x = x ^ y; y = y ^ x; x = x ^ y;
Now x and y are swapped! :)
Another thing, when you are dividing something with 2, it is better to use shift right operator. Same could be said for multiplying by 2, shift left.
In the old Borland C compiler, there was a _stklen
property which you can assign in order to reduce stack size and code. I haven't seen anything like that nowadays as compiler technology has advanced since.
When using malloc, it would be better to calloc instead as it initializes memory to zero instead.
Usage of the ternary operator instead of the if/else statement is apparently faster, I guess that compiler writers have got more smarter with regards to machine code generation. I simply cannot provide proof of that in that regard, but it was held true back then when Borland C 3.01 ruled the roost.
Inlining code with assembly routines.
I like this question topic as it reminds me of the old days when memory was precious and having to squeeze a pint into a quart pot and used the hocus pocus tricks of the x86 code. Thanks for posting this question Mr.Database.
Your compiler is almost certainly better at optimizing than your ugly attempts would give you. Most of the historical little tricks are now pointless. People ignoring readability and maintainability tend to write code that ends up less efficient because the real optimizations are made more difficult.
When code has been optimized in all ways possible and still needs performance gain, rewriting critical portions in ASM is the best hope to have any effect.
There is nothing unorthodox left to do for C code performance. All of the effective techniques have been "orthodoxed".
The best I've found is to use a profiler with access to CPU performance counters and pay special attention to cache and branch misses. Add cache prefetches wherever you can and remove unpredictable branches wherever you can.
Don't bother with loop unrolling. If the branch is predictable it is almost free. Let the compiler worry about it.
On some very parallel architectures like IA64 it can be faster to unroll a loop all the way to the end. One example of this is avoiding the C string functions. Use memset to zero a string array, memcpy to set the string and memcmp to compare the entire array against another similar array. This can use 64-bit loads, never has to check for the zero terminator and can be optimized to not loop or branch at all if using a "small" array size of 64 or 128. The memxxx() functions are usually compiler built-ins and very optimized.
I hear a lot of answers of the form "Try doing X, Y, or Z", but that's like saying "Hear, have a fish, and eat well for a day".
I would rather teach you how to fish - for performance problems. People who say "Profile First" are on the right track but (IMHO) are far too timid.
Here's an example of aggressive performance tuning.
Here's a short explanation of why it works.
Here's a long explanation of why it works.
That will teach you to fish by showing you how to find out where the fish are and how big they are. Once you find them, you can cook them up (fix them) in many wonderful ways. The great thing is, once you find and dispose of one fish (performance problem), the others get bigger and easier to catch.
Duff's Device & Carmack's Fast InvSqrt
.
In DSP applications, it's still worth going to assembly language to get the best performance out of the SIMD instructions that C compilers don't do very well with. But that's not really a "C" solution.
Something I do pretty often is use curve-fitting software to replace functions with approximations that are faster to calculate. Sometimes LUTs are still faster than doing a bunch of calculations, but not as often as they used to be.
See this chapter, It’s a plain Wonderful Life by Abrash (it's about 5 pages: click 'Next' at the bottom of each screen).
Summary (some quotes from the article):
- Table-driven magic (huge lookup table and incredible state machine)
- An approach to performance programming that operates at a more efficient, tightly integrated level than you may ever see again
- Astonishing economy of effort
Inline Assembly.
精彩评论