开发者

Producing the fastest possible executable

I have a very large program which I have been compiling under visual studio (v6 then migrated to 2008). I need the executable to run as fast as possible. The program spends most of its time processing integers of various sizes and does very little IO.

Obviously I will select maximum optimization, but it seems that there are a variety of things that can be done which don't come under the heading of optimization which do still affect the speed of the executable. For example selecting the __fastcall calling convention or setting structure member alignment to a large number.

So my question is: Are there other compiler/linker options I should be using to make the program faster which are not controlled from the "optimization" page of the "propertie开发者_开发问答s" dialog.

EDIT: I already make extensive use of profilers.


Another optimization option to consider is optimizing for size. Sometimes size-optimized code can run faster than speed-optimized code due to better cache locality.

Also, beyond optimization operations, run the code under a profiler and see where the bottlenecks are. Time spent with a good profiler can reap major dividends in performance (especially it if gives feedback on the cache-friendliness of your code).

And ultimately, you'll probably never know what "as fast as possible" is. You'll eventually need to settle for "this is fast enough for our purposes".


Profile-guided optimization can result in a large speedup. My application runs about 30% faster with a PGO build than a normal optimized build. Basically, you run your application once and let Visual Studio profile it, and then it is built again with optimization based on the data collected.


1) Reduce aliasing by using __restrict.

2) Help the compiler in common subexpression elimination / dead code elimination by using __pure.

3) An introduction to SSE/SIMD can be found here and here. The internet isn't exactly overflowing with articles about the topic, but there's enough. For a reference list of intrinsics, you can search MSDN for 'compiler intrinsics'.

4) For 'macro parallelization', you can try OpenMP. It's a compiler standard for easy task parallelization -- essentially, you tell the compiler using a handful of #pragmas that certain sections of the code are reentrant, and the compiler creates the threads for you automagically.

5) I second interjay's point that PGO can be pretty helpful. And unlike #3 and #4, it's almost effortless to add in.


You're asking which compiler options can help you speed up your program, but here's some general optimisation tips:

1) Ensure your algorithms are appropriate for the job. No amount of fiddling with compiler options will help you if you write an O(shit squared) algorithm.

2) There's no hard and fast rules for compiler options. Sometimes optimise for speed, sometimes optimise for size, and make sure you time the differences!

3) Understand the platform you are working on. Understand how the caches for that CPU operate, and write code that specifically takes advantage of the hardware. Make sure you're not following pointers everywhere to get access to data which will thrash the cache. Understand the SIMD operations available to you and use the intrinsics rather than writing assembly. Only write assembly if the compiler is definitely not generating the right code (i.e. writing to uncached memory in bad ways). Make sure you use __restrict on pointers that will not alias. Some platforms prefer you to pass vector variables by value rather than by reference as they can sit in registers - I could go on with this but this should be enough to point you in the right direction!

Hope this helps,

-Tom


Forget micro-optimization such as what you are describing. Run your application through a profiler (there is one included in Visual Studio, at least in some editions). The profiler will tell you where your application is spending its time.

Micro-optimization will rarely give you more than a few percentage points increase in performance. To get a really big boost, you need to identify areas in your code where inefficient algorithms and/or data structures are being used. Focus on those, for example by changing algorithms. The profiler will help identify these problem areas.


Check which /precision mode you are using. Each one generates quite different code and you need to choose based on what accuracy is required in your app. Our code needs precision (geometry, graphics code) but we still use /fp:fast (C/C++ -> Code generation options).

Also make sure you have /arch:SSE2, assuming your deployment covers processors that all support SSE2. This will result is quite a big difference in performance, as compile will use very few cycles. Details are nicely covered in the blog SomeAssemblyRequired

Since you are already profiling, I would suggest loop unrolling if it is not happening. I have seen VS2008 not doing it more frequently (templates, references etc..)

Use __forceinline in hotspots if applicable.

Change hotspots of your code to use SSE2 etc as your app seems to be compute intense.


You should always address your algorithm and optimise that before relying on compiler optimisations to get you significant improvements in most cases.

Also you can throw hardware at the problem. Your PC may already have the necessary hardware lying around mostly unused: the GPU! One way of improving performance of some types of computationally expensive processing is to execute it on the GPU. This is hardware specific but NVIDIA provide an API for exactly that: CUDA. Using the GPU is likely to get you far greater improvement than using the CPU.


I agree with what everyone has said about profiling. However you mention "integers of various sizes". If you are doing much arithmetic with mismatched integers a lot of time can be wasted in changing sizes, shorts to ints for example, when the expressions are evaluated.

I'll throw in one more thing too. Probably the most significant optimisation is in choosing and implementing the best algorithm.


You have three ways to speed up your application:

  1. Better algorithm - you've not specified the algorithm or the data types (is there an upper limit to integer size?) or what output you want.

  2. Macro parallelisation - split the task into chunks and give each chunk to a separate CPU, so, on a two core cpu divide the integer set into two sets and give half to each cpu. This depends on the algorithm you're using - not all algorithms can be processed like this.

  3. Micro parallelisation - this is like the above but uses SIMD. You can combine this with point 2 as well.


You say the program is very large. That tells me it probably has many classes in a hierarchy.

My experience with that kind of program is that, while you are probably assuming that the basic structure is just about right, and to get better speed you need to worry about low-level optimization, chances are very good that there are large opportunities for optimization that are not of the low-level kind.

Unless the program has already been tuned aggressively, there may be room for massive speedup in the form of mid-stack operations that can be done differently. These are usually very innocent-looking and would never grab your attention. They are not cases of "improve the algorithm". They are usually cases of "good design" that just happen to be on the critical path.

  • Unfortunately, you cannot rely on profilers to find these things, because they are not designed to look for them.

This is an example of what I'm talking about.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜