Optmization flags with g++
I am using g++ to compile a C++ code; a scientific simulation software.
Currently I am using the -O3
and -funroll-loops
flags. I could notice a big difference between -O0
, -O1
, -O2
, and -O3
, and almost no difference with -funroll-loops
.
Would you have any suggestions to help me to increase the optimization or tricks that I can use to get even better performances ?
Than开发者_如何转开发ks !
Edit, as suggested in the comments:
I am asking here about 'pure' compiling optimization, ie. is there clever things to do than just -O3
. The computing intensive part of the code deals with manipulation of blitz::array
in huge loops.
Edit2: I actually deal with a lot of fp (double) math
Without seeing the code, we can only give you generic advice that applies to a broad range of problems.
- Try GCC's profile-guided optimisation. Compile instrumented code with
-fprofile-generate
, do a few test runs with a realistic workload, then use the output from test run when building final binary (-fprofile-use
). Then GCC can guess better which branches are taken and optimise code better. - Try to parallelize your code if you can. You mentioned you have loops over big data items, this may work if your work items are independent and you can partition them. E. g. have a work queue with a worker thread pool with size equal to the number of CPUs and dispatch work to the queue instead of processing sequentially, then pool threads will grab work items off the queue and process them in parallel.
- Look at the size of the data units your code works with and try to fit them in as few L1 cache line (usually 64 bytes). For example if you have 66-byte data items and your cache line size is 64 bytes, it may be worth packing the structure, or otherwise squeezing it to fit in 64 bytes.
It's hard to tell without knowing the code you want to accelerate. Also, knowing the code may allow us to make improvements to it, to make it faster.
As a general advice, try specifying the -march
option to tell GCC what CPU model are you targeting. You can try -fomit-frame-pointer
if you make many function calls (esp. recursive). If you use heavily floating point math, and stay away from corner cases (e.g. NaNs, FP exceptions), you can try -ffast-math
. The last one may buy you a huge speedup, but in some cases it can bring wrong results. Analyze your code to ensure it is safe.
I don't have enough mojo to comment or to edit Alex B's answer so I will answer instead.
After you turn on profiling and run your application per Alex B's answer, actually look at the profile information to look for hot spots where your application spends most of its time. If you find any, take a look at the code to see what you can do to make them less hot.
Appropriate algorithm replacement will generally outperform any automated optimization by a wide margin.
精彩评论