How to profile a C++ function at assembly level?
I have a function that is the bottleneck of my program. It requires no access to memory and requires only calculation. It is the inner loop and called many times so any small gains to this function is big wins for my program.
I come from a background in optimizing SPU code on the PS3 where you take a SPU program and run it through a pipeline analyzer where you can put each assembly statement in its own column and 开发者_高级运维you minimize the amount of cycles the function takes. Then you overlay loops so you can minimized pipeline dependencies even more. With that program and a list of all the cycles each assembly instruction takes I could optimize much better then the compiler ever could.
On a different platform it had events I could register (cache misses, cycles, etc.) and I could run the function and track CPU events. That was pretty nice as well.
Now I'm doing a hobby project on Windows using Visual Studio C++ 2010 w/ a Core i7 Intel processor. I don't have the money to justify paying the large cost of VTune.
My question:
How do I profile a function at the assembly level for an Intel processor on Windows?
I want to compile, view disassembly, get performance metrics, adjust my code and repeat.
There are some great free tools available, mainly AMD's CodeAnalyst (from my experiences on my i7 vs my phenom II, its a bit handicapped on the Intel processor cause it doesn't have access to the direct hardware specific counters, though that might have been bad config).
However, a lesser know tool is the Intel Architecture Code Analyser (which is free like CodeAnalyst), which is similar to the spu tool you described, as it details latency, throughput and port pressure (basically the request dispatches to the ALU's, MMU and the like) line by line for your programs assembly. Stan Melax gave a nice talk on it and x86 optimization at this years GDC, under the title "hotspots, flops and uops: to-the-metal cpu optimization".
Intel also has a few more tools in the same vein as IACA, avaibale under the performance tuning section of their experimental/what-if code site, such as PTU, which is (or was) an experimental evolution of VTune, from what I can see, its free.
Its also a good idea to have read the intel optimization manual before diving into this.
EDIT: as Ben pointed out, the timings might not be correct for older processors, but that can be easily made up for using Agner Fog's Optimization manuals, which also contain many other gems.
You might want to try some of the utilities included in valgrind like callgrind or cachegrind.
Callgrind can do profiling and dump assembly.
And kcachegrind is a nice GUI, and will show the dumps including assembly and number of hits per instruction etc.
From you description it sounds like you problem may be embarrassingly parallel, have you considered using ppl's parallel_for
?
精彩评论