Programmatically counting cache faults
I need to evaluate the time taken by a C++ function in a bunch of hypothesis about memory hierarchy efficiency (e.g: time taken when we have a cache miss, a cache hit or page fault when reading a portion of an array), so I'd like to have some libraries that let me count the cache miss / page faults in order to be capable of auto-generating a performance summary.
I know there are some tools like cachegrind that gives some r开发者_开发技巧elated statistics on a given application execution, but I'd like a library, as I've already said.
edit Oh, I forgot: I'm using Linux and I'm not interested in portability, it's an academic thing.
Any suggestion is welcome!
Most recent CPUs (both AMD and Intel) have performance monitor registers that can be used for this kind of job. For Intel, they're covered in the programmer's reference manual, volume 3B, chapter 30. For AMD, it's in the BIOS and Kernel Developer's Guide.
Either way, you can count things like cache hits, cache misses, memory requests, data prefetches, etc. They have pretty specific selectors, so you could get a count of (for example) the number of reads on the L2 cache to fill lines in the L1 instruction cache (while still excluding L2 reads to fill lines in the L1 data cache).
There is a Linux kernel module to give access to MSRs (Model-specific registers). Offhand, I don't know whether it gives access to the performance monitor registers, but I'd expect it probably does.
It looks like now there is exactly what I was searching for: perf_event_open.
It lets you do interesting things like initializing/enabling/disabling some performance counters for subsequently fetching their values through an uniform and intuitive API (it gives you a special file descriptor which hosts a struct containing the previously requested informations).
It is a linux-only solution and the functionalities varies depending on the kernel version, so be careful :)
Intel VTune is a performance tuning tool that does exactly what you are asking for;
Of course it works with Intel processors, as it access the internal processor counters, as explained by Jerry Coffin
, so this probably not work on an AMD processor.
It expose literally undreds of counters, like cache hit/misses, branch prediction rates, etc. the real issue with it is understanding which counters to check ;)
The cache misses cannot be just counted easily. Most tools or profilers simulate the memory access by redirecting memory accesses to a function that provides this feature. That means these kind of tools instrument the code at all places where a memory access is done and makes your code run awfully slowly. This is not what your intent is I guess.
However depending on the hardware you might have some other possibilities. But even if this is the case the OS should support it (because otherwise you would get system global stats not the ones related to a process or thread)
EDIT: I could find this interesting article that may help you: http://lwn.net/Articles/417979/
精彩评论