Assembly Performance Tuning
I am writing a compiler (more for fun than anything else), but I want to try to make it as efficient as possible. For example I was told that on Intel architecture the use of any register other than EAX
for performing math incurs a cost (presumably because it swaps into EAX
to do the actual piece of math). Here is at least one source that state开发者_如何学Cs the possibility (http://www.swansontec.com/sregisters.html).
I would like to verify and measure these differences in performance characteristics. Thus, I have written this program in C++:
#include "stdafx.h"
#include <intrin.h>
#include <iostream>
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
__int64 startval;
__int64 stopval;
unsigned int value; // Keep the value to keep from it being optomized out
startval = __rdtsc(); // Get the CPU Tick Counter using assembly RDTSC opcode
// Simple Math: a = (a << 3) + 0x0054E9
_asm {
mov ebx, 0x1E532 // Seed
shl ebx, 3
add ebx, 0x0054E9
mov value, ebx
}
stopval = __rdtsc();
__int64 val = (stopval - startval);
cout << "Result: " << value << " -> " << val << endl;
int i;
cin >> i;
return 0;
}
I tried this code swapping eax
and ebx
but I'm not getting a "stable" number. I would hope that the test would be deterministic (the same number every time) because it's so short that it's unlikely a context switch is occurring during the test. As it stands there is no statistical difference but the number fluctuates so wildly that it would be impossible to make that determination. Even if I take a large number of samples the number is still impossibly varied.
I'd also like to test xor eax, eax
vs mov eax, 0
, but have the same problem.
Is there any way to do these kinds of performance tests on Windows (or anywhere else)? When I used to program Z80 for my TI-Calc I had a tool where I could select some assembly and it would tell me how many clock cycles to execute the code -- can that not be done with our new-fangeled modern processors?
EDIT: There are a lot of answers indicating to run the loop a million times. To clarify, this actually makes things worse. The CPU is much more likely to context switch and the test becomes about everything but what I am testing.
To even have a hope of repeatable, determinstic timing at the level that RDTSC gives, you need to take some extra steps. First, RDTSC is not a serializing instruction, so it can be executed out of order, which will usually render it meaningless in a snippet like the one above.
You normally want to use a serializing instruction, then your RDTSC, then the code in question, another serializing instruction, and the second RDTSC.
Nearly the only serializing instruction available in user mode is CPUID. That, however, adds one more minor wrinkle: CPUID is documented by Intel as requiring varying amounts of time to execute -- the first couple of executions can be slower than others.
As such, the normal timing sequence for your code would be something like this:
XOR EAX, EAX
CPUID
XOR EAX, EAX
CPUID
XOR EAX, EAX
CPUID ; Intel says by the third execution, the timing will be stable.
RDTSC ; read the clock
push eax ; save the start time
push edx
mov ebx, 0x1E532 // Seed // execute test sequence
shl ebx, 3
add ebx, 0x0054E9
mov value, ebx
XOR EAX, EAX ; serialize
CPUID
rdtsc ; get end time
pop ecx ; get start time back
pop ebp
sub eax, ebp ; find end-start
sbb edx, ecx
We're starting to get close, but there's on last point that's difficult to deal with using inline code on most compilers: there can also be some effects from crossing cache lines, so you normally want to force your code to be aligned to a 16-byte (paragraph) boundary. Any decent assembler will support that, but inline assembly in a compiler usually won't.
Having said all that, I think you're wasting your time. As you can guess, I've done a fair amount of timing at this level, and I'm quite certain what you've heard is an outright myth. In reality, all recent x86 CPUs use a set of what are called "rename registers". To make a long story short, this means the name you use for a register doesn't really matter much -- the CPU has a much larger set of registers (e.g., around 40 for Intel) that it uses for the actual operations, so your putting a value in EBX vs. EAX has little effect on the register that the CPU is really going to use internally. Either could be mapped to any rename register, depending primarily on which rename registers happen to be free when that instruction sequence starts.
I'd suggest taking a look at Agner Fog's "Software optimization resources" - in particular, the assembly and microarchitecture manuals (2 and 3), and the test code, which includes a rather more sophisticated framework for measurements using the performance monitor counters.
The Z80, and possibly the TI, had the advantage of synchronized memory access, no caches, and in-order execution of the instructions. That made it a lot easier to calculate to number of clocks per instruction.
On current x86 CPUs, instructions using AX or EAX are not faster per se, but some instructions might be shorter than the instructions using other registers. That might just save a byte in the instruction cache!
Go here and download the Architectures Optimization Reference Manual.
There are many myths. I think the EAX claim is one of them.
Also note that you can't talk anymore about 'which instruction is faster'. On today's hardware there are no 1 to 1 relation between instructions and execution time. Some instructions are preferred to others not because they are 'faster' but because they break dependencies between other instructions.
I believe that if there's a difference nowadays it will only be because some of the legacy instructions have a shorter encoding for the variant that uses EAX. To test this, repeat your test case a million times or more before you compare cycle counts.
You're getting ridiculous variance because rdtsc
does not serialize execution. Depending on inaccessible details of the execution state, the instructions you're trying to benchmark may in fact be executed entirely before or after the interval between the rdtsc
instructions! You will probably get better results if you insert a serializing instruction (such as cpuid
) immediately after the first rdtsc
and immediately before the second. See this Intel tech note (PDF) for gory details.
Starting your program is going to take much longer than running 4 assembly instructions once, so any difference from your assembly will drown in the noise. Running the program many times won't help, but it would probably help if you run the 4 assembly instructions inside a loop, say, a million times. That way the program start-up happens only once.
There can still be variation. One especially annoying thing that I've experienced myself is that your CPU might have a feature like Intel's Turbo Boost where it will dynamically adjust it's speed based on things like the temperature of your CPU. This is more likely to be the case on a laptop. If you've got that, then you will have to turn it off for any benchmark results to be reliable.
I think what the article tries to say about the EAX register, is that since some operations can only be performed on EAX, it's better to use it from the start. This was very true with the 8086 (MUL comes to mind), but the 386 made the ISA much more orthogonal, so it's much less true these days.
精彩评论