AMD64 -- nopw assembly instruction?
In this compiler output, I'm trying to understand how machine-code encoding of the nopw
instruction works:
000000000开发者_运维问答04004d0 <main>:
4004d0: eb fe jmp 4004d0 <main>
4004d2: 66 66 66 66 66 2e 0f nopw %cs:0x0(%rax,%rax,1)
4004d9: 1f 84 00 00 00 00 00
There is some discussion about "nopw" at http://john.freml.in/amd64-nopl. Can anybody explain the meaning of 4004d2-4004e0? From looking at the opcode list, it seems that 66 ..
codes are multi-byte expansions. I feel I could probably get a better answer to this here than I would unless I tried to grok the opcode list for a few hours.
That asm output is from the following (insane) code in C, which optimizes down to a simple infinite loop:
long i = 0;
main() {
recurse();
}
recurse() {
i++;
recurse();
}
When compiled with gcc -O2
, the compiler recognizes the infinite recursion and turns it into an infinite loop; it does this so well, in fact, that it actually loops in the main()
without calling the recurse()
function.
editor's note: padding functions with NOPs isn't specific to infinite loops. Here's a set of functions with a range of lengths of NOPs, on the Godbolt compiler explorer.
The 0x66
bytes are an "Operand-Size Override" prefix. Having more than one of these is equivalent to having one.
The 0x2e
is a 'null prefix' in 64-bit mode (it's a CS: segment override otherwise - which is why it shows up in the assembly mnemonic).
0x0f 0x1f
is a 2 byte opcode for a NOP that takes a ModRM byte
0x84
is ModRM byte which in this case codes for an addressing mode that uses 5 more bytes.
Some CPUs are slow to decode instructions with many prefixes (e.g. more than three), so a ModRM byte that specifies a SIB + disp32 is a much better way to use up an extra 5 bytes than five more prefix bytes.
AMD K8 decoders in Agner Fog's microarch pdf:
Each of the instruction decoders can handle three prefixes per clock cycle. This means that three instructions with three prefixes each can be decoded in the same clock cycle. An instruction with 4 - 6 prefixes takes an extra clock cycle to decode.
Essentially, those bytes are one long NOP instruction that will never get executed anyway. It's in there to ensure that the next function is aligned on a 16-byte boundary, because the compiler emitted a .p2align 4
directive, so the assembler padded with a NOP. gcc's default for x86 is
-falign-functions=16
. For NOPs that will be executed, the optimal choice of long-NOP depends on the microarchitecture. For a microarchitecture that chokes on many prefixes, like Intel Silvermont or AMD K8, two NOPs with 3 prefixes each might have decoded faster.
The blog article the question linked to ( http://john.freml.in/amd64-nopl ) explains why the compiler uses a complicated single NOP instruction instead of a bunch of single-byte 0x90 NOP instructions.
You can find the details on the instruction encoding in AMD's tech ref documents:
- http://developer.amd.com/documentation/guides/pages/default.aspx#manuals
Mainly in the "AMD64 Architecture Programmer's Manual Volume 3: General Purpose and System Instructions". I'm sure Intel's technical references for the x64 architecture will have the same information (and might even be more understandable).
The assembler (not the compiler) pads code up to the next alignment boundary with the longest NOP instruction it can find that fits. This is what you're seeing.
I would guess this is just the branch-delay instruction.
I belive that the nopw is junk - i is never read in your program, and there are thus no need to increment it.
精彩评论