x86 opcode alignment references and guidelines
I'm generating some opcodes dynamically in a JIT compiler and I'm looking for guidelines for opcode alignment.
1) I've read comments that briefly "recommend" alignment by adding nops after calls
2) I've also read about using nop for optimizing sequences for parallelism.
3) I've read that alignment of ops is good for "cache" performance
Usually these comments don't give any supporting references. Its one thing to read a blog or a comment that says, "its a good idea to do such and such", but its another to actually write a compiler that implements specific op sequences and realize most material online, especially blogs, are not useful for practical application. So I'm a believer in finding things out myself (disassembly, etc. to see what real world apps do). This is one case where I need some outside info.
I notice compilers will usually start an odd byte instruction immediately after whatever previous instruction sequence there was. So the compiler is not taking any special care in most cases. I see "nop" here or there, but usually it seems nop is used sparingly, if at all. How critical is opcode alignment? Can you开发者_如何学Go provide references for cases that I can actually use for implementation? Thanks.
I would recommend against inserting nops except for the alignment of branch targets. On some specific CPUs, branch prediction algorithms may penalize control transfers to control transfers, and so a nop may be able to act as a flag and invert the prediction, but otherwise it is unlikely to help.
Modern CPU's are going to translate your ISA ops into micro-ops anyway. This may make classical alignment techniques less important, as presumably the micro-operation transcoder will leave out nops and change both the size and alignment of the secret true machine ops.
However, by the same token, optimizations based on first principles should do little or no harm.
The theory is that one makes better use of the cache by starting loops at cache line boundaries. If a loop were to start in the middle of a cache line, then the first half of the cache line would be unavoidably loaded and kept loaded during the loop, and this would be wasted space in the cache if the loop is longer than 1/2 of a cache line.
Also, for branch targets, the initial load of the cache line loads the largest forward window of instruction stream when the target is aligned.
Regarding separating in-line instructions that are not branch targets with nops, there are few reasons for doing this on modern CPU's. (There was a time when RISC machines had delay slots which often led to inserting nops after control transfers.) Decoding the instruction stream is easy to pipeline and if an architecture has odd-byte-length ops you can be assured that they are decoded reasonably.
The best source for all these micro optimizations is Agner Fog's x86 optimization manuals. Those documents should have everything you need, and then some. :)
One thing I can think of is aligning a loop so that the loop code doesn't cross any cache line boundary, i.e. the loop is < 64 bytes and starts at an address divisible by 64. The entire loop would then fit in a single cache line and leave more cache lines available for other things. I doubt that would matter in a real world program though, no matter how "hot" that particular loop happens to be.
精彩评论