x86 opcode alignment references and guidelines

2022-12-24 16:59 问答作者：

I'm generating some opcodes dynamically in a JIT compiler and I'm looking for guidelines for opcode alignment.

1) I've read comments that briefly "recommend" alignment by adding nops after calls

2) I've also read about using nop for optimizing sequences for parallelism.

3) I've read that alignment of ops is good for "cache" performance

Usually these comments don't give any supporting references. Its one thing to read a blog or a comment that says, "its a good idea to do such and such", but its another to actually write a compiler that implements specific op sequences and realize most material online, especially blogs, are not useful for practical application. So I'm a believer in finding things out myself (disassembly, etc. to see what real world apps do). This is one case where I need some outside info.

I notice compilers will usually start an odd byte instruction immediately after whatever previous instruction sequence there was. So the compiler is not taking any special care in most cases. I see "nop" here or there, but usually it seems nop is used sparingly, if at all. How critical is opcode alignment? Can you开发者_如何学Go provide references for cases that I can actually use for implementation? Thanks.

I would recommend against inserting nops except for the alignment of branch targets. On some specific CPUs, branch prediction algorithms may penalize control transfers to control transfers, and so a nop may be able to act as a flag and invert the prediction, but otherwise it is unlikely to help.

Modern CPU's are going to translate your ISA ops into micro-ops anyway. This may make classical alignment techniques less important, as presumably the micro-operation transcoder will leave out nops and change both the size and alignment of the secret true machine ops.

However, by the same token, optimizations based on first principles should do little or no harm.

The theory is that one makes better use of the cache by starting loops at cache line boundaries. If a loop were to start in the middle of a cache line, then the first half of the cache line would be unavoidably loaded and kept loaded during the loop, and this would be wasted space in the cache if the loop is longer than 1/2 of a cache line.

Also, for branch targets, the initial load of the cache line loads the largest forward window of instruction stream when the target is aligned.

Regarding separating in-line instructions that are not branch targets with nops, there are few reasons for doing this on modern CPU's. (There was a time when RISC machines had delay slots which often led to inserting nops after control transfers.) Decoding the instruction stream is easy to pipeline and if an architecture has odd-byte-length ops you can be assured that they are decoded reasonably.

The best source for all these micro optimizations is Agner Fog's x86 optimization manuals. Those documents should have everything you need, and then some. :)

One thing I can think of is aligning a loop so that the loop code doesn't cross any cache line boundary, i.e. the loop is < 64 bytes and starts at an address divisible by 64. The entire loop would then fit in a single cache line and leave more cache lines available for other things. I doubt that would matter in a real world program though, no matter how "hot" that particular loop happens to be.

继续阅读：assembly memory-alignment micro-optimization x86 x86-64

x86 opcode alignment references and guidelines

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？