Machine code alignment

2023-02-15 01:21 问答作者：

I am trying to understand the principles of machine code alignment. I have an assembler implementation which can generate machine code in run-time. I use 16-bytes alignment on every branch destination, but looks like it is not the optimal choice, since I've noticed that if I remove alignment than sometimes same code works faster. I think that something to do with cache line widt开发者_JAVA百科h, so that some commands are cut by a cache line and CPU experiences stalls because of that. So if some bytes of alignment inserted at one place, it will move instructions somewhere further pass the cache border line...

I was hoping to implement an automatic alignment procedure, which can process a code as a whole and insert alignment according to the specification of the CPU (cache line width, 32/64 bits and so on)...

Can someone give some hints about this procedure? As an example the target CPU could be Intel Core i7 CPU 64-bit platform.

Thank you.

I'm not qualified to answer your question because this is such a vast and complicated topic. There are probably many more mechanisms in play here, other than cache line size.

However, I would like to point you to Agner Fog's site and the optimization manuals for compiler makers that you can find there. They contain a plethora of information on these kind of subjects - cache lines, branch prediction and data/code alignment.

Paragraph (16-byte) alignment is usually the best. However, it can force some "local" JMP instructions to no longer be local (due to code size bloat). May also result in not as much code being cached. I would only align major segments of code, I would not align every tiny subroutine/JMP section.

Not an expert, however... Branches to places that are not going to be in the instruction cache should benefit from alignment the most because you'll read whole cache-line of instructions to fill the pipeline. Given that statement, forward branches will benefit on the first run of a function. Backward branches ("for" and "while" loops for example) will probably not benefit because the branch target and following instructions have been read into cache already. Do follow the links in Martins answer.

As mentioned previously this is a very complex area. Agner Fog seems like a good place to visit. As to the complexities I ran across the article here Torbjörn Granlund on "Improved Division by Invariant Integers" and in the code he uses to illustrate his new algorithm the first instruction at - I guess - the main label is nop - no operation. According to the commentary it improves performance significantly. Go figure.

继续阅读：assembly cpu-architecture cpu-cache cpu-speed optimization

Machine code alignment

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？