开发者

Why is the LLVM execution engine faster than compiled code?

I have a compiler which targets LLVM, and I provide two ways to run the code:

  1. Run it automatically. This mode compiles the code to LLVM and uses the ExecutionEngine JIT to compile it into machine code on-the-fly and run it without ever generating an output file.
  2. Compile it and run separately. This mode outputs an LLVM .bc file, which I manually optimise (with opt), compile to native assembly (with llc) compile to machine code and link (with gcc), and run.

I was expecting approach #2 to be faster than approach #1, or at least the sam开发者_如何学Pythone speed, but running a few speed tests, I am surprised to find that #2 consistently runs about twice as slow. That is a huge speed difference.

Both cases are running the same LLVM source code. With approach #1, I haven't yet bothered to run any LLVM optimisation passes (which is why I was expecting it to be slower). With approach #2, I am running opt with -std-compile-opts and llc with -O3, to maximise optimisation, yet it isn't getting anywhere near #1. Here is an example run of the same program:

  • #1 without optimisation: 11.833s
  • #2 without optimisation: 22.262s
  • #2 with optimisation (-std-compile-opts and -O3): 18.823s

Is the ExecutionEngine doing something special that I don't know about? Is there any way for me to optimise the compiled code to achieve the same performance as the ExecutionEngine JIT?


It is normal for a VM with JIT to run some applications faster than than a compiled application. That's because a VM with JIT is like a simulator that simulates a virtual computer, and also runs a compiler in realtime. Because both tasks are built into the VM with JIT, the machine simulator can feed information to the compiler so that the code can be recompiled to run more efficiently. The information that it provides is not available to statically compiled code.

This effect has also been noted with Java VMs and with Python's PyPy VM, among others.


Another issue is aligning code and other optimizations. Nowadays cpu's are so complex that it's hard to predict which techniques will result in faster execution of final binary.

As an real-life example, let's consider Google's Native Client - I mean original nacl compilation approach, not involing LLVM (cause, as far as I know, currently there is direction on supporting both "nativeclient" and "LLVM bitcode"(modyfied) code).

As you can see on presentations (check out youtube.com) or in papers, like this Native Client: A Sandbox for Portable, Untrusted x86 Native Code, even their aligning technique makes code size bigger, in some cases such aligning of instructions (for example with noops) gives better cache hitting.

Aligning instructions with noops and instruction reordering it known in parallel computing, and here it shows it's impact as well.

I hope this answer gives an idea how much circumstances might influence on code speed execution, and that are many possible reasons for different pieces of code, and each of them needs investigation. Nevermore, it's interesting topic, so If you find some more details, don't hestitate to reedit your answer and let us know in "Post-Scriptorium", what have you found more :). (Maybe link to whitepaper/devblog with new findings :) ). Benchmarks are always welcome - take a look : http://llvm.org/OpenProjects.html#benchmark .

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜