What are some tips for optimizing the assembly code generated by a compiler?
I am currently in the process of writing a compiler and I seem to have run into some problems getting it to output code that executes in a decent timeframe.
A brief overview of the compiler:
7Basic is a compiler that aims to compile 7Basic code directly into machine code for the target architecture开发者_JAVA技巧 / platform. Currently 7Basic generates x86 assembly given a source file.
The problem is that the assembly code generated by the compiler is slow and inefficient.
For example, this code (which compiles down to this assembly code) takes nearly 80.47 times longer to execute than the equivalent C code.
Part of the problem is that the compiler is generating code like the following:
push eax
push 5000000
pop ebx
pop eax
Instead of the more logical:
mov ebx,5000000
...which accomplishes the same thing.
My question is: what are some techniques to avoid this sort of problem? The parser basically uses recursion to parse the expressions, so the code generated reflects this.
One technique is called peephole optimisation. This takes an iterative approach to cleaning up assembly code. Essentially you scan through the assembly code, looking at only two or three instructions at a time, and see whether you can reduce them into something simpler. For example,
push eax ; 1
push 5000000 ; 2
pop ebx ; 3
pop eax ; 4
The first step would look at lines 2 and 3, and replace that with:
push eax ; 1
mov ebx,5000000 ; 2a
pop eax ; 4
Second, you might consider 1 and 4, and if eax
is not touched in the middle instruction, remove them both, leaving what you want:
mov ebx,5000000 ; 2a
You might want to consider generating C code rather than assembly and then let a C compiler (e.g. gcc) handle the code generation for you. There's no point trying to re-invent the wheel.
I'm taking a compiler course at the moment. I've made some great progress in outputting efficient code, but you should look into the dragon book. It is a rite of passage. You should take a look at the code from Jeremy Bennett's book, Introduction to Compiling Techniques: A First Course Using ANSI C, LEX, and YACC. The book itself is very hard to find, but you can download the source code for the compiler free from
http://www.jeremybennett.com/publications/download.html
The code generator file (cg.c) has some functions for generating fairly optimized code. The target language isn't i386, but you should consider looking at how he describes registers and keeps track of where symbol table entries are stored. His output assembly could be further optimized, but it provides a great base for producing code that could rival the output from gcc -S in some regards.
One general optimization would be to subtract the stack pointer to reserve space for all local and temporary variables upon entering a function. Then just reference the offsets instead of constantly pushing/popping.
For example, if your intermediate code is a list of quadruples, you should simply iterator through it for each function and keep track of the maximum offset. Then output the line to subtract the amount of space on the stack. This removes the need to push so many variables on and off. To remove the need to pop them, you can simply mov their value from their offset on the stack into a register. This will significantly improve performance.
There are any number of reasons a particular code generator may emit the instruction sequence you list. The most likely is that the code generator you're using just isn't trying very hard to emit optimum code.
This pattern of emitted code suggests to me that your code generator doesn't know that the x86 has "mov immediate" instructions that embed the constant value into the instruction stream directly. The x86 encoding for opcodes with immediate values can get a little complicated (variable length R/M bytes) but this is already required if you want to use many of the x86 instructions.
This emitted code also suggests that the code generator doesn't know that EAX is not modified by the EBX instructions. This feels like the codegen is template driven rather than discrete logic.
This kind of codegen happens when the compiler's internal intermediate representation of operations is not detailed enough to represent all the facets of the target architecture. This is particularly true if the code generator architecture was originally designed for a RISC instruction set but has been repurposed to emit x86 instructions. RISC architecture tend to have very few and very simple load, store, and operate reg/reg instructions, whereas the x86 instruction set has evolved organically over decades to include a wide variety of opcodes that operate directly on memory, inline constants into the instructions, and a whole mess of other stuff. If the compiler's intermediate representation (expression graph) is wired for RISC, it will be difficult to make it grok the wide variety and subtleties of x86.
Peephole optimizations will help, but one obvious issue is that your compiler doesn't do register allocation!
http://en.wikipedia.org/wiki/Register_allocation
If you want to get serious performance levels, you're doing to have to look into that. It can be done in a single pass if you do it greedily "on the fly".
精彩评论