开发者

How does bootstrapping work for gcc?

I was looking up the pypy project (Python in Python), and started pondering the issue of what is running the outer layer of python? Surely, I conjectured, it can't be as the old saying goes "turtles all the way down"! Afterall, python is not valid x86 assembly!

Soon I remembered the concept of bootstrapping, and looked up c开发者_如何学Goompiler bootstrapping. "Ok", I thought, "so it can be either written in a different language or hand compiled from assembly". In the interest of performance, I'm sure C compilers are just built up from assembly.

This is all well, but the question still remains, how does the computer get that assembly file?!

Say I buy a new cpu with nothing on it. During the first operation I wish to install an OS, which runs C. What runs the C compiler? Is there a miniature C compiler in the BIOS?

Can someone explain this to me?


Say I buy a new cpu with nothing on it. During the first operation I wish to install an OS, which runs C. What runs the C compiler? Is there a miniature C compiler in the BIOS?

I understand what you're asking... what would happen if we had no C compiler and had to start from scratch?

The answer is you'd have to start from assembly or hardware. That is, you can either build a compiler in software or hardware. If there were no compilers in the whole world, these days you could probably do it faster in assembly; however, back in the day I believe compilers were in fact dedicated pieces of hardware. The wikipedia article is somewhat short and doesn't back me up on that, but never mind.

The next question I guess is what happens today? Well, those compiler writers have been busy writing portable C for years, so the compiler should be able to compile itself. It's worth discussing on a very high level what compilation is. Basically, you take a set of statements and produce assembly from them. That's it. Well, it's actually more complicated than that - you can do all sorts of things with lexers and parsers and I only understand a small subset of it, but essentially, you're looking to map C to assembly.

Under normal operation, the compiler produces assembly code matching your platform, but it doesn't have to. It can produce assembly code for any platform you like, provided it knows how to. So the first step in making C work on your platform is to create a target in an existing compiler, start adding instructions and get basic code working.

Once this is done, in theory, you can now cross compile from one platform to another. The next stages are: building a kernel, bootloader and some basic userland utilities for that platform.

Then, you can have a go at compiling the compiler for that platform (once you've got a working userland and everything you need to run the build process). If that succeeds, you've got basic utilities, a working kernel, userland and a compiler system. You're now well on your way.

Note that in the process of porting the compiler, you probably needed to write an assembler and linker for that platform too. To keep the description simple, I omitted them.

If this is of interest, Linux from Scratch is an interesting read. It doesn't tell you how to create a new target from scratch (which is significantly non trivial) - it assumes you're going to build for an existing known target, but it does show you how you cross compile the essentials and begin building up the system.

Python does not actually assemble to assembly. For a start, the running python program keeps track of counts of references to objects, something that a cpu won't do for you. However, the concept of instruction-based code is at the heart of Python too. Have a play with this:

>>> def hello(x, y, z, q):
...     print "Hello, world"
...     q()
...     return x+y+z
... 
>>> import dis
dis.dis(hello)


  2           0 LOAD_CONST               1 ('Hello, world')
              3 PRINT_ITEM          
              4 PRINT_NEWLINE       

  3           5 LOAD_FAST                3 (q)
              8 CALL_FUNCTION            0
             11 POP_TOP             

  4          12 LOAD_FAST                0 (x)
             15 LOAD_FAST                1 (y)
             18 BINARY_ADD          
             19 LOAD_FAST                2 (z)
             22 BINARY_ADD          
             23 RETURN_VALUE

There you can see how Python thinks of the code you entered. This is python bytecode, i.e. the assembly language of python. It effectively has its own "instruction set" if you like for implementing the language. This is the concept of a virtual machine.

Java has exactly the same kind of idea. I took a class function and ran javap -c class to get this:

invalid.site.ningefingers.main:();
  Code:
   0:   aload_0
   1:   invokespecial   #1; //Method java/lang/Object."<init>":()V
   4:   return
public static void main(java.lang.String[]);
  Code:
   0:   iconst_0
   1:   istore_1
   2:   iconst_0
   3:   istore_1
   4:   iload_1
   5:   aload_0
   6:   arraylength
   7:   if_icmpge   57
   10:  getstatic   #2; 
   13:  new #3; 
   16:  dup
   17:  invokespecial   #4; 
   20:  ldc #5; 
   22:  invokevirtual   #6; 
   25:  iload_1
   26:  invokevirtual   #7; 
   //.......
}

I take it you get the idea. These are the assembly languages of the python and java worlds, i.e. how the python interpreter and java compiler think respectively.

Something else that would be worth reading up on is JonesForth. This is both a working forth interpreter and a tutorial and I can't recommend it enough for thinking about "how things get executed" and how you write a simple, lightweight language.


In the interest of performance, I'm sure C compilers are just built up from assembly.

C compilers are, nowadays, (almost?) completely written in C (or higher-level languages - Clang is C++, for instance). Compilers gain little to nothing from including hand-written assembly code. The things that take most time are as slow as they are because they solve very hard problems, where "hard" means "big computational complexity" - rewriting in assembly brings at most a constant speedup, but those don't really matter anymore at that level.

Also, most compilers want high portability, so architecture-specific tricks in the front and middle end are out of question (and in the backends, they' not desirable either, because they may break cross-compilation).

Say I buy a new cpu with nothing on it. During the first operation I wish to install an OS, which runs C. What runs the C compiler? Is there a miniature C compiler in the BIOS?

When you're installing an OS, there's (usually) no C compiler run. The setup CD is full of readily-compiled binaries for that architecture. If there's a C compiler included (as it's the case with many Linux distros), that's an already-compiled exectable too. And those distros that make you build your own kernel etc. also have at least one executable included - the compiler. That is, of course, unless you have to compile your own kernel on an existing installation of anything with a C compiler.

If by "new CPU" you mean a new architecture that isn't backwards-compatible to anything that's yet supported, self-hosting compilers can follow the usual porting procedure: First write a backend for that new target, then compile yourself for it, and suddenly you got a mature compiler with a battle-hardened (compiled a whole compiler) native backend on the new platform.


If you buy a new machine with a pre-installed OS, it doesn't even need to include a compiler anywhere, because all the executable code has been compiled on some other machine, by whoever provides the OS - your machine doesn't need to compile anything itself.

How do you get to this point if you have a completely new CPU architecture? In this case, you would probably start by writing a new code generation back-end for your new CPU architecture (the "target") for an existing C compiler that runs on some other platform (the "host") - a cross-compiler.

Once your cross-compiler (running on the host) works well enough to generate a correct compiler (and necessary libraries, etc.) that will run on the target, then you can compile the compiler with itself on the target platform, and end up with a target-native compiler, which runs on the target and generates code which runs on the target.

It's the same principle with a new language: you have to write code in an existing language that you do have a toolchain for, which will compile your new language into something that you can work with (let's call this the "bootstrap compiler"). Once you get this working well enough, you can write a compiler in your new language (the "real compiler"), and then compile the real compiler with the bootstrap compiler. At this point you're writing the compiler for your new language in the new language itself, and your language is said to be "self-hosting".

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜