Optimisation for a brainfuck interpreter
As an exercise to help me learn about interpreters and optimisation, neither of which I know anything about, I have written a brainfuck interpreter in C. It appears to work flawlessly thus far, though it does not compete well in开发者_如何学Go execution speed compared to other fast interpreters.
What are some ways that I can change this interpreter to improve performance (or otherwise)?
One interesting aspect of my interpreter (though most others probably do this as well) is that I run one loop that reads through the source input and converts each instruction into a
struct { long instruction; long loop; }
The loop
value is the index of the matching ]
instruction, if the instruction is a [
, and the index of the matching [
instruction, if the instruction is a ]
, allowing quick jumping. I'd imagine that this 'parsing' process (which doesn't take long) improves execution times over doing redundant reparsing to find matching square brackets every time they are needed.
An interesting test of brainfuck interpreter speed is this program:
++++++++[->-[->-[->-[-]<]<]<]>++++++++[<++++++++++>-]<[>+>+<<-]>-.>-----.>
- the first version of the interpreter
- the interpreter after implementing Jerry Coffin's answer, which removes the giant switch in the runtime loop, by making the
instruction
struct'sinstruction
a direct pointer to an operation function - this runs slower than the previous version (function call overhead?) - the interpreter after reversing the previous change and adding an optimisation to 'collapse' multiple consecutive non-loop operations, reducing loop cycles - this runs slightly faster than the original
Well, this is not C. And it is not an interpeter. So, yeah, pretty much totally inappropriate for this question.
But what it is, is, a perfectly portable brainfuck compiler using C++0x variadic templates. You have to #define PROGRAM
as a comma-separated sequence of C-syntax characters, because I could not extract them from the string at compile-time. But otherwise it is legit. I think.
Tested with g++ 4.5.2, using g++ -std=c++0x -O2 -Wall
.
#include <cstdio>
#include <vector>
#define PROGRAM '+', '+', '+', '+', '+', '+', '+', '+', '[', '-', '>', \
'-', '[', '-', '>', '-', '[', '-', '>', '-', '[', '-', ']', '<', \
']', '<', ']', '<', ']', '>', '+', '+', '+', '+', '+', '+', '+', \
'+', '[', '<', '+', '+', '+', '+', '+', '+', '+', '+', '+', '+', \
'>', '-', ']', '<', '[', '>', '+', '>', '+', '<', '<', '-', ']', \
'>', '-', '.', '>', '-', '-', '-', '-', '-', '.', '>'
template<char... all>
struct C;
template<char... rest>
struct C<'>', rest...> {
typedef C<rest...> rest_t;
typedef typename rest_t::remainder remainder;
static char *body(char *p) {
return rest_t::body(p+1);
}
};
template<char... rest>
struct C<'<', rest...> {
typedef C<rest...> rest_t;
typedef typename rest_t::remainder remainder;
static char *body(char *p) {
return rest_t::body(p-1);
}
};
template<char... rest>
struct C<'+', rest...> {
typedef C<rest...> rest_t;
typedef typename rest_t::remainder remainder;
static char *body(char *p) {
++*p;
return rest_t::body(p);
}
};
template<char... rest>
struct C<'-', rest...> {
typedef C<rest...> rest_t;
typedef typename rest_t::remainder remainder;
static char *body(char *p) {
--*p;
return rest_t::body(p);
}
};
template<char... rest>
struct C<'.', rest...> {
typedef C<rest...> rest_t;
typedef typename rest_t::remainder remainder;
static char *body(char *p) {
putchar(*p);
return rest_t::body(p);
}
};
template<char... rest>
struct C<',', rest...> {
typedef C<rest...> rest_t;
typedef typename rest_t::remainder remainder;
static char *body(char *p) {
*p = getchar();
return rest_t::body(p);
}
};
template<char... rest>
struct C<'[', rest...> {
typedef C<rest...> rest_t;
typedef typename rest_t::remainder::remainder remainder;
static char *body(char *p) {
while (*p) {
p = rest_t::body(p);
}
return rest_t::remainder::body(p);
}
};
template<char... rest>
struct C<']', rest...> {
typedef C<rest...> rest_t;
struct remainder_hack {
typedef typename rest_t::remainder remainder;
static char *body(char *p) {
return rest_t::body(p);
}
};
typedef remainder_hack remainder;
static char *body(char *p) {
return p;
}
};
template<>
struct C<> {
static char *body(char *p) {
return p;
}
struct remainder {
static char *body(char *p) {
return p;
}
};
};
int
main(int argc, char *argv[])
{
std::vector<char> v(30000, 0);
C<PROGRAM> thing;
thing.body(&v[0]);
return 0;
}
I can see a couple of possibilities. I think the way I'd go would be to turn it into a compiler that produced direct-threaded code. I.e., as you read the input, instead of copying most of "instructions" into memory more or less as-is, I'd instead write the code to implement each instruction as a function, and copy a pointer to each function into memory. Then executing the code would consist of calling those functions in order. I'd probably have that function return the index (or perhaps address) of the next instruction to execute, so you'd end up with something like:
typedef int return_type;
typedef return_type (*f)(void);
f *im = malloc(sizeof(f) * ia);
ci = (*(im[ci]))();
I'd also have three separate functions for each instruction, one for each BF_END_* mode, so you'd only have to deal with that during the "compilation" phase. When you execute the code, you'd have a pointer directly to the correct function.
Edit:
I've been playing with the code a bit. I've separated the loop addresses into a separate array, and merged most of the parsing together, so it looks like this:
for (ii = 0; (i = getc(fp)) != EOF; ++ii) {
if (++in > ia) {
ia *= 2;
im = realloc(im, sizeof(*im) * ia);
loops = realloc(loops, sizeof(*loops) * ia);
}
im[in-1] = i;
switch (i) {
case BF_OP_LSTART:
if (ln >= la)
ls = realloc(ls, sizeof(*ls) * (la *= 2));
ls[ln++] = ii;
break;
case BF_OP_LEND:
loops[in-1] = ls[--ln];
loops[ls[ln]] = ii;
break;
}
}
That doesn't make any real difference to the speed, but does make the code a lot shorter, and (at least in my opinion) easier to understand.
Edit2:
Okay, I've had a chance to play with this a bit more, and found one (rather strange) optimization that does seem to help at least a bit. Compilers often produce marginally better code for switch statements with dense case values, so I tried converting to that, and got an improvement of around 9-10% (depending a bit on compiler).
#include <stdio.h>
#include <stdlib.h>
#include <stdarg.h>
#include <string.h>
#include <unistd.h>
#include <errno.h>
#define BF_END_ERROR 'e'
#define BF_END_IGNORE 'i'
#define BF_END_WRAP 'w'
#define BF_OP_VINC '+'
#define BF_OP_VDEC '-'
#define BF_OP_PINC '>'
#define BF_OP_PDEC '<'
#define BF_OP_LSTART '['
#define BF_OP_LEND ']'
#define BF_OP_IN ','
#define BF_OP_OUT '.'
enum {
C_OP_VINC,
C_OP_VDEC,
C_OP_PINC,
C_OP_PDEC,
C_OP_LSTART,
C_OP_LEND,
C_OP_IN,
C_OP_OUT
};
typedef struct {
long instruction; /* instruction type */
long loop; /* 'other' instruction index in a loop */
} instruction;
void die(const char *s, ...) {
va_list a;
va_start(a, s);
fprintf(stderr, "brief: error: ");
vfprintf(stderr, s, a);
putchar(10);
va_end(a);
exit(1);
}
int main(int argc, char **argv) {
unsigned instruction_count = 0;
long
ci = 0, /* current cell index */
cn = 4096, /* number of cells to allocate */
cw = BF_END_WRAP, /* cell wrap behaviour */
ia = 4096, /* number of allocated instructions */
ii = 0, /* current instruction index */
in = 0, /* number of used instructions */
la = 4096, /* loop stack allocation */
ln = 0, /* loop stack used */
va = 0, /* minimum value */
vb = 255, /* maximum value */
vw = BF_END_WRAP /* value wrap behaviour */
;
instruction *im = malloc(sizeof(instruction) * ia); /* instruction memory */
long *cm = NULL; /* cell memory */
long *ls = malloc(sizeof(long) * la); /* loop stack */
FILE *fp = NULL;
int i;
while ((i = getopt(argc, argv, "a:b:c:f:hv:w:")) != -1) {
switch (i) {
case 'a': va = atol(optarg); break;
case 'b': vb = atol(optarg); break;
case 'c': cn = atol(optarg); break;
case 'f':
fp = fopen(optarg, "r");
if (!fp)
die("%s: %s", optarg, strerror(errno));
break;
case 'h':
fputs(
"brief: a flexible brainfuck interpreter\n"
"usage: brief [options]\n\n"
"options:\n"
" -a set minimum cell value (default 0)\n"
" -b set maximum cell value (default 255)\n"
" -c set cells to allocate (default 4096)\n"
" -f source file name (required)\n"
" -h this help output\n"
" -v value over/underflow behaviour\n"
" -w cell pointer over/underflow behaviour\n\n"
, stderr);
fputs(
"cells are 'long int' values, so do not use -a with a "
"value less than -2^31 or -2^63, and do not use -b with a "
"value more than 2^31-1 or 2^63-1, depending on your "
"architecture's 'long int' size.\n\n"
"over/underflow behaviours can be one of:\n"
" e throw an error and quit upon over/underflow\n"
" i do nothing when attempting to over/underflow\n"
" w wrap-around to other end upon over/underflow\n"
, stderr);
exit(1);
break;
case 'v': vw = optarg[0]; break;
case 'w': cw = optarg[0]; break;
default: break;
}
}
if (!fp)
die("no source file specified; use -f");
for (ii = 0; (i = getc(fp)) != EOF; ++ii) {
if (++in > ia) {
ia *= 2;
im = realloc(im, sizeof(*im) * ia);
}
switch (i) {
case BF_OP_LSTART:
if (ln >= la)
ls = realloc(ls, sizeof(*ls) * (la *= 2));
ls[ln++] = ii;
im[in-1].instruction = C_OP_LSTART;
break;
case BF_OP_LEND:
im[in-1].loop = ls[--ln];
im[ls[ln]].loop = ii;
im[in-1].instruction = C_OP_LEND;
break;
case BF_OP_VINC:
im[in-1].instruction = C_OP_VINC;
break;
case BF_OP_VDEC:
im[in-1].instruction = C_OP_VDEC;
break;
case BF_OP_PINC:
im[in-1].instruction = C_OP_PINC;
break;
case BF_OP_PDEC:
im[in-1].instruction = C_OP_PDEC;
break;
case BF_OP_IN:
im[in-1].instruction = C_OP_IN;
break;
case BF_OP_OUT:
im[in-1].instruction = C_OP_OUT;
break;
}
}
cm = memset(malloc(cn * sizeof(long)), 0, cn * sizeof(long));
for (ii = 0; ii < in; ii++) {
++instruction_count;
switch (im[ii].instruction) {
case C_OP_VINC:
if (cm[ci] == vb)
switch (vw) {
case BF_END_ERROR:
die("value overflow");
break;
case BF_END_IGNORE: break;
case BF_END_WRAP: cm[ci] = 0; break;
}
else ++cm[ci];
break;
case C_OP_VDEC:
if (cm[ci] == 0)
switch (vw) {
case BF_END_ERROR:
die("value underflow");
break;
case BF_END_IGNORE: break;
case BF_END_WRAP: cm[ci] = vb; break;
}
else --cm[ci];
break;
case C_OP_PINC:
if (ci == cn - 1)
switch (cw) {
case BF_END_ERROR:
die("cell index overflow");
break;
case BF_END_IGNORE: break;
case BF_END_WRAP: ci = 0; break;
}
else ++ci;
break;
case C_OP_PDEC:
if (ci == 0)
switch (cw) {
case BF_END_ERROR:
die("cell index underflow");
break;
case BF_END_IGNORE: break;
case BF_END_WRAP: ci = cn - 1; break;
}
else --ci;
break;
case C_OP_IN:
cm[ci] = getchar();
break;
case C_OP_OUT:
putchar(cm[ci]);
break;
case C_OP_LSTART:
if (!cm[ci])
ii = im[ii].loop;
break;
case C_OP_LEND:
if (cm[ci])
ii = im[ii].loop;
break;
default: break;
}
}
fprintf(stderr, "Executed %d instructions\n", instruction_count);
free(cm);
return 0;
}
Brainfuck should be pretty easy to compile into C code, which you then compile and execute. That would probably be a very fast BF "interpreter".
Fundamentally all you have to do is generate pretty trivial code for each brainfuck operator from left to right in the program. One can easily optimize sequences of + and -; similarly, one can optimize sequences of < and >, by caching the counts of each recently encountered. This is a kind of peephole optimization.
Here's a draft compiler, accepting BF code on the command line and printing the compiled program to the console:
int increments; // holds pending increment operations
void flush_increments(){
if (increments==0) return;
printf(" *ptr+=%d;\n",increments);
increments=0;
}
int steps; // holds pending pointer steps
void flush_steps(){
if (steps==0) return;
printf(" ptr+=%d;\n",steps);
steps=0;
}
int main(int argc, char **argv){
// Brainfuck compiler
if( !(argc > 1) )
return 1;
unsigned char *code = argv[1];
int nesting=0;
printf("int main(){\n");
printf(" #define CELLSPACE 1000\n");
printf(" unsigned char *ptr = malloc(sizeof(char)*CELLSPACE);\n");
printf(" if(ptr == NULL) return 1;\n")
printf(" for(int i=0;i<CELLSPACED;i++) ptr[i]=0; // reset cell space to zeros");
increments=0;
steps=0;
for(;;) {
switch(*code++) {
case '+':
flush_steps();
++increments;
break;
case '-':
flush_steps();
--increments;
break;
case '>':
flush_increments();
++steps;
break;
case '<':
flush_increments();
--steps;
break;
case '[':
flush_increments();
flush_steps();
printf("while(*ptr){");
++nesting;
break;
case ']':
flush_increments();
flush_steps();
if (--nesting<0)
{ printf("Unmatched ']'\n");
return 1;
}
printf("}\n";);
break;
case '.':
flush_increments();
flush_steps();
printf(" putc(*ptr, stdout);\n");
break;
case ',':
increments=0;
flush_steps();
printf("*ptr = getc(stdin);");
break;
case '\0':
printf("}");
if (nesting>0)
{ printf("Unmatched '['\n");
return 1;
}
return 0;
}
}
}
This is keyed in off the top of my head, inspired by Matthew Blanchard's code (thanks Matthew!), but not tested. I'll leave that to some other soul; feel free to patch the code if you find a problem. Obviously it would be improved if it wrote its code to a file :-}
[I used the http://en.wikipedia.org/wiki/Brainfuck article as the obvious inspiration for the code to generate].
The OP's BF program:
++++++++[->-[->-[->-[-]<]<]<]>++++++++[<++++++++++>-]<[>+>+<<-]>-.>-----.>
should compile to (indentation added):
int main(){
#define CELLSPACE 1000
unsigned char *ptr = malloc(sizeof(char)*CELLSPACE);
if(ptr == NULL) return 1;
for(int i=0;i<CELLSPACED;i++) ptr[i]=0; // reset cell space to zeros
*ptr+=8;
while(*ptr) {
*ptr+=-1;
ptr+=1;
*ptr+=-1;
while(*ptr) {
*ptr+=-1;
ptr+=1;
*ptr+=-1;
while(*ptr) {
*ptr+=-1;
ptr+=1;
*ptr+=-1;
while(*ptr) {
*ptr+=-1;
}
ptr+=-1;
}
ptr+=-1;
}
ptr+=1;
*ptr+=8;
while (*ptr) {
ptr+=-1;
*ptr+=10;
ptr+=1;
*ptr+=-1;
}
ptr+=-1;
while (*ptr) {
ptr+=1;
*ptr+=1;
ptr+=1;
*ptr+=1;
ptr+=-2;
*ptr+=-1;
}
ptr+=1;
*ptr+=-1;
putc(*ptr,stdout);
ptr+=1;
*ptr+=-5;
putc(*ptr,stdout);
ptr+=1;
}
This probably averages pretty close to one machine instruction per BF op.
Somebody who was really ambitious would compute the possible values for ptr at each point in the program; I guess in many cases it refers to a constant cell. Then one could avoid indirect accesses.
If you really wanted to go nuts, you could figure out what the BF commands do up till the first input request; that must be a "constant" initial memory configuration, and generate a CELLSPACE intializer with that constant, and generate code for the rest of the program the way I have shown. If you did that, OP's example program would vanish into a single CELLSPACE initialzer and a few putc calls.
Since the entire point of this project is learning, using a tool or replacement solution clearly isn't answering the question.
First, a disclaimer: I'm not an x86 programmer -- I've done a decent amount of work in embedded environments and now (with mobile phones) ARM chips. On to the good stuff...
There's two basic ways of making your interpreter faster: Make it optimize the BF code itself, and make the interpreter itself optimized. I'll recommend a bit of both in one step below.
As far as I know, x86 spends a lot of dye space on providing relatively impressive fast branching properties. Likely due to this, I've seen several compilers (including gcc) produce nested branches in favor of actual jump tables for x86. Jump tables sound attractive in theory, but really x86 optimizes so well for legacy techniques that conventional big-O thinking just doesn't apply in most practice. That's why long term x86 developers will tell you if you want to calculate how fast code is then you need to write it, run it, and time it.
Despite the speed at which branches can occur on x86, it's still likely worth spending a bit of overhead on not branching. Since BF instructions are so simple anyways, that can take the form of "do most of the instructions all at once since it's faster than another branch." Some of this may even be done in parallel by the processor where branching cannot be. (x86 has enough decoding and execution units in a single core such that this is likely feasible)
Another thing that's going to kill your performance is error checking (such as wrapping). Keeping it in there is causing you performance issues (not major right now) but more importantly preventing you from optimizing.
Additionally, your program is very generic. It will allow you to use any maximum value for wrapping. If it was a power of two, you'd just perform a bitwise AND (very fast since it's one CPU cycle on virtually all modern platforms) equivalent to wrapping instead of comparing and branching. The devil is in the details for writing truly fast interpreters -- every little bit you add makes it that much more complex.
Toward that end, I recommend streamlining what your BF interpreter does (make it wrap at a power of two for values, for example). The bitwise AND trick alone and forcing wrap to be the option (as is the case in most modern programming languages for overflows/underflows) already removes at least two branches.
Once those are taken care of, this enables an interesting optimization: throw away BF instructions and instead make your machine perform different instructions that are more suitable for an interpreter.
Consider Java: Java does not interpret. It JITs into an entirely different language.
I recommend applying the same logic. You've already done this a bit by giving your instructions a value associated with them.
I recommend creating an instruction with the following information:
- the amount to add to the value at the data pointer (or subtract -- subtraction is just adding a negative number)
- the amount to add to the value of the data pointer (again, or subtract)
- a pointer to the next instruction to execute
- a value which is compared to the value at the data pointer before it is changed
The interpreter loop then changes to the logic as follows: add the data value in the instruction to the value at the data pointer add the data address value in the instruction to the data pointer itself if the value at the data pointer matches our comparison value set the instruction pointer to the new instruction pointer continue (to the next interpreter loop) if the comparison value is a special value (e.g. you could pick 0x80000000) handle input/output stuffs increase the instruction address by one
Initial interpretation now becomes trickier: Instructions can now combine +, -, <, > and sometimes even [, and usually ] into the same single instruction. The first branch in the loop can be represented without branching if you're clever about it (and even more efficiently with some assembly stuck in or some compiler intrinsics). You may want to inform the compiler that the second branch is unlikely to hit (input/output is the bottleneck in that case, not the speed of interpretation, so even if you're doing a lot of input/output, one little unoptimized branch won't make the difference).
Care must be taken for the end-of-running condition. The last instruction in your interpreted BF program should now ALWAYS make the instruction pointer NULL so the loop exits.
The order in which interpretation happens is important because -/+ is done before <> is done before [ is done before ] is done before I/O. So, >+ is two interpreted instructions while +> is one.
Beyond engineering a fast tight-looped interpreter like this, you're looking into more complex code analysis in which case you'll get into compiler design and less away from a straight-up interpreter. This is not something I do every day, but Louden's "Compiler Construction" book was a very good read for me, but it won't end up being a small project that way. Unless you're serious about making this thing ridiculously fast and in the end probably compiled code, I'd stay away from throwing the big, hard-hitting optimizations at it.
I hope I've given you an idea to try and test that leads you to more optimization steps. I haven't tried any of this myself so it's still all conjecture based on past experience. However, even if it doesn't end up being faster, you'll have gained some experience in rewriting BF code to a relatively different architecture from a vanilla BF interpreter.
P.S. Great question!
Use the LLVM JIT infrastructure and have it optimize the code for you...
edit: actually it has already been done multiple times; compiler: http://www.remcobloemen.nl/2010/02/brainfuck-using-llvm/ JIT: https://github.com/resistor/BrainFTracing (note how the "compiler" is 230 lines long, counting also blank lines, comments and #includes)
edit2: for the downvoter: since you seem to have missed it, the meaning of my answer was "don't reinvent the wheel"
I'd see several things.
Your switch
is quite complicated because of the error handling. Try to reorganize that, that you only have the fast path inside the switch, and call one or several functions for errors. In general, any shorter you get the code inside the switch
and the less variables you use in there, the better your optimizer will be able to kick in.
You have too many indirections. E.g your index ci
doesn't serve much. Have a pointer that points to the actual cell. This lets you save a register. You could do similar with ii
. Instead of keeping the number of the instruction, just have a pointer into the position in cm
.
In any case, check the assembler that is generated. You will quickly see where your compiler produces too much register spilling or things like that.
Here's an example of how you can make a fast BF interpreter:
int main(int argc, char **argv)
{
if( !(argc > 1) )
return 1;
unsigned char *progmem = argv[1];
unsigned char *cellmem = malloc(sizeof(char)*CELLSPACE);
if(cellmem == NULL)
return 1;
unsigned char **loopdepth = malloc(sizeof(char*)*MAXLOOPDEPTH);
if(loopdepth == NULL)
return 1;
unsigned char *origcellmem = cellmem;
unsigned char **origloopdepth = loopdepth;
for(;;)
{
switch(*progmem)
{
case '+':
++*cellmem;
break;
case '-':
--*cellmem;
break;
case '>':
cellmem++;
break;
case '<':
cellmem--;
break;
case '[':
*loopdepth = progmem-1;
loopdepth++;
break;
case ']':
loopdepth--;
if(*cellmem)
{
progmem = *loopdepth;
}
break;
case '.':
putc(*cellmem, stdout);
break;
case ',':
*cellmem = getc(stdin);
break;
case '\0':
free(origcellmem);
free(origloopdepth);
return 0;
}
progmem++;
}
}
Alright so, the highlights of my code that should make it faster than your solution:
I'm not doing any checking each loop, the compiler is likely to generate a raw unconditional loop here (or so the C wizards tell me.) And since I'm using the raw data from the string instead of structs I just have to put '\0' at the end of the switch! This means the only time my interpreter checks if it needs to end the program is when nothing else matches the switch.
I'm using simple pointers for everything, and only manipulating those, I'm not doing arithmatic on integers and then accessing the pointed to memory with the [] operators, I'm simply manipulating the pointers and their pointed to memory directly.
I'm using a simple 'stack' to hold loops, this limits the maximum depth of loops you can have but 32 is plenty enough for most brainfuck programs, and it's adjustable in source.
I've ordered the switch case for how often things appear, seeing as + and - are the most common brainfuck characters, and then > and < and then [ and ] and finally the input characters. I am not 100% on this but I'm -pretty sure- the order of the switch matters IF ONLY A LITTLE BIT!
I'm not doing any error checking, I'm assuming the user's input is perfect. While this might not give nice errors, like running out of bounds and such, that's EXACTLy what languages do at runtime, a C program can generate a segfault easily, we probably -shouldn't- check for these. (A quick note, mine generated a LOT of segfaults in writing this :P)
And lastly one possible optimization I thought of:
Run length encoding, like used in compression. You could run length encode the brainfuck in a simple format, so that: +++ turns into 3+ and the interpreter 'gets' it and instead of adding one three times, it adds three one time. The performance gain here could be AMAZING in some places
And there you go, that's really all I've got. I don't know C amazingly well so I might have made some mistakes, but I tried my best, feel free to benchmark the provided code, I've got no idea exactly how fast it runs. It accepts input as a command line argument.
精彩评论