On improving Haskell's performance compared to C in fibonacci micro-benchmark
I came across this question, which compared the performance of various compilers on computing fibonaci numbers the naive way.
I tried doing this with Haskell to see how it compares to C.
C code:
#include <stdio.h>
#include <stdlib.h>
int fib (int n) {
if (n < 2) return 1;
return fib (n-1) + fib (n-2);
}
int main (int argc, char* argv[]) {
printf ("%i\n", fib (atoi(argv[1])));
return 0;
}
Result:
> gcc -O3 main.c -o fib
> time ./fib 40
165580141
real 0m0.421s
user 0m0.420s
sys 0m0.000s
Haskell:
module Main where
import System.Environment (getArgs)
fib :: Int -> Int
fib n | n < 2 = 1
| otherwise = fib (n-1) + fib (n-2)
main = getArgs >>= print . fib . read . head
Result:
> ghc -O3 -fllvm -optlo-O3 Main.hs -o fib
> time ./fib 40
165580141
real 0m1.476s
user 0m1.476s
sys 0m0.000s
Profiling with
> ghc -O3 -fllvm -optlo-O3 -prof -auto-all -caf-all -rtsopts Main.hs -fforce-recomp -o fib
> ./fib 40 +RTS -prof
reveals that fib
takes 100% time and alloc, no surprise. I took some profile of the heap, but don't know what they imply:
> ./fib 40 +RTS -hc
> ./fib 40 +RTS -hd
So my question: Is there anything I can do from my side to make this Haskell program's performance closer to C, or this is just the way GHC does things that happens to make it slower in this micro-benchmark? (I'm not asking for an asymptotically faster algorithm 开发者_StackOverflowto compute fibs.)
Thank you very much.
[EDIT]
It turned out that ghc -O3
was faster than ghc -O3 -fllvm -optlo-O3
in this case. But optlo-block-placement
made an observable difference for the LLVM backend:
> ghc -O3 Main.hs -o fib -fforce-recomp
> time ./fib 40
165580141
real 0m1.283s
user 0m1.284s
sys 0m0.000s
> ghc -O3 -fllvm -optlo-O3 -o fib -fforce-recomp
> time ./fib 40
165580141
real 0m1.449s
user 0m1.448s
sys 0m0.000s
> ghc -O3 -fllvm -optlo-O3 -optlo-block-placement -o fib -fforce-recomp
> time ./fib 40
165580141
real 0m1.112s
user 0m1.096s
sys 0m0.016s
The reason I wanted to investigate this was because both C and OCaml were significantly faster than Haskell for this program. I sort of couldn't accept that and wanted to learn more to make sure I already did everything I could :D
> ocamlopt main.ml -o fib
> time ./fib 40
165580141
real 0m0.668s
user 0m0.660s
sys 0m0.008s
The heap profile is not very interesting here, as GHC compiles fib
into a function that operates soleley on the stack. Just look at the profile... Only 800 bytes are ever allocated, the small overhead of your main
implementation.
As far as GHC's core-level is concerned, this actually gets optimized as far as it can get. The low-level code generation is another matter though. Let us have a quick dive into the code GHC generates:
_Main_zdwfib_info:
.Lc1RK:
leal -8(%ebp),%eax
cmpl 84(%ebx),%eax
jb .Lc1RM
This is the check for stack space. Probably something C doesn't need, as it can let the operation system handle stack space allocation. Haskell has user level threads, so stack space is managed manually.
cmpl $2,0(%ebp)
jl .Lc1RO
The comparison against 2 from your code.
movl 0(%ebp),%eax
decl %eax
The parameter is reloaded from the stack and decremented to get the parameter for the recursive call. The reload is probably unnecessary - not sure it makes a difference though.
movl %eax,-8(%ebp)
movl $_s1Qp_info,-4(%ebp)
addl $-8,%ebp
jmp _Main_zdwfib_info
The parameter and the return address are pushed on top of the stack, and we jump directly to the label in order to recurse.
.Lc1RO:
movl $1,%esi
addl $4,%ebp
jmp *0(%ebp)
The code for the case that the parameter is less than 2. The return value is passed in a register.
Bottom line: Everything seems to be working like it should, it's unlikely you will be able to squeeze much more out of this by changing the program. The custom stack check is an obvious source of slowdowns, not sure whether it can be blamed for the full time difference though.
These seems like a really feeble 'benchmark' as barsoap
says. Suppose I compare the following almost equally naive programs:
module Main where
import System.Environment (getArgs)
fib :: Int -> Int
fib 0 = 1
fib 1 = 1
fib 2 = 2
fib n = (\x y -> x + y + y ) (fib (n-3)) (fib (n-2) )
main = getArgs >>= print . fib . read . head
... and in the other corner ...
#include <stdio.h>
#include <stdlib.h>
int fib (int n) {
if (n < 2) return 1;
if (n < 3) return n;
return (fib (n-3) + fib (n-2)) + fib (n-2);
}
int main (int argc, char* argv[]) {
printf ("%i\n", fib (atoi(argv[1])));
return 0;
}
Then the Glorious ghc
crushes gcc
, not too surprisingly, really:
$ ghc --make -fforce-recomp fib.hs -o fibh
[1 of 1] Compiling Main ( fib.hs, fib.o )
Linking fibh ...
$ time ./fibh 40
165580141
real 0m0.013s
user 0m0.007s
sys 0m0.003s
$ gcc fib.c -o fibc
$ time ./fibc 40
165580141
real 0m1.495s
user 0m1.483s
sys 0m0.003s
now turning on optimizations, ghc
picks up a little more speed:
$ ghc --make -fforce-recomp fib.hs -O3 -o fibhO
$ time ./fibhO 40
165580141
real 0m0.007s
user 0m0.002s
sys 0m0.004s
but gcc
finally gets a clue.
$ gcc fib.c -O3 -o fibcO
$ time ./fibcO 40
165580141
real 0m0.007s
user 0m0.004s
sys 0m0.002s
I think the explanation is the ghc
's wariness of common subexpression elimination: it is dangerous where '(almost) everything is an expression', and it figures the programmer knows how to use a lambda.
GHC compiles this fine. The next step is to micro-optimize the output from the backend of GHC. Playing with various LLVM flags can help here.
To do this, use ghc-core to inspect the assembly, and try other flags to LLVM to see what you get.
Another approach would be to add a small amount of parallelism.
Try this:
fibs :: [Integer]
fibs = 0:1:zipWith (+) fibs (tail fibs)
fib n = fibs !! n
$ time ./fib 10000
33644[...]6875
./fib 10000 0.03s user 0.01s system 89% cpu 0.039 total
(That's on an good ole Athlon64 3200+)
The version you are using is, for every n
, computing fib (n-1)
and fib (n-2)
, that is, has an complexity of roughly triangular nature. The version above is linear: Each fib is only computed once. Despite what the non-Haskell programming hivemind seems to think, Haskell does not automatically memoise (which would be slower, in general, than good ole dynamic programming, anyway).
There's even faster (using maths tricks) version of fibonnaci on the Haskell Wiki.
Change the C version to be non-recursive, and my bet is that you'll see both Haskell and C having very similar performance. Tight loops are just easier to optimise.
精彩评论