Why does my performance slow to a crawl I move methods into a base class?
I'm writing different implementations of immutable binary trees in C#, and I wanted my trees to inherit some common methods from a base class.
Unfortun开发者_如何转开发ately, classes which derive from the base class are abysmally slow. Non-derived classes perform adequately. Here are two nearly identical implementations of an AVL tree to demonstrate:
- AvlTree: http://pastebin.com/V4WWUAyT
- DerivedAvlTree: http://pastebin.com/PussQDmN
The two trees have the exact same code, but I've moved the DerivedAvlTree.Insert method in base class. Here's a test app:
using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using Juliet.Collections.Immutable;
namespace ConsoleApplication1
{
class Program
{
const int VALUE_COUNT = 5000;
static void Main(string[] args)
{
var avlTreeTimes = TimeIt(TestAvlTree);
var derivedAvlTreeTimes = TimeIt(TestDerivedAvlTree);
Console.WriteLine("avlTreeTimes: {0}, derivedAvlTreeTimes: {1}", avlTreeTimes, derivedAvlTreeTimes);
}
static double TimeIt(Func<int, int> f)
{
var seeds = new int[] { 314159265, 271828183, 231406926, 141421356, 161803399, 266514414, 15485867, 122949829, 198491329, 42 };
var times = new List<double>();
foreach (int seed in seeds)
{
var sw = Stopwatch.StartNew();
f(seed);
sw.Stop();
times.Add(sw.Elapsed.TotalMilliseconds);
}
// throwing away top and bottom results
times.Sort();
times.RemoveAt(0);
times.RemoveAt(times.Count - 1);
return times.Average();
}
static int TestAvlTree(int seed)
{
var rnd = new System.Random(seed);
var avlTree = AvlTree<double>.Create((x, y) => x.CompareTo(y));
for (int i = 0; i < VALUE_COUNT; i++)
{
avlTree = avlTree.Insert(rnd.NextDouble());
}
return avlTree.Count;
}
static int TestDerivedAvlTree(int seed)
{
var rnd = new System.Random(seed);
var avlTree2 = DerivedAvlTree<double>.Create((x, y) => x.CompareTo(y));
for (int i = 0; i < VALUE_COUNT; i++)
{
avlTree2 = avlTree2.Insert(rnd.NextDouble());
}
return avlTree2.Count;
}
}
}
- AvlTree: inserts 5000 items in 121 ms
- DerivedAvlTree: inserts 5000 items in 2182 ms
My profiler indicates that the program spends an inordinate amount of time in BaseBinaryTree.Insert
. Anyone whose interested can see the EQATEC log file I've created with the code above (you'll need EQATEC profiler to make sense of file).
I really want to use a common base class for all of my binary trees, but I can't do that if performance will suffer.
What causes my DerivedAvlTree to perform so badly, and what can I do to fix it?
Note - there's now a "clean" solution here, so skip to the final edit if you only want a version that runs fast and don't care about all of the detective work.
It doesn't seem to be the difference between direct and virtual calls that's causing the slowdown. It's something to do with those delegates; I can't quite explain specifically what it is, but a look at the generated IL is showing a lot of cached delegates which I think might not be getting used in the base class version. But the IL itself doesn't seem to be significantly different between the two versions, which leads me to believe that the jitter itself is partly responsible.
Take a look at this refactoring, which cuts the running time by about 60%:
public virtual TreeType Insert(T value)
{
Func<TreeType, T, TreeType, TreeType> nodeFunc = (l, x, r) =>
{
int compare = this.Comparer(value, x);
if (compare < 0) { return CreateNode(l.Insert(value), x, r); }
else if (compare > 0) { return CreateNode(l, x, r.Insert(value)); }
return Self();
};
return Insert<TreeType>(value, nodeFunc);
}
private TreeType Insert<U>(T value,
Func<TreeType, T, TreeType, TreeType> nodeFunc)
{
return this.Match<TreeType>(
() => CreateNode(Self(), value, Self()),
nodeFunc);
}
This should (and apparently does) ensure that the insertion delegate is only being created once per insert - it's not getting created on each recursion. On my machine it cuts the running time from 350 ms to 120 ms (by contrast, the single-class version runs in about 30 ms, so this is still nowhere near where it should be).
But here's where it gets even weirder - after trying the above refactoring, I figured, hmm, maybe it's still slow because I only did half the work. So I tried materializing the first delegate as well:
public virtual TreeType Insert(T value)
{
Func<TreeType> nilFunc = () => CreateNode(Self(), value, Self());
Func<TreeType, T, TreeType, TreeType> nodeFunc = (l, x, r) =>
{
int compare = this.Comparer(value, x);
if (compare < 0) { return CreateNode(l.Insert(value), x, r); }
else if (compare > 0) { return CreateNode(l, x, r.Insert(value)); }
return Self();
};
return Insert<TreeType>(value, nilFunc, nodeFunc);
}
private TreeType Insert<U>(T value, Func<TreeType> nilFunc,
Func<TreeType, T, TreeType, TreeType> nodeFunc)
{
return this.Match<TreeType>(nilFunc, nodeFunc);
}
And guess what... this made it slower again! With this version, on my machine, it took a little over 250 ms on this run.
This defies all logical explanations that might relate the issue to the compiled bytecode, which is why I suspect that the jitter is in on this conspiracy. I think the first "optimization" above might be (WARNING - speculation ahead) allowing that insertion delegate to be inlined - it's a known fact that the jitter can't inline virtual calls - but there's still something else that's not being inlined and that's where I'm presently stumped.
My next step would be to selectively disable inlining on certain methods via the MethodImplAttribute
and see what effect that has on the runtime - that would help to prove or disprove this theory.
I know this isn't a complete answer but hopefully it at least gives you something to work with, and maybe some further experimentation with this decomposition can produce results that are close in performance to the original version.
Edit: Hah, right after I submitted this I stumbled on another optimization. If you add this method to the base class:
private TreeType CreateNilNode(T value)
{
return CreateNode(Self(), value, Self());
}
Now the running time drops to 38 ms here, just barely above the original version. This blows my mind, because nothing actually references this method! The private Insert<U>
method is still identical to the very first code block in my answer. I was going to change the first argument to reference the CreateNilNode
method, but I didn't have to. Either the jitter is seeing that the anonymous delegate is the same as the CreateNilNode
method and sharing the body (probably inlining again), or... or, I don't know. This is the first instance I've ever witnessed where adding a private method and never calling it can speed up a program by a factor of 4.
You'll have to check this to make sure I haven't accidentally introduced any logic errors - pretty sure I haven't, the code is almost the same - but if it all checks out, then here you are, this runs almost as fast as the non-derived AvlTree
.
FURTHER UPDATE
I was able to come up with a version of the base/derived combination that actually runs slightly faster than the single-class version. Took some coaxing, but it works!
What we need to do is create a dedicated inserter that can create all of the delegates just once, without needing to do any variable capturing. Instead, all of the state is stored in member fields. Put this inside the BaseBinaryTree
class:
protected class Inserter
{
private TreeType tree;
private Func<TreeType> nilFunc;
private Func<TreeType, T, TreeType, TreeType> nodeFunc;
private T value;
public Inserter(T value)
{
this.nilFunc = () => CreateNode();
this.nodeFunc = (l, x, r) => PerformMatch(l, x, r);
this.value = value;
}
public TreeType Insert(TreeType parent)
{
this.tree = parent;
return tree.Match<TreeType>(nilFunc, nodeFunc);
}
private TreeType CreateNode()
{
return tree.CreateNode(tree, value, tree);
}
private TreeType PerformMatch(TreeType l, T x, TreeType r)
{
int compare = tree.Comparer(value, x);
if (compare < 0) { return tree.CreateNode(l.Insert(value, this), x, r); }
else if (compare > 0) { return tree.CreateNode(l, x, r.Insert(value, this)); }
return tree;
}
}
Yes, yes, I know, it's very un-functional using that mutable internal tree
state, but remember that this isn't the tree itself, it's just a throwaway "runnable" instance. Nobody ever said that perf-opt was pretty! This is the only way to avoid creating a new Inserter
for each recursive call, which would otherwise slow this down on account of all the new allocations of the Inserter
and its internal delegates.
Now replace the insertion methods of the base class to this:
public TreeType Insert(T value)
{
return Insert(value, null);
}
protected virtual TreeType Insert(T value, Inserter inserter)
{
if (inserter == null)
{
inserter = new Inserter(value);
}
return inserter.Insert(Self());
}
I've made the public Insert
method non-virtual; all of the real work is delegated to a protected method that takes (or creates its own) Inserter
instance. Altering the derived class is simple enough, just replace the overridden Insert
method with this:
protected override DerivedAvlTree<T> Insert(T value, Inserter inserter)
{
return base.Insert(value, inserter).Balance();
}
That's it. Now run this. It will take almost the exact same amount of time as the AvlTree
, usually a few milliseconds less in a release build.
The slowdown is clearly due to some specific combination of virtual methods, anonymous methods and variable capturing that's somehow preventing the jitter from making an important optimization. I'm not so sure that it's inlining anymore, it might just be caching the delegates, but I think the only people who could really elaborate are the jitter folks themselves.
It's not anything to do with the derived class calling the original implementation and then also calling Balance, is it?
I think you'll probably need to look at the generated machine code to see what's different. All I can see from the source code is that you've changed a lot of static methods into virtual methods called polymorphically... in the first case the JIT knows exactly what method will be called and can do a direct call instruction, possibly even inline. But with a polymorphic call it has no choice but to do a v-table lookup and indirect call. That lookup represents a significant fraction of the work being done.
Life might get a little better if you call ((TreeType)this).Method() instead of this.Method(), but likely you can't remove the polymorphic call unless you also declare the overriding methods as sealed. And even then, you might pay the penalty of a runtime check on the this instance.
Putting your reusable code into generic static methods in the base class might help somewhat as well, but I think you're still going to be paying for polymorphic calls. C# generics just don't optimize as well as C++ templates.
You're running under the VS IDE, right? It's taking about 20 times longer, right?
Wrap a loop around it to iterate it 10 times, so the long version takes 20 seconds. Then while it is running, hit the "pause" button, and look at the call stack. You will see exactly what the problem is with 95% certainty. If you don't believe what you see, try it a few more times. Why does it work? Here's the long explanation, and here's the short one.
精彩评论