How can I make my function run as fast as "Contains" on an ArrayList?
I can't figure out a discrepancy between the time it takes for the Contains
method to find an element in an ArrayList
and the time it takes for a small function that I wrote to do the same thing. The documentation states that Contains
performs a linear search, so it's supposed to be in O(n)
and not any other faster method. However, while the exact values may not be relevant, the Contains
method returns in 00:00:00.1087087
seconds while my function takes 00:00:00.1876165
. It might not be much, but this difference becomes more evident when dealing with even larger arrays. What am I missing and how should I write my function to match Contains
's performances?
I'm using C# on .NET 3.5.
public partial class Window1 : Window
{
public bool DoesContain(ArrayList list, object element)
{
for (int i = 0; i < list.Count; i++)
if (list[i].Equals(element)) return true;
return false;
}
public Window1()
{
InitializeComponent();
ArrayList list = new ArrayList();
for (int i = 0; i < 10000000; i++) list.Add("zzz " + i);
Stopwatch sw = new Stopwatch();
sw.Start();
//Console.Out.WriteLine(list.Contains("zzz 9000000") + " " + sw.Elapsed);
Console.Out.WriteLine(DoesContain(list, "zzz 9000000") + " " + sw.Elapsed);
}
}
EDIT:
Okay, now, lads, look:
public partial class Window1 : Window
{
public bool DoesContain(ArrayList list, object element)
{
int count = list.Count;
for (int i = count - 1; i >= 0; i--)
if (element.Equals(list[i])) return true;
return false;
}
public bool DoesContain1(ArrayList list, object element)
{
int count = list.Count;
for (int i = 0; i < count; i++)
if (element.Equals(list[i])) return true;
return false;
}
public Window1()
{
InitializeComponent();
ArrayList list = new ArrayList();
for (int i = 0; i < 10000000; i++) list.Add("zzz " + i);
Stopwatch sw = new Stopwatch();
long total = 0;
int nr = 100;
for (int i = 0; i < nr; i++)
{
sw.Reset();
sw.Start();
DoesContain(list,"zzz");
total += sw.ElapsedMilliseconds;
}
Console.Out.WriteLine(total / nr);
total = 0;
for (int i = 0; i < nr; i++)
{
sw.Reset();
sw.Start();
DoesContain1(list, "zzz");
total += sw.ElapsedMilliseconds;
}
Console.Out.WriteLine(total / nr);
total = 0;
for (int i = 0; i < nr; i++)
{
sw.Reset();
sw.Start();
开发者_StackOverflow中文版 list.Contains("zzz");
total += sw.ElapsedMilliseconds;
}
Console.Out.WriteLine(total / nr);
}
}
I made an average of 100 running times for two versions of my function(forward and backward loop) and for the default Contains
function. The times I've got are 136
and
133
milliseconds for my functions and a distant winner of 87
for the Contains
version. Well now, if before you could argue that the data was scarce and I based my conclusions on a first, isolated run, what do you say about this test? Not does only on average Contains
perform better, but it achieves consistently better results in each run. So, is there some kind of disadvantage in here for 3rd party functions, or what?
First, you're not running it many times and comparing averages.
Second, your method isn't being jitted until it actually runs. So the just in time compile time is added into its execution time.
A true test would run each multiple times and average the results (any number of things could cause one or the other to be slower for run X out of a total of Y), and your assemblies should be pre-jitted using ngen.exe.
As you're using .NET 3.5, why are you using ArrayList
to start with, rather than List<string>
?
A few things to try:
- You could see whether using
foreach
instead of afor
loop helps You could cache the count:
public bool DoesContain(ArrayList list, object element) { int count = list.Count; for (int i = 0; i < count; i++) { if (list[i].Equals(element)) { return true; } return false; } }
You could reverse the comparison:
if (element.Equals(list[i]))
While I don't expect any of these to make a significant (positive) difference, they're the next things I'd try.
Do you need to do this containment test more than once? If so, you might want to build a HashSet<T>
and use that repeatedly.
I'm not sure if you're allowed to post Reflector code, but if you open the method using Reflector, you can see that's it's essentially the same (there are some optimizations for null values, but your test harness doesn't include nulls).
The only difference that I can see is that calling list[i]
does bounds checking on i
whereas the Contains method does not.
Using the code below I was able to get the following timings relatively consitently (within a few ms):
1: 190ms DoesContainRev
2: 198ms DoesContainRev1
3: 188ms DoesContainFwd
4: 203ms DoesContainFwd1
5: 199ms Contains
Several things to notice here.
This is run with release compiled code from the commandline. Many people make the mistake of benchmarking code inside the Visual Studio debugging environment, not to say anyone here did but something to be careful of.
The
list[i].Equals(element)
appears to be just a bit slower thanelement.Equals(list[i])
.using System; using System.Diagnostics; using System.Collections; namespace ArrayListBenchmark { class Program { static void Main(string[] args) { Stopwatch sw = new Stopwatch(); const int arrayCount = 10000000; ArrayList list = new ArrayList(arrayCount); for (int i = 0; i < arrayCount; i++) list.Add("zzz " + i); sw.Start(); DoesContainRev(list, "zzz"); sw.Stop(); Console.WriteLine(String.Format("1: {0}", sw.ElapsedMilliseconds)); sw.Reset(); sw.Start(); DoesContainRev1(list, "zzz"); sw.Stop(); Console.WriteLine(String.Format("2: {0}", sw.ElapsedMilliseconds)); sw.Reset(); sw.Start(); DoesContainFwd(list, "zzz"); sw.Stop(); Console.WriteLine(String.Format("3: {0}", sw.ElapsedMilliseconds)); sw.Reset(); sw.Start(); DoesContainFwd1(list, "zzz"); sw.Stop(); Console.WriteLine(String.Format("4: {0}", sw.ElapsedMilliseconds)); sw.Reset(); sw.Start(); list.Contains("zzz"); sw.Stop(); Console.WriteLine(String.Format("5: {0}", sw.ElapsedMilliseconds)); sw.Reset(); Console.ReadKey(); } public static bool DoesContainRev(ArrayList list, object element) { int count = list.Count; for (int i = count - 1; i >= 0; i--) if (element.Equals(list[i])) return true; return false; } public static bool DoesContainFwd(ArrayList list, object element) { int count = list.Count; for (int i = 0; i < count; i++) if (element.Equals(list[i])) return true; return false; } public static bool DoesContainRev1(ArrayList list, object element) { int count = list.Count; for (int i = count - 1; i >= 0; i--) if (list[i].Equals(element)) return true; return false; } public static bool DoesContainFwd1(ArrayList list, object element) { int count = list.Count; for (int i = 0; i < count; i++) if (list[i].Equals(element)) return true; return false; } } }
With a really good optimizer there should not be difference at all, because the semantics seems to be the same. However the existing optimizer can optimize your function not so good as the hardcoded Contains
is optimized. Some of the points for optimization:
- comparing to a property each time can be slower than counting downwards and comparing against 0
- function call itself has its performance penalty
- using iterators instead of explicit indexing can be faster (
foreach
loop instead of plainfor
)
First, if you are using types you know ahead of time, I'd suggest using generics. So List instead of ArrayList. Underneath the hood, ArrayList.Contains actually does a bit more than what you are doing. The following is from reflector:
public virtual bool Contains(object item)
{
if (item == null)
{
for (int j = 0; j < this._size; j++)
{
if (this._items[j] == null)
{
return true;
}
}
return false;
}
for (int i = 0; i < this._size; i++)
{
if ((this._items[i] != null) && this._items[i].Equals(item))
{
return true;
}
}
return false;
}
Notice that it forks itself on being passed a null value for item. However, since all the values in your example are not null, the additional check on null at the beginning and in the second loop should in theory take longer.
Are you positive you are dealing with fully compiled code? I.e., when your code runs the first time it gets JIT compiled where as the framework is obviously already compiled.
After your Edit, I copied the code and made a few improvements to it.
The difference was not reproducable, it turns out to be a measuring/rounding issue.
To see that, change your runs to this form:
sw.Reset();
sw.Start();
for (int i = 0; i < nr; i++)
{
DoesContain(list,"zzz");
}
total += sw.ElapsedMilliseconds;
Console.WriteLine(total / nr);
I just moved some lines. The JIT issue was insignificant with this numbr of repetitions.
My guess would be that ArrayList is written in C++ and could be taking advantage of some micro-optimizations (note: this is a guess).
For instance, in C++ you can use pointer arithmetic (specifically incrementing a pointer to iterate an array) to be faster than using an index.
using an array structure, you can't search faster than O(n) whithout any additional information. if you know that the array is sorted, then you can use binary search algorithm and spent only o(log(n)) otherwise you should use a set.
Revised after reading comments:
It does not use some Hash-alogorithm to enable fast lookup.
Use SortedList<TKey,TValue>
, Dictionary<TKey, TValue>
or System.Collections.ObjectModel.KeyedCollection<TKey, TValue>
for fast access based on a key.
var list = new List<myObject>(); // Search is sequential
var dictionary = new Dictionary<myObject, myObject>(); // key based lookup, but no sequential lookup, Contains fast
var sortedList = new SortedList<myObject, myObject>(); // key based and sequential lookup, Contains fast
KeyedCollection<TKey, TValue>
is also fast and allows indexed lookup, however, it needs to be inherited as it is abstract. Therefore, you need a specific collection. However, with the following you can create a generic KeyedCollection
.
public class GenericKeyedCollection<TKey, TValue> : KeyedCollection<TKey, TValue> {
public GenericKeyedCollection(Func<TValue, TKey> keyExtractor) {
this.keyExtractor = keyExtractor;
}
private Func<TValue, TKey> keyExtractor;
protected override TKey GetKeyForItem(TValue value) {
return this.keyExtractor(value);
}
}
The advantage of using the KeyedCollection is that the Add method does not require that a key is specified.
精彩评论