Performance-comparison of Sort() and BinarySearch() with Integers/Strings
Originally i wanted to ask if it's faster to sort Integers than Strings. But i have answered this question myself and i'm suprised of the big difference. Why is sorting and BinarySearch Integers as much faster compared to Strings?
The (VB.Net) Test with 1.000.000 Int32/Strings:
Private Function CheckIntBinarySearch() As TimeSpan
Dim watch As New System.Diagnostics.Stopwatch()
Dim rnd As New Random(Date.Now.Millisecond)
Dim intCol1 As New List(Of Int32)
Dim intCol2 As New List(Of Int32)
Dim contains As Int32
For i As Int32 = 1 To 1000000
intCol1.Add(rnd.Next(1, 1000000))
Next
For i As Int32 = 1 To 1000000
intCol2.Add(rnd.Next(1, 1000000))
Next
Me.output.WriteLine("Integers sorting ...")
watch.Start()
intCol1.Sort()
watch.Stop()
Me.output.WriteLine("Sorting finished: " & 开发者_StackOverflowwatch.Elapsed.TotalSeconds & " seconds elapsed.")
Me.output.WriteLine("Integers BinarySearch ...")
watch.Start()
For Each Val As Int32 In intCol2
If intCol1.BinarySearch(Val) > -1 Then contains += 1
Next
watch.Stop()
Me.output.WriteLine("BinarySearch finished(contains " & contains & "): " & watch.Elapsed.TotalSeconds & " seconds elapsed.")
Return watch.Elapsed
End Function
Private Function CheckStringBinarySearch() As TimeSpan
Dim watch As New System.Diagnostics.Stopwatch()
Dim rnd As New Random(Date.Now.Millisecond)
Dim stringCol1 As New List(Of String)
Dim stringCol2 As New List(Of String)
Dim contains As Int32
For i As Int32 = 1 To 1000000
stringCol1.Add(rnd.Next(1, 1000000).ToString)
Next
For i As Int32 = 1 To 1000000
stringCol2.Add(rnd.Next(1, 1000000).ToString)
Next
Me.output.WriteLine("Strings sorting ...")
watch.Start()
stringCol1.Sort()
watch.Stop()
Me.output.WriteLine("Sorting finished: " & watch.Elapsed.TotalSeconds & " seconds elapsed.")
Me.output.WriteLine("Strings BinarySearch ...")
watch.Start()
For Each Val As String In stringCol2
If stringCol1.BinarySearch(Val) > -1 Then contains += 1
Next
watch.Stop()
Me.output.WriteLine("BinarySearch finished(contains " & contains & "): " & watch.Elapsed.TotalSeconds & " seconds elapsed.")
Return watch.Elapsed
End Function
Checking performance 5 times:
For i As Int32 = 1 To 5
intChecks.Add(CheckIntBinarySearch())
Next
For i As Int32 = 1 To 5
stringChecks.Add(CheckStringBinarySearch())
Next
Output:
1.)Integers sorting ...
Sorting finished: 0,2292863 seconds elapsed.
Integers BinarySearch ...
BinarySearch finished(contains 630857): 0,9365744 seconds elapsed.
2.)Integers sorting ...
Sorting finished: 0,2287382 seconds elapsed.
Integers BinarySearch ...
BinarySearch finished(contains 632600): 0,9053134 seconds elapsed.
3.)Integers sorting ...
Sorting finished: 0,2318829 seconds elapsed.
Integers BinarySearch ...
BinarySearch finished(contains 631475): 0,9038594 seconds elapsed.
4.)Integers sorting ...
Sorting finished: 0,2308994 seconds elapsed.
Integers BinarySearch ...
BinarySearch finished(contains 632346): 0,9011047 seconds elapsed.
5.)Integers sorting ...
Sorting finished: 0,2266423 seconds elapsed.
Integers BinarySearch ...
BinarySearch finished(contains 632982): 0,893541 seconds elapsed.
1.)Strings sorting ...
Sorting finished: 6,5661916 seconds elapsed.
Strings BinarySearch ...
BinarySearch finished(contains 632579): 12,9545657 seconds elapsed.
2.)Strings sorting ...
Sorting finished: 6,5641975 seconds elapsed.
Strings BinarySearch ...
BinarySearch finished(contains 631478): 13,0184132 seconds elapsed.
3.)Strings sorting ...
Sorting finished: 6,4281382 seconds elapsed.
Strings BinarySearch ...
BinarySearch finished(contains 631775): 12,7684214 seconds elapsed.
4.)Strings sorting ...
Sorting finished: 6,9455087 seconds elapsed.
Strings BinarySearch ...
BinarySearch finished(contains 631478): 13,7057234 seconds elapsed.
5.)Strings sorting ...
Sorting finished: 6,6707111 seconds elapsed.
Strings BinarySearch ...
BinarySearch finished(contains 632346): 13,0493649 seconds elapsed.
Int32
Sort average: 0,22948982 secondsString
Sort average: 6,63494942 secondsInt32
BinarySearch average: 0,90807858 secondsString
BinarySearch average: 13,09929772 seconds
Conclusion:
- sorting Integers is 29 times faster than sorting Strings
- BinarySearch Integers is 14,4 times faster than BinarySearch Strings
Why? Consider having large collections of "String-Integers"("1","2","3",...). Would it even be better to parse them to integers before sorting and searching them? What is the cost of parsing Strings to integers? Ok, i think this is another question.
There are a number of issues with String comparison which integer comparison escapes.
- Think about how you might write a String comparison. First you might check if the reference was null, then you might check the lengths of the two strings match, then you might compare each digit one at a time. This is at least 2 comparisons, where the integer only uses one.
- Consider comparing "111112" and "111113". For the string 6 comparisons need to be made (one for each character) before getting a result. For the integer it is only one comparison.
- Integer comparisons and arrays can be optimised to use register specific instructions, so a lot of sort/search may happen inside the CPU cache directly - with strings multiple method invocations may be needed to access characters, check length, etc.
- If you need a locale specific comparison, or one which can cope with "ß" being the same as "ss" or "æ" the same as "ae" you can't even use the length optimisation.
- Strings are reference types and integers are value types, so there may be a dereference.
This list is not exhaustive, but at least gives a sense of the issues.
As an aside, not affecting performance - you should be aware that "12" < "2" when comparing strings (this happens in dictionary order), so the code above may not be what you want.
A computer can compare 2 integers with a single instruction, which only takes nanoseconds.
Comparing two strings is another kettle of fish, which involves:
- Get the length of string A
- Get the length of string B
- Check the lengths, if they're both zero, consider the strings equal.
- Get the location of string A
- Get the location of string B
- Start comparing, character by character A against B. (Some processors can do this in one instruction, but it's a long instruction.)
Comparing strings might appear similar to comparing integers, but for a computer, it's much harder, as your test results show
You are using a culture-aware comparison for sorting the strings. Remember, not all languages of the world agree on the alphabet.
This means that whenever two strings are being compared, .NET will look up the user's current culture, and then compare the string using the rules of that language. This is a non-trivial operation - for example "ae" and "æ" are considered equal.
To speed up the sort of strings, use:
stringCol1.Sort(StringComparer.Ordinal)
With this change, you're removing a large part of the overhead (all the culture-aware stuff), but then smirkingman's answer still applies for the simple string comparison.
To get an idea of how complex the default string comparison is, take a look at the Unicode Collation Algorithm.
The problem is that strings are reference types and integers are value types.
For every string, a dereference must be made to the actual location of the string and a comparison must be made.
For integer, the value is already there and the comparison is much much cheaper.
As others said, comparing integers is a single sub
instruction, while to compare two strings they have to be pushed by reference on the stack, the function called, the function entry code performed, then a loop over the characters performed, the function exit code performed, then the return, followed by a single sub
instruction.
If you pause it a few times that's what you'll see. On nearly every pause, it will be in the string-compare function, telling you that function accounts for 90% or more of the time.
精彩评论