开发者

Performance-comparison of Sort() and BinarySearch() with Integers/Strings

Originally i wanted to ask if it's faster to sort Integers than Strings. But i have answered this question myself and i'm suprised of the big difference. Why is sorting and BinarySearch Integers as much faster compared to Strings?

The (VB.Net) Test with 1.000.000 Int32/Strings:

Private Function CheckIntBinarySearch() As TimeSpan
    Dim watch As New System.Diagnostics.Stopwatch()
    Dim rnd As New Random(Date.Now.Millisecond)
    Dim intCol1 As New List(Of Int32)
    Dim intCol2 As New List(Of Int32)
    Dim contains As Int32
    For i As Int32 = 1 To 1000000
        intCol1.Add(rnd.Next(1, 1000000))
    Next
    For i As Int32 = 1 To 1000000
        intCol2.Add(rnd.Next(1, 1000000))
    Next
    Me.output.WriteLine("Integers sorting ...")
    watch.Start()
    intCol1.Sort()
    watch.Stop()
    Me.output.WriteLine("Sorting finished: " & 开发者_StackOverflowwatch.Elapsed.TotalSeconds & " seconds elapsed.")

    Me.output.WriteLine("Integers BinarySearch ...")
    watch.Start()
    For Each Val As Int32 In intCol2
        If intCol1.BinarySearch(Val) > -1 Then contains += 1
    Next
    watch.Stop()
    Me.output.WriteLine("BinarySearch finished(contains " & contains & "): " & watch.Elapsed.TotalSeconds & " seconds elapsed.")
    Return watch.Elapsed
End Function

Private Function CheckStringBinarySearch() As TimeSpan
    Dim watch As New System.Diagnostics.Stopwatch()
    Dim rnd As New Random(Date.Now.Millisecond)
    Dim stringCol1 As New List(Of String)
    Dim stringCol2 As New List(Of String)
    Dim contains As Int32
    For i As Int32 = 1 To 1000000
        stringCol1.Add(rnd.Next(1, 1000000).ToString)
    Next
    For i As Int32 = 1 To 1000000
        stringCol2.Add(rnd.Next(1, 1000000).ToString)
    Next
    Me.output.WriteLine("Strings sorting ...")
    watch.Start()
    stringCol1.Sort()
    watch.Stop()
    Me.output.WriteLine("Sorting finished: " & watch.Elapsed.TotalSeconds & " seconds elapsed.")
    Me.output.WriteLine("Strings BinarySearch ...")
    watch.Start()
    For Each Val As String In stringCol2
        If stringCol1.BinarySearch(Val) > -1 Then contains += 1
    Next
    watch.Stop()
    Me.output.WriteLine("BinarySearch finished(contains " & contains & "): " & watch.Elapsed.TotalSeconds & " seconds elapsed.")
    Return watch.Elapsed
End Function

Checking performance 5 times:

For i As Int32 = 1 To 5
   intChecks.Add(CheckIntBinarySearch())
Next
For i As Int32 = 1 To 5
   stringChecks.Add(CheckStringBinarySearch())
Next

Output:

    1.)Integers sorting ...
    Sorting finished: 0,2292863 seconds elapsed.
    Integers BinarySearch ...
    BinarySearch finished(contains 630857): 0,9365744 seconds elapsed.
    2.)Integers sorting ...
    Sorting finished: 0,2287382 seconds elapsed.
    Integers BinarySearch ...
    BinarySearch finished(contains 632600): 0,9053134 seconds elapsed.
    3.)Integers sorting ...
    Sorting finished: 0,2318829 seconds elapsed.
    Integers BinarySearch ...
    BinarySearch finished(contains 631475): 0,9038594 seconds elapsed.
    4.)Integers sorting ...
    Sorting finished: 0,2308994 seconds elapsed.
    Integers BinarySearch ...
    BinarySearch finished(contains 632346): 0,9011047 seconds elapsed.
    5.)Integers sorting ...
    Sorting finished: 0,2266423 seconds elapsed.
    Integers BinarySearch ...
    BinarySearch finished(contains 632982): 0,893541 seconds elapsed.
    1.)Strings sorting ...
    Sorting finished: 6,5661916 seconds elapsed.
    Strings BinarySearch ...
    BinarySearch finished(contains 632579): 12,9545657 seconds elapsed.
    2.)Strings sorting ...
    Sorting finished: 6,5641975 seconds elapsed.
    Strings BinarySearch ...
    BinarySearch finished(contains 631478): 13,0184132 seconds elapsed.
    3.)Strings sorting ...
    Sorting finished: 6,4281382 seconds elapsed.
    Strings BinarySearch ...
    BinarySearch finished(contains 631775): 12,7684214 seconds elapsed.
    4.)Strings sorting ...
    Sorting finished: 6,9455087 seconds elapsed.
    Strings BinarySearch ...
    BinarySearch finished(contains 631478): 13,7057234 seconds elapsed.
    5.)Strings sorting ...
    Sorting finished: 6,6707111 seconds elapsed.
    Strings BinarySearch ...
    BinarySearch finished(contains 632346): 13,0493649 seconds elapsed.
  • Int32 Sort average: 0,22948982 seconds
  • String Sort average: 6,63494942 seconds
  • Int32 BinarySearch average: 0,90807858 seconds
  • String BinarySearch average: 13,09929772 seconds

Conclusion:

  • sorting Integers is 29 times faster than sorting Strings
  • BinarySearch Integers is 14,4 times faster than BinarySearch Strings

Why? Consider having large collections of "String-Integers"("1","2","3",...). Would it even be better to parse them to integers before sorting and searching them? What is the cost of parsing Strings to integers? Ok, i think this is another question.


There are a number of issues with String comparison which integer comparison escapes.

  1. Think about how you might write a String comparison. First you might check if the reference was null, then you might check the lengths of the two strings match, then you might compare each digit one at a time. This is at least 2 comparisons, where the integer only uses one.
  2. Consider comparing "111112" and "111113". For the string 6 comparisons need to be made (one for each character) before getting a result. For the integer it is only one comparison.
  3. Integer comparisons and arrays can be optimised to use register specific instructions, so a lot of sort/search may happen inside the CPU cache directly - with strings multiple method invocations may be needed to access characters, check length, etc.
  4. If you need a locale specific comparison, or one which can cope with "ß" being the same as "ss" or "æ" the same as "ae" you can't even use the length optimisation.
  5. Strings are reference types and integers are value types, so there may be a dereference.

This list is not exhaustive, but at least gives a sense of the issues.

As an aside, not affecting performance - you should be aware that "12" < "2" when comparing strings (this happens in dictionary order), so the code above may not be what you want.


A computer can compare 2 integers with a single instruction, which only takes nanoseconds.

Comparing two strings is another kettle of fish, which involves:

  1. Get the length of string A
  2. Get the length of string B
  3. Check the lengths, if they're both zero, consider the strings equal.
  4. Get the location of string A
  5. Get the location of string B
  6. Start comparing, character by character A against B. (Some processors can do this in one instruction, but it's a long instruction.)

Comparing strings might appear similar to comparing integers, but for a computer, it's much harder, as your test results show


You are using a culture-aware comparison for sorting the strings. Remember, not all languages of the world agree on the alphabet.

This means that whenever two strings are being compared, .NET will look up the user's current culture, and then compare the string using the rules of that language. This is a non-trivial operation - for example "ae" and "æ" are considered equal.

To speed up the sort of strings, use:

 stringCol1.Sort(StringComparer.Ordinal)

With this change, you're removing a large part of the overhead (all the culture-aware stuff), but then smirkingman's answer still applies for the simple string comparison.

To get an idea of how complex the default string comparison is, take a look at the Unicode Collation Algorithm.


The problem is that strings are reference types and integers are value types.

For every string, a dereference must be made to the actual location of the string and a comparison must be made.

For integer, the value is already there and the comparison is much much cheaper.


As others said, comparing integers is a single sub instruction, while to compare two strings they have to be pushed by reference on the stack, the function called, the function entry code performed, then a loop over the characters performed, the function exit code performed, then the return, followed by a single sub instruction.

If you pause it a few times that's what you'll see. On nearly every pause, it will be in the string-compare function, telling you that function accounts for 90% or more of the time.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜