Why is my MergeSort so slow in Python?

2023-03-27 22:30 问答作者：

I'm having some troubles understanding this behaviour. I'm measuring the execution time with the timeit-module and get the following results for 10000 cycles:

Merge : 1.22722930395
Bubble: 0.810706578175
Select: 0.469924766812

This is my code for MergeSort:

def mergeSort(array):
    if len(array) <= 1:
        return array
    else:
        left = array[:len(array)/2]
        right = array[len(array)/2:]
        return merge(mergeSort(left),mergeSort(right))

def merge(array1,array2):
    merged_array=[]
    while len(array1) > 0 or len(array2) > 0:

        if array2 and not array1:
            merged_array.append(array2.pop(0))

        elif (array1 and not array2) or array1[0] < array2[0]:
            merged_array.append(array1.pop(0))

        else:
            merged_array.append(array2.pop(0))
    return merged_array

Edit:

I've changed th开发者_开发问答e list operations to use pointers and my tests now work with a list of 1000 random numbers from 0-1000. (btw: I changed to only 10 cycles here)

result:

Merge : 0.0574434420723
Bubble: 1.74780097558
Select: 0.362952293025

This is my rewritten merge definition:

def merge(array1, array2):
    merged_array = []
    pointer1, pointer2 = 0, 0
    while pointer1 < len(array1) and pointer2 < len(array2):
        if array1[pointer1] < array2[pointer2]:
            merged_array.append(array1[pointer1])
            pointer1 += 1
        else:
            merged_array.append(array2[pointer2])
            pointer2 += 1
    while pointer1 < len(array1):
        merged_array.append(array1[pointer1])
        pointer1 += 1

    while pointer2 < len(array2):
        merged_array.append(array2[pointer2])
        pointer2 += 1

    return merged_array

seems to work pretty well now :)

list.pop(0) pops the first element and has to shift all remaining ones, this is an additional O(n) operation which must not happen.

Also, slicing a list object creates a copy:

left = array[:len(array)/2]
right = array[len(array)/2:]

Which means you're also using O(n * log(n)) memory instead of O(n).

I can't see BubbleSort, but I bet it works in-place, no wonder it's faster.

You need to rewrite it to work in-place. Instead of copying part of original list, pass starting and ending indexes.

For starters : I cannot reproduce your timing results, on 100 cycles and lists of size 10000. The exhaustive benchmark with timeit of all implementations discussed in this answer (including bubblesort and your original snippet) is posted as a gist here. I find the following results for the average duration of a single run :

Python's native (Tim)sort : 0.0144600081444
Bubblesort : 26.9620819092
(Your) Original Mergesort : 0.224888720512

Now, to make your function faster, you can do a few things.

Edit : Well, apparently, I was wrong on that one (thanks cwillu). Length computation takes O(1) in python. But removing useless computation everywhere still improves things a bit (Original Mergesort: 0.224888720512, no-length Mergesort: 0.195795390606):

def nolenmerge(array1,array2):
    merged_array=[]
    while array1 or array2:
        if not array1:
            merged_array.append(array2.pop(0))
        elif (not array2) or array1[0] < array2[0]:
            merged_array.append(array1.pop(0))
        else:
            merged_array.append(array2.pop(0))
    return merged_array

def nolenmergeSort(array):
    n  = len(array)
    if n <= 1:
        return array
    left = array[:n/2]
    right = array[n/2:]
    return nolenmerge(nolenmergeSort(left),nolenmergeSort(right))

Second, as suggested in this answer, pop(0) is linear. Rewrite your merge to pop() at the end:

def fastmerge(array1,array2):
    merged_array=[]
    while array1 or array2:
        if not array1:
            merged_array.append(array2.pop())
        elif (not array2) or array1[-1] > array2[-1]:
            merged_array.append(array1.pop())
        else:
            merged_array.append(array2.pop())
    merged_array.reverse()
    return merged_array

This is again faster: no-len Mergesort: 0.195795390606, no-len Mergesort+fastmerge: 0.126505711079

Third - and this would only be useful as-is if you were using a language that does tail call optimization, without it , it's a bad idea - your call to merge to merge is not tail-recursive; it calls both (mergeSort left) and (mergeSort right) recursively while there is remaining work in the call (merge).

But you can make the merge tail-recursive by using CPS (this will run out of stack size for even modest lists if you don't do tco):
```
def cps_merge_sort(array):
    return cpsmergeSort(array,lambda x:x)

def cpsmergeSort(array,continuation):
    n  = len(array)
    if n <= 1:
        return continuation(array)
    left = array[:n/2]
    right = array[n/2:]
    return cpsmergeSort (left, lambda leftR:
                         cpsmergeSort(right, lambda rightR:
                                      continuation(fastmerge(leftR,rightR))))
```
Once this is done, you can do TCO by hand to defer the call stack management done by recursion to the while loop of a normal function (trampolining, explained e.g. here, trick originally due to Guy Steele). Trampolining and CPS work great together.

You write a thunking function, that "records" and delays application: it takes a function and its arguments, and returns a function that returns (that original function applied to those arguments).
```
thunk = lambda name, *args: lambda: name(*args)
```
You then write a trampoline that manages calls to thunks: it applies a thunk until the thunk returns a result (as opposed to another thunk)
```
def trampoline(bouncer):
    while callable(bouncer):
        bouncer = bouncer()
    return bouncer
```
Then all that's left is to "freeze" (thunk) all your recursive calls from the original CPS function, to let the trampoline unwrap them in proper sequence. Your function now returns a thunk, without recursion (and discarding its own frame), at every call:
```
def tco_cpsmergeSort(array,continuation):
    n  = len(array)
    if n <= 1:
        return continuation(array)
    left = array[:n/2]
    right = array[n/2:]
    return thunk (tco_cpsmergeSort, left, lambda leftR:
                  thunk (tco_cpsmergeSort, right, lambda rightR:
                         (continuation(fastmerge(leftR,rightR)))))

mycpomergesort = lambda l: trampoline(tco_cpsmergeSort(l,lambda x:x))
```

Sadly this does not go that fast (recursive mergesort:0.126505711079, this trampolined version : 0.170638551712). OK, I guess the stack blowup of the recursive merge sort algorithm is in fact modest : as soon as you get out of the leftmost path in the array-slicing recursion pattern, the algorithm starts returning (& removing frames). So for 10K-sized lists, you get a function stack of at most log_2(10 000) = 14 ... pretty modest.

You can do slightly more involved stack-based TCO elimination in the guise of this SO answer gives:

    def leftcomb(l):
        maxn,leftcomb = len(l),[]
        n = maxn/2
        while maxn > 1:
            leftcomb.append((l[n:maxn],False))
            maxn,n = n,n/2
        return l[:maxn],leftcomb

    def tcomergesort(l):
        l,stack = leftcomb(l)
        while stack: # l sorted, stack contains tagged slices
            i,ordered = stack.pop()
            if ordered:
                l = fastmerge(l,i)
            else:
                stack.append((l,True)) # store return call
                rsub,ssub = leftcomb(i)
                stack.extend(ssub) #recurse
                l = rsub
        return l

But this goes only a tad faster (trampolined mergesort: 0.170638551712, this stack-based version:0.144994809628). Apparently, the stack-building python does at the recursive calls of our original merge sort is pretty inexpensive.

The final results ? on my machine (Ubuntu natty's stock Python 2.7.1+), the average run timings (out of of 100 runs -except for Bubblesort-, list of size 10000, containing random integers of size 0-10000000) are:

Python's native (Tim)sort : 0.0144600081444
Bubblesort : 26.9620819092
Original Mergesort : 0.224888720512
no-len Mergesort : 0.195795390606
no-len Mergesort + fastmerge : 0.126505711079
trampolined CPS Mergesort + fastmerge : 0.170638551712
stack-based mergesort + fastmerge: 0.144994809628

Your merge-sort has a big constant factor, you have to run it on large lists to see the asymptotic complexity benefit.

Umm.. 1,000 records?? You are still well within the polynomial cooefficient dominance here.. If I have selection-sort: 15 * n ^ 2 (reads) + 5 * n^2 (swaps) insertion-sort: 5 * n ^2 (reads) + 15 * n^2 (swaps) merge-sort: 200 * n * log(n) (reads) 1000 * n * log(n) (merges)

You're going to be in a close race for a lonng while.. By the way, 2x faster in sorting is NOTHING. Try 100x slower. That's where the real differences are felt. Try "won't finish in my life-time" algorithms (there are known regular expressions that take this long to match simple strings).

So try 1M or 1G records and let us know if you still thing merge-sort isn't doing too well.

That being said..

There are lots of things causing this merge-sort to be expensive. First of all, nobody ever runs quick or merge sort on small scale data-structures.. Where you have if (len <= 1), people generally put: if (len <= 16) : (use inline insertion-sort) else: merge-sort At EACH propagation level.

Since insertion-sort is has smaller coefficent cost at smaller sizes of n. Note that 50% of your work is done in this last mile.

Next, you are needlessly running array1.pop(0) instead of maintaining index-counters. If you're lucky, python is efficiently managing start-of-array offsets, but all else being equal, you're mutating input parameters

Also, you know the size of the target array during merge, why copy-and-double the merged_array repeatedly.. Pre-allocate the size of the target array at the start of the function.. That'll save at least a dozen 'clones' per merge-level.

In general, merge-sort uses 2x the size of RAM.. Your algorithm is probably using 20x because of all the temporary merge buffers (hopefully python can free structures before recursion). It breaks elegance, but generally the best merge-sort algorithms make an immediate allocation of a merge buffer equal to the size of the source array, and you perform complex address arithmetic (or array-index + span-length) to just keep merging data-structures back and forth. It won't be as elegent as a simple recursive problem like this, but it's somewhat close.

In C-sorting, cache-coherence is your biggest enemy. You want hot data-structures so you maximize your cache. By allocating transient temp buffers (even if the memory manager is returning pointers to hot memory) you run the risk of making slow DRAM calls (pre-filling cache-lines for data you're about to over-write). This is one advantage insertion-sort,selection-sort and quick-sort have over merge-sort (when implemented as above)

Speaking of which, something like quick-sort is both naturally-elegant code, naturally efficient-code, and doesn't waste any memory (google it on wikipedia- they have a javascript implementation from which to base your code). Squeezing the last ounce of performance out of quick-sort is hard (especially in scripting languages, which is why they generally just use the C-api to do that part), and you have a worst-case of O(n^2). You can try and be clever by doing a combination bubble-sort/quick-sort to mitigate worst-case.

Happy coding.

继续阅读：algorithm mergesort python sorting

Why is my MergeSort so slow in Python?

Edit:

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Edit:

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？