开发者

Quicksort- how pivot-choosing strategies affect the overall Big-oh behavior of quicksort?

I have came up with several strategies, but I am not entirely sure how they affect the overall behavior. I know the average case is O(NlogN), so I would assume that would be in the answer somewhere. I want to just put NlogN+1 for if I just select the 1st item in the array as the the pivot for the quicksort, but I don't know whether that is either correct nor acceptable? If anyone could enlighten me on this subject that would be great. Thanks!

Possible Strategies:

a) Array is random: pick the first item since that is the most cost effective choice.

b) Array is mostly sorted: pick middle item so we are likely to compliment the binary recursion of splitting in half each time.

c) Array is relatively large: pick first, middle and last indexes in array and compare them, picking the smallest to ensure we avoid worst case.

d) Perform 'c' w开发者_如何学运维ith randomly generated indexes to make selection less deterministic.


An important fact you should know is that in an array of distinct elements, quicksort with a random choice of partition will run in O(n lg n). There are many good proofs of this, and the one on Wikipedia actually has a pretty good discussion of this. If you're willing to go for a slightly less formal proof that's mostly mathematically sound, the intuition goes as follows. Whenever we pick a pivot, let's say that a "good" pivot is a pivot that gives us at least a 75%/25% split; that is, it's greater than at least 25% of the elements and at most 75% of the elements. We want to bound the number of times that we can get a pivot of this sort before the algorithm terminates. Suppose that we get k splits of this sort and consider the size of the largest subproblem generated this way. It has size at most (3/4)kn, since on each iteration we're getting rid of at least a quarter of the elements. If we consider the specific case where k = log3/4 (1/n) = log4/3 n, then the size of the largest subproblem after k good pivots are chosen will be 1, and the recursion will stop. This means that if we choose get O(lg n) good pivots, the recursion will terminate. But on each iteration, what's the chance of getting such a pivot? Well, if we pick the pivot randomly, then there's a 50% chance that it's in the middle 50% of the elements, and so on expectation we'll choose two random pivots before we get a good pivot. Each step of choosing a pivot takes O(n) time, and so we should spend roughly O(n) time before getting each good pivot. Since we get at most O(lg n) good pivots, the overall runtime is O(n lg n) on expectation.

An important detail in the above discussion is that if you replace the 75%/25% split with any constant split - say, a (100 - k%) / k% split - the over asymptotic analysis is the same. You'll get that quicksort takes, on average, O(n lg n) time.

The reason that I've mentioned this proof is that it gives you a good framework for thinking about how to choose a pivot in quicksort. If you can pick a pivot that's pretty close to the middle on each iteartion, you can guarantee O(n lg n) runtime. If you can't guarantee that you'll get a good pivot on any iteration, but can say that on expectation it takes only a constant number of iterations before you get a good pivot, then you can also guarantee O(n lg n) expected runtime.

Given this, let's take a look at your proposed pivot schemes. For (a), if the array is random, picking the first element as the pivot is essentially the same as picking a random pivot at each step, and so by the above analysis you'll get O(n lg n) runtime on expectation. For (b), if you know that the array is mostly sorted, then picking the median is a good strategy. The reason is that if we can say that each element is "pretty close" to where it should be in the sorted sequence, then you can make an argument that every pivot you choose is a good pivot, giving you the O(n lg n) runtime you want. (The term "pretty close" isn't very mathematically precise, but I think you could formalize this without too much difficulty if you wanted to).

As for (c) and (d), of the two, (d) is the only one guaranteed to get O(n lg n) on expectation. If you deterministically pick certain elements to use as pivots, your algorithm will be vulnerable to deterministic sequences that can degenerate it to O(n2) behavior. There's actually a really interesting paper on this called "A Killer Adversary for Quicksort" by McIlroy that describes how you can take any deterministic quicksort and construct a pathologically worst-case input for it by using a malicious comparison function. You almost certainly want to avoid this in any real quicksort implementation, since otherwise malicious users could launch DoS attacks on your code by feeding in these killer sequences to force your program to sort in quadratic time and thus hang. On the other hand, because (d) is picking its sample points randomly, it is not vulnerable to this attack, because on any sequence the choice of pivots is random.

Interestingly, though, for (d), while it doesn't hurt to pick three random elements and take the median, you don't need to do this. The earlier proof is enough to show that you'll get O(n lg n) on expectation with a single random pivot choice. I actually don't know if picking the median of three random values will improve the performance of the quicksort algorithm, though since quicksort is always Ω(n lg n) it certainly won't be asymptotically any better than just picking random elements as the pivots.

I hope that this helps out a bit - I really love the quicksort algorithm and all the design decisions involved in building a good quicksort implementation. :-)


You have to understand that there are already many algorithms that will allow you to sustain a O(nlog(n)) complexity. Using randomized quick sort has expected time complexity of O(nlog(n)), and is usually considered better than other approaches.

You would be able to sustain O(nlog(n)) if you would a mix of all above, i.e. conditionally apply one of them based on the "profile" of your input data set. That being said, categorising an input data set in itself a challenge. In any case, to do any better, you have to research on your input data set and choose on the possible alternatives.


The best pivot is the one which can divide the array exactly in two halves. The median of the array is off course the best choice. I will suggest this approach :-
select some random indexes
calculate median of these elements
Use this as pivot element

From the O(n) median finding algorithm, I think 5 random indexes should be sufficient enough.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜