Does quicksort with randomized median-of-three do appreciably better than randomized quicksort?
I was just answering a question about different approaches for picking the partition in a quicksort implementation and came up with a question that I honestly don't know how to answer. It's a bit math-heavy, and this may be the wrong site on which to ask this, so if this needs to move please let me know and I'll gladly migrate it elsewhere.
It's well-known that a quicksort implementation that picks its pivots uniformly at random will end up running in expected O(n lg n) time (there's a nice proof of this on Wikipedia). However, due to the cost of generating random numbers, many quicksort implementations don't pick pivots randomly, but instead rely on a "median-of-three" approach in which three elements are chosen deterministically and of which the median is chosen as the pivot. This is known to degenerate to O(n2) in the worst-case (see this great paper on how to generate those worst-case inputs, for example).
Now, suppose that we combine these two approaches by picking three random elements from the sequence and using their median as the choice of pivot. I know that this also guarantees O(n lg n)开发者_StackOverflow社区 average-case runtime using a slightly different proof than the one for the regular randomized quicksort. However, I have no idea what the constant factor in front of the n lg n term is in this particular quicksort implementation. For regular randomized quicksort Wikipedia lists the actual runtime of randomized quicksort as requiring at most 1.39 n lg n comparisons (using lg as the binary logarithm).
My question is this: does anyone know of a way to derive the constant factor for the number of comparisons made using a "median-of-three" randomized quicksort? If we go even more generally, is there an expression for the constant factor on quicksort using a randomized median-of-k approach? I'm curious because I think it would be fascinating to see if there is some "sweet spot" of this approach that makes fewer comparisons than other randomized quicksort implementations. I mean, wouldn't it be cool to be able to say that randomized quicksort with a randomized median-of-six pivot choice makes the fewest comparisons? Or be able to conclusively say that you should just pick a pivot element at random?
Here's a heuristic derivation of the constant. I think it can be made rigorous, with a lot more effort.
Let P be a continuous random variable with values in [0, 1]. Intuitively, P is the fraction of values less than the pivot. We're looking to find the constant c such that
c n lg n = E[n + c P n lg (P n) + c (1 - P) n lg ((1 - P) n)].
A little bit of algebra later, we have
c = 1/E[-P lg P - (1 - P) lg (1 - P))].
In other words, c is the reciprocal of the expected entropy of the Bernoulli distribution with mean P. Intuitively, for each element, we need to compare it to pivots in a way that yields about lg n bits of information.
When P is uniform, the pdf of P is 1. The constant is
In[1]:= -1/NIntegrate[x Log[2, x] + (1 - x) Log[2, 1 - x], {x, 0, 1}]
Out[1]= 1.38629
When the pivot is a median of 3, the pdf of P is 6 x (1 - x). The constant is
In[2]:= -1/NIntegrate[6 x (1 - x) (x Log[2, x] + (1 - x) Log[2, 1 - x]), {x, 0, 1}]
Out[2]= 1.18825
The constant for the usual randomized quicksort is easy to compute because the probability that two elements k locations apart are compared is exactly 2/(k+1): the probability that one of the those two elements is chosen as a pivot before any of the k-1 elements between them. Unfortunately nothing so clever applies to your algorithm.
I'm hesitant to attempt your bolded question because I can answer your "underlying" question: asymptotically speaking, there is no "sweet spot". The total added cost of computing medians of k elements, even O(n1 - ε) elements, is linear, and the constant for the n log n term decreases with the array being split more evenly. The catch is of course constants on the linear term that are spectacularly impractical, highlighting one of the drawbacks of asymptotic analysis.
Based on my comments below, I guess k = O(nα) for 0 < α < 1 is the "sweet spot".
If the initial state of the set is randomly ordered, you will get the exact same constant factor for randomly picking three items to calculate the median as when picking three items deterministically.
The motive for picking item by random would be that the deterministic method would give a result that is worse than the average. If the deterministic method gives a good median, you can't improve on it by picking items by random.
So, which ever method gives the best result depends on the input data, it can't be determined for every possible set.
The only sure way to lower the constant factor is to increase the number of items that you use to calcuate the median, but at some point calculating the median will be more expensive than what you gain by getting a better median value.
Yes, it does. Bentley and McIlroy, authors of the C
standard library's qsort
function, wrote in their paper, Engineering a Sort Function the following numbers:
- 1.386 n lg n average comparisons using first, middle or a randomized pivot
- 1.188 n lg n average comparisons using a median of 3 pivot
- 1.094 n lg n average comparisons using a median of 3 medians pivot
According to the above paper:
Our final code therefore chooses the middle element of smaller arrays, the median of the first, middle and last elements of a mid-sized array, and the pseudo-median of nine evenly spaced elements of a large array.
Just a thought: If you use the median-of-three approach, and you find it to be better, why not use a median-of-five or median-of-eleven approach? And while you are on it, maybe one can think of a median-of-n optimization... hmmm... Ok, that is obviously a bad idea (since you would have to sort your sequence for that...).
Basically, to choose your pivot element as the median-of-m elements, you sort those m elements, right? Therefore, I'd simply guess, that one of the constants you are looking for is "2": By first sorting 3 elements to choose your pivot you execute how many additional comparisons? Lets say its 2. You do this inside the quicksort over and over again. A basic conclusion would be that the median-of-3 is therefore 2-times slower then the simple random quicksort.
But what is working for you here? That you get a better device-and-conquer-distribution, and you are better protected against the degenerated case (a bit).
So, back to my infamous question at the beginning: Why not choose the pivot element from a median-of-m, m being 5, 7, n/3, or so. There must be a sweet spot where the sorting of the m elements is worse then the gain from the better divide-and-conquer behavior and quicksort. I guess, this sweet-spot is there very early -- you have to fight first against the constant factor of 2 comparisons if you choose median-of-3. It is worth an experiment, I admit, but I would not be too expectant of the result :-) But if I am wrong, and the gain is huge: don't stop at 3!
精彩评论