Average number of intervals from an input in 0..N
The question sprang up when examining the "Find the K missing numbers in this set supposed to cover [0..N]" questio开发者_开发百科n.
The author of the question asked for CS answers instead of equation-based answers, and his proposal was to sort the input and then iterate over it to list the K missing numbers.
While this seems fine to me, it also seems wasteful. Let's take an example:
- N = 200
- K = 2 (we will consider K << N)
- missing elements: 53, 75
The "sorted" set can be represented as: [0, 52] U [54, 74] U [76, 200]
, which is way more compact than enumerating all values of the set (and allows to retrieve the missing numbers in O(K) operations, to be compared with O(N) if the set is sorted).
However this is the final result, but during the construction the list of intervals might be much larger, as we feed the elements one at a time....
Let us, therefore, introduce another variable: let I
be the number of elements of the set that we fed to the structure so far. Then, we may at worst have: min((N-K)/2, I)
intervals (I think...)
From which we deduce that the number of intervals reached during the construction is the maximum encountered for I in [0..N], the worst case being (N-K)/2
thus O(N)
.
I have however a gut feeling that if the input is random, instead of being specially crafted, we might get a much lower bound... and thus the always so tricky question:
How much intervals... in average ?
Your approach vs. the proposed one with sorting seems to be a classical trade-off of which operations is cheap and which one is expensive.
I find your notation a bit confusing, so please allow me to use my own:
Let S
be the set. Let n
be the number of items in the set: n = |S|
. Let max
be the biggest number in the set: max = max(S)
. Let k
be the number of elements not in the set: k = |{0,...,max} \ S|
.
For the sorting solution, we could very cheaply insert all n
elements into S
using hashing. That would take expected O(n)
. Then for finding the k
missing elements, we sort the set in O(nlogn)
, and then determine the missing elements in O(n)
.
That is, the overall cost for adding n
elements and then finding the missing k
elements takes expected O(n) + O(nlogn) + O(n) = O(nlogn)
.
You suggest a different approach in which we represent the set as a list of dense subsets of S
. How would you implement such a data structure? I suggest a sorted tree (instead of a list) so that an insert becomes efficient. Because what do you have to do for an insert of a new element e
? I think you have to:
- Find the potential candidate subset(s) in the tree where
e
could be added - If a subset already contains
e
, nothing has to be done. - If a subset contains
e+1
and another subset containse-1
, merge the subsets together and adde
to the result - If a subset already contains
e+1
, bute-1
is not contained inS
, adde
to that subset - If a subset already contains
e-1
, bute+1
is not contained inS
, adde
to that subset - Otherwise, create a new subset holding only the element
e
and insert it into the tree.
We can expect that finding the subsets needed for the above operations takes O(logn)
. The operations of 4. or 5. take constant time if we represent the subsets as pairs of integers (we just have to decrement the lower or increment the upper boundary). 3. and 6. potentially require changing the tree structure, but we expect that to take at most O(logn)
, so the whole "insert" will not take more than O(logn)
.
Now with such a datastructure in place, we can easily determine the k
missing numbers by traversing the tree in order and collecting the numbers not in any of the subsets. The costs are linear in the number of nodes in the tree, which is <= n/2
, so the total costs are O(n)
for that.
However, if we consider again the complete sequence operations, we get for n
inserts O(nlogn)
+ O(n)
for finding the k missing numbers, so the overall costs are again O(nlogn)
.
This is not better than the expected costs of the first algorithm.
A third solution is to use a boolean array
to represent the set and a single integer max
for the biggest element in the set.
If an element e
is added to the Set, you set array[e] = true
. You can implement the variable size of the set using table expansion, so the costs for inserting an element into the array is constant.
To retrieve the missing elements, you just collect those elements f
where array[f] == false
. This will take O(max)
.
The overall costs for inserting n elements and finding the k missing ones is thus: O(n) + O(max)
. However, max = n + k
, and so we get as the overall costs O(n + k)
.
A fourth method which is a cross-over between the third one and the one using hashing is the following one, which also uses hashing, but doesn't require sorting.
Store your set S
in a hash set, and also store the maximum element in S
in a variable max
. To find the k
missing ones, first generate a result set R containing all numbers {0,...,max}
. Then iterate over S
and delete every element in S
from R
.
The costs for that are also O(n + k)
.
精彩评论