Fast algorithm for repeated calculation of percentile?
In an algorithm I have to calculate the 75th percentile of a data set whenever I add a value. Right now I am doing this:
- Get value
x
- Insert
x
in an already sorted array at the back - swap
x
down until the array is sorted - Read the element at position
array[array.size * 3/4]
Point 3 is O(n), and the rest is O(1), but this is still quite slow, especially if the array gets larger. Is there any way to optimize this?
UPDATE
Thanks Nikita! Since I am using C++ this is the solution easiest to implement. Here is the code:
template<class T>
class IterativePercentile {
public:
/// Percentile has to be in range [0, 1(
IterativePercentile(double percentile)
: _percentile(percentile)
{ }
// Adds a number in O(log(n))
void add(const T& x) {
if (_lower.empty() || x <= _lower.f开发者_开发知识库ront()) {
_lower.push_back(x);
std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
} else {
_upper.push_back(x);
std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
}
unsigned size_lower = (unsigned)((_lower.size() + _upper.size()) * _percentile) + 1;
if (_lower.size() > size_lower) {
// lower to upper
std::pop_heap(_lower.begin(), _lower.end(), std::less<T>());
_upper.push_back(_lower.back());
std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
_lower.pop_back();
} else if (_lower.size() < size_lower) {
// upper to lower
std::pop_heap(_upper.begin(), _upper.end(), std::greater<T>());
_lower.push_back(_upper.back());
std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
_upper.pop_back();
}
}
/// Access the percentile in O(1)
const T& get() const {
return _lower.front();
}
void clear() {
_lower.clear();
_upper.clear();
}
private:
double _percentile;
std::vector<T> _lower;
std::vector<T> _upper;
};
You can do it with two heaps. Not sure if there's a less 'contrived' solution, but this one provides O(logn)
time complexity and heaps are also included in standard libraries of most programming languages.
First heap (heap A) contains smallest 75% elements, another heap (heap B) - the rest (biggest 25%). First one has biggest element on the top, second one - smallest.
- Adding element.
See if new element x
is <= max(A)
. If it is, add it to heap A
, otherwise - to heap B
.
Now, if we added x
to heap A and it became too big (holds more than 75% of elements), we need to remove biggest element from A
(O(logn)) and add it to heap B (also O(logn)).
Similar if heap B became too big.
- Finding "0.75 median"
Just take the largest element from A (or smallest from B). Requires O(logn) or O(1) time, depending on heap implementation.
edit
As Dolphin noted, we need to specify precisely how big each heap should be for every n (if we want precise answer). For example, if size(A) = floor(n * 0.75)
and size(B)
is the rest, then, for every n > 0
, array[array.size * 3/4] = min(B)
.
A simple Order Statistics Tree is enough for this.
A balanced version of this tree supports O(logn) time insert/delete and access by Rank. So you not only get the 75% percentile, but also the 66% or 50% or whatever you need without having to change your code.
If you access the 75% percentile frequently, but only insert less frequently, you can always cache the 75% percentile element during an insert/delete operation.
Most standard implementations (like Java's TreeMap) are order statistic trees.
If you can do with an approximate answer, you can use a histogram instead of keeping entire values in memory.
For each new value, add it to the appropriate bin. Calculate percentile 75th by traversing bins and summing counts until 75% of the population size is reached. Percentile value is between bin's (which you stopped at) low bound to high bound.
This will provide O(B) complexity where B is the count of bins, which is range_size/bin_size
. (use bin_size
appropriate to your user case).
I have implemented this logic in a JVM library: https://github.com/IBM/HBPE which you can use as a reference.
You can use binary search to do find the correct position in O(log n). However, shifting the array up is still O(n).
If you have a known set of values, following will be very fast:
Create a large array of integers (even bytes will work) with number of elements equal to maximum value of your data. For example, if the maximum value of t is 100,000 create an array
int[] index = new int[100000]; // 400kb
Now iterate over the entire set of values, as
for each (int t : set_of_values) {
index[t]++;
}
// You can do a try catch on ArrayOutOfBounds just in case :)
Now calculate percentile as
int sum = 0, i = 0;
while (sum < 0.9*set_of_values.length) {
sum += index[i++];
}
return i;
You can also consider using a TreeMap instead of array, if the values don't confirm to these restrictions.
Here is a javaScript solution . Copy-paste it in browser console and it works . $scores
contains the List of scores and , $percentile
gives the n-th percentile
of the list . So 75th percentile is 76.8 and 99 percentile is 87.9.
function get_percentile($percentile, $array) {
$array = $array.sort();
$index = ($percentile/100) * $array.length;
if (Math.floor($index) === $index) {
$result = ($array[$index-1] + $array[$index])/2;
}
else {
$result = $array[Math.floor($index)];
}
return $result;
}
$scores = [22.3, 32.4, 12.1, 54.6, 76.8, 87.3, 54.6, 45.5, 87.9];
get_percentile(75, $scores);
get_percentile(90, $scores);
精彩评论