Computing standard deviation over a circular buffer
I need to calculate the standard deviation of values are stored in a circular buffer. The final algorithm will run on a resource-constrained device, thus I want it to be as lightweight as possible. The naive approach would be to re-evaluate the standard deviation over the whole buffer each time that a new value is pushed in, but it would be really slow. Ideally, I'd like an algorithm that dynamically updates the cur开发者_StackOverflow中文版rent value of standard deviation as new values are pushed in.
Wikipedia reports some techniques for rapid calculation, but they can be used on streams: in my case when a new value is pushed in, the standard deviation should be calculated as if the last value that has been popped never existed.
tl;dr: how can I calculate standard deviation over a circular buffer with minimal computational effort?
Standard deviation can be expressed as:
stddev = sqrt(1/N * SUM(x[i]^2) - 1/N^2 * SUM(x[i])^2)
For the uncorrected sample standard deviation. For the corrected sample standard deviation one can write:
stddev = sqrt(1/(N-1) * SUM(x[i]^2) - 1/(N^2-N) * SUM(x[i])^2)
If you maintain two accumulators, one for SUM(x[i]^2)
, and one for SUM(x[i])
, then these are trivial to update for each new value (subtract the effect of the oldest value, and add in the effect of the newest value). N
will of course be the length of your buffer.
Of course, if you are doing this in floating-point, then you will probably run into roundoff errors ((x + y) - x != y
, in general). I don't think there's any easy fix for that.
Keep three numbers for the set: The count of values, the sum of the values, and the sum of the squares of the values. Let's call these k, sn, and sn2 for convenience.
If, as I think you're implying, a new value always replaces an old value in the circular queue, then maybe the count is constant. Or maybe it's possible to have a less than full queue. Either way:
Every time you add a value to the queue, add one to the count, add this value to the sum, and add the square of this value to the square-sum. That is, if you add a new value "n", then k=k+1, sn=sn+n, and sn2=sn2+n^2.
Every time you remove a value from the queue, subtract one from the count, subtract this value from the sum, and subtract the square of this value from the square-sum. That is, if you remove a value "n", then k=k-1, sn=sn-n, and sn2=sn2-n^2.
You can then easily recalculate the standard deviation after every change without having to retotal everything.
Note this means that you do have to be able to "catch" a value before it is truly deleted.
Note: I suspect this is how my pocket calculator does it, because it has functions to dump sum(n) and sum(n^2), and I can "delete" a value that I never added, like I can say add 2 and 4 to the set, and then delete 3, and it say's okay fine. So I don't think it's keeping a list: it must just be keeping the count and the sums.
The simplest thing to do is to invert Knuth's formulas (from the Wikipedia article on variance calculation:
Mk-1 = Mk - (xk - Mk)/(k-1)
Sk-1 = Sk - (xk - Mk-1)(xk - Mk)
However, note that floating point errors will accumulate over the entire run! This means that your mean and variance numbers will tend to drift, and this method can't really be used as an accurate online method.
A stable online method which takes O(log N) operations per sample (where N is the number of elements in your queue) would be to use a binary merge tree of statistical items, using the formula for "parallel merge" from the Wikipedia article, as follows (in C++ form):
struct Statistic {
int k;
Element M;
Element S;
Statistic(Element x)
: k(1)
, M(x)
, S(0)
{}
Statistic(Statistic a, Statistic b)
: k(a.k + b.k)
, M(a.M*a.k + b.M*b.k)/float(k)
, S(a.S + b.S + (a.M-b.M)*(a.M-b.M)*(a.k*b.k/float(k)))
{}
};
For a stable O(log N) online algorithm, maintain a balanced binary tree of the above statistics, with the leaves representing individual elements; the root will yield the desired online statistics. As you update the elements (in rotating buffer style), it will take O(log N) operations to propagate each update from the leaf to the root.
精彩评论