On-line/Streaming algorithm for the selection of the top X percent of a time series

2023-04-04 18:00 问答作者：

I am working on a time series of nume开发者_开发技巧ric values, such as those produced by a temperature sensor. I'd like to filter those values, roughly selecting those samples that form e.g. the top 10% of the received values.

The obvious solution of recording all samples and using any well-known algorithm for the extraction of the k-highest values is not possible in my case for two reasons:

The series may be infinite, memory is definitely not.
I'd like this algorithm to be usable in real-time, or at least in a streaming mode with predetermined latency.

The distribution of the values is not normal, nor is it consistent with any well-known distribution that I know of. Metrics that I already have available at any time include the mean, the variance and the skewness of the values that have already been received.

Unlike this question, I do not need perfect accuracy, although I would like to be able to tune the parameters of the selection algorithm.

I believe something similar is used in single-pass variable bit-rate (VBR) media codecs to allocate the available bandwidth to each frame, by determining the number of available bits. Unfortunately all the VBR algorithms I studied are too focused on DSP and media streams for me to understand and/or implement.

Are there any known algorithms that could help me deal with this issue? Any hints that would orient me towards the right direction would be greatly appreciated.

If you decide you are only interested in the last 10N items you can use two heaps, one of size N and one of size 9N, to keep track of the N highest items in the last 10N. When you see a new item first remove the oldest item. If it came from the small heap, take the largest item from the large heap and put it in the small. Now look at the new item and either put it straight in the large heap or take the smallest item from the small heap and transfer it to the large one before putting the new item in the small heap.

At any time you have a small heap full of high items and a large heap full of low items, and you know whether the latest item was in the top 10% of those 10N.

But is this really what you want? Note that if your samples rise steadily and then fall steadily over a period of time much larger than your 10N samples then nearly half the time the latest item will be in the top 10% - in fact it will be the largest item seen in the memory of 10N items.

There have been academic papers on finding approximaste quantiles of streaming data. One such is "Effective Computation of Biased Quantiles over Data Streams", by Cormode, Korn, Muthukrishnan, and Srivastava

You cannot get high accuracy without storing the whole stream and knowing something about the distribution.

Imagine a stream where the values are sorted in descending order. The 10% head of the list should be your answer but you have no idea how long the stream is so you have to store the whole stream because at any point it's possible that you processed less than 10% of the whole stream.

If you have to store the whole stream you don't looking for an online algorithm anymore.

Now if you know something about the distribution then perhaps you can do something better.. A naive algorithm would be to chunk the stream to slices and calculate the top 10% of each slice.

继续阅读：algorithm filtering statistics time-series

On-line/Streaming algorithm for the selection of the top X percent of a time series

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？