How can I equal out a probability distribution in a realtime stream?
I am getting a (real fast) realtime stream of data points between 0 and 1 and need to sort them into "buckets".
Assume there is a 0.6 coming and my buckets cover an area of 0.25, each. This would mean 0.6 goes into the third bucket. However, when there are a lot numbers around 0.6 coming, they will all end up in bucket three, which is bad.
I want to change the areas covered by the four buckets such that each bucket has an equal probability of being hit. For example it might be better to make bucket one cover 0-0.5, the second one 0.5-0.6, thirs one 0.6-0.65 and the last one 0.65-1.
The problem is, I can not store the values - only which buckets are hit how often. So is there a working update formula for t开发者_如何学Pythonhis?!
Thank you very much in advance!
If you are not storing the values then at one point in time all you have is
how many values fall between x and y.
To keep an equal probability between your bins/buckets the values are very critical. Lets start with this realtime stream and assume it started initially with 0.12 0.37 0.62 and 0.87 The bins will have 1 in each.
For values 0.24,0.49,0.74,0.99 again each bin will get 1
For values 0.01,0.26,0.51,0.76 again each bin will get 1
You will end up with 3 in each bin. Now if 0.6 starts coming in comes in about 6 times making 3rd bin at 9 while rest of them are 3. Say now you have to update your boundaries. If you move the bin boundaries now then your probability for each bin will not be correct.
You cannot move bins on the basis of count or even average of what it has. My example values can come in any order so its even impossible to move bins after the first value that came in without the knowledge of all previous values.
Would be interesting to know how others think around this.
I think that you want to keep a sort of tree of buckets.
This sounds like a job for Huffman coding or a variant of Arithmetic coding.
I know there is a streaming variant of the Huffman coding. I am not sure about Arithmetic. But it seems like at the very least you could periodically define a new model and copy the old values into the new model.
Since you don't have the old values you might have to guess at the bounds instead of computing them. You could, for example, define 100 new buckets under 0.6 and then later collapse the unused buckets.
精彩评论