开发者

How to keep a random subset of a stream of data?

I have a stream of events flowing through my servers. It is not feasible for me to store all of them, 开发者_开发问答but I would like to periodically be able to process some of them in aggregate. So, I want to keep a subset of the stream that is a random sampling of everything I've seen, but is capped to a max size.

So, for each new item, I need an algorithm to decide if I should add it to the stored set, or if I should discard it. If I add it, and I'm already at my limit, I need an algorithm to evict one of the old items.

Obviously, this is easy as long as I'm below my limit (just save everything). But how can I maintain a good random sampling without being biased towards old items or new items once I'm past that limit?

Thanks,


This is a common interview question.

One easy way to do it is to save the nth element with probability k/n (or 1, whichever is lesser). If you need to remove an element to save the new sample, evict a random element.

This gives you a uniformly random subset of the n elements. If you don't know n, you can estimate it and get an approximately uniform subset.


This is called random sampling. Source: http://en.wikipedia.org/wiki/Reservoir_sampling

array R[k];    // result
integer i, j;

// fill the reservoir array
for each i in 1 to k do
    R[i] := S[i]
done;

// replace elements with gradually decreasing probability
for each i in k+1 to length(S) do
    j := random(1, i);   // important: inclusive range
    if j <= k then
        R[j] := S[i]
    fi
done

A decent explanation/proof: http://propersubset.com/2010/04/choosing-random-elements.html


While this paper isn't precisely what you're looking for, it may be a good starting point in your search.


store samples in a first in first out (FIFO) queue.

set a sampling rate of so many events between samples, or randomize this a bit - depending on your patterns of events.

save every nth event, or whenever your rate tells you to, then stick it in to the end of the queue.

pop one off the top if the size is too big.


This is assuming you dont know the total number of events that will be received and that you don't need a minimum number of elements in the subset.

arr = arr[MAX_SIZE] //Create a new array that will store the events. Assuming first index 1.
counter = 1 //Initialize a counter.

while(receiving event){
  random = //Generate a random number between 1 and counter
  if( counter == random ){
     if( counter <= MAX_SIZE ){
        arr[counter] = event
     }
     else{
        tmpRandom = //Generate a random number between 1 and MAX_SIZE
        arr[tmpRandom] = event
     }
  }
  counter =+ 1

}


Assign a probability of recording each event and store the event in an indexable data structure. When the size of the structure gets to the threshold, remove a random element and add new elements. In Ruby, you could do this:

@storage = []
prob = 0.002

while ( message = getnextMessage) do
    @storage.delete((rand() * @storage.length).floor) if @storage.length > MAX_LEN
    @storage << message if (rand() < prob) 
end

This addresses your max size AND your non-bias toward when the event occurred. You could also choose which element gets deleted by partitioning your stored elements into buckets and then removing an element from any bucket that has more than one element. The bucket method allows you to keep one from each hour, for example.

You should also know that sampling theory is Big Math. If you need more than a layman's idea about this you should consult a qualified mathematician in your area.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜