开发者

Need some help calculating percentile

An rpc server is given which receives millions of requests a day. Each request i takes processing time Ti to get processed. We want to find the 65th percentile processing time (when processing times are sorted accor开发者_JAVA百科ding to their values in increasing order) at any moment. We cannot store processing times of all the requests of the past as the number of requests is very large. And so the answer need not be exact 65th percentile, you can give some approximate answer i.e. processing time which will be around the exact 65th percentile number.

Hint: Its something to do how a histogram (i.e. an overview) is stored for a very large data without storing all of data.


Take one day's data. Use it to figure out what size to make your buckets (say one day's data shows that the vast majority (95%?) of your data is within 0.5 seconds of 1 second (ridiculous values, but hang in)

To get 65th percentile, you'll want at least 20 buckets in that range, but be generous, and make it 80. So you divide your 1 second window (-0.5 seconds to +0.5 seconds) into 80 buckets by making each 1/80th of a second wide.

Each bucket is 1/80th of 1 second. Make bucket 0 be (center - deviation) = (1 - 0.5) = 0.5 to itself + 1/80th of a second. Bucket 1 is 0.5+1/80th - 0.5 + 2/80ths. Etc.

For every value, find out which bucket it falls in, and increment a counter for that bucket.

To find 65th percentile, get the total count, and walk the buckets from zero until you get to 65% of that total.

Whenever you want to reset, set the counters all to zero.

If you always want to have good data available, keep two of these, and alternate resetting them, using the one you reset least recently as having more useful data.


Use an updown filter:

if q < x:
    q += .01 * (x - q)  # up a little
else:
    q += .005 * (x - q)  # down a little

Here a quantile estimator q tracks the x stream, moving a little towards each x. If both factors were .01, it would move up as often as down, tracking the 50 th percentile. With .01 up, .005 down, it floats up, 67 th percentile; in general, it tracks the up / (up + down) th percentile. Bigger up/down factors track faster but noisier -- you'll have to experiment on your real data.

(I have no idea how to analyze updowns, would appreciate a link.)

The updown() below works on long vectors X, Q in order to plot them:

Need some help calculating percentile

#!/usr/bin/env python
from __future__ import division
import sys
import numpy as np
import pylab as pl

def updown( X, Q, up=.01, down=.01 ):
    """ updown filter: running ~ up / (up + down) th percentile
        here vecs X in, Q out to plot
    """
    q = X[0]
    for j, x in np.ndenumerate(X):
        if q < x:
            q += up * (x - q)  # up a little
        else:
            q += down * (x - q)  # down a little
        Q[j] = q
    return q

#...............................................................................
if __name__ == "__main__":

    N = 1000
    up = .01
    down = .005
    plot = 0
    seed = 1
    exec "\n".join( sys.argv[1:] )  # python this.py N= up= down=
    np.random.seed(seed)
    np.set_printoptions( 2, threshold=100, suppress=True )  # .2f

    title = "updown random.exponential: N %d  up %.2g  down %.2g" % (N, up, down)
    print title
    X = np.random.exponential( size=N )
    Q = np.zeros(N)
    updown( X, Q, up=up, down=down )
        # M = np.zeros(N)
        # updown( X, M, up=up, down=up )
    print "last 10 Q:", Q[-10:]
    if plot:
        fig = pl.figure( figsize=(8,3) )
        pl.title(title)
        x = np.arange(N)
        pl.plot( x, X, "," )
        pl.plot( x, Q )
        pl.ylim( 0, 2 )
        png = "updown.png"
        print >>sys.stderr, "writing", png
        pl.savefig( png )
        pl.show()


An easier way to get the value that represents a given percentile of a list or array is the scoreatpercentile function in the scipy.stats module.

>>>import scipy.stats as ss
>>>ss.scoreatpercentile(v,65)

there's a sibling percentileofscore to return the percentile given the value


you will need to store a running sum and a total count.

then check out standard deviation calculations.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜