开发者

calculate mean and variance with one iteration

I have an iterator of numbers, for example a file object:

f = open("datafile.dat")

now I want to compute:

mean = get_mean(f)
sigma = get_sigma(f, mean)

What is the best imp开发者_如何学JAVAlementation? Suppose that the file is big and I would like to avoid to read it twice.


If you want to iterate once, you can write your sum function:

def mysum(l):
    s2 = 0
    s = 0
    for e in l:
        s += e
        s2 += e * e
    return (s, s2)

and use the result in your sigma function.

Edit: now you can calculate the variance like this: (s2 - (s*s) / N) / N

By taking account of @Adam Bowen's comment,
keep in mind that if we use mathematical tricks and transform the original formulas
we may degrade the results.


I think Nick D has the correct answer.

Assuming you want to compute both mean and variance in one sweep of the file (and you don't really want two functions that have to be called one after the other), you can collect the sum of the values and of their squares and them use such sums (toghether with the number of read elements) to compute at the same time mean and variance.

There are some numerical stability issues, but the idea in

http://en.wikipedia.org/wiki/Computational_formula_for_the_variance

is the basic ingredient you need. Some more details are at

http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

where I suggest you to read the "Naïve algorithm".

Hope this helps,

Massimo


You can compute both in one pass. See:

http://www.johndcook.com/standard_deviation.html


Make a list from the iterable, or use itertools.tee().


I am not sure there is much choice.

You will have to iterate your numbers twice in any case as the standard deviation will require the mean information on each value.

If you have enough memory, you can gain on the I/O access by loading your file in memory during the first iteration but that is about it IMO.


As I feel that there are good elements scattered in multiple answers, I would like to summarize:

  • If your file is too big to conveniently fit in memory, and if you want a good precision in the variance, you do need to read the file twice (with one pass, the variance is the difference between two large numbers, which is not precise because of floating point limitations). Note that your operating system is likely to provide some automatic speed-up for the second file reading, as it may still be in RAM during the second pass.

  • If you do not care for the precision of the variance, you can simply iterate once over the file and calculate the quantities suggested by Nick D, with the details provided in the comment by Adam Bowen.


You have two solutions

  1. Make a list out of your iterator and loop it as many time as you wish. Drawback is everything will be in memory, so not suitable if your file is big. Simple use of itertools.tee also will not save you

  2. There is no other solution , unless , you do not need to pass output of get_mean to get_sigma, because in that case they can only be in series, but if you remove this restriction then you can run both functions in parallel using threads, and use itertools.tee to have two iterators from one


You can use map reduce in an elegant fashion way

sample is the list you want to get its variance

sample = [a,b,c, ...]

mean = float(reduce(lambda x,y : x+y, sample)) / len(sample)

variance = reduce(lambda x,y: x+y, map(lambda xi: (xi-mean)**2, sample))/ len(sample)

In a succinct line of code:

variance = reduce(lambda x,y: x+y, map(lambda xi: (xi-(float(reduce(lambda x,y : x+y, sample)) / len(sample)))**2, sample))/ len(sample)
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜