calculate mean and variance with one iteration

2022-12-21 03:47 问答作者：

I have an iterator of numbers, for example a file object:

f = open("datafile.dat")

now I want to compute:

mean = get_mean(f)
sigma = get_sigma(f, mean)

What is the best imp开发者_如何学JAVAlementation? Suppose that the file is big and I would like to avoid to read it twice.

If you want to iterate once, you can write your sum function:

def mysum(l):
    s2 = 0
    s = 0
    for e in l:
        s += e
        s2 += e * e
    return (s, s2)

and use the result in your sigma function.

Edit: now you can calculate the variance like this: (s2 - (s*s) / N) / N

By taking account of @Adam Bowen's comment,
keep in mind that if we use mathematical tricks and transform the original formulas
we may degrade the results.

I think Nick D has the correct answer.

Assuming you want to compute both mean and variance in one sweep of the file (and you don't really want two functions that have to be called one after the other), you can collect the sum of the values and of their squares and them use such sums (toghether with the number of read elements) to compute at the same time mean and variance.

There are some numerical stability issues, but the idea in

http://en.wikipedia.org/wiki/Computational_formula_for_the_variance

is the basic ingredient you need. Some more details are at

http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance

where I suggest you to read the "Naïve algorithm".

Hope this helps,

Massimo

You can compute both in one pass. See:

http://www.johndcook.com/standard_deviation.html

Make a list from the iterable, or use itertools.tee().

I am not sure there is much choice.

You will have to iterate your numbers twice in any case as the standard deviation will require the mean information on each value.

If you have enough memory, you can gain on the I/O access by loading your file in memory during the first iteration but that is about it IMO.

As I feel that there are good elements scattered in multiple answers, I would like to summarize:

If your file is too big to conveniently fit in memory, and if you want a good precision in the variance, you do need to read the file twice (with one pass, the variance is the difference between two large numbers, which is not precise because of floating point limitations). Note that your operating system is likely to provide some automatic speed-up for the second file reading, as it may still be in RAM during the second pass.
If you do not care for the precision of the variance, you can simply iterate once over the file and calculate the quantities suggested by Nick D, with the details provided in the comment by Adam Bowen.

You have two solutions

Make a list out of your iterator and loop it as many time as you wish. Drawback is everything will be in memory, so not suitable if your file is big. Simple use of itertools.tee also will not save you
There is no other solution , unless , you do not need to pass output of get_mean to get_sigma, because in that case they can only be in series, but if you remove this restriction then you can run both functions in parallel using threads, and use itertools.tee to have two iterators from one

You can use map reduce in an elegant fashion way

sample is the list you want to get its variance

sample = [a,b,c, ...]

mean = float(reduce(lambda x,y : x+y, sample)) / len(sample)

variance = reduce(lambda x,y: x+y, map(lambda xi: (xi-mean)**2, sample))/ len(sample)

In a succinct line of code:

variance = reduce(lambda x,y: x+y, map(lambda xi: (xi-(float(reduce(lambda x,y : x+y, sample)) / len(sample)))**2, sample))/ len(sample)

继续阅读：iterator python

calculate mean and variance with one iteration

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？