Calculating the mean for a set of numbers while neglecting outliers

2023-03-09 14:32 问答作者：

First of all this 开发者_如何学Cis more of a math question than it is a coding one, so please be patient. I am trying to figure out an algorithm to calculate the mean for a set of numbers. However I need to neglect any numbers that are not close to the majority of the results. Here is an example of what I am trying to do:

Lets say I have a set of numbers that are similar to the following:

{ 90, 91, 92, 95, 2, 3, 99, 92, 92, 91, 300, 91, 92, 99, 400 }

it is clear for the set above that the majority of numbers lies between 90 and 99, however I have some outliers like { 300, 400, 2, 3 }. I need to calculate the mean of those numbers while neglecting the outliers. I do remember reading about something like that in a statistics class but I cant remember what was it or how to approach the solution.

Will appreciate any help..

Thanks

What you could do is:

estimate the percentage of outliers in your data: about 25% (4/15) of the provided dataset,
compute the adequate quantiles: 8-quantiles for your dataset, so as to exclude the outliers,
estimate the mean between the first and the last quantile.

PS: Outliers constituting 25% of your dataset is a lot!

PPS: For the second step, we assumed outliers are "symmetrically distributed". See the graph below, where we use 4-quantiles and 1.5 times the interquartile range (IQR) from Q1 and Q3:

Calculating the mean for a set of numbers while neglecting outliers

First you need to determine the standard deviation and mean of the full set. The outliers are those values that are greater than 3 standard deviations from the (full set) mean.

A simple method that works well is to take the median instead of the average. The median is far more robust to outliers.

You could also minimize a Geman-McClure function:

x^ = argmin sum( G(xi - x')), where G(x) = x^2/(x^2+sigma^2)

If you plot the G function, you will find that it saturates, which is a good way of softly excluding outliers.

I'd be very careful about this. You could be doing yourself and your conclusions a great disservice.

How is your program supposed to recognize outliers? The normal distribution would say that 99.9% of the values fall within +/- three standard deviations of the mean, so you could calculate both for the unfiltered data, exclude the values that fall outside the assumed range, and recalculate.

However, you might be throwing away something significant by doing so. The normal distribution isn't sacred; outliers are far more common in real life than the normal distribution would suggest. Read Taleb's "Black Swan" to see what I mean.

Be sure you understand fully what you're excluding before you do so. I think it'd be far better to leave all the data points, warts and all, and come up with a good written explanation for them.

Another approach would be a use an alternate measure like median, which is less sensitive to outliers than mean. It's harder to calculate, though.

继续阅读：math

Calculating the mean for a set of numbers while neglecting outliers

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？