Calculating the mean for a set of numbers while neglecting outliers
First of all this 开发者_如何学Cis more of a math question than it is a coding one, so please be patient. I am trying to figure out an algorithm to calculate the mean for a set of numbers. However I need to neglect any numbers that are not close to the majority of the results. Here is an example of what I am trying to do:
Lets say I have a set of numbers that are similar to the following:
{ 90, 91, 92, 95, 2, 3, 99, 92, 92, 91, 300, 91, 92, 99, 400 }
it is clear for the set above that the majority of numbers lies between 90
and 99
, however I have some outliers like { 300, 400, 2, 3 }
. I need to calculate the mean of those numbers while neglecting the outliers. I do remember reading about something like that in a statistics class but I cant remember what was it or how to approach the solution.
Will appreciate any help..
Thanks
What you could do is:
- estimate the percentage of outliers in your data: about 25% (4/15) of the provided dataset,
- compute the adequate quantiles: 8-quantiles for your dataset, so as to exclude the outliers,
- estimate the mean between the first and the last quantile.
PS: Outliers constituting 25% of your dataset is a lot!
PPS: For the second step, we assumed outliers are "symmetrically distributed". See the graph below, where we use 4-quantiles and 1.5 times the interquartile range (IQR) from Q1 and Q3:
First you need to determine the standard deviation and mean of the full set. The outliers are those values that are greater than 3 standard deviations from the (full set) mean.
A simple method that works well is to take the median instead of the average. The median is far more robust to outliers.
You could also minimize a Geman-McClure function:
x^ = argmin sum( G(xi - x')), where G(x) = x^2/(x^2+sigma^2)
If you plot the G function, you will find that it saturates, which is a good way of softly excluding outliers.
I'd be very careful about this. You could be doing yourself and your conclusions a great disservice.
How is your program supposed to recognize outliers? The normal distribution would say that 99.9% of the values fall within +/- three standard deviations of the mean, so you could calculate both for the unfiltered data, exclude the values that fall outside the assumed range, and recalculate.
However, you might be throwing away something significant by doing so. The normal distribution isn't sacred; outliers are far more common in real life than the normal distribution would suggest. Read Taleb's "Black Swan" to see what I mean.
Be sure you understand fully what you're excluding before you do so. I think it'd be far better to leave all the data points, warts and all, and come up with a good written explanation for them.
Another approach would be a use an alternate measure like median, which is less sensitive to outliers than mean. It's harder to calculate, though.
精彩评论