开发者

Returning the middle n (values not index) from a collection

I have a List<int> and I need to remove the outliers so want to use an approach where I only take the middle n. I want the middle in terms of values, not index.

For instance, given the following list if I wanted the middle 80% i would开发者_开发技巧 expect that the 11 and 100 would be removed.

11,22,22,33,44,44,55,55,55,100.

Is there an easy / built in way to do this in LINQ?


I have a List<int> and i need to remove the outliers so want to use an approach where I only take the middle n. I want the middle in terms of values, not index.

Removing outliers correctly depends entirely on the statistical model that accurately describes the distribution of the data -- which you have not supplied for us.

On the assumption that it is a normal (Gaussian) distribution, here's what you want to do.

First compute the mean. That's easy; it's just the sum divided by the number of items.

Second, compute the standard deviation. Standard deviation is a measure of how "spread out" the data is around the mean. Compute it by:

  • take the difference of each point from the mean
  • square the difference
  • take the mean of the squares -- this is the variance
  • take the square root of the variance -- this is the standard deviation

In a normal distribution 80% of the items are within 1.2 standard deviations of the mean. So, for example, suppose the mean is 50 and the standard deviation is 20. You would expect that 80% of the sample would fall between 50 - 1.2 * 20 and 50 + 1.2 * 20. You can then filter out items from the list that are outside of that range.

Note however that this is not removing "outliers". This is removing elements that are more than 1.2 standard deviations from the mean, in order to get an 80% interval around the mean. In a normal distribution one expects to see "outliers" on a regular basis. 99.73% of items are within three standard deviations of the mean, which means that if you have a thousand observations, it is perfectly normal to see two or three observations more than three standard deviations outside the mean! In fact, anywhere up to, say, five observations more than three standard deviations away from the mean when given a thousand observations probably does not indicate an outlier.

I think you need to very carefully define what you mean by outlier and describe why you are attempting to eliminate them. Things that look like outliers are potentially not outliers at all, they are real data that you should be paying attention to.

Also, note that none of this analysis is correct if the normal distribution is incorrect! You can get into big, big trouble eliminating what look like outliers when in fact you've actually got the entire statistical model wrong. If the model is more "tail heavy" than the normal distribution then outliers are common, and not actually outliers. Be careful! If your distribution is not normal then you need to tell us what the distribution is before we can recommend how to identify outliers and eliminate them.


You could use the Enumerable.OrderBy method to sort your list, then use Enumerable.Skip and the Enumerable.Take functions, e.g.:

var result = nums.OrderBy(x => x).Skip(1).Take(8);

Where nums is your list of integers.

Figuring out what values to use as arguments for Skip and Take should look something like this, if you just want the "middle n values":

nums.OrderBy(x => x).Skip((nums.Count - n) / 2).Take(n);

However, when the result of (nums.Count - n) / 2 is not an integer, how do you want the code to behave?


Assuming you're not doing any weighted average funny business:

List<int> ints = new List<int>() { 11,22,22,33,44,44,55,55,55,100 };

int min = ints.Min();
double range = (ints.Max() - min);

var results = ints.Select(o => new { IntegralValue = o, Weight = (o - ints.Min()) / range} );

results.Where(o => o.Weight >= .1 && o.Weight < .9);

You can then filter on Weight as needed. Drop the top/botton n% as desired.

In your case:

results.Where(o => o.Weight >= .1 && o.Weight < .9)

Edit: As an extension method, because I like extension methods:

public static class Lulz
{
    public static List<int> MiddlePercentage(this List<int> ints, double Percentage)
    {
        int min = ints.Min();
        double range = (ints.Max() - min);

        var results = ints.Select(o => new { IntegralValue = o, Weight = (o - ints.Min()) / range} );

        double tolerance = (1 - Percentage) / 2;
        return results.Where(o => o.Weight >= tolerance && o.Weight < 1 - tolerance).Select(o => o.IntegralValue).ToList();
    }
}

Usage:

List<int> ints = new List<int>() { 11,22,22,33,44,44,55,55,55,100 };
var results = ints.MiddlePercentage(.8);


Normally, if you wanted to exclude statistical outliers from a set of values, you'd compute the arithmetic mean and standard deviation for the set, and then remove values lying further from the mean than you'd like (measure in standard deviations). A normal distribution — your classic bell-shaped curve — exhibits the following properties:

  • About 68% of the data will lie within +/- 1 standard deviation from the mean.
  • About 95% of the data will lie within +/- 2 standard deviations from the mean.
  • About 99.7% of the data will lie within +/- 3 standard deviations of the mean.

You can get Linq extension methods for computation of standard deviation (and other statistical functions) at http://www.codeproject.com/KB/linq/LinqStatistics.aspx


I am not going to question the validity of calculating outliers since I had a similar need to do exactly this kind of selection. The answer to the specific question of taking the middle n is:

List<int> ints = new List<int>() { 11,22,22,33,44,44,55,55,55,100 };
var result = ints.Skip(1).Take(ints.Count() - 2);

This skips the first item, and stops before the last giving you just the middle n items. Here is a link to a .NET Fiddle demonstrating this query.

https://dotnetfiddle.net/p1z7em


I have a List and I need to remove the outliers so want to use an approach where I only take the middle n. I want the middle in terms of values, not index.

If I understand correctly we want to keep any values that fall into the middle 80% of the 11-100 range, or

min + (max - min - (max - min) * 0.8) / 2 < x < max - (max - min - (max - min) * 0.8) / 2

Assuming an ordered list, we can SkipWhile the values are lower than the lowerBound, and then TakeWhile the numbers are lover than the upperBound

public void Calculalte()
{
    var numbers = new[] { 11, 22, 22, 33, 44, 44, 55, 55, 55, 100 };

    var percentage = 0.8;

    var result = RemoveOutliers(numbers, percentage);
}

private IEnumerable<int> RemoveOutliers(int[] numbers, double percentage)
{
    int min = numbers.First();
    int max = numbers.Last();
    double range = (max - min);
    double lowerBound = min + (range - range * percentage) / 2;
    double upperBound = max - (range - range * percentage) / 2;
    return numbers.SkipWhile(n => n < lowerBound).TakeWhile(n => n < upperBound);   
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜