Returning the middle n (values not index) from a collection

2023-02-27 00:50 问答作者：

I have a List<int> and I need to remove the outliers so want to use an approach where I only take the middle n. I want the middle in terms of values, not index.

For instance, given the following list if I wanted the middle 80% i would开发者_开发技巧 expect that the 11 and 100 would be removed.

11,22,22,33,44,44,55,55,55,100.

Is there an easy / built in way to do this in LINQ?

I have a List<int> and i need to remove the outliers so want to use an approach where I only take the middle n. I want the middle in terms of values, not index.

Removing outliers correctly depends entirely on the statistical model that accurately describes the distribution of the data -- which you have not supplied for us.

On the assumption that it is a normal (Gaussian) distribution, here's what you want to do.

First compute the mean. That's easy; it's just the sum divided by the number of items.

Second, compute the standard deviation. Standard deviation is a measure of how "spread out" the data is around the mean. Compute it by:

take the difference of each point from the mean
square the difference
take the mean of the squares -- this is the variance
take the square root of the variance -- this is the standard deviation

In a normal distribution 80% of the items are within 1.2 standard deviations of the mean. So, for example, suppose the mean is 50 and the standard deviation is 20. You would expect that 80% of the sample would fall between 50 - 1.2 * 20 and 50 + 1.2 * 20. You can then filter out items from the list that are outside of that range.

Note however that this is not removing "outliers". This is removing elements that are more than 1.2 standard deviations from the mean, in order to get an 80% interval around the mean. In a normal distribution one expects to see "outliers" on a regular basis. 99.73% of items are within three standard deviations of the mean, which means that if you have a thousand observations, it is perfectly normal to see two or three observations more than three standard deviations outside the mean! In fact, anywhere up to, say, five observations more than three standard deviations away from the mean when given a thousand observations probably does not indicate an outlier.

I think you need to very carefully define what you mean by outlier and describe why you are attempting to eliminate them. Things that look like outliers are potentially not outliers at all, they are real data that you should be paying attention to.

Also, note that none of this analysis is correct if the normal distribution is incorrect! You can get into big, big trouble eliminating what look like outliers when in fact you've actually got the entire statistical model wrong. If the model is more "tail heavy" than the normal distribution then outliers are common, and not actually outliers. Be careful! If your distribution is not normal then you need to tell us what the distribution is before we can recommend how to identify outliers and eliminate them.

You could use the Enumerable.OrderBy method to sort your list, then use Enumerable.Skip and the Enumerable.Take functions, e.g.:

var result = nums.OrderBy(x => x).Skip(1).Take(8);

Where nums is your list of integers.

Figuring out what values to use as arguments for Skip and Take should look something like this, if you just want the "middle n values":

nums.OrderBy(x => x).Skip((nums.Count - n) / 2).Take(n);

However, when the result of (nums.Count - n) / 2 is not an integer, how do you want the code to behave?

Assuming you're not doing any weighted average funny business:

List<int> ints = new List<int>() { 11,22,22,33,44,44,55,55,55,100 };

int min = ints.Min();
double range = (ints.Max() - min);

var results = ints.Select(o => new { IntegralValue = o, Weight = (o - ints.Min()) / range} );

results.Where(o => o.Weight >= .1 && o.Weight < .9);

You can then filter on Weight as needed. Drop the top/botton n% as desired.

In your case:

results.Where(o => o.Weight >= .1 && o.Weight < .9)

Edit: As an extension method, because I like extension methods:

public static class Lulz
{
    public static List<int> MiddlePercentage(this List<int> ints, double Percentage)
    {
        int min = ints.Min();
        double range = (ints.Max() - min);

        var results = ints.Select(o => new { IntegralValue = o, Weight = (o - ints.Min()) / range} );

        double tolerance = (1 - Percentage) / 2;
        return results.Where(o => o.Weight >= tolerance && o.Weight < 1 - tolerance).Select(o => o.IntegralValue).ToList();
    }
}

Usage:

List<int> ints = new List<int>() { 11,22,22,33,44,44,55,55,55,100 };
var results = ints.MiddlePercentage(.8);

Normally, if you wanted to exclude statistical outliers from a set of values, you'd compute the arithmetic mean and standard deviation for the set, and then remove values lying further from the mean than you'd like (measure in standard deviations). A normal distribution — your classic bell-shaped curve — exhibits the following properties:

About 68% of the data will lie within +/- 1 standard deviation from the mean.
About 95% of the data will lie within +/- 2 standard deviations from the mean.
About 99.7% of the data will lie within +/- 3 standard deviations of the mean.

You can get Linq extension methods for computation of standard deviation (and other statistical functions) at http://www.codeproject.com/KB/linq/LinqStatistics.aspx

I am not going to question the validity of calculating outliers since I had a similar need to do exactly this kind of selection. The answer to the specific question of taking the middle n is:

List<int> ints = new List<int>() { 11,22,22,33,44,44,55,55,55,100 };
var result = ints.Skip(1).Take(ints.Count() - 2);

This skips the first item, and stops before the last giving you just the middle n items. Here is a link to a .NET Fiddle demonstrating this query.

https://dotnetfiddle.net/p1z7em

I have a List and I need to remove the outliers so want to use an approach where I only take the middle n. I want the middle in terms of values, not index.

If I understand correctly we want to keep any values that fall into the middle 80% of the 11-100 range, or

min + (max - min - (max - min) * 0.8) / 2 < x < max - (max - min - (max - min) * 0.8) / 2

Assuming an ordered list, we can SkipWhile the values are lower than the lowerBound, and then TakeWhile the numbers are lover than the upperBound

public void Calculalte()
{
    var numbers = new[] { 11, 22, 22, 33, 44, 44, 55, 55, 55, 100 };

    var percentage = 0.8;

    var result = RemoveOutliers(numbers, percentage);
}

private IEnumerable<int> RemoveOutliers(int[] numbers, double percentage)
{
    int min = numbers.First();
    int max = numbers.Last();
    double range = (max - min);
    double lowerBound = min + (range - range * percentage) / 2;
    double upperBound = max - (range - range * percentage) / 2;
    return numbers.SkipWhile(n => n < lowerBound).TakeWhile(n => n < upperBound);   
}

继续阅读：algorithm linq

Returning the middle n (values not index) from a collection

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？