Best way to find out if IEnumerable<> has unique values

2023-02-19 10:49 问答作者：

I have a lot of code in which I do something like this

bool GetIsUnique(IEnumerab开发者_JAVA百科le<T> values)
{
    return values.Count() == values.Distinct().Count;
}

Is there a better faster nicer way to do this?

I would make this a nice extension method

public static bool IsUnique<T>(this IEnumerable<T> list)
{
    var hs = new HashSet<T>();
    return list.All(hs.Add);  
}

Checks that all items can be added to a HashSet.

Your method needs to iterate through the sequence twice, with a few of potential drawbacks:

Iterating twice will be slower than iterating once for sequences of any significant size.
Some sequences will throw an exception if you try to iterate them more than once; others might return different results for subsequent iterations.
Your method uses Count which needs to iterate the entire sequence each time. There's no reason why you shouldn't break-out early as soon as you know that there's a duplicate value.

The following method only needs to iterate through the sequence once, and will break-out early as soon as any duplicate value is encountered:

bool GetIsUnique<T>(IEnumerable<T> values)
{
    var set = new HashSet<T>();

    foreach (T item in values)
    {
        if (!set.Add(item))
            return false;
    }
    return true;
}

I think it depends on what you want to do if there are non unique values. @Jamiec's Or @LukeH's answer are great answers and probably best for pure speed, but it can't tell you where issues are.

You might also consider something like

var group = values.GroupBy(x => x);
return group.Any(g => g.Count() > 1);

On it's own its worse than the HashSet implementation. But if you keep that group around you can find which elements are duplicated.

var group = values.GroupBy(x => x);
return group.Where(g => g.Count() > 1);

var group = values.GroupBy(x => x);
return group.Where(g => g.Count() > 1).Select(g => g.Key);

Thinking about it with GroupBy lets you keep your options open for what to do next. But if all you care about is knowing if all values are unique I'd go with the HashSet

You would be doing two loops through the data for the above - once to get the count, once to get the distinct count. Especially bad if the first two items are identical! Try something like this:

bool GetIsUnique<T>(IEnumerable<T> values)
{
    HashSet<T> hashSet = new HashSet<T>();
    foreach(var value in values)
    {
        if (hashSet.Contains(value))
        {
            return false;
        }
        hashSet.Add(value);
    }
    return true;
}

This one will finish as soon as it finds a duplicate. Obviously it on the speed of the hash lookup but given Distinct uses a set internally I'd still expect it to be quicker.

Two basic rules:

The easiest way to read and understand is almost always the best way to code something. That code is deliciously easy to read, so I'd say you're good to go.
Performance ("faster") should only be a concern if you can prove that this is the method that's slowing your program down, or if you're building a library that other people will have access to without having access to the source code.

The other methods are going to be faster (they'll short circuit when a repeated value is found, returning false), but, I'd still stick with your version if it were my code.

A quick way of finding the first duplicate, if present, is:

public static bool TryFindFirstDuplicate<T>(this IEnumerable<T> source, out T duplicate)
{
    var set = new HashSet<T>();
    foreach (var item in source)
    {
        if (!set.Add(item))
        {
            duplicate = item;
            return true;
        }
    }
    duplicate = default(T);
    return false;
}

I'm suprised no one has tested this just yet:

Comparing Gluip's version in the question with JamieC, LukeK, and MrKWatkins, All three answers are better than the asker's version.

Of the three answers, they are all pretty comparable but in most case JamieC's is slightly faster.

When the data has no duplicates or a duplicate is at the end of the IEnumerable, the size or algorithm had no major differences.

When the data has a duplicate near the middle, or at the beginning, Gluip's version in the original question shows its slowness compared to the other three.

The time to check a set appears to scale linearly with the size of set, meaning no algorithm is preferable for large or small sets.

Here's a test program that spits out a CSV that you can load into a spreadsheet program and sort and graph if you like:

Test Program:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace AreUniqueTest
{
class Program
{
    const int Iterations = 1000;

    enum DupeLocation
    {
        None,
        Early,
        Center,
        Late,
    }

    enum SetSize
    {
        Tiny= 10,
        Small = 100,
        Medium = 1000,
        Large = 10000,
        Huge = 100000,
    }

    static void Main()
    {
        Dictionary<string, Func<IEnumerable<int>, bool>> functions = new Dictionary<string, Func<IEnumerable<int>, bool>>
        {
            {"Gluip", GetIsUniqueGluip},
            {"LukeH", GetIsUniqueLukeH },
            {"Jamiec", GetIsUniqueJamiec },
            {"MrKWatkins", GetIsUniqueMrKWatkins }
        };

        var output = new StringBuilder();

        Console.WriteLine("Function,SetSize,DupeLocation,TotalTicks,AverageTicks");
        output.AppendLine("Function,SetSize,DupeLocation,TotalTicks,AverageTicks");

        foreach (SetSize size in Enum.GetValues(typeof(SetSize)))
        {
            var sizevalue = (int) size;
            foreach (DupeLocation location in Enum.GetValues(typeof(DupeLocation)))
            {
                var data = CreateTestData((int)size, location);
                foreach (string functionKey in functions.Keys)
                {
                    var ticks = RunSet(functions[functionKey], data, Iterations);
                    var avg = ticks / Iterations;
                    Console.WriteLine($"{functionKey},{sizevalue},{location},{ticks},{avg}");
                    output.AppendLine($"{functionKey},{sizevalue},{location},{ticks},{avg}");
                }
            }
        }

        File.WriteAllText("output.csv", output.ToString());
        Process.Start("output.csv");
    }

    static long RunSet<T>(Func<IEnumerable<T>, bool> getIsUnique, IEnumerable<T> values, int iterations)
    {
        var stopwatch = new Stopwatch();
        stopwatch.Start();
        for (var i = 0; i < iterations; i++)
        {
            getIsUnique.Invoke(values);
        }
        stopwatch.Stop();
        return stopwatch.ElapsedTicks;
    }

    static bool GetIsUniqueLukeH<T>(IEnumerable<T> values)
    {
        var set = new HashSet<T>();

        foreach (T item in values)
        {
            if (!set.Add(item))
                return false;
        }
        return true;
    }

    static bool GetIsUniqueGluip<T>(IEnumerable<T> values)
    {
        return values.Count() == values.Distinct().Count();
    }

    static bool GetIsUniqueJamiec<T>(IEnumerable<T> list)
    {
        var hs = new HashSet<T>();
        return list.All(hs.Add);
    }

    static bool GetIsUniqueMrKWatkins<T>(IEnumerable<T> values)
    {
        HashSet<T> hashSet = new HashSet<T>();
        foreach (var value in values)
        {
            if (hashSet.Contains(value))
            {
                return false;
            }
            hashSet.Add(value);
        }
        return true;
    }

    static int[] CreateTestData(int size, DupeLocation location)
    {
        var result = new int[size];
        Parallel.For(0, size, i => { result[i] = i; });
        return SetDupe(result, location);
    }

    static int[] SetDupe(int[] values, DupeLocation location)
    {
        switch (location)
        {
            case DupeLocation.Early:
                values[1] = values[0];
                break;
            case DupeLocation.Center:
                var midpoint = values.Length / 2;
                values[midpoint] = values[midpoint + 1];
                break;
            case DupeLocation.Late:
                values[values.Length - 1] = values[values.Length - 2];
                break;
            // case DupeLocation.None: // do nothing.
        }
        return values;
    }
}
}

继续阅读：.net linq

Best way to find out if IEnumerable<> has unique values

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？