开发者

Does Distinct() method keep original ordering of sequence intact?

I want to remove duplicates from list, without changing order of unique elements in the list.

Jon Sk开发者_Go百科eet & others have suggested to use the following:

list = list.Distinct().ToList();

Reference:

  • How to remove duplicates from a List<T>?
  • Remove duplicates from a List<T> in C#

Is it guaranteed that the order of unique elements would be same as before? If yes, please give a reference that confirms this as I couldn't find anything on it in documentation.


It's not guaranteed, but it's the most obvious implementation. It would be hard to implement in a streaming manner (i.e. such that it returned results as soon as it could, having read as little as it could) without returning them in order.

You might want to read my blog post on the Edulinq implementation of Distinct().

Note that even if this were guaranteed for LINQ to Objects (which personally I think it should be) that wouldn't mean anything for other LINQ providers such as LINQ to SQL.

The level of guarantees provided within LINQ to Objects is a little inconsistent sometimes, IMO. Some optimizations are documented, others not. Heck, some of the documentation is flat out wrong.


In the .NET Framework 3.5, disassembling the CIL of the Linq-to-Objects implementation of Distinct() shows that the order of elements is preserved - however this is not documented behavior.

I did a little investigation with Reflector. After disassembling System.Core.dll, Version=3.5.0.0 you can see that Distinct() is an extension method, which looks like this:

public static class Emunmerable
{
    public static IEnumerable<TSource> Distinct<TSource>(this IEnumerable<TSource> source)
    {
        if (source == null)
            throw new ArgumentNullException("source");

        return DistinctIterator<TSource>(source, null);
    }
}

So, interesting here is DistinctIterator, which implements IEnumerable and IEnumerator. Here is simplified (goto and lables removed) implementation of this IEnumerator:

private sealed class DistinctIterator<TSource> : IEnumerable<TSource>, IEnumerable, IEnumerator<TSource>, IEnumerator, IDisposable
{
    private bool _enumeratingStarted;
    private IEnumerator<TSource> _sourceListEnumerator;
    public IEnumerable<TSource> _source;
    private HashSet<TSource> _hashSet;    
    private TSource _current;

    private bool MoveNext()
    {
        if (!_enumeratingStarted)
        {
            _sourceListEnumerator = _source.GetEnumerator();
            _hashSet = new HashSet<TSource>();
            _enumeratingStarted = true;
        }

        while(_sourceListEnumerator.MoveNext())
        {
            TSource element = _sourceListEnumerator.Current;

             if (!_hashSet.Add(element))
                 continue;

             _current = element;
             return true;
        }

        return false;
    }

    void IEnumerator.Reset()
    {
        throw new NotSupportedException();
    }

    TSource IEnumerator<TSource>.Current
    {
        get { return _current; }
    }

    object IEnumerator.Current
    {        
        get { return _current; }
    }
}

As you can see - enumerating goes in order provided by source enumerable (list, on which we are calling Distinct). Hashset is used only for determining whether we already returned such element or not. If not, we are returning it, else - continue enumerating on source.

So, it is guaranteed, that Distinct() will return elements exactly in same order, which are provided by collection to which Distinct was applied.


According to the documentation the sequence is unordered.


Yes, Enumerable.Distinct preserves order. Assuming the method to be lazy "yields distinct values are soon as they are seen", it follows automatically. Think about it.

The .NET Reference source confirms. It returns a subsequence, the first element in each equivalence class.

foreach (TSource element in source)
    if (set.Add(element)) yield return element;

The .NET Core implementation is similar.

Frustratingly, the documentation for Enumerable.Distinct is confused on this point:

The result sequence is unordered.

I can only imagine they mean "the result sequence is not sorted." You could implement Distinct by presorting then comparing each element to the previous, but this would not be lazy as defined above.


By default when use Distinct linq operator uses Equals method but you can use your own IEqualityComparer<T> object to specify when two objects are equals with a custom logic implementing GetHashCode and Equals method. Remember that:

GetHashCode should not used heavy cpu comparision ( eg. use only some obvious basic checks ) and its used as first to state if two object are surely different ( if different hash code are returned ) or potentially the same ( same hash code ). In this latest case when two object have the same hashcode the framework will step to check using the Equals method as a final decision about equality of given objects.

After you have MyType and a MyTypeEqualityComparer classes follow code not ensure the sequence maintain its order:

var cmp = new MyTypeEqualityComparer();
var lst = new List<MyType>();
// add some to lst
var q = lst.Distinct(cmp);

In follow sci library I implemented an extension method to ensure Vector3D set maintain the order when use a specific extension method DistinctKeepOrder:

relevant code follows:

/// <summary>
/// support class for DistinctKeepOrder extension
/// </summary>
public class Vector3DWithOrder
{
    public int Order { get; private set; }
    public Vector3D Vector { get; private set; }
    public Vector3DWithOrder(Vector3D v, int order)
    {
        Vector = v;
        Order = order;
    }
}

public class Vector3DWithOrderEqualityComparer : IEqualityComparer<Vector3DWithOrder>
{
    Vector3DEqualityComparer cmp;

    public Vector3DWithOrderEqualityComparer(Vector3DEqualityComparer _cmp)
    {
        cmp = _cmp;
    }

    public bool Equals(Vector3DWithOrder x, Vector3DWithOrder y)
    {
        return cmp.Equals(x.Vector, y.Vector);
    }

    public int GetHashCode(Vector3DWithOrder obj)
    {
        return cmp.GetHashCode(obj.Vector);
    }
}

In short Vector3DWithOrder encapsulate the type and an order integer, while Vector3DWithOrderEqualityComparer encapsulates original type comparer.

and this is the method helper to ensure order maintained

/// <summary>
/// retrieve distinct of given vector set ensuring to maintain given order
/// </summary>        
public static IEnumerable<Vector3D> DistinctKeepOrder(this IEnumerable<Vector3D> vectors, Vector3DEqualityComparer cmp)
{
    var ocmp = new Vector3DWithOrderEqualityComparer(cmp);

    return vectors
        .Select((w, i) => new Vector3DWithOrder(w, i))
        .Distinct(ocmp)
        .OrderBy(w => w.Order)
        .Select(w => w.Vector);
}

Note : further research could allow to find a more general ( uses of interfaces ) and optimized way ( without encapsulate the object ).


This highly depends on your linq-provider. On Linq2Objects you can stay on the internal source-code for Distinct, which makes one assume the original order is preserved.

However for other providers that resolve to some kind of SQL for example, that isn´t neccessarily the case, as an ORDER BY-statement usually comes after any aggregation (such as Distinct). So if your code is this:

myArray.OrderBy(x => anothercol).GroupBy(x => y.mycol);

this is translated to something similar to the following in SQL:

SELECT * FROM mytable GROUP BY mycol ORDER BY anothercol;

This obviously first groups your data and sorts it afterwards. Now you´re stuck on the DBMS own logic of how to execute that. On some DBMS this isn´t even allowed. Imagine the following data:

mycol anothercol
1     2
1     1
1     3
2     1
2     3

when executing myArr.OrderBy(x => x.anothercol).GroupBy(x => x.mycol) we assume the following result:

mycol anothercol
1     1
2     1

But the DBMS may aggregate the anothercol-column so, that allways the value of the first row is used, resulting in the following data:

mycol anothercol
1    2
2    1

which after ordering will result in this:

mycol anothercol
2    1
1    2

This is similar to the following:

SELECT mycol, First(anothercol) from mytable group by mycol order by anothercol;

which is the completely reverse order than what you expected.

You see the execution-plan may vary depending on what the underlying provider is. This is why there´s no guarantee about that in the docs.


A bit late to the party, but no one really posted the best complete code to accomplish this IMO, so let me offer this (which is essentially identical to what .NET Framework does with Distinct())*:

    public static IEnumerable<T> DistinctOrdered<T>(this IEnumerable<T> items)
    {
        HashSet<T> returnedItems = new HashSet<T>();
        foreach (var item in items)
        {
            if (returnedItems.Add(item))
                yield return item;
        }                       
    }

This guarantees the original order without reliance on undocumented or assumed behavior. I also believe this is more efficient than using multiple LINQ methods though I'm open to being corrected here.

(*) The .NET Framework source uses an internal Set class, which appears to be substantively identical to HashSet.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜