开发者

Finding duplicates within list of list

Simple situation. I have a list of lists, almost table like, and I am trying to find out if any of the lists are duplicated.

Example:

List<List<int>> list = new 开发者_开发问答List<List<int>>(){
  new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
  new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
  new List<int>() {0 ,1 ,4, 2, 4, 5, 6 },
  new List<int>() {0 ,3 ,2, 5, 1, 6, 4 }
};

I would like to know that there are 4 total items, 2 of which are duplicates. I was thinking about doing something like a SQL checksum but I didn't know if there was a better/easier way.

I care about performance, and I care about ordering.

Additional Information That May Help

  • Things inserted into this list will never be removed
  • Not bound to any specific collection.
  • Dont care about function signature
  • They type is not restricted to int


Let's try to get best performace. if n is number of lists and m is length of lists then we can get O(nm + nlogn + n) plus some probability of hash codes to be equal for different lists.

Major steps:

  1. Calculate hash codes*
  2. Sort them
  3. Go over list to find dupes

* this is important step. for simlicity you can calc hash as = ... ^ (list[i] << i) ^ (list[i + 1] << (i + 1))

Edit for those people that think that PLINQ can boost the thing, but not good algorythm. PLINQ can also be added here, because all the steps are easily parallelizable.

My code:

static public void Main()
{
    List<List<int>> list = new List<List<int>>(){
      new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
      new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
      new List<int>() {0 ,1 ,4, 2, 4, 5, 6 },
      new List<int>() {0 ,3 ,2, 5, 1, 6, 4 }
    };
    var hashList = list.Select((l, ind) =>
    {
        uint hash = 0;
        for (int i = 0; i < l.Count; i++)
        {
            uint el = (uint)l[i];
            hash ^= (el << i) | (el >> (32 - i));
        }
        return new {hash, ind};
    }).OrderBy(l => l.hash).ToList();
    //hashList.Sort();
    uint prevHash = hashList[0].hash;
    int firstInd = 0;            
    for (int i = 1; i <= hashList.Count; i++)
    {
        if (i == hashList.Count || hashList[i].hash != prevHash)
        {
            for (int n = firstInd; n < i; n++)
                for (int m = n + 1; m < i; m++)
                {
                    List<int> x = list[hashList[n].ind];
                    List<int> y = list[hashList[m].ind];
                    if (x.Count == y.Count && x.SequenceEqual(y))
                        Console.WriteLine("Dupes: {0} and {1}", hashList[n].ind, hashList[m].ind);
                }                    
        }
        if (i == hashList.Count)
            break;
        if (hashList[i].hash != prevHash)
        {
            firstInd = i;
            prevHash = hashList[i].hash;
        }
    }
}


Unless you're doing some seriously heavy lifting, perhaps the following simple code will work for you:

var lists = new List<List<int>>()
{
   new List<int>() {0 ,1, 2, 3, 4, 5, 6 },
   new List<int>() {0 ,1, 2, 3, 4, 5, 6 },
   new List<int>() {0 ,1, 4, 2, 4, 5, 6 },
   new List<int>() {0 ,3, 2, 5, 1, 6, 4 }
};

var duplicates = from list in lists
                 where lists.Except(new[] { list }).Any(l => l.SequenceEqual(list))
                 select list;

Obviously you could get better performance if you hand-tweak an algorithm such that you don't have to scan the lists each iteration, but there is something to be said for writing declarative, simpler code.

(Plus, thanks to the Awesomeness of LINQ®, by adding a .AsParallel() call to the above code, the algorithm will run on multiple cores, thus running potentially faster than the complex, hand-tweaked solutions mentioned in this thread.)


Something like this will give you the correct results:

List<List<int>> list = new List<List<int>>(){
  new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
  new List<int>() {0 ,1 ,2, 3, 4, 5, 6 },
  new List<int>() {0 ,1 ,4, 2, 4, 5, 6 },
  new List<int>() {0 ,3 ,2, 5, 1, 6, 4 }
};

list.ToLookup(l => String.Join(",", l.Select(i => i.ToString()).ToArray()))
    .Where(lk => lk.Count() > 1)
    .SelectMany(group => group);


You will have to iterate through each index of each list at least once, but you can potentially speed up the process by creating a custom hash table, so that you can quickly reject non-duplicate lists without having to do comparisons per-item.

Algorithm:

Create a custom hashtable (dictionary: hash -> list of lists)
For each list
  Take a hash of the list (one that takes order into account)
  Search in hashtable
  If you find matches for the hash
    For each list in the hash entry, re-compare the tables
      If you find a duplicate, return true
  Else if you don't find matches for the hash
    Create a temp list
    Append the current list to our temp list
    Add the temp list to the dictionary as a new hash entry
You didn't find any duplicates, so return false

If you have a strong enough hashing algorithm for your input data, you might not even have to do the sub-comparisons, since there wouldn't be any hash collisions.

I have some example code. The missing bits are:

  • An optimization so that we do the dictionary lookup only once per list (for search and insert). Might have to make your own Dictionary/Hash Table class to do this?
  • A better hashing algorithm that you find by profiling a bunch of them against your data

Here is the code:

public bool ContainsDuplicate(List<List<int>> input)
{
    var encounteredLists = new Dictionary<int, List<EnumerableWrapper>>();

    foreach (List<int> currentList in input)
    {
        var currentListWrapper = new EnumerableWrapper(currentList);
        int hash = currentListWrapper.GetHashCode();

        if (encounteredLists.ContainsKey(hash))
        {
            foreach (EnumerableWrapper currentEncounteredEntry in encounteredLists[hash])
            {
                if (currentListWrapper.Equals(currentEncounteredEntry))
                    return true;
            }
        }
        else
        {
            var newEntry = new List<EnumerableWrapper>();
            newEntry.Add(currentListWrapper);
            encounteredLists[hash] = newEntry;
        }
    }

    return false;
}

sealed class EnumerableWrapper
{
    public EnumerableWrapper(IEnumerable<int> list)
    {
        if (list == null)
            throw new ArgumentNullException("list");
        this.List = list;
    }

    public IEnumerable<int> List { get; private set; }

    public override bool Equals(object obj)
    {
        bool result = false;

        var other = obj as EnumerableWrapper;
        if (other != null)
            result = Enumerable.SequenceEqual(this.List, other.List);

        return result;
    }

    public override int GetHashCode()
    {
        // Todo: Implement your own hashing algorithm here
        var sb = new StringBuilder();
        foreach (int value in List)
            sb.Append(value.ToString());
        return sb.ToString().GetHashCode();
    }
}


Here's a potential idea (this assumes that the values are numerical):

Implement a comparer that multiplies each member of each collection by its index, then sum the whole thing:

Value:    0  5  8  3  2  0  5  3  5  1
Index:    1  2  3  4  5  6  7  8  9  10
Multiple: 0  10 24 12 10 0  35 24 45 10

Member CheckSum: 170

So, the whole "row" has a number that changes with the members and ordering. Fast to compute and compare.


You could also try probabilistic algorithms if duplicates are either very rare or very common. e.g. a bloom filter


What about that writing your own list comparer:

class ListComparer:IEqualityComparer<List<int>>
{
     public bool Equals(List<int> x, List<int> y)
     {
        if(x.Count != y.Count)
          return false;

        for(int i = 0; i < x.Count; i++)
          if(x[i] != y[i])
             return false;

       return true;
     }

     public int GetHashCode(List<int> obj)
     {
        return base.GetHashCode();
     }
}

and then just:

var nonDuplicatedList = list.Distinct(new ListComparer());
var distinctCount = nonDuplicatedList.Count();


if they are all single digit and have the same number of elements you can put them together so the first one is 123456 and check if the numbers are the same.

then you would have a list {123456, 123456, 142456, 325164}

which is easier to check for duplicates, if the individual members can be more than 10 you would have to modify this.

Edit: added sample code, can be optimize, just a quick example to explain what I meant.

for(int i = 0; i< list.length; i++)
{
    List<int> tempList = list[i];
    int temp = 0;
    for(int j = tempList.length - 1;i > = 0; j--)
    {
        temp = temp * 10 + tempList[j];
    }
    combinded.add(temp);
}

for(int i =0; i< combined.length; i++)
{
    for(int j = i; j < combined.length; j++)
    {
        if(combined[i] == combined[j])
        {
            return true;
        }
    }
}
return false;


There are a number of good solutions here already, but I believe this one will consistently run the fastest unless there is some structure of the data that you haven't yet told us about.

  • Create a map from integer key to List, and a map from key to List<List<int>>
  • For each List<int>, compute a hash using some simple function like (...((x0)*a + x1)*a + ...)*a + xN) which you can calculate recursively; a should be something like 1367130559 (i.e. some large prime that is randomly non-close to any interesting power of 2.
  • Add the hash and the list it comes from as a key-value pair, if it does not exist. If it does exist, look in the second map. If the second map has that key, append the new List<int> to the accumulating list. If not, take the List<int> you looked up from the first map and the List<int> you were testing, and add a new entry in the second map containing a list of those two items.
  • Repeat until you've passed through your entire first list. Now you have a hashmap with a list of potential collisions (the second map), and a hashmap with a list of keys (the first map).
  • Iterate through the second map. For each entry, take the List<List<int>> therein and sort it lexicographically. Now just walk through doing equality comparisons to count the number of different blocks.
  • Your total number of items is equal to the length of your original list.
  • Your number of distinct items is equal to the size of your first hashmap plus the sum of (number of blocks - 1) for each entry in your second hashmap.
  • Your number of duplicate items is the difference of those two numbers (and you can find out all sorts of other things if you want).

If you have N non-duplicate items, and M entries which are duplicates from a set of K items, then it will take you O(N+M+2K) to create the initial hash maps, at the very worst O(M log M) to do the sorting (and probably more like O(M log(M/K))), and O(M) to do the final equality test.


Check out C# 3.0: Need to return duplicates from a List<> it shows you how to return duplicates from the list.

Example from that page:

var duplicates = from car in cars
             group car by car.Color into grouped
             from car in grouped.Skip(1)
             select car;
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜