开发者

Removing duplicates from a list with "priority"

Given a collection of records like this:

string ID1;
string ID2;
string Data1;
string Data2;
// :
string DataN

Initially Data1..N are null, and can pretty much be ignored for this question. ID1 & ID2 both unique开发者_运维知识库ly identify the record. All records will have an ID2; some will also have an ID1. Given an ID2, there is a (time-consuming) method to get it's corresponding ID1. Given an ID1, there is a (time-consuming) method to get Data1..N for the record. Our ultimate goal is to fill in Data1..N for all records as quickly as possible.

Our immediate goal is to (as quickly as possible) eliminate all duplicates in the list, keeping the one with more information.

For example, if Rec1 == {ID1="ABC", ID2="XYZ"}, and Rec2 = {ID1=null, ID2="XYZ"}, then these are duplicates, --- BUT we must specifically remove Rec2 and keep Rec1.

That last requirement eliminates the standard ways of removing Dups (e.g. HashSet), as they consider both sides of the "duplicate" to be interchangeable.


How about you split your original list into 3 - ones with all data, ones with ID1, and ones with just ID2.

Then do:

var unique = allData.Concat(id1Data.Except(allData))
                    .Concat(id2Data.Except(id1Data).Except(allData));

having defined equality just on the basis of ID2.

I suspect there are more efficient ways of expressing that, but the fundamental idea is sound as far as I can tell. Splitting the initial list into three is simply a matter of using GroupBy (and then calling ToList on each group to avoid repeated queries).

EDIT: Potentially nicer idea: split the data up as before, then do:

var result = new HashSet<...>(allData);
result.UnionWith(id1Data);
result.UnionWith(id2Data);

I believe that UnionWith keeps the existing elements rather than overwriting them with new but equal ones. On the other hand, that's not explicitly specified. It would be nice for it to be well-defined...

(Again, either make your type implement equality based on ID2, or create the hash set using an equality comparer which does so.)


This may smell quite a bit, but I think a LINQ-distinct will still work for you if you ensure the two compared objects come out to be the same. The following comparer would do this:

private class Comp : IEqualityComparer<Item>
    {
      public bool Equals(Item x, Item y)
      {
        var equalityOfB = x.ID2 == y.ID2;
        if (x.ID1 == y.ID1 && equalityOfB)
          return true;
        if (x.ID1 == null && equalityOfB)
        {
          x.ID1 = y.ID1;
          return true;
        }
        if (y.ID1 == null && equalityOfB)
        {
          y.ID1 = x.ID1;
          return true;
        }
        return false;
      }

      public int GetHashCode(Item obj)
      {
        return obj.ID2.GetHashCode();
      }
    }

Then you could use it on a list as such...

var l = new[] { 
  new Item { ID1 = "a", ID2 = "b" }, 
  new Item { ID1 = null, ID2 = "b" } };
var l2 = l.Distinct(new Comp()).ToArray();


I had a similar issue a couple of months ago.

Try something like this...

public static List<T> RemoveDuplicateSections<T>(List<T> sections) where T:INamedObject
        {
            Dictionary<string, int> uniqueStore = new Dictionary<string, int>();
            List<T> finalList = new List<T>();
            int i = 0;
            foreach (T currValue in sections)
            {
                if (!uniqueStore.ContainsKey(currValue.Name))
                {
                    uniqueStore.Add(currValue.Name, 0);
                    finalList.Add(sections[i]);
                }
                i++;
             }
            return finalList;
        }


records.GroupBy(r => r, new RecordByIDsEqualityComparer())
       .Select(g => g.OrderByDescending(r => r, new RecordByFullnessComparer()).First())

or if you want to merge the records, then Aggregate instead of OrderByDescending/First.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜