Removing duplicates from a list with "priority"

2022-12-10 04:26 问答作者：

Given a collection of records like this:

string ID1;
string ID2;
string Data1;
string Data2;
// :
string DataN

Initially Data1..N are null, and can pretty much be ignored for this question. ID1 & ID2 both unique开发者_运维知识库ly identify the record. All records will have an ID2; some will also have an ID1. Given an ID2, there is a (time-consuming) method to get it's corresponding ID1. Given an ID1, there is a (time-consuming) method to get Data1..N for the record. Our ultimate goal is to fill in Data1..N for all records as quickly as possible.

Our immediate goal is to (as quickly as possible) eliminate all duplicates in the list, keeping the one with more information.

For example, if Rec1 == {ID1="ABC", ID2="XYZ"}, and Rec2 = {ID1=null, ID2="XYZ"}, then these are duplicates, --- BUT we must specifically remove Rec2 and keep Rec1.

That last requirement eliminates the standard ways of removing Dups (e.g. HashSet), as they consider both sides of the "duplicate" to be interchangeable.

How about you split your original list into 3 - ones with all data, ones with ID1, and ones with just ID2.

Then do:

var unique = allData.Concat(id1Data.Except(allData))
                    .Concat(id2Data.Except(id1Data).Except(allData));

having defined equality just on the basis of ID2.

I suspect there are more efficient ways of expressing that, but the fundamental idea is sound as far as I can tell. Splitting the initial list into three is simply a matter of using GroupBy (and then calling ToList on each group to avoid repeated queries).

EDIT: Potentially nicer idea: split the data up as before, then do:

var result = new HashSet<...>(allData);
result.UnionWith(id1Data);
result.UnionWith(id2Data);

I believe that UnionWith keeps the existing elements rather than overwriting them with new but equal ones. On the other hand, that's not explicitly specified. It would be nice for it to be well-defined...

(Again, either make your type implement equality based on ID2, or create the hash set using an equality comparer which does so.)

This may smell quite a bit, but I think a LINQ-distinct will still work for you if you ensure the two compared objects come out to be the same. The following comparer would do this:

private class Comp : IEqualityComparer<Item>
    {
      public bool Equals(Item x, Item y)
      {
        var equalityOfB = x.ID2 == y.ID2;
        if (x.ID1 == y.ID1 && equalityOfB)
          return true;
        if (x.ID1 == null && equalityOfB)
        {
          x.ID1 = y.ID1;
          return true;
        }
        if (y.ID1 == null && equalityOfB)
        {
          y.ID1 = x.ID1;
          return true;
        }
        return false;
      }

      public int GetHashCode(Item obj)
      {
        return obj.ID2.GetHashCode();
      }
    }

Then you could use it on a list as such...

var l = new[] { 
  new Item { ID1 = "a", ID2 = "b" }, 
  new Item { ID1 = null, ID2 = "b" } };
var l2 = l.Distinct(new Comp()).ToArray();

I had a similar issue a couple of months ago.

Try something like this...

public static List<T> RemoveDuplicateSections<T>(List<T> sections) where T:INamedObject
        {
            Dictionary<string, int> uniqueStore = new Dictionary<string, int>();
            List<T> finalList = new List<T>();
            int i = 0;
            foreach (T currValue in sections)
            {
                if (!uniqueStore.ContainsKey(currValue.Name))
                {
                    uniqueStore.Add(currValue.Name, 0);
                    finalList.Add(sections[i]);
                }
                i++;
             }
            return finalList;
        }

records.GroupBy(r => r, new RecordByIDsEqualityComparer())
       .Select(g => g.OrderByDescending(r => r, new RecordByFullnessComparer()).First())

or if you want to merge the records, then Aggregate instead of OrderByDescending/First.

继续阅读：.net algorithm distinct list

Removing duplicates from a list with "priority"

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？