Efficient algorithm to find additions and removals from 2 collections
Hi I would like to implement an efficient algorithm to handle the following case:
Lets assume we have 2 lists with the following elements:
Source: [a,b,c,d,e] New: [d,e,f,g]
Now I have to update source with the new information. The algorithm should be able to find that 'f' and 'g' are new entries, that 'a', 'b' and 'c' has been removed and that 'd' and 'e' have not being modified.
The operations involved are set-intersect op开发者_如何学Pythonerations between Source and New, and viceversa. I am looking for an efficient algorithm to implement in C# for arbitrary non-sorted enumerations.
Thanks in advance,
var added = New.Except(Source);
var removed = Source.Except(New);
var notModified = Source.Intersect(New);
If you want to have an approach where you "show your workings", I'd suggest putting them each into HashSets, as that allows for a fast Contains
check, compared with other enumerations.
Edit:
Okay, if we're going for total speed at the cost of efficiency of expression, then with the following assumptions:
- We have a reasonably hash-able type of item (if not, but they can be absolutely sorted, then a SortedList might beat a hash-set).
- We cannot predict whether Source or New will be larger (in the example, there's a slight advantage of doing this the other way around to how I have this, but I'm assuming that is just by chance in the data and that we have to expect each with equal likelihood.
Then I would suggest:
HashSet<T> removed = Source as HashSet<T> ?? new HashSet<T>(Source);
LinkedList<T> added = new LinkedList<T>();
LinkedList<T> notModified = new LinkedList<T>();
foreach(T item in New)
if(removed.Remove(item))
notModified.AddLast(item);
else
added.AddLast(item);
In setting up removed
I test if it's already a hashset to avoid a wasteful building of a new one (I assume the input is typed as IEnumerable<T>
). Of course, this is a destructive action so we may wish to avoid it anyway.
Note also that I modify the hashset while enumerating through it. This is allowed by hashset, but outside of the guarantees given by the enumerators, so is implementation-depended. Still, with the current framework impl. it's more efficient to do so than test and add to a different removed collection.
I went for linked-lists for the two other collections, as they tend to come out well in terms of insertion cost (not just O(1), but a fast O(1) compared to using another set).
Now, if you want to go further still, there're probably micro-optimisations available in the implementation of hash-set if you roll your own.
I have not tested this for performance, but my gut feeling is that you should first sort the two lists. Then you can step through the lists key each removed, added or unchanged element as you progress.
1- Sort the Old and New list
2- Set up a pointer for each list lets call them p1 and p2
3- Step the pointers using the following algorithm
a) If Old[p1] = New[p2] the items are unchanged, increment p1 and p2
b) If Old[p1] < New[p2] then Old[p1] has been removed, increment p1
c) If Old[p1] > new[p2] then New[p2] is a new element, increment p2
d) If p1 > Old.ItemCount then break out of loop, rest of New contains new items
e) If p2 > New.ItemCount then break out of loop, rest of Old items have been removed
f) If p1 < Old.ItemCount and p2 < Old.ItemCount Goto step **a**
That was just off the top of my head, but the basics should be correct. The key to this is that the lists are sorted of course.
Here is a quick and dirty demo, I included the sort for demonstration purposed, of course in this case the data is already sorted.
static void Main(string[] args)
{
string[] oldList = { "a", "b", "c", "d", "e" };
string[] newList = { "d", "e", "f", "g" };
Array.Sort(oldList);
Array.Sort(newList);
int p1 = 0;
int p2 = 0;
while (p1 < oldList.Length && p2 < newList.Length)
{
if (string.Compare(oldList[p1], newList[p2]) == 0)
{
Console.WriteLine("Unchanged:\t{0}", oldList[p1]);
p1++;
p2++;
}
else if (string.Compare(oldList[p1], newList[p2]) < 0)
{
Console.WriteLine("Removed:\t{0}", oldList[p1]);
p1++;
}
else if (string.Compare(oldList[p1], newList[p2]) > 0)
{
Console.WriteLine("Added:\t\t{0}", newList[p2]);
p2++;
}
}
while (p1 < oldList.Length)
{
Console.WriteLine("Removed:\t{0}", oldList[p1]);
p1++;
}
while (p2 < newList.Length)
{
Console.WriteLine("Added :\t\t{0}", newList[p2]);
p2++;
}
Console.ReadKey();
}
The output from the above
Removed: a
Removed: b
Removed: c
Unchanged: d
Unchanged: e
Added : f
Added : g
You might use the set operations available in Linq.
string[] list1 = { "a","b","c","d","e"};
string[] list2 = { "d", "e", "f", "g" };
string[] newElements = list2.Except(list1).ToArray();
string[] commonElements = list2.Intersect(list1).ToArray();
string[] removedElements = list1.Except(list2).ToArray();
Note: The above code assumes that each of the lists is distinct, i.e. does not contain the same element more than once. For example, for the lists [a, b, c, c] and [a, b, c] the code won't detect the removed element.
Call the sets X and Y. If set X supports rapid lookups, and you have a convenient means of "tagging" and "untagging" items in it, you could start by tagging all the items in X, and then query X for each item in Y. If an item isn't found, the item is "new" in Y. If the item is found, it's common to both sets and you should untag it in X. Repeat for all items in Y. When you're done, any items in X that are still tagged have been "deleted" from Y.
This approach only requires one of the sets to support convenient queries and tagging. It requires querying one set for all the records in the other, and then grabbing from it all items that haven't generated hits. There is no requirement to sort either set.
I think what you are looking are set operations i.e. union, etc. Take a look at this article: http://srtsolutions.com/public/item/251070
精彩评论