Efficient way to clone a HashSet<T>?
A few days ago, I answered an interesting question on SO about HashSet<T>
. A possible solution involved cloning the hashset, and in my answer I suggested to do something like this:
HashSet<int> original = ...
HashSet<int> clone开发者_StackOverflow中文版 = new HashSet<int>(original);
Although this approach is quite straightforward, I suspect it's very inefficient: the constructor of the new HashSet<T>
needs to separately add each item from the original hashset, and check if it isn't already present. This is clearly a waste of time: since the source collection is a ISet<T>
, it is guaranteed not to contain duplicates. There should be a way to take advantage of that knowledge...
Ideally, HashSet<T>
should implement ICloneable
, but unfortunately it's not the case. I also checked with Reflector to see if the HashSet<T>
constructor did something specific if the source collection was a hashset, but it doesn't. It could probably be done by using reflection on private fields, but that would be an ugly hack...
So, did someone come up with a clever solution to clone a hashset more efficiently ?
(Note that this question is purely theoretical, I don't need to do that in a real program)
If you really wanted the most efficient way to clone a HashSet<T>
, you'd do the following (but possibly at the cost of maintainability)
- Use reflector or the debugger to figure out exactly what fields in
HashSet<T>
need to be copied. You may need to do this recursively for each field. - Use
Reflection.Emit
or use expression trees to generate a method which does the necessary copying of all of the fields. May need to call other generated methods which copy the value of each field. We're using runtime code generation because it's the only way to directly access private fields. - Use
FormatterServices.GetUninitializedObject(...)
to instantiate a blank object. Use the method generated in step 2 to copy the original object to the new blank object.
I checked the .NET Framework source code for both version 4.5.2 and version 4.7.2. Version 4.7.2 does have optimization in the constructor to handle when the passed in collection is of type HashSet, using some internal cloning logic. You would need to also pass in the comparer into the constructor for this logic to work. Version 4.5.2 does NOT have this optimization it seems.
Example:
var clonedSet = new HashSet(set, set.Comparer);
EDIT: After closer inspection this does not seems to be a good idea, with less then 60 elements in the original hashset the method below appears to be slower then just creating a new hashset.
DISCLAIMER: this seems to work but use at your own risk, if you are going to serialize the cloned hashsets you probably want to copy SerializationInfo m_siInfo.
I also faced this problem and took a stab at it, below you will find an extension method that uses FieldInfo.GetValue and SetValue to copy the required fields. It is faster than using HashSet(IEnumerable), how much depends on the amount of elements in the original hashset. For 1,000 elements the difference is about a factor 7. With 100,000 elements its about a factor 3.
There are other ways which may be even faster, but this has gotten rid of the bottleneck for me for now. I tried using expressiontrees and emitting but hit a roadblock, if I get those to work Ill update this post.
using System;
using System.Collections.Generic;
using System.Reflection;
using System.Runtime.Serialization;
public static class HashSetExtensions
{
public static HashSet<T> Clone<T>(this HashSet<T> original)
{
var clone = (HashSet<T>)FormatterServices.GetUninitializedObject(typeof(HashSet<T>));
Copy(Fields<T>.comparer, original, clone);
if (original.Count == 0)
{
Fields<T>.freeList.SetValue(clone, -1);
}
else
{
Fields<T>.count.SetValue(clone, original.Count);
Clone(Fields<T>.buckets, original, clone);
Clone(Fields<T>.slots, original, clone);
Copy(Fields<T>.freeList, original, clone);
Copy(Fields<T>.lastIndex, original, clone);
Copy(Fields<T>.version, original, clone);
}
return clone;
}
static void Copy<T>(FieldInfo field, HashSet<T> source, HashSet<T> target)
{
field.SetValue(target, field.GetValue(source));
}
static void Clone<T>(FieldInfo field, HashSet<T> source, HashSet<T> target)
{
field.SetValue(target, ((Array)field.GetValue(source)).Clone());
}
static class Fields<T>
{
public static readonly FieldInfo freeList = GetFieldInfo("m_freeList");
public static readonly FieldInfo buckets = GetFieldInfo("m_buckets");
public static readonly FieldInfo slots = GetFieldInfo("m_slots");
public static readonly FieldInfo count = GetFieldInfo("m_count");
public static readonly FieldInfo lastIndex = GetFieldInfo("m_lastIndex");
public static readonly FieldInfo version = GetFieldInfo("m_version");
public static readonly FieldInfo comparer = GetFieldInfo("m_comparer");
static FieldInfo GetFieldInfo(string name)
{
return typeof(HashSet<T>).GetField(name, BindingFlags.Instance | BindingFlags.NonPublic);
}
}
}
Easy pattern which should won't work for many collections:
Class cloneableDictionary(Of T, U) Inherits Dictionary(Of T, U) Function clone() As Dictionary(Of T, U) Return CType(Me.MemberwiseClone, cloneableDict(Of T, U)) End Function End Class
Unfortunately, I don't know that Microsoft did anything to prevent calling MemberwiseClone in places where it shouldn't be called (e.g. declaring something other than a method--like maybe a class--with the name MemberwiseClone) so I don't know how one can tell whether such an approach is likely to work.
I think there's a fair reason for a standard collection not to support a public cloning method but only a protected one: it's possible that a class which derives from a collection might break severely if cloned, and if base class' cloning method is public there's no way to prevent an object of a derived class from being given to code that expects to clone it.
That having been said, it would have been nice if .net included cloneableDictionary and other such classes as standard types (though obviously not implemented essentially as above).
O(n) clone is as good as it can get, theoretically, to clone two sets that won't share the same underlying data structure.
Checking whether or not an element is in a HashSet should be a constant time (i.e. O(1)) operation.
So you could create a wrapper that would just wrap an existing HashSet and hold on to any new additions, but that seems pretty perverse.
When you say 'efficient', you mean 'more efficient than the existing O(n) method' - I posit you can't actually get more efficient than O(n) without playing pretty serious semantic games about what 'clone' means.
Just a random thought. It might be silly.
Since they did not implement ICloneable, and the constructor does not use the knowledge that the source is of the same type, I guess we're left with one option. Implementing the optimized version and adding it as an extension method to the type.
Something like:
namespace ExtensionMethods
{
public static class MyExtensions
{
public static HashSet<int> Clone(this HashSet<int> original)
{
HashSet<int> clone = new HashSet<int>();
//your optimized code here
return clone;
}
}
}
Then, your code from the question would look like this:
HashSet<int> original = ...
HashSet<int> clone = HashSet<int>.Clone(original);
精彩评论