What's better for creating distinct data structures: HashSet or Linq's Distinct()?
I'm wondering whether I can get a consensus on which method is the better approach to creating a distinct set of elements: a C# HashSet
or using IEnumerable's .Distinct()
, which is a Linq function?
Let's say I'm looping through query results from the DB with DataReader, and my options are t开发者_Go百科o add the objects I construct to a List<SomeObject>
or to a HashSet<SomeObject>
With the List
option, I would wind up having to do something like:
myList = myList.Distinct().ToList<SomeObject>();
With the HashSet
, my understanding is that adding elements to it takes care of the non-duplication by itself, assuming you've overrided the GetHashCode()
and Equals()
methods in SomeObject. I'm concerned mainly with the risks and performance aspects of the options.
Thanks.
Anthony Pegram has said it the best. Use the right tool for the job. I say this because a Distinct
or HashSet
isn't that big different when it comes to performance. Use a HashSet
when the collection should always hold only distinct stuffs. It also tells the programmer that you cant add duplicates to it. Use a normal List<T>
and .Distinct()
ont it when you will have to add duplicates and remove duplicates later. The intention matters.
In general,
a) a HashSet may not do any good if you're adding new objects from db and you haven't specified a custom Equals
of your own. Every object from db can be a new instance for your hashset (if you are just new-ing) and that will lead to duplicates in the collection. In that case use normal List<T>
.
b) If you do have an equality comparer defined for hashset, and your collection should always hold only distinct objects, use hashset.
c) If you do have an equality comparer defined for hashset, and you want only distinct objects from db but collection need not always hold only distinct objects (ie duplicates needed to be added later), a faster approach is to get the items from db to a hashset and then return a regular list from that hashset.
d) The best thing you should do is to give the task of removing duplicates to database, thats the right tool And that's first class!
As for performance differences, in my testing I always found HashSet to be faster, but then that's only marginal. That's obvious considering with List approach you have to first add and then do a distinct on it.
Test method: Starting with two general functions,
public static void Benchmark(Action method, int iterations = 10000)
{
Stopwatch sw = new Stopwatch();
sw.Start();
for (int i = 0; i < iterations; i++)
method();
sw.Stop();
MsgBox.ShowDialog(sw.Elapsed.TotalMilliseconds.ToString());
}
public static List<T> Repeat<T>(this ICollection<T> lst, int count)
{
if (count < 0)
throw new ArgumentOutOfRangeException("count");
var ret = Enumerable.Empty<T>();
for (var i = 0; i < count; i++)
ret = ret.Concat(lst);
return ret.ToList();
}
Implementation:
var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();
Benchmark(() =>
{
hash.Clear();
foreach (var item in d)
{
hash.Add(item);
}
});
~3300 ms
var d = Enumerable.Range(1, 100).ToList().Repeat(100);
List<int> list = new List<int>();
Benchmark(() =>
{
list.Clear();
foreach (var item in d)
{
list.Add(item);
}
list = list.Distinct().ToList();
});
~5800 ms
A difference of 2.5 seconds is not bad for a list of 10000 objects when iterated another 10000 times. For normal cases the difference will be hardly noticeable.
The best approach possibly for you with your current design:
var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();
List<int> list = new List<int>();
Benchmark(() =>
{
hash.Clear();
foreach (var item in d)
{
hash.Add(item);
}
list = hash.ToList();
});
~3300 ms
There isn't any significant difference, see..
Partly unrelated - after posting this answer, I was curious to know what's the best approach in removing duplicates, from a normal list.
var d = Enumerable.Range(1, 100).ToList().Repeat(100);
HashSet<int> hash = new HashSet<int>();
List<int> list = new List<int>();
Benchmark(() =>
{
hash = new HashSet<int>(d);
});
~3900 ms
var d = Enumerable.Range(1, 100).ToList().Repeat(100);
List<int> list = new List<int>();
Benchmark(() =>
{
list = d.Distinct().ToList();
});
~3200 ms
Here the right tool Distinct
is faster than hackish HashSet
! Perhaps its the overhead of creating a hash set.
I have tested with various other combinations like reference types, without duplicates in original list etc. The results are consistent.
What's better is what's the most expressive of describing your intention. The internal implementation details are more or less going to be the same, the difference being "who's writing the code?"
If your intention is to create from the ground up a distinct collection of items from a source that is not a collection of said items, I would argue for the HashSet<T>
. You have to create the item, you have to build the collection, you might as well build the right one from the beginning.
Otherwise, if you already have a collection of items and you want to eliminate duplicates, I would argue for invoking Distinct()
. You already have a collection, you just want an expressive way to get the distinct items out of it.
"Better" is a tricky word to use - it can mean so many different things to different people.
For readability, I would go for Distinct()
as I personally find this more comprehensible.
For performance, I suspect a hand-crafted HashSet implementation might perform mildly quicker - but I doubt it would be very different as the internal implementation of Distinct
will no doubt itself use some form of hashing.
For what I think of as "best" implementation... I think you should use Distinct
but somehow push this down to the database layer - i.e. change the underlying database SELECT before you fill the DataReader.
For large collections HashSet is likely to be faster. It relies on the hashcode of the objects to quickly determine whether or not an element already exists in the set.
In practice, it (most likely) won't matter (but you should measure if you care).
I instinctively guessed at first that HashSet
would be faster, because of the fast hash checking it uses. However, I looked up the current (4.0) implementation of Distinct in the reference sources, and it uses a similar Set
class (which also relies on hashing) under the covers. Conclusion; there are no practical performance difference.
For your case, I would go with .Distinct
for readability - it clearly conveys the intent of the code. However, I agree with one of the other answers, that you probably should perform this operationn in the DB if possible.
If yor looping through the results of a DbReader adding your resutls to a Hashset would be better than adding it to a List and than doing a Distinct on that. You would save one itteration. (Distinct internally uses a HashSet)
The implementation of Distinct may use HashSet. Take a look at Jon Skeet's Edulinq implementation.
精彩评论