Python: detect duplicates using a set
I have a large number of objects I need to store in memory for processing in Python. Specifically, I'm trying to remove duplicates from a large set of objects. I want to consider two objects "equal" if a certain instance variable in the object is equal. So, I assumed the easiest way to do this would be to insert all my objects into 开发者_Go百科a set, and override the __hash__
method so that it hashes the instance variable I'm concerned with.
So, as a test I tried the following:
class Person:
def __init__(self, n, a):
self.name = n
self.age = a
def __hash__(self):
return hash(self.name)
def __str__(self):
return "{0}:{1}".format(self.name, self.age)
myset = set()
myset.add(Person("foo", 10))
myset.add(Person("bar", 20))
myset.add(Person("baz", 30))
myset.add(Person("foo", 1000)) # try adding a duplicate
for p in myset: print(p)
Here, I define a Person
class, and any two instances of Person
with the same name
variable are to be equal, regardless of the value of any other instance variable. Unfortunately, this outputs:
baz:30
foo:10
bar:20
foo:1000
Note that foo
appears twice, so this program failed to notice duplicates. Yet the expression hash(Person("foo", 10)) == hash(Person("foo", 1000))
is True
. So why doesn't this properly detect duplicate Person
objects?
You forgot to also define __eq__()
.
If a class does not define a
__cmp__()
or__eq__()
method it should not define a__hash__()
operation either; if it defines__cmp__()
or__eq__()
but not__hash__()
, its instances will not be usable in hashed collections. If a class defines mutable objects and implements a__cmp__()
or__eq__()
method, it should not implement__hash__()
, since hashable collection implementations require that a object’s hash value is immutable (if the object’s hash value changes, it will be in the wrong hash bucket).
A set obviously will have to deal with hash collisions. If the hash of two objects matches, the set will compare them using the ==
operator to make sure they are really equal. In your case, this will only yield True
if the two objects are the same object (the standard implementation for user-defined classes).
Long story short: Also define __eq__()
to make it work.
Hash function is not enough to distinguish object you have to implement the comparison function (ie. __eq__
).
A hash function effectively says "A maybe equals B" or "A not equals B (for sure)".
If it says "maybe equals" then equality has to be checked anyway to make sure, which is why you also need to implement __eq__
.
Nevertheless, defining __hash__
will significantly speed things up by making "A not equal B (for sure)" an O(1)
operation.
The hash function must however always follow the "hash rule":
- "hash rule": equal things must hash to the same value
- (justification: or else we'd say "A not equals B (for sure)" when that is not the case)
For example you could hash everything by def __hash__(self): return 1
. This would still be correct, but it would be inefficient because you'd have to check __eq__
each time, which may be a long process if you have complicated large data structures (e.g. with large lists, dictionaries, etc.).
Do note that you technically follow the "hash rule" do this by ignoring age in your implementation def __hash__(self): return self.name
. If Bob is a person of age 20 and Bob is another person of age 30 and they are different people (likely unless this is some sort of keeps-track-of-people-over-time-as-they-age program), then they will hash to the same value and have to be compared with __eq__
. This is perfectly fine, but I would implement it like so:
def __hash__(self):
return hash( (self.name, self.age) )
Do note that your way is still correct. It would however have been a coding error to use hash( (self.name, self.age) )
in a world where Person("Bob", age=20)
and Person("Bob", age=30)
were actually the same person, because the hash function would be saying they're different while the equals function would not (but be ignored).
You also need the __ eq __() method.
精彩评论