Why does dict have worst case O(n) for so many operations?
How exactly is dict implemented that it has a linear time lookup for collisions? I would assume that it is implemented as a hashtable backed by a list. I would presume that a better implementation would be O(log(n)) for various operations, using a tree to back the table instead. Is there some magic happening behind the scenes to kee开发者_如何学JAVAp the constant time lookups alive for as long as possible?
My source for this, by the way, is this:
http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=python+complexity
Dict is O(1) for most operations, except for operations that touch all elements, such as iteration and copy (in which case, it's obviously O(n)).
See: http://wiki.python.org/moin/TimeComplexity
It has O(n) worst case, because you can always contrive a pathological example where all the keys have the same hash value.
The point of choosing one implementation over another isn't necessarily about the upper-bound, but rather the expected amortized performance. While different algorithms can have degenerate cases it's usually "better in practice" than using an approach with a provable lower upper-bound. In some cases, however, structures must be designed to guard against pathologically bad input.
Also, some languages/libraries -- not sure about Python -- actually change the underlying implementation, such as when the number of items exceeds a low n. This affects the amortized performance (in certain cases), but not necessarily the big O.
And in conclusion: "It depends".
Happy coding.
Consider even the best hash function in the galaxy. There's still a chance that you may walk up one day with a list of values whose best hash function value happens to be all the same. If you put those in a dict, the system has no choice but to perform linear searches.
Using a balanced tree would keep the worst-case time down at O(log n), but the maintenance costs are pretty high. Usually, hash tables perform pretty well.
I would presume that a better implementation would be O(log(n)) for various operations, using a tree to back the table instead.
Trees and hash tables have very different requirements and performance characteristics.
- Trees require an ordered type.
- Trees require performing order comparisons to find the object. For some objects, like strings, this prevents some significant optimizations: you always need to perform a string comparison, which is nontrivially expensive. This makes the constant factor of O(log n) fairly high.
- Hash tables require a hashable type, and that you can test for equality, but they don't require an ordered type.
- Testing for equality can be optimized significantly. If two strings are interned, you can test if they're equal in O(1) by comparing their pointer, rather than O(n) by comparing the whole string. This is a massive optimization: in every
foo.bar
lookup that gets translated tofoo.__dict__["bar"]
,"bar"
is an interned string. - Hash tables are O(n) in the worst case, but examine what leads to that worst case: a very bad hash table implementation (eg. you only have one bucket), or a broken hash function that always returns the same value. When you have a proper hash function and a proper bucketing algorithm, lookups are very cheap--very often approaching constant time.
Trees do have significant advantages:
- They tend to have lower memory requirements, since they don't have to preallocate buckets. The smallest tree might be 12 bytes (node pointer and two child pointers), where a hash table tends to be 128 bytes or more--sys.getsizeof({}) on my system is 136.
- They allow ordered traversal; it's extremely useful to be able to iterate over [a,b) in an ordered set, which hash tables don't allow.
I do consider it a flaw that Python doesn't have a standard binary tree container, but for the performance characteristics needed by the Python core, like __dict__
lookups, a hash table does make more sense.
Reliable sources of information about the hash functions and collision-resolution strategy that are actually used include the comments in the source file dictobject.c and the whole of the file dictnotes.txt
精彩评论