Why does dict have worst case O(n) for so many operations?

2023-02-07 15:44 问答作者：

How exactly is dict implemented that it has a linear time lookup for collisions? I would assume that it is implemented as a hashtable backed by a list. I would presume that a better implementation would be O(log(n)) for various operations, using a tree to back the table instead. Is there some magic happening behind the scenes to kee开发者_如何学JAVAp the constant time lookups alive for as long as possible?

My source for this, by the way, is this:

http://www.google.com/search?sourceid=chrome&ie=UTF-8&q=python+complexity

Dict is O(1) for most operations, except for operations that touch all elements, such as iteration and copy (in which case, it's obviously O(n)).

See: http://wiki.python.org/moin/TimeComplexity

It has O(n) worst case, because you can always contrive a pathological example where all the keys have the same hash value.

The point of choosing one implementation over another isn't necessarily about the upper-bound, but rather the expected amortized performance. While different algorithms can have degenerate cases it's usually "better in practice" than using an approach with a provable lower upper-bound. In some cases, however, structures must be designed to guard against pathologically bad input.

Also, some languages/libraries -- not sure about Python -- actually change the underlying implementation, such as when the number of items exceeds a low n. This affects the amortized performance (in certain cases), but not necessarily the big O.

And in conclusion: "It depends".

Happy coding.

Consider even the best hash function in the galaxy. There's still a chance that you may walk up one day with a list of values whose best hash function value happens to be all the same. If you put those in a dict, the system has no choice but to perform linear searches.

Using a balanced tree would keep the worst-case time down at O(log n), but the maintenance costs are pretty high. Usually, hash tables perform pretty well.

I would presume that a better implementation would be O(log(n)) for various operations, using a tree to back the table instead.

Trees and hash tables have very different requirements and performance characteristics.

Trees require an ordered type.
Trees require performing order comparisons to find the object. For some objects, like strings, this prevents some significant optimizations: you always need to perform a string comparison, which is nontrivially expensive. This makes the constant factor of O(log n) fairly high.
Hash tables require a hashable type, and that you can test for equality, but they don't require an ordered type.
Testing for equality can be optimized significantly. If two strings are interned, you can test if they're equal in O(1) by comparing their pointer, rather than O(n) by comparing the whole string. This is a massive optimization: in every foo.bar lookup that gets translated to foo.__dict__["bar"], "bar" is an interned string.
Hash tables are O(n) in the worst case, but examine what leads to that worst case: a very bad hash table implementation (eg. you only have one bucket), or a broken hash function that always returns the same value. When you have a proper hash function and a proper bucketing algorithm, lookups are very cheap--very often approaching constant time.

Trees do have significant advantages:

They tend to have lower memory requirements, since they don't have to preallocate buckets. The smallest tree might be 12 bytes (node pointer and two child pointers), where a hash table tends to be 128 bytes or more--sys.getsizeof({}) on my system is 136.
They allow ordered traversal; it's extremely useful to be able to iterate over [a,b) in an ordered set, which hash tables don't allow.

I do consider it a flaw that Python doesn't have a standard binary tree container, but for the performance characteristics needed by the Python core, like __dict__ lookups, a hash table does make more sense.

Reliable sources of information about the hash functions and collision-resolution strategy that are actually used include the comments in the source file dictobject.c and the whole of the file dictnotes.txt

继续阅读：python

Why does dict have worst case O(n) for so many operations?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？