If consistent hash is efficient,why don't people use it everywhere?

2023-03-16 07:52 问答作者：

I was asked some shortcommings of consistent hash. But I think it just costs a little more than a traditional hash%N hash. As the title mentioned, if consistent hash is very good, why not we just use it?

Do you know more? Who can tell m开发者_运维问答e some?

Implementing consistent hashing is not trivial and in many cases you have a hash table that rarely or never needs remapping or which can remap rather fast.

The only substantial shortcoming of consistent hashing I'm aware of is that implementing it is more complicated than simple hashing. More code means more places to introduce a bug, but there are freely available options out there now.

Technically, consistent hashing consumes a bit more CPU; consulting a sorted list to determine which server to map an object to is an O(log n) operation, where n is the number of servers X the number of slots per server, while simple hashing is O(1).

In practice, though, O(log n) is so fast it doesn't matter. (E.g., 8 servers X 1024 slots per server = 8192 items, log2(8192) = 13 comparisons at most in the worst case.) The original authors tested it and found that computing the cache server using consistent hashing took only 20 microseconds in their setup. Likewise, consistent hashing consumes space to store the sorted list of server slots, while simple hashing takes no space, but the amount required is minuscule, on the order of Kb.

Why is it not better known? If I had to guess, I would say it's only because it can take time for academic ideas to propagate out into industry. (The original paper was written in 1997.)

I assume you're talking about hash tables specifically, since you mention mod N. Please correct me if I'm wrong in that assumption, as hashes are used for all sorts of different things.

The reason is that consistent hashing doesn't really solve a problem that hash tables pressingly need to solve. On a rehash, a hash table probably needs to reassign a very large fraction of its elements no matter what, possibly a majority of them. This is because we're probably rehashing to increase the size of our table, which is usually done quadratically; it's very typical, for instance, to double the amount of nodes, once the table starts to get too full.

So in consistent hashing terms, we're not just adding a node; we're doubling the amount of nodes. That means, one way or another, best case, we're moving half of the elements. Sure, a consistent hashing technique could cut down on the moves, and try to approach this ideal, but the best case improvement is only a constant factor of 2x, which doesn't change our overall complexity.

Approaching from the other end, hash tables are all about cache performance, in most applications. All interest in making them go fast is on computing stuff as quickly as possible, touching as little memory as possible. Adding consistent hashing is probably going to be more than a 2x slowdown, no matter how you look at this; ultimately, consistent hashing is going to be worse.

Finally, this entire issue is sort of unimportant from another angle. We want rehashing to be fast, but it's much more important that we don't rehash at all. In any normal practical scenario, when a programmer sees he's having a problem due to rehashing, the correct answer is nearly always to find a way to avoid (or at least limit) the rehashing, by choosing an appropriate size to begin with. Given that this is the typical scenario, maintaining a fairly substantial side-structure for something that shouldn't even be happening is obviously not a win, and again, makes us overall slower.

Nearly all of the optimization effort on hash tables is either in how to calculate the hash faster, or how to perform collision resolution faster. These are things that happen on a much smaller time scale than we're talking about for consistent hashing, which is usually used where we're talking about time scales measured in microseconds or even milliseconds because we have to do I/O operations.

The reason is because Consistent Hashing tends to cause more work on the Read side for range scan queries.

For example, if you want to search for entries that are sorted by a particular column then you'd need to send the query to EVERY node because consistent hashing will place even "adjacent" items in separate nodes.

It's often preferred to instead use a partitioning that is going to match the usage patterns. Better yet replicate the same data in a host of different partitions/formats

继续阅读：algorithm consistent-hashing data-structures hash language-agnostic

If consistent hash is efficient,why don't people use it everywhere?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？