开发者

Hash table optimized for full iteration + key replacement

I have a hash table where the vast majority of accesses at run-time follow one of the following patterns:

  • Iterate through all key/value pairs. (The speed of this operation is critical.)
  • Modify keys (i.e. remove a key/value pair & add another with the same value but a different key. Detect duplicate keys & combine values if necessary.) This is done in a loop, affecting many thousands of keys, but with no other operations intervening.

I would also like it to consume as little memory as possible.

Other standard operations must be available, though they are used less frequently, e.g.

  • Insert a new key/value pair
  • Given a key, look up the corresponding value
  • Change the value associated with an existi开发者_Python百科ng key

Of course all "standard" hash table implementations, including standard libraries of most high-level-languages, have all of these capabilities. What I am looking for is an implementation that is optimized for the operations in the first list.

Issues with common implementations:

  • Most hash table implementations use separate chaining (i.e. a linked list for each bucket.) This works but I am hoping for something that occupies less memory with better locality of reference. Note: my keys are small (13 bytes each, padded to 16 bytes.)
  • Most open addressing schemes have a major disadvantage for my application: Keys are removed and replaced in large groups. That leaves deletion markers that increase the load factor, requiring the table to be re-built frequently.

Schemes that work, but are less than ideal:

  • Separate chaining with an array (instead of a linked list) per bucket:

    Poor locality of reference, resulting from memory fragmentation as small arrays are reallocated many times

  • Linear probing/quadratic hashing/double hashing (with or without Brent's Variation):

    Table quickly fills up with deletion markers

  • Cuckoo hashing

    Only works for <50% load factor, and I want a high LF to save memory and speed up iteration.

Is there a specialized hashing scheme that would work well for this case?


Note: I have a good hash function that works well with both power-of-2 and prime table sizes, and can be used for double hashing, so this shouldn't be an issue.


Would Extendable Hashing help? Iterating though the keys by walking the 'directory' should be fast. Not sure if the "modify key for value" operation is any better with this scheme or not.


Based on how you're accessing the data, does it really make sense to use a hash table at all?

Since you're main use cases involve iteration - a sorted list or a btree might be a better data structure.

It doesnt seem like you really need the constant time random data access a hash table is built for.


You can do much better than a 50% load factor with cuckoo hashing.

Two hash functions with four items will get you over 90% with little effort. See this paper:

http://www.ru.is/faculty/ulfar/CuckooHash.pdf

I'm building a pre-computed dictionary using a cuckoo hash and getting a load factor of better than 99% with two hash functions and seven items per bucket.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜