Hash Table lookup - with perfect hash, in C
I have a C-language app where I need to do table lookups.
The entries are strings, All are known at the start of runtime. The table is initi开发者_如何学JAVAalized once, and then looked up many times. The table can change, but it's basically as if the app starts over. I think this means I can use a perfect-hash? It's ok to consume some time for the hashtable initialization, as it happens just once.
There will be between 3 and 100,000 entries, each one unique, and I estimate that 80% of cases will have fewer than 100 entries. A simple naive lookup is "fast enough" in those cases. (== no one is complaining)
However in the cases where there are 10k+ entries, the lookup speed of a naive approach is unacceptable. What's a good approach for delivering good hashtable-based lookup performance for strings in C? Assume I do not have a 3rd-party commercial library like Boost/etc. What hash algorithm should I use? how do I decide?
Generating a perfect hash is not a simple problem. There's libraries devoted to the task. In this case the most popular one is probably CMPH. I haven't used it though so can't help beyond that. gperf is another tool, but it requires the strings to be known at compile time (you could work around it by compiling a .so and loading, but kind of overkill).
But frankly, I'd at least try to go with a binary search first. Simply sort the array using qsort
, then search with bsearch
(or roll your own). Both those are part of stdlib.h
since C89.
If a naive (I assume you mean linear) approach is ok for 100 entries (so 50 comparisons are done on average) then a binary search will be more than sufficient for 100,000 entries (it takes at most 17 comparisons).
So I wouldn't bother with hashes at all but just resort to sorting your string table on startup (e.g. using qsort
) and later using a binary search (e.g. using bsearch
) to look up entries.
If the (maximal) table size is known, a plain hashtable with chaining is very easy to implement. Size overhead is only two ints per item. With a reasonable hash function only 1.5 probes per lookup are needed on average, this for a 100% loaded table.
Constructing a perfect hash is only feasible if your data does not change. Once it changes, you'll have to recompute and rehash, which is way more expensive than doing a few extra compares.
精彩评论