开发者

Choosing a Data structure for very large data

I have x (millions) positive integers, where their values can be as big as allowed (+2,147,483,647). Assuming they are unique, what is the best way to store them for a lookup intensive program.

So far i thought of using a binary AVL tree or a hash table, where the integer is the key to the ma开发者_运维百科pped data (a name). However am not to sure whether i can implement such large keys and in such large quantity with a hash table (wouldn't that create a >0.8 load factor in addition to be prone for collisions?)

Could i get some advise on which data structure might be suitable for my situation


The choice of structure depends heavily on how much memory you have available. I'm assuming based on the description that you need lookup but not to loop over them, find nearest, or other similar operations.

Best is probably a bucketed hash table. By placing hash collisions into buckets and keeping separate arrays in the bucket for keys and values, you can both reduce the size of the table proper and take advantage of CPU cache speedup when searching a bucket. Linear search within a bucket may even end up faster than binary search!

AVL trees are nice for data sets that are read-intensive but not read-only AND require ordered enumeration, find nearest and similar operations, but they're an annoyingly amount of work to implement correctly. You may get better performance with a B-tree because of CPU cache behavior, though, especially a cache-oblivious B-tree algorithm.


Have you looked into B-trees? The efficiency runs between log_m(n) and log_(m/2)(n) so if you choose m to be around 8-10 or so you should be able to keep your search depth to below 10.


Bit Vector , with the index set if the number is present. You can tweak it to have the number of occurrences of each number. There is a nice column about bit vectors in Bentley's Programming Pearls.


If memory isn't an issue a map is probably your best bet. Maps are O(1) meaning that as you scale up the number of items to be looked up the time is takes to find a value is the same.

A map where the key is the int, and the value is the name.


Do try hash tables first. There are some variants that can tolerate being very dense without significant slowdown (like Brent's variation).

If you only need to store the 32-bit integers and not any associated record, use a set and not a map, like hash_set in most C++ libraries. It would use only 4-bytes records plus some constant overhead and a little slack to avoid being 100%. In the worst case, to handle 'millions' of numbers you'd need a few tens of megabytes. Big, but nothing unmanageable.

If you need it to be much tighter, just store them sorted in a plain array and use binary search to fetch them. It will be O(log n) instead of O(1), but for 'millions' of records it's still just twentysomething steps to get any one of them. In C you have bsearch(), which is as fast as it can get.

edit: just saw in your question you talk about some 'mapped data (a name)'. are those names unique? do they also have to be in memory? if yes, they would definitely dominate the memory requirements. Even so, if the names are the typical english words, most would be 10 bytes or less, keeping the total size in the 'tens of megabytes'; maybe up to a hundred megs, still very manageable.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜