c++ container for checking whether ordered data is in a collection

2022-12-16 20:52 问答作者：

I have data that is a set of ordered ints

[0] = 12345 [1] = 12346 [2] = 12454 etc.

I need to开发者_如何学JAVA check whether a value is in the collection in C++, what container will have the lowest complexity upon retrieval? In this case, the data does not grow after initiailization. In C# I would use a dictionary, in c++, I could either use a hash_map or set. If the data were unordered, I would use boost's unordered collections. However, do I have better options since the data is ordered? Thanks

EDIT: The size of the collection is a couple of hundred items

Just to detail a bit over what have already been said.

Sorted Containers

The immutability is extremely important here: std::map and std::set are usually implemented in terms of binary trees (red-black trees for my few versions of the STL) because of the requirements on insertion, retrieval and deletion operation (and notably because of the invalidation of iterators requirements).

However, because of immutability, as you suspected there are other candidates, not the least of them being array-like containers. They have here a few advantages:

minimal overhead (in term of memory)
contiguity of memory, and thus cache locality

Several "Random Access Containers" are available here:

Boost.Array
std::vector
std::deque

So the only thing you actually need to do can be broken done in 2 steps:

push all your values in the container of your choice, then (after all have been inserted) use std::sort on it.
search for the value using std::binary_search, which has O(log(n)) complexity

Because of cache locality, the search will in fact be faster even though the asymptotic behavior is similar.

If you don't want to reinvent the wheel, you can also check Alexandrescu's [AssocVector][1]. Alexandrescu basically ported the std::set and std::map interfaces over a std::vector:

because it's faster for small datasets
because it can be faster for frozen datasets

Unsorted Containers

Actually, if you really don't care about order and your collection is kind of big, then a unordered_set will be faster, especially because integers are so trivial to hash size_t hash_method(int i) { return i; }.

This could work very well... unless you're faced with a collection that somehow causes a lot of collisions, because then unsorted containers will search over the "collisions" list of a given hash in linear time.

Conclusion

Just try the sorted std::vector approach and the boost::unordered_set approach with a "real" dataset (and all optimizations on) and pick whichever gives you the best result.

Unfortunately we can't really help more there, because it heavily depends on the size of the dataset and the repartition of its elements

If the data is in an ordered random-access container (e.g. std::vector, std::deque, or a plain array), then std::binary_search will find whether a value exists in logarithmic time. If you need to find where it is, use std::lower_bound (also logarithmic).

Use a sorted std::vector, and use a std::binary_search to search it.

Your other options would be a hash_map (not in the C++ standard yet but there are other options, e.g. SGI's hash_map and boost::unordered_map), or an std::map.

If you're never adding to your collection, a sorted vector with binary_search will most likely have better performance than a map.

I'd suggest using a std::vector<int> to store them and a std::binary_search or std::lower_bound to retrieve them.

Both std::unordered_set and std::set add significant memory overhead - and even though the unordered_set provides O(1) lookup, the O(logn) binary search will probably outperform it given that the data is stored contiguously (no pointer following, less chance of a page fault etc.) and you don't need to calculate a hash function.

If you already have an ordered array or std::vector<int> or similar container of the data, you can just use std::binary_search to probe each value. No setup time, but each probe will take O(log n) time, where n is the number of ordered ints you've got.

Alternately, you can use some sort of hash, such as boost::unordered_set<int>. This will require some time to set up, and probably more space, but each probe will take O(1) time on the average. (For small n, this O(1) could be more than the previous O(log n). Of course, for small n, the time is negligible anyway.)

There is no point in looking at anything like std::set or std::map, since those offer no advantage over binary search, given that the list of numbers to match will not change after being initialized.

So, the questions are the approximate value of n, and how many times you intend to probe the table. If you aren't going to check many values to see if they're in the ints provided, then setup time is very important, and std::binary_search on the sorted container is the way to go. If you're going to check a lot of values, it may be worth setting up a hash table. If n is large, the hash table will be faster for probing than binary search, and if there's a lot of probes this is the main cost.

So, if the number of ints to compare is reasonably small, or the number of probe values is small, go with the binary search. If the number of ints is large, and the number of probes is large, use the hash table.

继续阅读：collections

c++ container for checking whether ordered data is in a collection

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？