What are the fastest-performing options for a read-only, unordered collection of unique strings?

2023-01-03 16:52 问答作者：

Disclaimer: I realize the totally obvious answer to this question is HashSet<string>. It is absurdly fast, it is unordered, and its values are unique.

But I'm just wondering, because HashSet<T> is a mutable class, so it has Add, Remove, etc.; and so I am not sure if the underlying data structure that makes these operations possible makes certain performance sacrifices when it comes to read operations -- in particular, I'm concerned with Contains.

Basically, I'm wondering what are the absolute faste开发者_高级运维st-performing data structures in existence that can supply a Contains method for objects of type string. Within or outside of the .NET framework itself.

I'm interested in all kinds of answers, regardless of their limitations. For example I can imagine that some structure might be restricted to strings of a certain length, or may be optimized depending on the problem domain (e.g., range of possible input values), etc. If it exists, I want to hear about it.

One last thing: I'm not restricting this to read-only data structures. Obviously any read-write data structure could be embedded inside a read-only wrapper. The only reason I even mentioned the word "read-only" is that I don't have any requirement for a data structure to allow adding, removing, etc. If it has those functions, though, I won't complain.

UPDATE:

Moron's answer is an excellent example of the sort of thing I'm looking for. A Trie* definitely seems like a great possibility for the following reason: HashSet<T>.Contains depends on the GetHashCode function of some IEqualityComparer<string>, which, as far as I can tell, is O(n)** by default in .NET. In other words, every character in a string must be examined for HashSet<string>.Contains to return either true or false. For a Trie, only a return value of true would take O(n) to determine; a return value of false could potentially return much more quickly.

This is of course hypothetical. So far I have not written or come across a Trie implementation in .NET that can beat a HashSet<string> at Contains (though an implementation I wrote myself came quite close for the alphabet 'a' to 'z'). I'm just saying, it seems possible.

_{*That link, by the way, also led me to another intriguing/similar possibility: DAWG.}

_{**Here "n" is referring to the length of the string.}

Tries are good for doing a Contains, especially for strings from a finite alphabet. Given a string s, the time complexity for Contains on a trie is O(|s|) (|s| = length of s), which is optimal.

Apart from your wondering Hashset is the fastest collection.

There's no faster method because the underlying Hashtable allows O(1) read-write-access

A hashing container approaches O(1) for insert and retrieval, so from an order-of-magnitude perspective you can't get much better than that.

Within a hash container, your performance over time is going to be related to two things: how good of a distribution your hash function provides, and how fast it can compute it. These are not equivalent - a poorly distributed function (where you end up with a lot of collisions) is going to be much more performance-impacting than a slower but better distributed hash function.

Thus, if you could come up with a perfect hash function that was also extremely fast to compute, that would be an improvement. It's possible that constraining the data in specific ways may make that easier. But, odds are you, whatever you come up with will not be as good as what already exists.

Hashing tables are amortized O(1) for lookup. Can't do better than that, O(1/n) algorithms are perpetual motion devices. There are only two things that make them behave poorly:

A poor hashing function that causes many collisions. The worst one will degenerate lookup to O(n). You'll have no trouble with strings, they hash really well. String.GetHashCode() does a terrific job.
A collection that is heavily mutated with many removed items that were added early. This can causes many empty hash buckets that need to be skipped by iterators. Degradation to O(n) is technically possible although quite rare. A simple workaround is to rebuild the collection by reassigning the reference (like table = new HashSet(table); )

These kind of problems are rare. You don't design for them up front (other than the hash function), you start considering them only when you detect perf problems with the program.

继续阅读：.net data-structures hashset performance string

What are the fastest-performing options for a read-only, unordered collection of unique strings?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？