Fast String Search like startsWith() not equals()

2023-01-09 10:44 问答作者：

I have an ordered list (a dictionary - 100K words) and many words to seach on this list frequen开发者_StackOverflow社区tly. So performance is an issue. I know that a HashSet.contains(theWord) or Collections.binarySearch(sortedList, theWord) are very fast. But I am actually not looking for the whole word.

What I want is let's say searching for "se" and getting all the words starts with "se". So is there a ready to use solution in Java or any libraries?

A better example: On a sorted list a quick solution for the following operation

List.subList (String beginIndex, String endIndex) // returns the interval

myWordList.subList(“ab”, “bc”);

Note: Here is a very similar question but accepted answer is not satisfying. Overriding HashSet's Contains Method

What you're looking for here is a data structure commanly called a 'trie':

http://en.wikipedia.org/wiki/Trie

It stores strings in a tree indexed by prefix, where the first level of the tree contains the first character of the string, the second level the second character, etc. The result is that it allows you to extract subsets of very large sets of strings by prefix extremely quickly.

The Trie structure is very well suited for dictionaries and finding words with common prefixes. There is a contribution of a Trie implementation in Google Collections/Guava.

There's really no big need for new structures: problem can be solved by binary search on your list. In particular, you can modify binary search to return first matching element (first element with specified prefix).

List.subList (String beginIndex, String endIndex) // returns the interval
I may be stupid, but what kind of index has string type? Can you clarify this part?

Your search result will be a range from your ordered word list. To get that, you need the index of the first and the last element of the range.

To get the first, run a binary search with the original search string ("se"), comparing it to the current position in each iteration. Stop when the word at the current position is greater than the search string, but the current-1 th word is lower.

To get the last index, run another binary search on the search term+"z" ("sez"), but now stop only when the word at the current index is smaller than "sez" but current+1 is greater.

Finally return the range marked by the first and last index by whatever means that are available in your programming language.

This method is built on two assumptions:

String comparison sees "b" greater than "az"
"z" is the highest char value among the list of words

I have this algorithm implemented in a JavaScript data manipulation library (jOrder.net).

继续阅读：performance search string

Fast String Search like startsWith() not equals()

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？