开发者

Fast String Search like startsWith() not equals()

I have an ordered list (a dictionary - 100K words) and many words to seach on this list frequen开发者_StackOverflow社区tly. So performance is an issue. I know that a HashSet.contains(theWord) or Collections.binarySearch(sortedList, theWord) are very fast. But I am actually not looking for the whole word.

What I want is let's say searching for "se" and getting all the words starts with "se". So is there a ready to use solution in Java or any libraries?

A better example: On a sorted list a quick solution for the following operation

List.subList (String beginIndex, String endIndex) // returns the interval

myWordList.subList(“ab”, “bc”);

Note: Here is a very similar question but accepted answer is not satisfying. Overriding HashSet's Contains Method


What you're looking for here is a data structure commanly called a 'trie':

http://en.wikipedia.org/wiki/Trie

It stores strings in a tree indexed by prefix, where the first level of the tree contains the first character of the string, the second level the second character, etc. The result is that it allows you to extract subsets of very large sets of strings by prefix extremely quickly.


The Trie structure is very well suited for dictionaries and finding words with common prefixes. There is a contribution of a Trie implementation in Google Collections/Guava.


There's really no big need for new structures: problem can be solved by binary search on your list. In particular, you can modify binary search to return first matching element (first element with specified prefix).

List.subList (String beginIndex, String endIndex) // returns the interval
I may be stupid, but what kind of index has string type? Can you clarify this part?


Your search result will be a range from your ordered word list. To get that, you need the index of the first and the last element of the range.

To get the first, run a binary search with the original search string ("se"), comparing it to the current position in each iteration. Stop when the word at the current position is greater than the search string, but the current-1 th word is lower.

To get the last index, run another binary search on the search term+"z" ("sez"), but now stop only when the word at the current index is smaller than "sez" but current+1 is greater.

Finally return the range marked by the first and last index by whatever means that are available in your programming language.

This method is built on two assumptions:

  • String comparison sees "b" greater than "az"
  • "z" is the highest char value among the list of words

I have this algorithm implemented in a JavaScript data manipulation library (jOrder.net).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜