开发者

fast index for "contains string"

I n my application i have upt o millions of short strings (mostly shorter than 32 characters). I want to implement a search box with a attached list that contains only elements that contain the whole string entered in the search box. How can i prebuild a index to find such strings fast? All sorted STL containers check the whole string.

For a entered search string "str" i need to 开发者_如何学编程find all strings containg "str": "main street", "struve", "ustr" etc.


You can build a Permuterm indexes.

For "struve" you would insert into a Radix tree (or a general purpose search tree):

struve$
truve$s
ruve$st
uve$str
ve$stru
e$struv
$struve

To search the infix you would search from the root node for matching prefix strings.


You might start by looking at trie's. Although they are mostly used as prefix trees, the data structure itself may be adapted for faster general search.


If the strings are of arbitrary length, and of arbitrary count, you could try the Aho-Corasick algorithm, which is simple to implement and scales at O(n) of the length of the search text, and performs searches on all strings simultaneously.

Alternatively, if the number of strings you're looking for are small, try the Horspool algorithms, which is exceptionally easy to implement and less than O(n) on average per string.


You say you have millions of short string so I assume you can't store it RAM and keep it in a database. Let's assume you keep your "short strings" in table named my_string (id, string). Create another table, let's name it my_substring (id, substring[unique]), containing every substring of every string in my_string. Also create a joining table for two tables above: my_substring_to_string (id, substring_id, string_id), its contents is obvious I suppose.

Now searching is straightforward and fast: search for your substring in my_substring (remember to create an index on my_substring.substring) and join it with my_string via my_substring_to_string.

Adding and removing of a new short string will require an update in my_substring and my_substring_to_string but these are quite straightforward.

If this solution will produce my_substring table with unacceptably large size, it can be optimized. Instead of keeping every substring try to keep every suffix and search for 'substring%' with ilike. For example if the word is 'blues', you have to store suffixes: 'blues', 'lues', 'ues', 'es', 's' (joined with 'blues'). Then search for 'lu' (ilike 'lu%') will match 'lues'. This way a database will still be ale to use an index created on my_substring.substring column, so searching still will be fast.


I would use SQLite. Maybe using a database in memory if anyway you load everything in RAM and need extream performances.


I'd probably start with an inverted index -- i.e. a list of letters, and attached to each a list of the words containing that letter. If you're only working with letters (especially if you restrict it to English, or at least Western European languages), you can also pretty easily create inverted indices for digraphs (i.e. each pair of letters), trigraphs, and so on -- though much beyond trigraphs may not gain a lot, since by then you've usually reduced the list to the point that you can do normal string searches within the lists pretty easily.

Note that I don't intend "list" to mean "linked list", but just "some sort of sequential data structure", which usually means a vector...

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜