fast index for "contains string"

2022-12-16 13:42 问答作者：

I n my application i have upt o millions of short strings (mostly shorter than 32 characters). I want to implement a search box with a attached list that contains only elements that contain the whole string entered in the search box. How can i prebuild a index to find such strings fast? All sorted STL containers check the whole string.

For a entered search string "str" i need to 开发者_如何学编程find all strings containg "str": "main street", "struve", "ustr" etc.

You can build a Permuterm indexes.

For "struve" you would insert into a Radix tree (or a general purpose search tree):

struve$
truve$s
ruve$st
uve$str
ve$stru
e$struv
$struve

To search the infix you would search from the root node for matching prefix strings.

You might start by looking at trie's. Although they are mostly used as prefix trees, the data structure itself may be adapted for faster general search.

If the strings are of arbitrary length, and of arbitrary count, you could try the Aho-Corasick algorithm, which is simple to implement and scales at O(n) of the length of the search text, and performs searches on all strings simultaneously.

Alternatively, if the number of strings you're looking for are small, try the Horspool algorithms, which is exceptionally easy to implement and less than O(n) on average per string.

You say you have millions of short string so I assume you can't store it RAM and keep it in a database. Let's assume you keep your "short strings" in table named my_string (id, string). Create another table, let's name it my_substring (id, substring[unique]), containing every substring of every string in my_string. Also create a joining table for two tables above: my_substring_to_string (id, substring_id, string_id), its contents is obvious I suppose.

Now searching is straightforward and fast: search for your substring in my_substring (remember to create an index on my_substring.substring) and join it with my_string via my_substring_to_string.

Adding and removing of a new short string will require an update in my_substring and my_substring_to_string but these are quite straightforward.

If this solution will produce my_substring table with unacceptably large size, it can be optimized. Instead of keeping every substring try to keep every suffix and search for 'substring%' with ilike. For example if the word is 'blues', you have to store suffixes: 'blues', 'lues', 'ues', 'es', 's' (joined with 'blues'). Then search for 'lu' (ilike 'lu%') will match 'lues'. This way a database will still be ale to use an index created on my_substring.substring column, so searching still will be fast.

I would use SQLite. Maybe using a database in memory if anyway you load everything in RAM and need extream performances.

I'd probably start with an inverted index -- i.e. a list of letters, and attached to each a list of the words containing that letter. If you're only working with letters (especially if you restrict it to English, or at least Western European languages), you can also pretty easily create inverted indices for digraphs (i.e. each pair of letters), trigraphs, and so on -- though much beyond trigraphs may not gain a lot, since by then you've usually reduced the list to the point that you can do normal string searches within the lists pretty easily.

Note that I don't intend "list" to mean "linked list", but just "some sort of sequential data structure", which usually means a vector...

继续阅读：algorithm stl

fast index for "contains string"

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？