Fuzzy runtime search without using database\index

2023-01-07 09:21 问答作者：

I need to filter stream of text articles by checking every entry for fuzzy matches of predefined string(I am searching for misspelled product names, sometime they have dif开发者_Python百科ferent order of words and extra non letter characters like ":" or ",").

I get excellent results by putting this articles in sphinx index and performing search on it, but unfortunately I get hundreds of articles every second and updating index after getting every article is too slow(and I understand that it's not designed for such task). I need some library which can build in memory index of small ~100kb text and perform fuzzy search on it, does anything like this exist?

This problem is almost identical to Bayesian spam filtering and tools already written for that can just be trained to recognize according to your criteria.

added in response to comment:

So how are you partitioning the stream into bins now? If you already have a corpus of separated articles, just feed that into the classifier. Bayesian classifiers are the way to do fuzzy content matching in context and can classify everything from spam to nucleotides to astronomical spectral categories.

You could use less stochastic methods (e.g. Levenshtein), but at some point you have to describe the difference between hits and misses. The beauty of Bayesian methods, especially if you already have a segregated corpus in hand is that you don't actually need to expressly know how you are classifying.

How about using sqlite fts3 extension?

CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT);

(You may create any number of columns -- all of them will be indexed)

After that you insert whatever you like, and can search it without index rebuild -- matching either specific column, or the whole row.

(http://www.sqlite.org/fts3.html)

继续阅读：full-text-search fuzzy-search python

Fuzzy runtime search without using database\index

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？