Fuzzy match of an English sentence with a set of English sentences stored in a database

2023-03-15 05:35 问答作者：

There are about 1000 records in a database table. There is a column named title which is used to store the title of articles. Before inserting a record, I need to check if there is already an article with similar title exists in that table. If so, I will skip.

What's the fastest way to perform this kind of fuzzy matching? Assuming all words in sentences can be found in a English dictionary. If 70% of words in sentence #1 can be found 开发者_如何学Pythonin sentence #2, we consider them a match. Ideally, the algorithm can pre-compute a value for each sentence so that the value can be stored in the database.

For a 1000 records, doing the dumb thing and just iterating over all the records could work (assuming that the strings aren't too long and you aren't getting hit with too many queries). Just pull all of the titles out of your database, and then sort them by their distance to your given string (for example, you could use Levenshtein distance for this metric).

A fancier way to do approximate string matching would be to precompute n-grams of all your strings and store them in your database (some systems support this feature natively). This will definitely scale better performance wise, but it could mean more work:

http://en.wikipedia.org/wiki/N-gram

You can read up on forward / reverse indexing of token - value storage for getting faster search results. I personally prefer reverse indexing which stores a hash map of token(key) to value (here title).

Whenever you write a new article, like a new stackoverflow question, the tokens in the title would be searched to map against all the titles available.

To optimize the result, i.e. get the fuzzy logic for results, you can sort the titles by the max amount of occurrences in tokens being searched for. Eg, if t1,t2 and t3 refer to the tokens 'what' 'is' 'love', and the title 'what this love is for?' would exist in all the tokens mappings, it would be placed at the topmost.

You can play around with this more. I hope this approach is more simple and appealing.

继续阅读：algorithm fuzzy-search

Fuzzy match of an English sentence with a set of English sentences stored in a database

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？