开发者

Fuzzy match of an English sentence with a set of English sentences stored in a database

There are about 1000 records in a database table. There is a column named title which is used to store the title of articles. Before inserting a record, I need to check if there is already an article with similar title exists in that table. If so, I will skip.

What's the fastest way to perform this kind of fuzzy matching? Assuming all words in sentences can be found in a English dictionary. If 70% of words in sentence #1 can be found 开发者_如何学Pythonin sentence #2, we consider them a match. Ideally, the algorithm can pre-compute a value for each sentence so that the value can be stored in the database.


For a 1000 records, doing the dumb thing and just iterating over all the records could work (assuming that the strings aren't too long and you aren't getting hit with too many queries). Just pull all of the titles out of your database, and then sort them by their distance to your given string (for example, you could use Levenshtein distance for this metric).

A fancier way to do approximate string matching would be to precompute n-grams of all your strings and store them in your database (some systems support this feature natively). This will definitely scale better performance wise, but it could mean more work:

http://en.wikipedia.org/wiki/N-gram


You can read up on forward / reverse indexing of token - value storage for getting faster search results. I personally prefer reverse indexing which stores a hash map of token(key) to value (here title).

Whenever you write a new article, like a new stackoverflow question, the tokens in the title would be searched to map against all the titles available.

To optimize the result, i.e. get the fuzzy logic for results, you can sort the titles by the max amount of occurrences in tokens being searched for. Eg, if t1,t2 and t3 refer to the tokens 'what' 'is' 'love', and the title 'what this love is for?' would exist in all the tokens mappings, it would be placed at the topmost.

You can play around with this more. I hope this approach is more simple and appealing.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜