Sphinx and "did you mean ... ?" suggestions idea. WIll it work?

2023-02-10 05:17 问答作者：

I'm trying to come up with the fastest way to make search suggestions. At first I thought a Levenstein UDF function combined with a mysql table would do the job. But using levenshtein, mysql would have to go over every row in the table (tons of words) which would make the query really slow.

Now I recently installed and started to use Sphinx (http://sphinxsearch.com/) for fulltext searching mainly because of its performance and tight mysql integration with SphinxSE.

So I asked myself if I can implement a "did you mean" algorithm using sphinx to boost performance somehow, and I think I found a simple one. Basically i take all the keywords I want to correct, put a space between each letter, then put it in the sphinx index. If the word is 'keyword' it becomes 'k e y w o r d'. Now when the user enters a word I split it in to letters and search in the sphinx index for a record (I just need one) that matches any of the letters provided. The best part is that sphinx is very good on calculating relevance (weight) of the matched rows, so the best match will always 开发者_高级运维have the biggest weight (I think). It also accounts for word (letters in my case) positions so the best match will be in that order.

With the sphinx query I get the most similar word in my keywords list. Then I check it with php using the extended Levenshtain distance which accounts for rearranged letters https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance . If the string distance is lower than 2 (and != 0) then suggest the word. Otherwise don't suggest anything.

Is there a problem with my idea? Something I didn't think of? Any expected glitches with the sphinx query, and quirks with the sphinx relevance calculation which woudn't give the best match? Please correct me if I'm mistaking somewhere.

I can't see a problem with your idea. Go for it. Just to point out that your method is only relevant if you want to override the builtin behaviour that is very similar to LD.

For example, with sphinx 1.10-beta, you can specify min_infix_len and expand_keywords and use sphinx's builtin weighting methods (BM25 and some proprietary code) for good results. http://sphinxsearch.com/blog/2010/08/17/how-sphinx-relevance-ranking-works/

Don't forget to memcache these queries, and create a warm-up script.

You could just log every search query that's entered, along with a next search query that the user enters.

Lets assume that lots of users search for rhinosorous but actually mean rhinoceros. Because users will correct their query, this will mean there will be a lot of rhinosorous queries with rhinoceros as the next query.

You can select suggestions like this:

SELECT id, query, next_query, COUNT(id) AS count FROM queries GROUP BY query ORDER BY COUNT(id) DESC

If the top result has a count that's a high % of all queries for that keyword, display a message.

I haven't tested this, its just an idea.

I think it will be interesting to you to read what Andrew Aksyonoff (author of Sphinx) thinks about implementation of this task via Sphinx - http://habrahabr.ru/blogs/sphinx/61807/ (use translator to translate from russian)

继续阅读：keyword levenshtein-distance search-suggestion sphinx

Sphinx and "did you mean ... ?" suggestions idea. WIll it work?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？