how to increase mysql full-text accuracy?
I am running a simple mysql full-text query that searches for users on my site based off of their "display name". The query example is below - in this example we are searching 'lancaster toy store':
SELECT MATCH(`display_name`) AGAINST ('lancaster toy store') as `rel`
WHERE MATCH(`display_name`) AGAINST ('lancaster toy store')
ORDER BY `rel` DESC
It works well in that it pulls up a good amount of results, but an example of the results would be:
- charlotte toy store
- toy store on broadway
- arizona toy stores
- toy store of lancaster
- east coast toys
As you can see, my problem is that people are searching for 'lancaster toy store', and the obvious best result is coming up near the middle or bottom.
I am using the porter-stemmer technique, as well.
Any ideas how to get more accu开发者_开发百科rate results?
UPDATE
Here's the real query (the actual search term is 'lancaster restore'):
SELECT `id`,
MATCH (`display_name`) AGAINST ('lancast* restor*' IN BOOLEAN MODE)
AS `RELEVANCY`
FROM `users`
WHERE `status` = 'active'
&& MATCH (`display_name`) AGAINST ('lancast* restor*' IN BOOLEAN MODE)
ORDER BY `RELEVANCY` DESC
LIMIT 25
and here are the results:
- Habitat for Humanity of Orange County - ReStores
- ReStore 15 Fourth Street Dover NH
- Morris Habitat for Humanity ReStore
- Habitat ReStore Lima Ohio
- Habitat for Humanity Charlotte ReStore
- ReStore Montgomery County
- Dayton Ohio Habitat for Humanity ReStore
- ReStore
- Lancaster Area Habitat for Humanity ReStore
Erm, stemming might be useful to get results, but as not all arguments are required (no +
in front of it), you'd be better of searching for:
MATCH (display_name)
AGAINST ('lancast* restor* >lancaster >restore' IN BOOLEAN MODE)
or at the very least
ORDER BY MATCH (display_name)
AGAINST ('lancast* restor* >lancaster >restore' IN BOOLEAN MODE) DESC
MySQL has a powerful text engine, but if you are looking for high presition searching then your bet should be somewhere else. If you use Lucene, using the API for any of the languages available (.NET, PHP using Zend_Search_Lucene, etc..) then you´ll get the power of a framework made for indexing.
It will cost you some development time, but you can make a ridiculous ammount of searches using it. For instance, if you decide to hold descriptions of your data, then making search queries with mysql WILL end up in disaster, because is not made for that, not with the efficiency of Lucene.
I´m just giving a small advice here! Remember, it should cost some development time to integrate to a mature application the indexing feature, but the array of options that you can place will be huge and worth it.
Best regards, David
I don't know what the porter-stemmer technique is, but using your sample data and query with a standard MySQL fulltext index, the only result that should be returned is #4:
4. toy store of lancaster
I noticed your sample query is missing a FROM clause, so I assume that is not the exact query you are running. Is it missing anything else? Perhaps you are using BOOLEAN MODE in your query? If you are using BOOLEAN MODE, that would explain the extra results, but result #4 should be at the top of the list since it has all 3 of the words.
Can you provide your exact query?
Since you asked it in a very general way - "Any ideas how to get more accurate results?" - the answer might also be a bit more general: use some other full text search engine instead of the MySQL's built in one. A good one is Sphinx search ( http://sphinxsearch.com/), it is easy to set up, easy to learn, plays well along with MySQL and offers far better speed and precision.
I'll throw in an alternative suggestion. If you are at an early stage, you can go for Solr - a dedicated search engine, so that you don't rely on full-text search capabilities of the database.
If it's getting the correct result and all you really want is better ordering, you could re-sort the results. As a first option, get a word frequency distribution for your corpus and rank results with rare word matches higher.
Throwing word order in should also help. OTOH general search is as much art as science.
精彩评论