开发者

Whats the easiest site search application to implement, that supports fuzzy searching?

I have a site that needs to search thru about 20-30k records, which are mostly movie and TV show names. The site runs php/mysql with memcache.

Im looking to replace the FULLTEXT with soundex() searching that I currently have, which works... kind of, but isn't very good in many situations.

Are there any decent search scripts out there that are simple to implement, and will provide a开发者_C百科 decent searching capability (of 3 columns in a table).


ewemli's answer is in the right direction but you should be combining FULLTEXT and soundex mapping, not replacing the fulltext, otherwise your LIKE queries are likely be very slow.

create table with_soundex (
  id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
  original TEXT,
  soundex TEXT,
  FULLTEXT (soundex)
);

insert into with_soundex (original, soundex) values 

('add some test cases', CONCAT_WS(' ', soundex('add'), soundex('some'), soundex('test'), soundex('cases'))),
('this is some text', CONCAT_WS(' ', soundex('this'), soundex('is'), soundex('some'), soundex('text'))),
('one more test case', CONCAT_WS(' ', soundex('one'), soundex('more'), soundex('test'), soundex('case'))),
('just filling the index', CONCAT_WS(' ', soundex('just'), soundex('filling'), soundex('the'), soundex('index'))),
('need one more example', CONCAT_WS(' ', soundex('need'), soundex('one'), soundex('more'), soundex('example'))),
('seems to need more', CONCAT_WS(' ', soundex('seems'), soundex('to'), soundex('need'), soundex('more')))
('some helpful cases to consider', CONCAT_WS(' ', soundex('some'), soundex('helpful'), soundex('cases'), soundex('to'), soundex('consider')))

select * from with_soundex where match(soundex) against (soundex('test'));
+----+---------------------+---------------------+
| id | original            | soundex             |
+----+---------------------+---------------------+
|  1 | add some test cases | A300 S500 T230 C000 | 
|  2 | this is some text   | T200 I200 S500 T230 | 
|  3 | one more test case  | O500 M600 T230 C000 | 
+----+---------------------+---------------------+

select * from with_soundex where match(soundex) against (CONCAT_WS(' ', soundex('test'), soundex('some')));
+----+--------------------------------+---------------------------+
| id | original                       | soundex                   |
+----+--------------------------------+---------------------------+
|  1 | add some test cases            | A300 S500 T230 C000       | 
|  2 | this is some text              | T200 I200 S500 T230       | 
|  3 | one more test case             | O500 M600 T230 C000       | 
|  7 | some helpful cases to consider | S500 H414 C000 T000 C5236 | 
+----+--------------------------------+---------------------------+

That gives pretty good results (within the limits of the soundex algo) while taking maximum advantage of an index (any query LIKE '%foo' has to scan every row in the table).

Note the importance of running soundex on each word, not on the entire phrase. You could also run your own version of soundex on each word rather than having SQL do it but in that case make sure you do it both when storing and retrieving in case there are differences between the algorithms (for instance, MySQL's algo doesn't limit itself to the standard 4 chars)


If you are looking for a simple existing solution instead of creating your own solution check out

  • Sphider.eu
  • PHPDig


There is a function SOUNDEX in mysql. If you want to search for a movie title :

select * from movie where soundex(title) = soundex( 'the title' );

Of course it doesn't work to search in text, such as movie or plot summary.


Soundex is a relatively simple algo. You can also decide to handle all that at the applicative level, it may be easier:

  • when text is stored, tokenize it and apply soundex on all words
  • store the original text and the soundex version in two columns
  • when you search, compute the soundex at the app. level and then use a regular LIKE at the db level.


Soundex has limitations to deal with fuzzy search. A better function is edit distance, which can be integrated into MySQL using UDF. Check http://flamingo.ics.uci.edu/toolkit/ for a C++ implementation for MySQL on Linux.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜