开发者

levenshtein alternative

i have a big set of queries and use levenshtein to calculate typos, now levenshtein causes mysql to take full cpu time. My query is a fulltext search + levenshtein in a UNION statement. sql1 is my current query, sql2 is only fulltext search which is fast and doesnt use too much cpu time, the last one the leventhein one which will peak!

Any of you have an alternative way to get typos as well? Please don't answer normalize data, I have thought of that, but is not applicable to my data, as I cannot pre-make the matches/calculations and create a separate table with indexes.

            $sql1 = "(SELECT * FROM ci_sanctions_properties WHERE prop_type='LASTNAME' AND prop_value!='' AND MATCH(prop_value) AGAINST ('+usama bin laden' IN BOOLEAN MODE)) UNION (SELECT s.* FROM开发者_如何学C (SELECT levenshtein(prop_value, 'usama bin laden') AS dist, sanction_id, prop_type, prop_value FROM ci_sanctions_properties WHERE prop_type='LASTNAME' AND prop_value!='') s WHERE dist < 3) ORDER BY sanction_id";

        $sql2 = "SELECT * FROM ci_sanctions_properties WHERE prop_type='LASTNAME' AND prop_value!='' AND MATCH(prop_value) AGAINST ('+usama bin laden' IN BOOLEAN MODE) ORDER BY sanction_id";

        $sql3 = "SELECT s.* FROM (SELECT levenshtein(prop_value, 'usama bin laden') AS dist, sanction_id, prop_type, prop_value FROM ci_sanctions_properties WHERE prop_type='LASTNAME' AND prop_value!='') s WHERE dist < 3";


If you are tied only to MySQL there is not an easy solution.

Usually this is solved using specialized ngram indexing for fast candidate lookup filtering and then calculating levensthein only on like 10-50 candidates which is faster that calculating levensthein for all pairs.

Specialized fulltext search engines like Solr/Lucene have this built in.

PostgreSQL has pg_trgm contrib module (http://www.postgresql.org/docs/9.0/static/pgtrgm.html) which works like a charm.

You can even simulate this in MySQL using fulltext indexing, but you have to collect words from all your documents convert them to ngrams, create fulltext indexes on them, and hack them all together for fast lookup. Which brings all sorts of trouble with redundancy, sync...not worth your time.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜