开发者

MYSQL search database for similar results

Essentially what I want to do is search a number of MYSQL databases and return results where a certain field is more than 50% similar to another record in the databases.

What am I trying to achieve? I have a number of writers who add content to a ne开发者_JS百科twork of websites that I own, I need a tool that will tell me if any of the pages they have written are too similar to any of the pages currently published on the network. This could run on post/update or as a cron... either way would work for me.

I've tried making something with php, drawing the records from the database and using the function similar_text(), which gives a % difference between two strings - this however is not a workable solution as you have to compare every entry against every other entry & I worked out with microtime that it would take around 80 hours to completely search all of the entries!

Wondering if it's even possible!?

Thanks!


You are probably looking for is SOUNDEX. It is the only sound based search in mysql. If you have A LOT of data to compare, you're probably going to need to pregenerate the soundex and compare the soundex columns or use it live like this:

SELECT * FROM data AS t1 LEFT JOIN data AS t2 ON SOUNDEX(t1.fieldtoanalyse) = SOUNDEX(t2.fieldtoanalyse)

Note that you can also use the

t1.fieldtoanalyze SOUNDS LIKE t2.fieldtoanalyze

syntax.

Finaly, you can save the SOUNDEX to a column, just create a column and:

UPDATE data SET fieldsoundex = SOUNDEX(fieldtoanalyze)

and then compare live with pregenerated values

More on Soundex

Soundex is a function that analyzes the composition of a word but in a very crude way. It is very useful for comparisons of "Color" vs "Colour" and "Armor" vs "Armour" but can also sometimes dish out weird results with long words because the SOUNDEX of a word is a letter + a 3 number code. There is just so much you can do sadly with these combinations.

Note that there is no levenstein or metaphone implementation in mysql... not yet, but probably, levenstein would have been the best for your case.


Anything is possible.

Without knowing your criteria for similar, it's difficult to offer a specific solution. However, my suggestion would be pre-build a similarity table, utilize a function such as similar_text(). Use this as your index table when searching by term.

You'll take an initial hit to build such an index. However, you can manage it easier as new records are added.


Thanks for your answers guys, for anyone looking for a solution to a problem similar to this I used the SOUNDEX function to pull out entries that had a similar title then compared them with the similar_text() function. Not quite a complete database comparison, but near as I could get it!

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜