开发者

how to find duplicate string values which are not exactly same but similar in SQL

I have a table with column name company_name(string),I need to find the duplicate values of this column,i can't use distinct or group by because they are not exactly same. Eg:

atec,inc
atec inc
ate inc  

abc group
abc groups
abc grups

how do I find such entries in sql or is there any ruby libraries for 开发者_如何转开发this.


If there's some sort of pattern to the variations you could try with regular expressions, which MySQL seems to support (we use Postgres, which definitely does):

http://dev.mysql.com/doc/refman/5.1/en/regexp.html

If that's not the case, I'm afraid you'll have to do it in your app. As Pete mentioned there are various algorithms for finding similar sounding things, we use/have used Levenshtein distance, Hamming distances of simhashes and Kölner Phonetic Encoding for this purpose.

  • http://en.wikipedia.org/wiki/Levenshtein_distance
  • http://de.wikipedia.org/wiki/K%C3%B6lner_Phonetik
  • http://en.wikipedia.org/wiki/Hamming_distance


This would be a perfect use for a functional index, but as I don't think MySQL has one of those, may I suggest the Full Text Searching in MySQL?

http://dev.mysql.com/doc/refman/5.5/en/fulltext-search.html

Specifically the boolean search:

http://dev.mysql.com/doc/refman/5.5/en/fulltext-boolean.html

Also, if you're not familiar with stemming of words, that's powerful concept that would also help you increase the retrieval of your search.

Hope that helps.


I don't know that SQL has a SELECT WHERE A SORTA_OF_LIKE B but look up "Levenshtein distance," perhaps a way of doing sorta' what you want.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜