how to find duplicate string values which are not exactly same but similar in SQL
I have a table with column name company_name(string),I need to find the duplicate values of this column,i can't use distinct or group by because they are not exactly same. Eg:
atec,inc
atec inc
ate inc
abc group
abc groups
abc grups
how do I find such entries in sql or is there any ruby libraries for 开发者_如何转开发this.
If there's some sort of pattern to the variations you could try with regular expressions, which MySQL seems to support (we use Postgres, which definitely does):
http://dev.mysql.com/doc/refman/5.1/en/regexp.html
If that's not the case, I'm afraid you'll have to do it in your app. As Pete mentioned there are various algorithms for finding similar sounding things, we use/have used Levenshtein distance, Hamming distances of simhashes and Kölner Phonetic Encoding for this purpose.
- http://en.wikipedia.org/wiki/Levenshtein_distance
- http://de.wikipedia.org/wiki/K%C3%B6lner_Phonetik
- http://en.wikipedia.org/wiki/Hamming_distance
This would be a perfect use for a functional index, but as I don't think MySQL has one of those, may I suggest the Full Text Searching in MySQL?
http://dev.mysql.com/doc/refman/5.5/en/fulltext-search.html
Specifically the boolean search:
http://dev.mysql.com/doc/refman/5.5/en/fulltext-boolean.html
Also, if you're not familiar with stemming of words, that's powerful concept that would also help you increase the retrieval of your search.
Hope that helps.
I don't know that SQL has a SELECT WHERE A SORTA_OF_LIKE B but look up "Levenshtein distance," perhaps a way of doing sorta' what you want.
精彩评论