Remove first row of all certain duplicate entries in MySQL
My issue is that I have duplicate entries in my table. However, it's not that simple. Some are valid duplicates while others are not. For example, E Pearl River, NY
and East Pearl River, NY
are the same city w开发者_如何学Chile Bardonia, NY
and Nanuet, NY
are not. So, if I removed ALL duplicates, it would remove valid cities like Bardonia.
My question is, does anyone know a way (in SQL or any programming language) to loop through each row and do the following:
- Check if row is duplicate
- Check if there is a partial match for
city
(i.e. E Pearl River, NY and East Pearl River, NY) - Remove first of the duplicate rows
- If there is no partial match for
city
(i.e. Bardonia, NY and Nanuet, NY), move along.
Before
+—————————+——————————————————————+
| zipcode | city |
+—————————+——————————————————————+
| 10956 | E Pearl River, NY |
| 10956 | East Pearl River, NY |
| 10977 | Spring Valley, NY |
| 10954 | Bardonia, NY |
| 10954 | Nanuet, NY |
+—————————+——————————————————————+
After
+—————————+——————————————————————+
| zipcode | city |
+—————————+——————————————————————+
| 10956 | East Pearl River, NY |
| 10977 | Spring Valley, NY |
| 10954 | Bardonia, NY |
| 10954 | Nanuet, NY |
+—————————+——————————————————————+
Any help is greatly appreciated!
There is no easy solution for your problem as it involves natural language processing.
How do you define E Pearl River, NY is duplicate for East Pearl River, NY? What about East and West Pearl River? For natural language processing you should take a look at Lucene or MySQL Full-Text Search.
For fairly easy (but not necessarily very accurate) solution, you could try implementing simple Levenshtein distance measure.
精彩评论