开发者

Remove first row of all certain duplicate entries in MySQL

My issue is that I have duplicate entries in my table. However, it's not that simple. Some are valid duplicates while others are not. For example, E Pearl River, NY and East Pearl River, NY are the same city w开发者_如何学Chile Bardonia, NY and Nanuet, NY are not. So, if I removed ALL duplicates, it would remove valid cities like Bardonia.

My question is, does anyone know a way (in SQL or any programming language) to loop through each row and do the following:

  • Check if row is duplicate
  • Check if there is a partial match for city (i.e. E Pearl River, NY and East Pearl River, NY)
  • Remove first of the duplicate rows
  • If there is no partial match for city (i.e. Bardonia, NY and Nanuet, NY), move along.

Before

+—————————+——————————————————————+
| zipcode |        city          |
+—————————+——————————————————————+
|  10956  | E Pearl River, NY    |
|  10956  | East Pearl River, NY |
|  10977  | Spring Valley, NY    |
|  10954  | Bardonia, NY         |
|  10954  | Nanuet, NY           |
+—————————+——————————————————————+

After

+—————————+——————————————————————+
| zipcode |        city          |
+—————————+——————————————————————+
|  10956  | East Pearl River, NY |
|  10977  | Spring Valley, NY    |
|  10954  | Bardonia, NY         |
|  10954  | Nanuet, NY           |
+—————————+——————————————————————+

Any help is greatly appreciated!


There is no easy solution for your problem as it involves natural language processing.

How do you define E Pearl River, NY is duplicate for East Pearl River, NY? What about East and West Pearl River? For natural language processing you should take a look at Lucene or MySQL Full-Text Search.

For fairly easy (but not necessarily very accurate) solution, you could try implementing simple Levenshtein distance measure.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜