Delete duplicates based on string length criteria
Background
Remove duplicate city names from a temporary table, based on the length of the name.
Problem
The following query returns 350,000 rows:
select
tc.id,
tc.name_lowercase,
tc.population,
tc.latitude_decimal,
tc.longitude_decimal
from
climate.temp_city tc
inner join (
select
tc2.latitude_decimal,
tc2.longitude_decimal
from
climate.temp_city tc2
group by
tc2.latitude_decimal,
tc2.longitude_decimal
having
count(*) > 3
) s on
tc.latitude_decimal = s.latitude_decimal and
tc.longitude_decimal = s.longitude_decimal
Sample data:
940308;"sara" ;;-53.4333333;-68.1833333
935665;"estancia la sara";;-53.4333333;-68.1833333
935697;"estancia sara" ;;-53.4333333;-68.1833333
937204;"la sara" ;;-53.4333333;-68.1833333
940350;"seccion gap" ;;-52.1666667;-68.5666667
941448;"zanja pique" ;;-52.1666667;-68.5666667
935941;"gap" ;;-52.1666667;-68.5666667
935648;"estancia gap" ;;-52.1666667;-68.5666667
939635;"ritchie" ;;-51.9833333;-70.4
934948;"d.e. ritchie" ;;-51.9833333;-70.4
934992;"diego richtie" ;;-51.9833333;-70.4
934993;"diego ritchie" ;;-51.9833333;-70.4
934990;"diego e. ritchie";;-51.9833333;-70.4
I would like to remove all duplicates, retaining the rows where:
- population is not null; and
- the name is the longest of the duplicates (
max(tc.name_lowercase)
); and - if neither of these conditions are met, retain
max(tc.id)
.
From the given set of data, the remaining rows would be:
935665;"estancia la sara";;-53.4333333;-68.1833333
935648;"estancia gap" ;;-52.1666667;-68.5666667
934990;"diego e. ritchie";;-51.9833333;-70.4
Question
How would y开发者_C百科ou select just the rows with duplicate lat/long values that meet the problem criteria?
Thank you!
I think you're looking for something like this:
SELECT t.id, t.name_lowercase, t.latitude_decimal, t.longitude_decimal
FROM (SELECT MAX(LENGTH(name_lowercase)) AS len, latitude_decimal, longitude_decimal FROM temp_city GROUP BY latitude_decimal, lng) AS max_length,
temp_city t
WHERE max_length.latitude_decimal = t.latitude_decimal
AND max_length.longitude_decimal = t.longitude_decimal
AND max_length.len = LENGTH(t.name_lowercase);
Where temp_city
is the table that contains your sample results.
The above will run into problems if your temp_city
also contains this row:
1 | xxxancia la sara | -53.4333333 | -68.1833333
You didn't offer a way to choose which row from amongst those whose name
has the maximum length so both of these will be returned:
1 | xxxancia la sara | -53.4333333 | -68.1833333
935665 | estancia la sara | -53.4333333 | -68.1833333
UPDATE: If max(tc.id)
is the extra criteria from trimming down the above duplicates, then you can wrap another layer on:
SELECT t.id, t.name_lowercase, t.latitude_decimal, t.longitude_decimal
FROM
(
SELECT MAX(t.id) AS id
FROM
(
SELECT MAX(LENGTH(name_lowercase)) AS len, latitude_decimal, longitude_decimal
FROM temp_city
GROUP BY latitude_decimal, longitude_decimal
) AS max_length,
temp_city t
WHERE max_length.latitude_decimal = t.latitude_decimal
AND max_length.longitude_decimal = t.longitude_decimal
AND max_length.len = LENGTH(t.name_lowercase)
GROUP BY t.latitude_decimal, t.longitude_decimal, LENGTH(t.name_lowercase)
) AS tt,
temp_city t
WHERE t.id = tt.id
精彩评论