Removing duplicates from SQL Join
The following is a hypothetical situation this which is close to my real problem. Table1
recid firstname lastname company
1 A B AAA
2 D E DEF
3 G H IJK
4 A 开发者_如何学C B ABC
I have a table2 that looks like this
recid firstname lastname company
10 A B ABC
20 D E DEF
30 M D DIM
40 A B CCC
Now if I join the table on recid
, it will give 0
results, there will be no duplicates because recid
is unique. But if I join on firstname
and lastname
columns, which are not unique and there are duplicates, I get duplicates on inner join. The more columns I add on join, the worse it becomes (more duplicates are created).
In the above simple situation, how can I remove duplicates in the following query? I want to compare firstname
and lastname
, if they match, I return firstname
, lastname
and recid
from table2
SELECT DISTINCT * FROM
(SELECT recid, first, last FROM table1) a
INNER JOIN
(SELECT recid, first,last FROM table2) b
ON a.first = b.first
The script is here if anyone wants to play with it in future
CREATE TABLE table1 (recid INT NOT NULL PRIMARY KEY, first varchar(20), last varchar(20), company varchar(20))
CREATE TABLE table2 (recid INT NOT NULL PRIMARY KEY, first varchar(20), last varchar(20), company varchar(20))
INSERT INTO TABLE1 VALUES(1,'A','B','ABC')
INSERT INTO TABLE1 VALUES(2,'D','E','DEF')
INSERT INTO TABLE1 VALUES(3,'M','N','MNO')
INSERT INTO TABLE1 VALUES(4,'A','B','ABC')
INSERT INTO TABLE2 VALUES(10,'A','B','ABC')
INSERT INTO TABLE2 VALUES(20,'D','E','DEF')
INSERT INTO TABLE2 VALUES(30,'Q','R','QRS')
INSERT INTO TABLE2 VALUES(40,'A','B','ABC')
You don't want to do a join per se, you're merely testing for existence/set inclusion.
I don't know what current flavor of SQL you're coding in, but this should work.
SELECT MAX(recid), firstname, lastname
FROM table2 T2
WHERE EXISTS (SELECT * FROM table1 WHERE firstname = T2.firstame AND lastname = T2.lastname)
GROUP BY lastname, firstname
If you want to implement as a join, leaving the code largely the same:
i.e.
SELECT max(t2.recid), t2.firstame, t2.lastname
FROM Table2 T2
INNER JOIN Table1 T1
ON T2.firstname = t1.firstname and t2.lastname = t1.lastname
GROUP BY t2.firstname, t2.lastname
Depending on the DBMS, an inner join may be implemented differently to an Exists (semi-join vs join) but the optimizer can sometimes figure it out anyway and chose the correct operator regardless of which way you write it.
SELECT t2.recid, t2.first, t2.last
FROM table1 t1
INNER JOIN table2 t2 ON t1.first = t2.first AND t1.last = t2.last
GROUP BY t2.recid, t2.first, t2.last
EDIT: Added picture
There are some situations when you have to do grouping in a subquery
SELECT distinct b.recid, b.first, b.last
FROM table1 a
INNER JOIN (
SELECT MAX(recid) as recid, first, last
FROM table2
GROUP BY first, last
) b
ON a.first = b.first AND a.last = b.last
精彩评论