How to find duplicate values in SQL Server
I'm using SQL Server 2008. I have a table
Customers
customer_number int
field1 varchar
field2 varchar
field3 varchar
field4 varchar
... and a lot more columns, that don't matter for my queries.
Column customer_number is pk. I'm trying to find duplicate values and some differences between them.
Please, help me to fi开发者_StackOverflow中文版nd all rows that have same
1) field1, field2, field3, field4
2) only 3 columns are equal and one of them isn't (except rows from list 1)
3) only 2 columns equal and two of them aren't (except rows from list 1 and list 2)
In the end, I'll have 3 tables with this results and additional groupId, which will be same for a group of similar (For example, for 3 column equals, rows that have 3 same columns equal will be a separate group)
Thank you.
Here's a handy query for finding duplicates in a table. Suppose you want to find all email addresses in a table that exist more than once:
SELECT email, COUNT(email) AS NumOccurrences
FROM users
GROUP BY email
HAVING ( COUNT(email) > 1 )
You could also use this technique to find rows that occur exactly once:
SELECT email
FROM users
GROUP BY email
HAVING ( COUNT(email) = 1 )
The easiest would probably be to write a stored procedure to iterate over each group of customers with duplicates and insert the matching ones per group number respectively.
However, I've thought about it and you can probably do this with a subquery. Hopefully I haven't made it more complicated than it ought to, but this should get you what you're looking for for the first table of duplicates (all four fields). Note that this is untested, so it might need a little tweaking.
Basically, it gets each group of fields where there are duplicates, a group number for each, then gets all customers with those fields and assigns the same group number.
INSERT INTO FourFieldsDuplicates(group_no, customer_no)
SELECT Groups.group_no, custs.customer_no
FROM (SELECT ROW_NUMBER() OVER(ORDER BY c.field1) AS group_no,
c.field1, c.field2, c.field3, c.field4
FROM Customers c
GROUP BY c.field1, c.field2, c.field3, c.field4
HAVING COUNT(*) > 1) Groups
INNER JOIN Customers custs ON custs.field1 = Groups.field1
AND custs.field2 = Groups.field2
AND custs.field3 = Groups.field3
AND custs.field4 = Groups.field4
The other ones are a bit more complicated, however as you'll need to expand out the possibilities. The three-field groups would then be:
INSERT INTO ThreeFieldsDuplicates(group_no, customer_no)
SELECT Groups.group_no, custs.customer_no
FROM (SELECT ROW_NUMBER() OVER(ORDER BY GroupsInner.field1) AS group_no,
GroupsInner.field1, GroupsInner.field2,
GroupsInner.field3, GroupsInner.field4
FROM (SELECT c.field1, c.field2, c.field3, NULL AS field4
FROM Customers c
WHERE NOT EXISTS(SELECT d.customer_no
FROM FourFieldsDuplicates d
WHERE d.customer_no = c.customer_no)
GROUP BY c.field1, c.field2, c.field3
UNION ALL
SELECT c.field1, c.field2, NULL AS field3, c.field4
FROM Customers c
WHERE NOT EXISTS(SELECT d.customer_no
FROM FourFieldsDuplicates d
WHERE d.customer_no = c.customer_no)
GROUP BY c.field1, c.field2, c.field4
UNION ALL
SELECT c.field1, NULL AS field2, c.field3, c.field4
FROM Customers c
WHERE NOT EXISTS(SELECT d.customer_no
FROM FourFieldsDuplicates d
WHERE d.customer_no = c.customer_no)
GROUP BY c.field1, c.field3, c.field4
UNION ALL
SELECT NULL AS field1, c.field2, c.field3, c.field4
FROM Customers c
WHERE NOT EXISTS(SELECT d.customer_no
FROM FourFieldsDuplicates d
WHERE d.customer_no = c.customer_no)
GROUP BY c.field2, c.field3, c.field4) GroupsInner
GROUP BY GroupsInner.field1, GroupsInner.field2,
GroupsInner.field3, GroupsInner.field4
HAVING COUNT(*) > 1) Groups
INNER JOIN Customers custs ON (Groups.field1 IS NULL OR custs.field1 = Groups.field1)
AND (Groups.field2 IS NULL OR custs.field2 = Groups.field2)
AND (Groups.field3 IS NULL OR custs.field3 = Groups.field3)
AND (Groups.field4 IS NULL OR custs.field4 = Groups.field4)
Hopefully this produces the right results and I'll leave the last one as an exercise. :-D
I'm not sure if you require an equality check on different fields (like field1=field2).
Otherwise this might be enough.
Edit
Feel free to adjust the testdata to provide us with inputs that give a wrong output according to your specifications.
Test data
DECLARE @Customers TABLE (
customer_number INTEGER IDENTITY(1, 1)
, field1 INTEGER
, field2 INTEGER
, field3 INTEGER
, field4 INTEGER)
INSERT INTO @Customers
SELECT 1, 1, 1, 1
UNION ALL SELECT 1, 1, 1, 1
UNION ALL SELECT 1, 1, 1, NULL
UNION ALL SELECT 1, 1, 1, 2
UNION ALL SELECT 1, 1, 1, 3
UNION ALL SELECT 2, 1, 1, 1
All Equal
SELECT ROW_NUMBER() OVER (ORDER BY c1.customer_number)
, c1.field1
, c1.field2
, c1.field3
, c1.field4
FROM @Customers c1
INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number
AND ISNULL(c2.field1, 0) = ISNULL(c1.field1, 0)
AND ISNULL(c2.field2, 0) = ISNULL(c1.field2, 0)
AND ISNULL(c2.field3, 0) = ISNULL(c1.field3, 0)
AND ISNULL(c2.field4, 0) = ISNULL(c1.field4, 0)
One field different
SELECT ROW_NUMBER() OVER (ORDER BY field1, field2, field3, field4)
, field1
, field2
, field3
, field4
FROM (
SELECT DISTINCT c1.field1
, c1.field2
, c1.field3
, field4 = NULL
FROM @Customers c1
INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number
AND c2.field1 = c1.field1
AND c2.field2 = c1.field2
AND c2.field3 = c1.field3
AND ISNULL(c2.field4, 0) <> ISNULL(c1.field4, 0)
UNION ALL
SELECT DISTINCT c1.field1
, c1.field2
, NULL
, c1.field4
FROM @Customers c1
INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number
AND c2.field1 = c1.field1
AND c2.field2 = c1.field2
AND ISNULL(c2.field3, 0) <> ISNULL(c1.field3, 0)
AND c2.field4 = c1.field4
UNION ALL
SELECT DISTINCT c1.field1
, NULL
, c1.field3
, c1.field4
FROM @Customers c1
INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number
AND c2.field1 = c1.field1
AND ISNULL(c2.field2, 0) <> ISNULL(c1.field2, 0)
AND c2.field3 = c1.field3
AND c2.field4 = c1.field4
UNION ALL
SELECT DISTINCT NULL
, c1.field2
, c1.field3
, c1.field4
FROM @Customers c1
INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number
AND ISNULL(c2.field1, 0) <> ISNULL(c1.field1, 0)
AND c2.field2 = c1.field2
AND c2.field3 = c1.field3
AND c2.field4 = c1.field4
) c
You can write simply something like that to count duplicates entries, i think it's working :
use *DATABASE_NAME*
go
SELECT *YOUR_FIELD*, COUNT(*) AS dupes
FROM *YOUR_TABLE_NAME*
GROUP BY *YOUR_FIELD*
HAVING (COUNT(*) > 1)
Enjoy
There is a clean way of doing this with CUBE()
, which will aggregate by all the possible combinations of columns
SELECT
field1,field2,field3,field4
,duplicate_row_count = COUNT(*)
,grp_id = GROUPING_ID(field1,field2,field3,field4)
INTO #duplicate_rows
FROM table_name
GROUP BY CUBE(field1,field2,field3,field4)
HAVING COUNT(*) > 1
AND GROUPING_ID(field1,field2,field3,field4) IN (0,1,2,4,8,3,5,6,9,10,12)
The numbers (0,1,2,4,8,3,5,6,9,10,12) are just the bitmasks (0000,0001,0010,0100,...,1010,1100) of the grouping sets that we care about-- those with 4, 3, or 2 matches.
Then join this back to the original table using a technique that treats NULLs in #duplicate_rows as wildcards
SELECT a.*
FROM table_name a
INNER JOIN #duplicate_rows b
ON NULLIF(b.field1,a.field1) IS NULL
AND NULLIF(b.field2,a.field2) IS NULL
AND NULLIF(b.field3,a.field3) IS NULL
AND NULLIF(b.field4,a.field4) IS NULL
--WHERE grp_id IN (0) --Use this for 4 matches
--WHERE grp_id IN (1,2,4,8) --Use this for 3 matches
--WHERE grp_id IN (3,5,6,9,10,12) --Use this for 2 matches
精彩评论