Fastest technique to deleting duplicate data
After searching stackoverflow.com I found several questions asking how to remove duplicates, but none of them addressed speed.
In my case I have a table with 10 columns that contains 5 million exact row duplicates. In addition, I have at least a million other rows with duplicates in 9 of the 10 columns. My current technique is taking (so far) 3 hours to delete these 5 million rows. Here is my process:
-- Step 1: **This step took 13 minutes.** Insert only one of the n duplicate rows into a temp table
select
MAX(prikey) as MaxPriKey, -- identity(1, 1)
a,
b,
c,
d,
e,
f,
g,
h,
i
into #dupTemp
FROM sourceTable
group by
a,
b,
c,
d,
e,
f,
g,
h,
i
having COUNT(*) > 1
Next,
-- Step 2: **This 开发者_C百科step is taking the 3+ hours**
-- delete the row when all the non-unique columns are the same (duplicates) and
-- have a smaller prikey not equal to the max prikey
delete
from sourceTable
from sourceTable
inner join #dupTemp on
sourceTable.a = #dupTemp.a and
sourceTable.b = #dupTemp.b and
sourceTable.c = #dupTemp.c and
sourceTable.d = #dupTemp.d and
sourceTable.e = #dupTemp.e and
sourceTable.f = #dupTemp.f and
sourceTable.g = #dupTemp.g and
sourceTable.h = #dupTemp.h and
sourceTable.i = #dupTemp.i and
sourceTable.PriKey != #dupTemp.MaxPriKey
Any tips on how to speed this up, or a faster way? Remember I will have to run this again for rows that are not exact duplicates.
Thanks so much.
UPDATE:
I had to stop step 2 from running at the 9 hour mark. I tried OMG Ponies' method and it finished after only 40 minutes. I tried my step 2 with Andomar's batch delete, it ran the 9 hours before I stopped it. UPDATE: Ran a similar query with one less field to get rid of a different set of duplicates and the query ran for only 4 minutes (8000 rows) using OMG Ponies' method.I will try the cte technique the next chance I get, however, I suspect OMG Ponies' method will be tough to beat.
What about EXISTS:
DELETE FROM sourceTable
WHERE EXISTS(SELECT NULL
FROM #dupTemp dt
WHERE sourceTable.a = dt.a
AND sourceTable.b = dt.b
AND sourceTable.c = dt.c
AND sourceTable.d = dt.d
AND sourceTable.e = dt.e
AND sourceTable.f = dt.f
AND sourceTable.g = dt.g
AND sourceTable.h = dt.h
AND sourceTable.i = dt.i
AND sourceTable.PriKey < dt.MaxPriKey)
Can you afford to have the original table unavailable for a short time?
I think the fastest solution is to create a new table without the duplicates. Basically the approach that you use with the temp table, but creating a "regular" table instead.
Then drop the original table and rename the intermediate table to have the same name as the old table.
The bottleneck in bulk row deletion is usually the transaction that SQL Server has to build up. You might be able to speed it up considerably by splitting the removal into smaller transactions. For example, to delete 100 rows at a time:
while 1=1
begin
delete top 100
from sourceTable
...
if @@rowcount = 0
break
end
...based on OMG Ponies comment above, a CTE method that's a little more compact. This method works wonders on tables where you've (for whatever reason) no primary key - where you can have rows which are identical on all columns.
;WITH cte AS (
SELECT ROW_NUMBER() OVER
(PARTITION BY a,b,c,d,e,f,g,h,i ORDER BY prikey DESC) AS sequence
FROM sourceTable
)
DELETE
FROM cte
WHERE sequence > 1
Well lots of differnt things. First would something like this work (do a select o make sure, maybe even put into a temp table of it's own, #recordsToDelete):
delete
from sourceTable
left join #dupTemp on
sourceTable.PriKey = #dupTemp.MaxPriKey
where #dupTemp.MaxPriKey is null
Next you can index temp tables, put an index on prikey
If you have records in a temp table of the ones you want to delete, you can delete in batches which is often faster than locking up the whole table with a delete.
Here's a version where you can combine both steps into a single step.
WITH cte AS
( SELECT prikey, ROW_NUMBER() OVER (PARTITION BY a,b,c,d,e,f,g,h,i ORDER BY
prikey DESC) AS sequence
FROM sourceTable
)
DELETE
FROM sourceTable
WHERE prikey IN
( SELECT prikey
FROM cte
WHERE sequence > 1
) ;
By the way, do you have any indexes that can be temporarily removed?
If you're using Oracle database, I recently found out that following statement performs best, from total durtion time as well as CPU consumption point of view. I've performed several test with different data sizes from tens of rows to thousands, always in a loop. I used TKProf tool to analyze the results.
When compared to ROW_NUMBER() solution above, this approach took 2/3 of the original time and consumed about 50% of the CPU time. It seemed to behave linearly, ie it should give similar results with any input data size.
Feel free to give me your feedback. I wonder if there is a better method.
DELETE FROM sourceTable
WHERE
ROWID IN(
-- delete all
SELECT ROWID
FROM sourceTable t
MINUS
-- but keep every unique row
SELECT
rid
FROM
(
SELECT a,b,c,d,e,f,g,h,i, MAX(ROWID) KEEP (DENSE_RANK FIRST ORDER BY ROWID) AS RID
FROM sourceTable t
GROUP BY a,b,c,d,e,f,g,h,i
)
)
;
精彩评论