How do I clean up data that would violate a Primary Key in SQL Server 2008?
I have some crappy data from a source I cannot control, it needs to go into a table with a composite primary key that looks like this:
PK_Part1, PK_Part2, StringData, DateData
My crappy data has full duplicates, PK Duplicates with different StringData, PK Duplicates with different DateData, and PK Duplicates with different StringData and DateData.
So I might开发者_如何学C see:
1234,1234,Blah,2011-1-1
1234,1234,Blah,2011-1-1
4321,4321,Blah,2011-1-1
4321,4321,Blah,2011-10-10
5678,5678,Blah,2011-1-1
5678,5678,Blah1,2011-1-1
8765,8765,Blah,2011-1-1
8765,8765,Blah,2011-10-10
8765,8765,Blah1,2011-10-10
How so I clean this up in SQL Server 2008? given that:
A) I want only the data associated with the latest date B) I'm trying to force the issue with the source about the string data, but for now longer string is better, same length either will do. C) I have to assume the source will be of no help and load everything nowI had hoped to use MERGE
but it seems to compare all rows of the Source table and Target table before doing any of the 'MATCH' or 'NO MATCH' statements so I got PK violations, and removing the PK constraint let all the duplicates in.
If you don't have that data in SQL Server already: BULK INSERT
that into a temporary table:
CREATE TABLE #tempStaging
(PK_Part1 INT, PK_Part2 INT, StringData VARCHAR(500), DateData DATE)
BULK INSERT #tempStaging
FROM 'c:\yourfile.txt'
WITH (FIELDTERMINATOR =',',
ROWTERMINATOR ='\n')
Then you should be able to do something like:
;WITH CleaupData AS
(
SELECT
PK_Part1, PK_Part2, StringData, DateData,
ROW_NUMBER() OVER(PARTIION BY PK_Part1, PK_Part2
ORDER BY DateData DESC, LEN(StringData) DESC) as 'RowNum'
FROM
#tempStaging
)
INSERT INTO dbo.YourTargetTable(PK_Part1, PK_Part2, StringData, DateData)
SELECT PK_Part1, PK_Part2, StringData, DateData
FROM CleanupData
WHERE RowNum = 1
This will "partition" your data based on some criteria (some ID or something), and each partition of data is order by date (descending - newest first).
So the entry with the RowNum = 1
is the newest entry for each partition - pick that one and toss out all others, and your data is cleaned up !
HINT: this assumes that your target table is empty! If that's not the case, then yes - you might need to apply a MERGE
statement instead, based on the CTE that selects out the data to keep from the BULK INSERT
.
The data form the source should go into a temp table, a holding temp area. Then you can choose the best one from that (since your sample data contains duplicate part1+part2 even within the input data)
Sample table and temp table
create table pkdup(
PK_Part1 int, PK_Part2 int, StringData varchar(100), DateData datetime,
primary key (PK_Part1,PK_Part2))
insert pkdup select 1234,1234,'', GETDATE()+1000
create table #tmp(col1 nvarchar(max), col2 nvarchar(max), col3 nvarchar(max), col4 datetime)
insert #tmp values
(1234,1234,'Blah','2011-1-1'),
(1234,1234,'Blah','2011-1-1'),
(4321,4321,'Blah','2011-1-1'),
(4321,4321,'Blah','2011-10-10'),
(5678,5678,'Blah','2011-1-1'),
(5678,5678,'Blah1','2011-1-1'),
(8765,8765,'Blah','2011-1-1'),
(8765,8765,'Blah','2011-10-10'),
(8765,8765,'Blah1','2011-10-10');
The merge statement
merge pkdup as target
using (
select col1, col2, col3, col4
from (select *, row_number() over (
partition by col1, col2
order by col4 desc, len(col3) desc) rownum
from #tmp) t
where rownum=1 -- only the best
) as source
on source.col1=target.PK_Part1 and source.col2=target.PK_Part2
WHEN MATCHED AND (source.col4 > target.datedata or (source.col4=target.datedata and len(source.col3) > target.stringdata))
THEN UPDATE SET target.stringdata = source.col3, target.datedata = source.col4
WHEN NOT MATCHED THEN
INSERT (PK_Part1, PK_Part2, StringData, DateData)
VALUES (source.col1, source.col2, source.col3, source.col4);
We usually put such data into a staging table and then get rid of the duplicates in the staging table before we try to run a merge statement.
not sure if you can apply a string length function in a join, but if you can, try this:
select PK_Part1, PK_Part2, max_date, max_len, first(StringData) as first_string
from
(select PK_Part1, PK_Part2, max_date, max(len(StringData)) as max_len
from table inner join
(select PK_Part1, PK_Part2, max(DateData) as max_date
from table
group by
PK_Part1, PK_Part2) md
on table.PK_Part1 = md.PK_Part1 and
table.PK_Part2 = md.PK_Part2 and
table.DateData = md.max_date
group by
PK_Part1, PK_Part2, max_date) ml
inner join table on
table.PK_Part1 = ml.PK_Part1 and
table.PK_Part2 = ml.PK_Part2 and
table.DateData = ml.max_date and
len(table.StringData) = ml.max_len
group by
PK_Part1, PK_Part2, max_date, max_len
精彩评论