开发者

How do I clean up data that would violate a Primary Key in SQL Server 2008?

I have some crappy data from a source I cannot control, it needs to go into a table with a composite primary key that looks like this:

PK_Part1, PK_Part2, StringData, DateData

My crappy data has full duplicates, PK Duplicates with different StringData, PK Duplicates with different DateData, and PK Duplicates with different StringData and DateData.

So I might开发者_如何学C see:

1234,1234,Blah,2011-1-1
1234,1234,Blah,2011-1-1
4321,4321,Blah,2011-1-1
4321,4321,Blah,2011-10-10
5678,5678,Blah,2011-1-1
5678,5678,Blah1,2011-1-1
8765,8765,Blah,2011-1-1
8765,8765,Blah,2011-10-10
8765,8765,Blah1,2011-10-10

How so I clean this up in SQL Server 2008? given that:

A) I want only the data associated with the latest date

B) I'm trying to force the issue with the source about the string data, but for now longer string is better, same length either will do.

C) I have to assume the source will be of no help and load everything now

I had hoped to use MERGE but it seems to compare all rows of the Source table and Target table before doing any of the 'MATCH' or 'NO MATCH' statements so I got PK violations, and removing the PK constraint let all the duplicates in.


If you don't have that data in SQL Server already: BULK INSERT that into a temporary table:

CREATE TABLE #tempStaging
(PK_Part1 INT, PK_Part2 INT, StringData VARCHAR(500), DateData DATE)

BULK INSERT #tempStaging
FROM 'c:\yourfile.txt'
WITH (FIELDTERMINATOR =',',
     ROWTERMINATOR ='\n')

Then you should be able to do something like:

;WITH CleaupData AS
(
  SELECT 
      PK_Part1, PK_Part2, StringData, DateData,
      ROW_NUMBER() OVER(PARTIION BY PK_Part1, PK_Part2
                        ORDER BY DateData DESC, LEN(StringData) DESC) as 'RowNum'
  FROM
      #tempStaging
)
INSERT INTO dbo.YourTargetTable(PK_Part1, PK_Part2, StringData, DateData)
    SELECT PK_Part1, PK_Part2, StringData, DateData 
    FROM CleanupData
    WHERE RowNum = 1

This will "partition" your data based on some criteria (some ID or something), and each partition of data is order by date (descending - newest first).

So the entry with the RowNum = 1 is the newest entry for each partition - pick that one and toss out all others, and your data is cleaned up !

HINT: this assumes that your target table is empty! If that's not the case, then yes - you might need to apply a MERGE statement instead, based on the CTE that selects out the data to keep from the BULK INSERT.


The data form the source should go into a temp table, a holding temp area. Then you can choose the best one from that (since your sample data contains duplicate part1+part2 even within the input data)

Sample table and temp table

create table pkdup(
    PK_Part1 int, PK_Part2 int, StringData varchar(100), DateData datetime,
    primary key (PK_Part1,PK_Part2))
insert pkdup select 1234,1234,'', GETDATE()+1000

create table #tmp(col1 nvarchar(max), col2 nvarchar(max), col3 nvarchar(max), col4 datetime)
insert #tmp values
(1234,1234,'Blah','2011-1-1'),
(1234,1234,'Blah','2011-1-1'),
(4321,4321,'Blah','2011-1-1'),
(4321,4321,'Blah','2011-10-10'),
(5678,5678,'Blah','2011-1-1'),
(5678,5678,'Blah1','2011-1-1'),
(8765,8765,'Blah','2011-1-1'),
(8765,8765,'Blah','2011-10-10'),
(8765,8765,'Blah1','2011-10-10');

The merge statement

merge pkdup as target
using (
    select col1, col2, col3, col4
    from (select *, row_number() over (
        partition by col1, col2
        order by col4 desc, len(col3) desc) rownum
        from #tmp) t
    where rownum=1 -- only the best
    ) as source
on source.col1=target.PK_Part1 and source.col2=target.PK_Part2
WHEN MATCHED AND (source.col4 > target.datedata or (source.col4=target.datedata and len(source.col3) > target.stringdata))
    THEN UPDATE SET target.stringdata = source.col3, target.datedata = source.col4
WHEN NOT MATCHED THEN
    INSERT (PK_Part1, PK_Part2, StringData, DateData)
    VALUES (source.col1, source.col2, source.col3, source.col4);


We usually put such data into a staging table and then get rid of the duplicates in the staging table before we try to run a merge statement.


not sure if you can apply a string length function in a join, but if you can, try this:

select PK_Part1, PK_Part2, max_date, max_len, first(StringData) as first_string
from        
   (select PK_Part1, PK_Part2, max_date, max(len(StringData)) as max_len
    from table inner join
           (select PK_Part1, PK_Part2, max(DateData) as max_date
           from table
           group by
           PK_Part1, PK_Part2) md
    on table.PK_Part1 = md.PK_Part1 and 
           table.PK_Part2 = md.PK_Part2 and 
           table.DateData = md.max_date
    group by
           PK_Part1, PK_Part2, max_date) ml
   inner join table on 
           table.PK_Part1 = ml.PK_Part1 and 
           table.PK_Part2 = ml.PK_Part2 and 
           table.DateData = ml.max_date and
           len(table.StringData) = ml.max_len
   group by
           PK_Part1, PK_Part2, max_date, max_len
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜