开发者

tsql bulk update

MyTableA has several million records. 开发者_StackOverflow中文版On regular occasions every row in MyTableA needs to be updated with values from TheirTableA.

Unfortunately I have no control over TheirTableA and there is no field to indicate if anything in TheirTableA has changed so I either just update everything or I update based on comparing every field which could be different (not really feasible as this is a long and wide table).

Unfortunately the transaction log is ballooning doing a straight update so I wanted to chunk it by using UPDATE TOP, however, as I understand it I need some field to determine if the records in MyTableA have been updated yet or not otherwise I'll end up in an infinite loop:

declare @again as bit;
set @again = 1;

while @again = 1
  begin
    update top (10000) MyTableA
    set my.A1 = their.A1, my.A2 = their.A2, my.A3 = their.A3
    from MyTableA my
    join TheirTableA their on my.Id = their.Id

    if @@ROWCOUNT > 0
      set @again = 1
    else
      set @again = 0
end

is the only way this will work if I add in a

where my.A1 <> their.A1 and my.A2 <> their.A2 and my.A3 <> their.A3

this seems like it will be horribly inefficient with many columns to compare

I'm sure I'm missing an obvious alternative?


Assuming both tables are the same structure, you can get a resultset of rows that are different using

SELECT * into #different_rows from MyTable EXCEPT select * from TheirTable and then update from that using whatever key fields are available.


Well, the first, and simplest solution, would obviously be if you could change the schema to include a timestamp for last update - and then only update the rows with a timestamp newer than your last change.

But if that is not possible, another way to go could be to use the HashBytes function, perhaps by concatenating the fields into an xml that you then compare. The caveat here is an 8kb limit (https://connect.microsoft.com/SQLServer/feedback/details/273429/hashbytes-function-should-support-large-data-types) EDIT: Once again, I have stolen code, this time from:

http://sqlblogcasts.com/blogs/tonyrogerson/archive/2009/10/21/detecting-changed-rows-in-a-trigger-using-hashbytes-and-without-eventdata-and-or-s.aspx

His example is:

select batch_id
from (
    select distinct batch_id, hash_combined = hashbytes( 'sha1', combined )
    from (  select batch_id,
                   combined =(  select batch_id, batch_name, some_parm, some_parm2
                                from deleted c       --  need old values
                                where c.batch_id = d.batch_id
                                for xml path( '' ) )
            from deleted d
            union all
            select batch_id,
                   combined =(  select batch_id, batch_name, some_parm, some_parm2
                                from some_base_table c       --  need current values (could use inserted here)
                                where c.batch_id = d.batch_id
                                for xml path( '' ) )
            from deleted d
        ) as r
    ) as c
group by batch_id
having count(*) > 1

A last resort (and my original suggestion) is to try Binary_Checksum? As noted in the comment, this does open the risk for a rather high collision rate.

http://msdn.microsoft.com/en-us/library/ms173784.aspx

I have stolen the following example from lessthandot.com - link to the full SQL (and other cool functions) is below.

--Data Mismatch
SELECT 'Data Mismatch', t1.au_id
FROM( SELECT BINARY_CHECKSUM(*) AS CheckSum1 ,au_id FROM pubs..authors) t1
JOIN(SELECT BINARY_CHECKSUM(*) AS CheckSum2,au_id FROM tempdb..authors2) t2 ON t1.au_id =t2.au_id
WHERE CheckSum1 <> CheckSum2

Example taken from http://wiki.lessthandot.com/index.php/Ten_SQL_Server_Functions_That_You_Have_Ignored_Until_Now


I don't know if this is better than adding where my.A1 <> their.A1 and my.A2 <> their.A2 and my.A3 <> their.A3, but I would definitely give it a try (assuming SQL Server 2005+):

declare @again as bit;
set @again = 1;

declare @idlist table (Id int);

while @again = 1
  begin
    update top (10000) MyTableA
    set my.A1 = their.A1, my.A2 = their.A2, my.A3 = their.A3
    output inserted.Id into @idlist (Id)
    from MyTableA my
    join TheirTableA their on my.Id = their.Id
    left join @idlist i on my.Id = i.Id
    where i.Id is null
    /* alternatively (instead of left join + where):
    where not exists (select * from @idlist where Id = my.Id) */

    if @@ROWCOUNT > 0
      set @again = 1
    else
      set @again = 0
end

That is, declare a table variable for collecting the IDs of the rows being updated and use that table for looking up (and omitting) IDs that have already been updated.

A slight variation on the method would be to use a local temporary table instead of a table variable. That way you would be able to create an index on the ID lookup table, which might result in better performance.


If schema change is not possible. How about using trigger to save off the Ids that have changed. And only import/export those rows.

Or use trigger to export it immediately.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜