Faster way to update 250k rows with SQL
I need to update about 250k rows on a table and each field to update will have a different value depending on the row itself (not calculated based on the row id or the key but externally).
I tried with a parametrized query but it turns out to be slow (I still can try with a table-value parameter, SqlDbType.Structured, in SQL Server 2008, but I'd like to have a general way to do it on several databases including MySql开发者_C百科, Oracle and Firebird).
Making a huge concat of individual updates is also slow (BUT about 2 times faster than making thousands of individual calls (roundtrips!) using parametrized queries)
What about creating a temp table and running an update joining my table and the tmp one? Will it work faster?
How slow is "slow"?
The main problem with this is that it would create an enormous entry in the database's log file (in case there's a power failure half-way through the update, the database needs to log each action so that it can rollback in the event of failure). This is most likely where the "slowness" is coming from, more than anything else (though obviously with such a large number of rows, there are other ways to make the thing inefficient [e.g. doing one DB roundtrip per update would be unbearably slow], I'm just saying once you eliminate the obvious things, you'll still find it's pretty slow).
There's a few ways you can do it more efficiently. One would be to do the update in chunks, 1,000 rows at a time, say. That way, the database writes lots of small log entries, rather than one really huge one.
Another way would be to turn off - or turn "down" - the database's logging for the duration of the update. In SQL Server, for example, you can set the Recovery Model to "simple" or "bulk update" which would speed it up considerably (with the caveat that you are more at risk if there's a power failure or something during the update).
Edit Just to expand a little more, probably the most efficient way to actually execute the queries in the first place would be to do a BULK INSERT of all the new rows into a temporary table, and then do a single UPDATE
of the existing table from that (or to do the UPDATE
in chunks of 1,000 as I said above). Most of my answer was addressing the problem once you've implemented it like that: you'll still find it's pretty slow...
call a stored procedure if possible
If the columns updated are part of indexes you could
- drop these indexes
- do the update
- re-create the indexes.
If you need these indexes to retrieve the data, well, it doesn't help.
You should use the SqlBulkCopy with the KeepIdentities flag set.
As part of a SqlTransaction do a query to SELECT all the records that need updating and then DELETE THEM, returning those selected (and now removed) records. Read them into C# in a single batch. Update the records on the C# side in memory, now that you've narrowed the selection and then SqlBulkCopy those updated records back, keys and all. And don't forget to commit the transaction. It's more work, but it's very fast.
Here's what I would do:
- Retrieve the entire table, that is, the columns you need in order to calculate/retrieve/find/produce the changes externally
- Calculate/produce those changes
- Run a bulk insert to a temporary table, uploading the information you need server-side in order to do the changes. This would require the key information + new values for all the rows you intend to change.
- Run SQL on the server to copy new values from the temporary table into the production table.
Pros:
- Running the final step server-side is faster than running tons and tons of individual SQL, so you're going to lock the table in question for a shorter time
- Bulk insert like this is fast
Cons:
- Requires extra space in your database for the temporary table
- Produces more log data, logging both the bulk insert and the changes to the production table
Here are things that can make your updates slow:
- executing updates one by one through parametrized query
- solution: do update in one statement
- large transaction creates big log entry
- see codeka's answer
- updating indexes (RDBMS will update index after each row. If you change indexed column, it could be very costly on large table)
- if you can, drop indices before update and recreate them after
- updating field that has foreign key constraint - for each inserted record RDBMS will go and look for appropriate key
- if you can, disable foreign key constraints before update and enable them after update
- triggers and row level checks
- if you can, disable triggers before update and enable them after
精彩评论