Optimizing suggestions needed for a SQL UPDATE statment. Two ~5 million record tables being used
I'm looking for any suggestions to optimize the following PROC SQL statement from a SAS program. The two tables involved contain around 5 million records each and the runtime is about 46 hours.
The statement is looking to update a "new" version of the "old" table. Noting a column if the "old" table, for a "PK_ID", was listed wit开发者_StackOverflow社区hout a value for "3RD_ID" and "CODE", but in the "new" table, for a "PK_ID", it is now listed WITH a value for "3RD_ID" and "CODE".
Thanks for any suggestions... (The code is really formatted below! For some reasons my spaces aren't showing for indents...)
PROC SQL _METHOD;
UPDATE NEW_TABLE AS N
SET NEW_2ND_ID=(SELECT 2ND_ID FROM OLD_TABLE AS O
WHERE N.PK_ID=0.PK_ID
AND N.2ND_ID<>O.2ND_ID
AND O.3RD_ID IS NULL
AND O.CODE IS NULL
AND N.3RD_ID IS NOT NULL
AND N.CODE IS NOT NULL
AND N.2ND_ID IS NOT NULL)
WHERE N.3RD_ID IS NOT NULL
AND N.PK_ID IS NOT NULL
AND N.CODE IS NOT NULL
AND N.2ND_ID IS NOT NULL;
QUIT;
All of the answers so far are firmly oriented in the SQL part of your question, but neglect the SAS part to some extent. I would strongly recommend trying a data step update/modify/merge instead of proc sql for this kind of update. It should be possible to sort both tables and apply similar logic from your SQL to ensure that the correct rows/columns are updated.
I've seen similar kinds of updates run in a matter of minutes on 20 million or more rows.
Also, check out http://runsubmit.com , a SAS specific stackoverflow style site, for more SAS specific answers.
Disclosure: I'm a SAS employee. I have nothing to do with runsubmit, which is independently run.
I am not familiar with the variant of SQL you're using. However, whether you get better performance or not, you should be using the ANSI join syntax. Here is what it would look like in T-SQL, modify it for your system:
UPDATE N
SET N.2ND_ID = O.2ND_ID
FROM
NEW_TABLE AS N
INNER JOIN OLD_TABLE AS O ON N.PK_ID = O.PK_ID
WHERE
N.2ND_ID <> O.2ND_ID
AND N.3RD_ID IS NOT NULL
AND O.3RD_ID IS NULL
AND N.CODE IS NOT NULL
AND O.CODE IS NULL
Note that the extra conditions I removed aren't necessary, for example N.2ND_ID <> O.2ND_ID
already guarantees that those two columns aren't null.
However, on two 5-million row tables you're going to get abysmal performance. Here are some ideas to speed it up. I bet that you can get this down to under an hour with the right combination of strategies.
Split the update into batches (small pieces, looping through the entire set). While this sounds counter to the normal database advice of "don't loop, use sets" it really isn't: you're just using smaller sets, not looping at the row level. The best way to batch an update like this is to "walk the clustered index." I'm not sure if that term makes sense in the DBMS you're using, but essentially it means to choose the chunks you update during each loop based on the order they will be found in the table object that you're updating. PK_ID sounds like it is the candidate to use, but if the raw table data is not ordered by this column then it will become more complicated. In T-SQL a batching loop might look like this:
DECLARE @ID int, @Count int SET @ID = 1 SET @Count = 1 WHILE @Count > 0 BEGIN UPDATE N SET N.2ND_ID = O.2ND_ID FROM NEW_TABLE AS N INNER JOIN OLD_TABLE AS O ON N.PK_ID = O.PK_ID WHERE N.2ND_ID <> O.2ND_ID AND N.3RD_ID IS NOT NULL AND O.3RD_ID IS NULL AND N.CODE IS NOT NULL AND O.CODE IS NULL AND N.PK_ID BETWEEN @ID AND @ID + 4999 SET @Count = @@RowCount SET @ID = @ID + 5000 END
This example assumes that your PK_ID column is densely packed, that each update will truly hit 5000 rows. If that's not the case, then switch to a method using TOP 5000 and either output the updated PK_IDs into a table, or find the @StartID and @EndID for the next update in one step, then perform it.
In my experience good batch sizes tend to be between 1000 and 20000 rows. In MS-SQL server, the sweet spot seems to be just below a number that forces a switch from a seek to a scan (because eventually, the db engine assumes that a single scan is cheaper than a multitude of seeks, though it is often wrong when dealing with 5 million-row tables).
Select the IDs and data to be updated into a working/temp table first, then join to that. The idea is to take the hit of the huge scan with a simple INSERT statement, then add indexes to the temp table and perform the update without needing a complicated WHERE clause. Once the table contains only the rows to be updated and the columns required, not only can the WHERE clause lose most of its conditions, but the temp table has many fewer rows and many more rows per page (because it has no extraneous columns), which will greatly improve performance. This can even be done in stages where a "shadow" of the New table is created, then a "shadow" of the Old table, then the join between them, and finally a join back to the New table to update it. While this sounds like a lot of work, I think you will be surprised at the totally-crazy speed of completion this can offer.
Anything you can do to convert the read from the old table to be seeks instead of scans will help. Anything you can do to reduce the amount of disk used to hold temp data (such as gigantic hash tables for 5 million rows) will help.
Don't use update, create a similar new table, and use a insert into (fields) select fields from both tables.
- Drop indexes before running the query.
- Drop triggers before running the query.
Something like:
insert into NEW_TABLE (field1,field2,NEW_2ND_ID)
select field1, field2, (SELECT 2ND_ID FROM OLD_TABLE....) from NEW_TABLE
- Recreate indexes after query finished.
- Recreate triggers after query finished.
(In the end you will replace with this new table your existing table)
I think right now it is nested sub query, so Select statement will be fired for the number of records which matches the where condition.
However i would recommend you to go for SQL - Update with a Join. Explaination can be found here: http://bytes.com/topic/oracle/answers/65819-sql-update-join-syntax.
When you have update with JOIN in place, then apply proper Indexes.
Also not that you should not use those indexes which include 2nd_id, they should be disabled and then rebuild after the update, as it can be mass data update.
UPDATE (
SELECT O.2ND_ID, N.2ND_ID
FROM OLD_TABLE AS O
INNER JOIN NEW_TABLE AS N on O.PK_ID=N.PK_ID
WHERE N.2ND_ID <> O.2ND_ID
AND O.3RD_ID IS NULL
AND O.CODE IS NULL
AND N.3RD_ID IS NOT NULL
AND N.CODE IS NOT NULL
AND N.2ND_ID IS NOT NULL
) t
set N.2ND_ID = O.2ND_ID
You can also try putting the new table in a different device than the old one, to take advantage of its parallelism. If you can convince the DBA, of course.
精彩评论