Optimizing suggestions needed for a SQL UPDATE statment. Two ~5 million record tables being used

2022-12-18 13:47 问答作者：

I'm looking for any suggestions to optimize the following PROC SQL statement from a SAS program. The two tables involved contain around 5 million records each and the runtime is about 46 hours.

The statement is looking to update a "new" version of the "old" table. Noting a column if the "old" table, for a "PK_ID", was listed wit开发者_StackOverflow社区hout a value for "3RD_ID" and "CODE", but in the "new" table, for a "PK_ID", it is now listed WITH a value for "3RD_ID" and "CODE".

Thanks for any suggestions... (The code is really formatted below! For some reasons my spaces aren't showing for indents...)

PROC SQL _METHOD;  
 UPDATE NEW_TABLE AS N  
   SET NEW_2ND_ID=(SELECT 2ND_ID FROM OLD_TABLE AS O  
                WHERE N.PK_ID=0.PK_ID  
                  AND N.2ND_ID<>O.2ND_ID  
                  AND O.3RD_ID IS NULL  
                  AND O.CODE IS NULL  
                  AND N.3RD_ID IS NOT NULL  
                  AND N.CODE IS NOT NULL  
                  AND N.2ND_ID IS NOT NULL)  
        WHERE N.3RD_ID IS NOT NULL  
          AND N.PK_ID IS NOT NULL  
          AND N.CODE IS NOT NULL  
          AND N.2ND_ID IS NOT NULL;  
QUIT;

All of the answers so far are firmly oriented in the SQL part of your question, but neglect the SAS part to some extent. I would strongly recommend trying a data step update/modify/merge instead of proc sql for this kind of update. It should be possible to sort both tables and apply similar logic from your SQL to ensure that the correct rows/columns are updated.

I've seen similar kinds of updates run in a matter of minutes on 20 million or more rows.

Also, check out http://runsubmit.com , a SAS specific stackoverflow style site, for more SAS specific answers.

Disclosure: I'm a SAS employee. I have nothing to do with runsubmit, which is independently run.

I am not familiar with the variant of SQL you're using. However, whether you get better performance or not, you should be using the ANSI join syntax. Here is what it would look like in T-SQL, modify it for your system:

UPDATE N
SET N.2ND_ID = O.2ND_ID
FROM
   NEW_TABLE AS N
   INNER JOIN OLD_TABLE AS O ON N.PK_ID = O.PK_ID
WHERE
   N.2ND_ID <> O.2ND_ID
   AND N.3RD_ID IS NOT NULL  
   AND O.3RD_ID IS NULL  
   AND N.CODE IS NOT NULL  
   AND O.CODE IS NULL

Note that the extra conditions I removed aren't necessary, for example N.2ND_ID <> O.2ND_ID already guarantees that those two columns aren't null.

However, on two 5-million row tables you're going to get abysmal performance. Here are some ideas to speed it up. I bet that you can get this down to under an hour with the right combination of strategies.

Split the update into batches (small pieces, looping through the entire set). While this sounds counter to the normal database advice of "don't loop, use sets" it really isn't: you're just using smaller sets, not looping at the row level. The best way to batch an update like this is to "walk the clustered index." I'm not sure if that term makes sense in the DBMS you're using, but essentially it means to choose the chunks you update during each loop based on the order they will be found in the table object that you're updating. PK_ID sounds like it is the candidate to use, but if the raw table data is not ordered by this column then it will become more complicated. In T-SQL a batching loop might look like this:
```
DECLARE
   @ID int,
   @Count int

SET @ID = 1
SET @Count = 1

WHILE @Count > 0 BEGIN
   UPDATE N
   SET N.2ND_ID = O.2ND_ID
   FROM
      NEW_TABLE AS N
      INNER JOIN OLD_TABLE AS O ON N.PK_ID = O.PK_ID
   WHERE
      N.2ND_ID <> O.2ND_ID
      AND N.3RD_ID IS NOT NULL  
      AND O.3RD_ID IS NULL  
      AND N.CODE IS NOT NULL  
      AND O.CODE IS NULL
      AND N.PK_ID BETWEEN @ID AND @ID + 4999
   SET @Count = @@RowCount
   SET @ID = @ID + 5000
END
```
This example assumes that your PK_ID column is densely packed, that each update will truly hit 5000 rows. If that's not the case, then switch to a method using TOP 5000 and either output the updated PK_IDs into a table, or find the @StartID and @EndID for the next update in one step, then perform it.

In my experience good batch sizes tend to be between 1000 and 20000 rows. In MS-SQL server, the sweet spot seems to be just below a number that forces a switch from a seek to a scan (because eventually, the db engine assumes that a single scan is cheaper than a multitude of seeks, though it is often wrong when dealing with 5 million-row tables).
Select the IDs and data to be updated into a working/temp table first, then join to that. The idea is to take the hit of the huge scan with a simple INSERT statement, then add indexes to the temp table and perform the update without needing a complicated WHERE clause. Once the table contains only the rows to be updated and the columns required, not only can the WHERE clause lose most of its conditions, but the temp table has many fewer rows and many more rows per page (because it has no extraneous columns), which will greatly improve performance. This can even be done in stages where a "shadow" of the New table is created, then a "shadow" of the Old table, then the join between them, and finally a join back to the New table to update it. While this sounds like a lot of work, I think you will be surprised at the totally-crazy speed of completion this can offer.
Anything you can do to convert the read from the old table to be seeks instead of scans will help. Anything you can do to reduce the amount of disk used to hold temp data (such as gigantic hash tables for 5 million rows) will help.

Don't use update, create a similar new table, and use a insert into (fields) select fields from both tables.

Drop indexes before running the query.
Drop triggers before running the query.

Something like:

insert into NEW_TABLE (field1,field2,NEW_2ND_ID)  
select field1, field2, (SELECT 2ND_ID FROM OLD_TABLE....) from NEW_TABLE

Recreate indexes after query finished.
Recreate triggers after query finished.

(In the end you will replace with this new table your existing table)

I think right now it is nested sub query, so Select statement will be fired for the number of records which matches the where condition.

However i would recommend you to go for SQL - Update with a Join. Explaination can be found here: http://bytes.com/topic/oracle/answers/65819-sql-update-join-syntax.

When you have update with JOIN in place, then apply proper Indexes.

Also not that you should not use those indexes which include 2nd_id, they should be disabled and then rebuild after the update, as it can be mass data update.

UPDATE (
    SELECT O.2ND_ID, N.2ND_ID 
    FROM OLD_TABLE AS O  
    INNER JOIN NEW_TABLE AS N on O.PK_ID=N.PK_ID
    WHERE N.2ND_ID <> O.2ND_ID  
    AND O.3RD_ID IS NULL  
    AND O.CODE IS NULL  
    AND N.3RD_ID IS NOT NULL  
    AND N.CODE IS NOT NULL  
    AND N.2ND_ID IS NOT NULL
) t 
set N.2ND_ID = O.2ND_ID

You can also try putting the new table in a different device than the old one, to take advantage of its parallelism. If you can convince the DBA, of course.

继续阅读：optimization query-optimization sas sql

Optimizing suggestions needed for a SQL UPDATE statment. Two ~5 million record tables being used

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？