开发者

Query with a UNION sub-query takes a very long time

I've been having an odd problem with some queries that depend on a sub query. They run lightning fast, until I use a UNION statement in the sub query. Then they run endlessly, I've given after 10 mi开发者_开发技巧nutes. The scenario I'm describing now isn't the original one I started with, but I think it cuts out a lot of possible problems yet yields the same problem. So even though it's a pointless query, bear with me!

I have a table:

tblUser - 100,000 rows
tblFavourites - 200,000 rows

If I execute:

SELECT COUNT(*) 
FROM tblFavourites 
WHERE userID NOT IN (SELECT uid FROM tblUser);  

… then it runs in under a second. However, if I modify it so that the sub query has a UNION, it will run for at least 10 minutes (before I give up!)

SELECT COUNT(*) 
FROM tblFavourites 
WHERE userID NOT IN (SELECT uid FROM tblUser UNION SELECT uid FROM tblUser);  

A pointless change, but it should yield the same result and I don't see why it should take any longer?

Putting the sub-query into a view and calling that instead has the same effect.

Any ideas why this would be? I'm using SQL Azure.


Problem solved. See my answer below.



UNION generate unique values, so the DBMS engine makes sorts. You can use safely UNION ALL in this case.


UNION is really doing a DISTINCT on all fields in the combined data set. It filters out dupes in the final results.

Is Uid indexed? If not it may take a long time as the query engine:

  • Generates the first result set
  • Generates the second result set
  • Filters out all the dupes (which is half the records) in a hash table

If duplicates aren't a concern (and using IN means they won't be) then use UNION ALL which removes the expensive Sort/Filter step.


UNION's are usually implemented via temporary in-memory tables. You're essentially copying your tblUser two times into memory, WITH NO INDEX. Then every row in tblFavourites incur a complete table scan over 200,000 rows - that's 200Kx200K=40 billion double-row scans (because the query engine must get the uid from both table rows)

If your tblUser has an index on uid (which is definitely true because all tables in SQL Azure must have a clustered index), then each row in tblFavourites incurs a very fast index lookup, resulting in only 200Kxlog(100K) =200Kx17 = 200K row scans, each with 17 b-tree index comparisons (which is much faster than reading the uid from a row on a data page), so it should equate to roughly 200Kx(3-4) or 1 million double-row scans. I believe newer versions of SQL server may also build a temp hash table containing just the uid's, so essentially it gets down to 200K row scans (assuming hash table lookup to be trivial).

You should also generate your query plan to check.

Essentially, the non-UNION query runs around 500,000 times faster if tblUser has an index (should be on SQL Azure).


It turns out the problem was due to one of the indexes ... tblFavourites contained two foreign keys to the primary key (uid) in tblUser:

userId
otherUserId

both columns had the same definition and same indexes, but I discovered that swapping userId for otherUserId in the original query solved the problem.

I ran:

ALTER INDEX ALL ON tblFavourites REBUILD

... and the problem went away. The query now executes almost instantly.

I don't know too much about what goes on behind the scenes in Sql Server/Azure ... but I can only imagine that it was a damaged index or something? I update statistics frequently, but that had no effect.

Thanks!

---- UPDATE

The above was not fully correct. It did fix the problem for around 20 minutes, then it returned. I have been in touch with Microsoft support for several days and it seems the problem is to do with the tempDB. They are working on a solution at their end.


I just ran into this problem. I have about 1million rows to go through and then I realized that some of my IDs were in another table, so I unioned to get the same information in one "NOT EXISTS." I went from the query taking about 7 sec to processing only 5000 rows after a minute or so. This seemed to help. I absolutely hate the solution, but I've tried a multitude of things that all end up w/the same extremely slow execution plan. This one got me what I needed in about 18 sec.

    DECLARE @PIDS TABLE ([PID] [INT] PRIMARY KEY)
    INSERT INTO @PIDS SELECT DISTINCT [ID] FROM [STAGE_TABLE] WITH(NOLOCK)
    INSERT INTO @PIDS SELECT DISTINCT [OTHERID] FROM [PRODUCTION_TABLE] WITH(NOLOCK)
        WHERE NOT EXISTS(SELECT [PID] FROM @PIDS WHERE [PID] = [OTHERID]

    SELECT (columns needed)
    FROM [ORDER_HEADER] [OH] WITH(NOLOCK)
    INNER JOIN @PIDS ON [OH].[SOME_ID] = [PID]

(And yes I tried "WHERE EXISTS IN..." for the final select... inner join was faster) Please let me say again, I personaly feel this is really ugly, but I actually use this join twice in my proc, so it's going to save me time in the long run. Hope this helps.


Doesn't it make more sense to rephrase the questions from

"UserIds that aren't on the combined list of all the Ids that apper in this table and/or that table"

to

"UserIds that aren't on this table AND aren't on that table either

SELECT COUNT(*) 
FROM tblFavourites 
WHERE userID NOT IN (SELECT uid FROM tblUser) 
AND   userID NOT IN (SELECT uid FROM tblUser);  
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜