Huge performance dip with TOP in SQL Server
Consider this SQL performance difference, where in the first, I select all 26,000 rows, and in the second, I just want the first 5.
SELECT tw.*
FROM entity e
JOIN entity_tag et on et.entity_id = e.id
JOIN tag t on t.tag_id = et.tag_id
JOIN tagrelatedtweets trt on trt.FK_Tag_ID = t.tag_id
JOIN tweets tw on tw.PK_Tweet_ID = trt.FK_Tweet_ID
WHERE e.id = 765131
ORDER BY tw.[timestamp]
vs
SELECT TOP (5) tw.*
FROM entity e
JOIN entity_tag et on et.entity_id = e.id
JOIN tag t on t.tag_id = et.tag_id
JOIN tagrelatedtweets trt on trt.FK_Tag_ID = t.tag_id
JOIN tweets tw on tw.PK_Tweet_ID = trt.FK_Tweet_I开发者_开发百科D
WHERE e.id = 765131
ORDER BY tw.[timestamp]
Without: CPU = 201 | Reads: 6880 | Writes: 0 | Duration: 451
With: CPU = 302439 | Reads: 7453199 | Writes: 3169 | Duration: 74188
This just doesn't make sense to me... Is there another way to go about this?
After Martin's suggestion of a REBUILD STATISTICS on all tables involved, there is a small improvement, but the trick with changing the TOP amount into a parameter works the best.
Before statistics rebuild:
CPU = 302439 | Reads: 7453199 | Writes: 3169 | Duration: 74188
After statistics rebuild:
CPU = 127734 | Reads: 4100436 | Writes: 2656 | Duration: 16880
With parameter:
CPU = 218 | Reads: 6899 | Writes: 0 | Duration: 83
Query with parameter:
DECLARE @TOP INT; SET @TOP=5;
SELECT TOP (@TOP) tw.*
FROM entity e
JOIN entity_tag et on et.entity_id = e.id
JOIN tag t on t.tag_id = et.tag_id
JOIN tagrelatedtweets trt on trt.FK_Tag_ID = t.tag_id
JOIN tweets tw on tw.PK_Tweet_ID = trt.FK_Tweet_ID
WHERE e.id = 765131
ORDER BY tw.timestamp desc
One last remark for those of you using Entity Framework; if you experience this behavior, you can simulate the same parameter-based behavior as follows:
.Take(100).ToList().Take(5)
I know it isn't pretty, but it's the only way to trigger the correct execution plan if you are using entity framework as far as I can tell.
Thanks a lot for pointing me in the correct direction Martin!
Following discussion in the comments it seems that for some reason the plan SQL Server is choosing to optimize for the "First 5 Rows Only" case is sub optimal. It probably will be using non blocking operators such as nested loops joins rather than blocking ones such as hash joins.
As you are unable to post the execution plans it is difficult to know exactly the reason why but comparing the plans for both versions of the query and looking at actual vs estimated number of rows in the actual execution plan for the problem query should set some light on the matter.
It seems in this case that hiding the TOP 5
information from the optimiser at compile time is sufficient to give you the plan you want however!
精彩评论