Which is faster: JOIN with GROUP BY or a Subquery?

2023-01-05 04:21 问答作者：

Let's say we have two tables: 'Car' and 'Part', with a joining table in 'Car_Part'. Say I want to see all cars that have a part 123 in them. I could do this:

SELECT Car.Col1, Car.Col2, Car.Col3 
FROM Car
INNER JOIN Car_Part ON Car_Part.Car_Id = Car.Car_Id
WHERE Car_Part.Part_Id = @part_to_look_for
G开发者_运维技巧ROUP BY Car.Col1, Car.Col2, Car.Col3

Or I could do this

SELECT Car.Col1, Car.Col2, Car.Col3 
FROM Car
WHERE Car.Car_Id IN (SELECT Car_Id FROM Car_Part WHERE Part_Id = @part_to_look_for)

Now, everything in me wants to use the first method because I've been brought up by good parents who instilled in me a puritanical hatred of sub-queries and a love of set theory, but it has been suggested to me that doing that big GROUP BY is worse than a sub-query.

I should point out that we're on SQL Server 2008. I should also say that in reality I want to select based the Part Id, Part Type and possibly other things too. So, the query I want to do actually looks like this:

SELECT Car.Col1, Car.Col2, Car.Col3 
FROM Car
INNER JOIN Car_Part ON Car_Part.Car_Id = Car.Car_Id
INNER JOIN Part ON Part.Part_Id = Car_Part.Part_Id
WHERE (@part_Id IS NULL OR Car_Part.Part_Id = @part_Id)
AND (@part_type IS NULL OR Part.Part_Type = @part_type)
GROUP BY Car.Col1, Car.Col2, Car.Col3

Or...

SELECT Car.Col1, Car.Col2, Car.Col3 
FROM Car
WHERE (@part_Id IS NULL OR Car.Car_Id IN (
    SELECT Car_Id 
    FROM Car_Part 
    WHERE Part_Id = @part_Id))
AND (@part_type IS NULL OR Car.Car_Id IN (
    SELECT Car_Id
    FROM Car_Part
    INNER JOIN Part ON Part.Part_Id = Car_Part.Part_Id
    WHERE Part.Part_Type = @part_type))

The best thing you can do is test them yourself, on realistic data volumes. That would not only benefit for this query, but for all future queries when you are not sure which is the best way.

Important things to do include:
- test on production level data volumes
- test fairly & consistently (clear cache: http://www.adathedev.co.uk/2010/02/would-you-like-sql-cache-with-that.html)
- check the execution plan

You could either monitor using SQL Profiler and check the duration/reads/writes/CPU there, or SET STATISTICS IO ON; SET STATISTICS TIME ON; to output stats in SSMS. Then compare the stats for each query.

If you can't do this type of testing, you'll be potentially exposing yourself to performance problems down the line that you'll have to then tune/rectify. There are tools out there you can use that will generate data for you.

I have similar data so I checked the execution plan for both styles of query. To my surprise, the Column In Subquery (CIS) produced an execution plan with 25% less I/O cost to than the inner join (IJ) query. In the CIS execution plan I get an 2 index scans of the intermediate table (Car_Part) versus an index scan of the intermediate and a relatively more expensive hash join in the IJ. My indexes are healthy but non-clustered so it stands to reason that the index scans might be made a bit faster by clustering them. I doubt this would impact the cost of the hash join which is the more expensive step in the IJ query.

Like the others have pointed out, it depends on your data. If you're working with many gigabytes in these 3 tables then tune away. If your rows are numbered in the hundreds or thousands then you might be splitting hairs over a very small performance gain. I would say that the IJ query is much more readable so as long as it's good enough, do any future developer who touches your code a favour and give them something easier to read. The row count in my tables are 188877, 283912, 13054 and both queries returned in less time that it took to sip coffee.

Small postscript: since you're not aggregating any numerical values, it looks like you mean to select distinct. Unless you're actually going to do something with the group, it's easier to see your intention with select distinct rather than group by at the end. IO cost is the same but one indicates your intention better IMHO.

With SQL Server 2008 I would expect In to be quicker as it is equivalent to this.

SELECT Car.Col1, Car.Col2, Car.Col3 
FROM Car
WHERE EXISTS(SELECT * FROM Car_Part
            WHERE Car_Part.Car_Id = Car.Car_Id
            AND Car_Part.Part_Id = @part_to_look_for
)

i.e. it only has to check for the existence of the row not join it then remove duplicates. This is discussed here.

继续阅读：group-by join sql-server sql-server-2008 subquery

Which is faster: JOIN with GROUP BY or a Subquery?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？