Optimizing multiple joins
I'm trying to figure out a way to speed up a particularly cumbersome query which aggregates some data by date across a couple of tables. The full (ugly) query is below along with an EXPLAIN ANALYZE
to show just how horrible it is.
If anyone could take a peek and see if they can spot any major issues (which is likely, I'm not a Postgres guy) that would be superb.
So here goes. The query is:
SELECT
to_char(p.period, 'DD/MM/YY') as period,
coalesce(o.value, 0) AS outbound,
coalesce(i.value, 0) AS inbound
FROM (
SELECT
date '2009-10-01' + s.day
AS period
FROM generate_series(0, date '2009-10-31' - date '2009-10-01') AS s(day)
) AS p
LEFT OUTER JOIN(
SELECT
SUM(b.body_size) AS value,
b.body_time::date AS period
FROM body AS b
LEFT JOIN
envelope e ON e.message_id = b.message_id
WHERE
e.envelope_command = 1
AND b.body_time BETWEEN '2009-10-01'
AND (date '2009-10-31' + INTERVAL '1 DAY')
GROUP BY period
ORDER BY period
) AS o ON p.period = o.period
LEFT OUTER JOIN(
SELECT
SUM(b.body_size) AS value,
b.body_time::date AS period
FROM body AS b
LEFT JOIN
envelope e ON e.message_id = b.message_id
WHERE
e.envelope_command = 2
AND b.body_time BETWEEN '2009-10-01'
AND (date '2009-10-31' + INTERVAL '1 DAY')
GROUP BY period
ORDER BY period
) AS i ON p.period = i.period
The EXPLAIN ANALYZE
can be found here: on explain.depesz.c开发者_如何学编程om
Any comments or questions are appreciated.
Cheers
There are always 2 things to consider when optimising queries:
- What indexes can be used (you may need to create indexes)
- How the query is written (you may need to change the query to allow the query optimser to be able to find appropriate indexes, and to not re-read data redundantly)
A few observations:
You are performing date manipulations before you join your dates. As a general rule this will prevent a query optimser from using an index even if it exists. You should try to write your expressions in such a way that indexed columns exist unaltered on one side of the expression.
Your subqueries are filtering to the same date range as
generate_series
. This is a duplication, and it limits the optimser's ability to choose the most efficient optimisation. I suspect that may have been written in to improve performance because the optimser was unable to use an index on the date column (body_time
)?NOTE: We would actually very much like to use an index on
Body.body_time
ORDER BY
within the subqueries is at best redundant. At worst it could force the query optimiser to sort the result set before joining; and that is not necessarily good for the query plan. Rather only apply ordering right at the end for final display.Use of
LEFT JOIN
in your subqueries is inappropriate. Assuming you're using ANSI conventions forNULL
behaviour (and you should be), any outer joins toenvelope
would returnenvelope_command=NULL
, and these would consequently be excluded by the conditionenvelope_command=?
.Subqueries
o
andi
are almost identical save for theenvelope_command
value. This forces the optimser to scan the same underlying tables twice. You can use a pivot table technique to join to the data once, and split the values into 2 columns.
Try the following which uses the pivot technique:
SELECT p.period,
/*The pivot technique in action...*/
SUM(
CASE WHEN envelope_command = 1 THEN body_size
ELSE 0
END) AS Outbound,
SUM(
CASE WHEN envelope_command = 2 THEN body_size
ELSE 0
END) AS Inbound
FROM (
SELECT date '2009-10-01' + s.day AS period
FROM generate_series(0, date '2009-10-31' - date '2009-10-01') AS s(day)
) AS p
/*The left JOIN is justified to ensure ALL generated dates are returned
Also: it joins to a subquery, else the JOIN to envelope _could_ exclude some generated dates*/
LEFT OUTER JOIN (
SELECT b.body_size,
b.body_time,
e.envelope_command
FROM body AS b
INNER JOIN envelope e
ON e.message_id = b.message_id
WHERE envelope_command IN (1, 2)
) d
/*The expressions below allow the optimser to use an index on body_time if
the statistics indicate it would be beneficial*/
ON d.body_time >= p.period
AND d.body_time < p.period + INTERVAL '1 DAY'
GROUP BY p.Period
ORDER BY p.Period
EDIT: Added filter suggested by Tom H.
Building on Craig Young's suggestions, here is the amended query which runs in ~1.8 seconds for the data set I'm working on. That is a slight improvement on the original ~2.0s and a huge improvement on Craig's which took ~22s.
SELECT
p.period,
/* The pivot technique... */
SUM(CASE envelope_command WHEN 1 THEN body_size ELSE 0 END) AS Outbound,
SUM(CASE envelope_command WHEN 2 THEN body_size ELSE 0 END) AS Inbound
FROM
(
/* Get days range */
SELECT date '2009-10-01' + day AS period
FROM generate_series(0, date '2009-10-31' - date '2009-10-01') AS day
) p
/* Join message information */
LEFT OUTER JOIN
(
SELECT b.body_size, b.body_time::date, e.envelope_command
FROM body AS b
INNER JOIN envelope e ON e.message_id = b.message_id
WHERE
e.envelope_command IN (2, 1)
AND b.body_time::date BETWEEN (date '2009-10-01') AND (date '2009-10-31')
) d ON d.body_time = p.period
GROUP BY p.period
ORDER BY p.period
I uninstalled my PostgreSQL server a couple of days ago, so you'll likely have to play around with this, but hopefully it's a good start for you.
The keys are:
- You shouldn't need the subqueries - just do the direct joins and aggregate
- You should be able to use INNER JOINs, which are typically more performant than OUTER JOINs
If nothing else, I think that the query below is a bit clearer.
I used a calendar table in my query, but you can replace that with the generate_series as you were using it.
Also, depending on indexing, it might be better to compare the body_date with >= and < rather than pulling out the date part and comparing. I don't know enough about PostgreSQL to know how it works behind the scenes, so I would try both approaches to see which the server can optimize better. In pseudo-code you would be doing: body_date >= date (time=midnight) AND body_date < date + 1 (time=midnight).
SELECT
CAL.calendar_date AS period,
SUM(O.body_size) AS outbound,
SUM(I.body_size) AS inbound
FROM
Calendar CAL
INNER JOIN Body OB ON
OB.body_time::date = CAL.calendar_date
INNER JOIN Envelope OE ON
OE.message_id = OB.message_id AND
OE.envelope_command = 1
INNER JOIN Body IB ON
IB.body_time::date = CAL.calendar_date
INNER JOIN Envelope IE ON
IE.message_id = IB.message_id AND
IE.envelope_command = 2
GROUP BY
CAL.calendar_date
精彩评论