Optimizing Mysql Query For Group by with date functions
I have a report that pulls information from a summary table and ideally will pull from two periods at once, the current period and the previous period. My table is structured thusly:
report_table
item_id INT(11)
amount Decimal(8,2)
day DATE
The primary key is item_id, day. This table currently holds 37k records with 92 different items and 1200 different days. I am using Mysql 5.1.
Here is my select statement:
SELECT r.day, sum(r.amount)/(count(distinct r.item_id)*count(r.day)) AS `current_avg_day`,
sum(r2.amount)/(count(distinct r2.item_id)*count(r2.day)) AS `previous_avg_day`
FROM `client_location_item` AS `cla`
INNER JOIN `client_location` AS `cl`
INNER JOIN `report_item_day` AS `r`
INNER JOIN `report_item_day` AS `r2`
WHERE (r.item_id = cla.item_id)
AND (cla.location_id = cl.location_id)
AND (r.day between from_unixtime(1293840000) and开发者_开发百科 from_unixtime(1296518399))
AND (r2.day between from_unixtime(1291161600) and from_unixtime(1293839999))
AND (cl.location_code = 'LOCATION')
group by month(r.day);
At present this query takes 2.2 seconds in my environment. The explain plan is:
'1', 'SIMPLE', 'cl', 'ALL', 'PRIMARY', NULL, NULL, NULL, '33', 'Using where; Using temporary; Using filesort'
'1', 'SIMPLE', 'cla', 'ref', 'PRIMARY,location_id,location_id_idxfk', 'location_id', '4', 'cl.location_id', '1', 'Using index'
'1', 'SIMPLE', 'r', 'ref', 'PRIMARY', 'PRIMARY', '4', cla.asset_id', '211', 'Using where'
'1', 'SIMPLE', 'r2', 'ALL', NULL, NULL, NULL, NULL, '37602', 'Using where; Using join buffer'
If I add an index to the "day" column, instead of my query running faster, it runs in 2.4 seconds. The explain plan for the query at that time is:
'1', 'SIMPLE', 'r2', 'range', 'report_day_day_idx', 'report_day_day_idx', '3', NULL, '1092', 'Using where; Using temporary; Using filesort'
'1', 'SIMPLE', 'r', 'range', 'PRIMARY,report_day_day_idx', 'report_day_day_idx', '3', NULL, '1180', 'Using where; Using join buffer'
'1', 'SIMPLE', 'cla', 'eq_ref', 'PRIMARY,location_id,location_id_idxfk', 'PRIMARY', '4', 'r.asset_id', '1', 'Using where'
'1', 'SIMPLE', 'cl', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', cla.location_id', '1', 'Using where'
According to the MySQL documentation the most efficient group by execution is when there is an index to retrieve the grouping columns. But it also states that the only functions that can really make use of the indexes are min() and max(). Does anyone have any ideas what I can do to further optimize my query? Or why, my 'indexed' version runs more slowly despite having fewer rows overall than the non-indexed version?
Create table:
CREATE TABLE `report_item_day` (
`item_id` int(11) NOT NULL,
`amount` decimal(8,2) DEFAULT NULL,
`day` date NOT NULL,
PRIMARY KEY (`item_id`,`day`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
Of course the other option I have is to make 2 db calls, one for each time period. If I do that, straight away the query for each drops to 0.031s. Still I feel like there should be a way to optimize this query to achieve comparable results.
Three things:
1) I don't see in the WHERE clause something for r2.item_id. Without it, r2 is factored in via a Cartesian Product and will sum up other item_ids as well.
Change your original query to look like this:
SELECT r.day
,sum(r.amount)/(count(distinct r.item_id)*count(r.day)) AS `current_avg_day`
,sum(r2.amount)/(count(distinct r2.item_id)*count(r2.day)) AS `previous_avg_day`
FROM `client_location_item` AS `cla`
INNER JOIN `client_location` AS `cl`
INNER JOIN `report_item_day` AS `r`
INNER JOIN `report_item_day` AS `r2`
WHERE (r.item_id = cla.item_id) AND (r2.item_id = cla.item_id) AND (cla.location_id = cl.location_id)
AND (r.day between from_unixtime(1293840000) and from_unixtime(1296518399))
AND (r2.day between from_unixtime(1291161600) and from_unixtime(1293839999))
AND (cl.location_code = 'LOCATION')
group by month(r.day);
See if the EXPLAIN PLAN changes after this.
2) Do this : ALTER TABLE report_itme_day ADD INDEX (date,item_id);
This will index scan the date instead of the item id.
See if the EXPLAIN PLAN changes after this.
3) Last resort : Refactor the query
SELECT r.day, sum(r.amount)/(count(distinct r.item_id)*count(r.day)) AS `current_avg_day`, sum(r2.amount)/(count(distinct r2.item_id)*count(r2.day)) AS `previous_avg_day` FROM
(SELECT CLA.item_id FROM client_location CL,client_location_item CLA WHERE CLA.location_code = 'LOCATION' AND CLA.location_id=CL.location_id) A,
report_item_day r,
report_item_day r2,
WHERE (r.item_id = A.item_id)
AND (r2.item_id = A.item_id)
AND (r.day between from_unixtime(1293840000) and from_unixtime(1296518399))
AND (r2.day between from_unixtime(1291161600) and from_unixtime(1293839999))
group by month(r.day);
This can definitely be refactored further. I just refactorted it a littte.
Give it a Try !!!
Why you are selecting day when you are grouping on month? I don't entirely what you would like the output of your query to look like. I hate MySQL for allowing that!
I will show you two approaches to query for 2 periods in one go. The first one is a union all query. It should do what your 2-query approach already does. It will return 2 rows, one for each period.
select sum(r.amount) / (count(distinct r.item_id) * count(r.day) ) as curr_avg
from report_item_day r
join client_location_item cla using(item_id)
join client_location cl using(location_id)
where cl.location_code = 'LOCATION'
and r.day between from_unixtime(1293840000) and from_unixtime(1296518399)
union all
select sum(r.amount) / (count(distinct r.item_id) * count(r.day) ) as prev_avg
from report_item_day r
join client_location_item cla using(item_id)
join client_location cl using(location_id)
where cl.location_code = 'LOCATION'
and r.day between from_unixtime(1291161600) and from_unixtime(1293839999)
The following approach is potentially faster than the above, but it is much uglier and harder to read.
select period
,sum(amount) / (count(distinct item_id) * count(day) ) as avg_day
from (select case when r.day between from_unixtime(1293840000) and from_unixtime(1296518399) then 'Current'
when r.day between from_unixtime(1291161600) and from_unixtime(1293839999) then 'Previous'
end as period
,r.amount
,r.item_id
,r.day
from report_item_day r
join client_location_item cla using(item_id)
join client_location cl using(location_id)
where cl.location_code = 'LOCATION'
and ( r.day between from_unixtime(1293840000) and from_unixtime(1296518399)
or r.day between from_unixtime(1291161600) and from_unixtime(1293839999)
)
) v
group
by period;
Note 1: You didn't give us DDL, so I can't test if the syntax is correct
Note 2: Consider creating a calendar table, keyed by DATE. Add appropriate columns such as MONTH, WEEK, FINANCIAL_YEAR etcetera, to be able to support the reporting you are doing. The queries will be much much easier to write and understand.
First of all (and this might be just aesthetics), why aren't you using ON / USING clauses in your INNER JOIN ? Why make the JOIN on the WHERE clause instead of the actual part, in the FROM?
Second, my guess with the indexed vs non-indexed issue is that now it has to check against an index first for the records that matches said range, whereas in the non-indexed version memory goes faster than disk. But I can't be too sure.
Now, for the query. Here's part of the doc. on JOINs:
The `conditional_expr` used with ON is any conditional expression of the form
that can be used in a WHERE clause. Generally, you should use the ON clause for
conditions that specify how to join tables, and the WHERE clause to restrict
which rows you want in the result set.
So yeah, move the join conditions to the FROM clause. Also, you might be interested in the Index hint syntax: http://dev.mysql.com/doc/refman/5.0/en/index-hints.html
And lastly, you could try using a view, but be wary of performance issues: http://www.mysqlperformanceblog.com/2007/08/12/mysql-view-as-performance-troublemaker/
Good luck.
精彩评论