Optimizing Mysql Query For Group by with date functions

2023-02-10 13:33 问答作者：

I have a report that pulls information from a summary table and ideally will pull from two periods at once, the current period and the previous period. My table is structured thusly:

report_table
item_id INT(11)
amount Decimal(8,2)
day DATE

The primary key is item_id, day. This table currently holds 37k records with 92 different items and 1200 different days. I am using Mysql 5.1.

Here is my select statement:

SELECT r.day, sum(r.amount)/(count(distinct r.item_id)*count(r.day)) AS `current_avg_day`, 
sum(r2.amount)/(count(distinct r2.item_id)*count(r2.day)) AS `previous_avg_day` 
FROM `client_location_item` AS `cla`
 INNER JOIN `client_location` AS `cl`
 INNER JOIN `report_item_day` AS `r`
 INNER JOIN `report_item_day` AS `r2` 
 WHERE (r.item_id = cla.item_id) 
 AND (cla.location_id = cl.location_id) 
 AND (r.day between from_unixtime(1293840000) and开发者_开发百科 from_unixtime(1296518399)) 
 AND (r2.day between from_unixtime(1291161600) and from_unixtime(1293839999)) 
 AND (cl.location_code = 'LOCATION')
 group by month(r.day);

At present this query takes 2.2 seconds in my environment. The explain plan is:

'1', 'SIMPLE', 'cl', 'ALL', 'PRIMARY', NULL, NULL, NULL, '33', 'Using where; Using temporary; Using filesort'
'1', 'SIMPLE', 'cla', 'ref', 'PRIMARY,location_id,location_id_idxfk', 'location_id', '4', 'cl.location_id', '1', 'Using index'
'1', 'SIMPLE', 'r', 'ref', 'PRIMARY', 'PRIMARY', '4', cla.asset_id', '211', 'Using where'
'1', 'SIMPLE', 'r2', 'ALL', NULL, NULL, NULL, NULL, '37602', 'Using where; Using join buffer'

If I add an index to the "day" column, instead of my query running faster, it runs in 2.4 seconds. The explain plan for the query at that time is:

'1', 'SIMPLE', 'r2', 'range', 'report_day_day_idx', 'report_day_day_idx', '3', NULL, '1092', 'Using where; Using temporary; Using filesort'
'1', 'SIMPLE', 'r', 'range', 'PRIMARY,report_day_day_idx', 'report_day_day_idx', '3', NULL, '1180', 'Using where; Using join buffer'
'1', 'SIMPLE', 'cla', 'eq_ref', 'PRIMARY,location_id,location_id_idxfk', 'PRIMARY', '4', 'r.asset_id', '1', 'Using where'
'1', 'SIMPLE', 'cl', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', cla.location_id', '1', 'Using where'

According to the MySQL documentation the most efficient group by execution is when there is an index to retrieve the grouping columns. But it also states that the only functions that can really make use of the indexes are min() and max(). Does anyone have any ideas what I can do to further optimize my query? Or why, my 'indexed' version runs more slowly despite having fewer rows overall than the non-indexed version?

Create table:

CREATE TABLE `report_item_day` (
  `item_id` int(11) NOT NULL,
  `amount` decimal(8,2) DEFAULT NULL,
  `day` date NOT NULL,
  PRIMARY KEY (`item_id`,`day`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1

Of course the other option I have is to make 2 db calls, one for each time period. If I do that, straight away the query for each drops to 0.031s. Still I feel like there should be a way to optimize this query to achieve comparable results.

Three things:

1) I don't see in the WHERE clause something for r2.item_id. Without it, r2 is factored in via a Cartesian Product and will sum up other item_ids as well.

Change your original query to look like this:

SELECT r.day
      ,sum(r.amount)/(count(distinct r.item_id)*count(r.day)) AS `current_avg_day`
      ,sum(r2.amount)/(count(distinct r2.item_id)*count(r2.day)) AS `previous_avg_day`
FROM `client_location_item` AS `cla`
INNER JOIN `client_location` AS `cl`
INNER JOIN `report_item_day` AS `r`
INNER JOIN `report_item_day` AS `r2`
WHERE (r.item_id = cla.item_id) AND (r2.item_id = cla.item_id) AND (cla.location_id = cl.location_id)
AND (r.day between from_unixtime(1293840000) and from_unixtime(1296518399))
AND (r2.day between from_unixtime(1291161600) and from_unixtime(1293839999))
AND (cl.location_code = 'LOCATION')
group by month(r.day);

See if the EXPLAIN PLAN changes after this.

2) Do this : ALTER TABLE report_itme_day ADD INDEX (date,item_id);

This will index scan the date instead of the item id.

See if the EXPLAIN PLAN changes after this.

3) Last resort : Refactor the query

SELECT r.day, sum(r.amount)/(count(distinct r.item_id)*count(r.day)) AS `current_avg_day`, sum(r2.amount)/(count(distinct r2.item_id)*count(r2.day)) AS `previous_avg_day` FROM
(SELECT CLA.item_id FROM client_location CL,client_location_item CLA WHERE CLA.location_code = 'LOCATION' AND CLA.location_id=CL.location_id) A,
report_item_day r,
report_item_day r2,
WHERE (r.item_id  = A.item_id)
AND   (r2.item_id = A.item_id)
AND   (r.day  between from_unixtime(1293840000) and from_unixtime(1296518399))
AND   (r2.day between from_unixtime(1291161600) and from_unixtime(1293839999))
group by month(r.day);

This can definitely be refactored further. I just refactorted it a littte.

Give it a Try !!!

Why you are selecting day when you are grouping on month? I don't entirely what you would like the output of your query to look like. I hate MySQL for allowing that!

I will show you two approaches to query for 2 periods in one go. The first one is a union all query. It should do what your 2-query approach already does. It will return 2 rows, one for each period.

select sum(r.amount)  / (count(distinct r.item_id)  * count(r.day) ) as curr_avg
  from report_item_day r
  join client_location_item cla using(item_id)
  join client_location      cl  using(location_id)
 where cl.location_code = 'LOCATION'
   and r.day between from_unixtime(1293840000) and from_unixtime(1296518399)
union all
select sum(r.amount)  / (count(distinct r.item_id)  * count(r.day) ) as prev_avg
  from report_item_day r
  join client_location_item cla using(item_id)
  join client_location      cl  using(location_id)
 where cl.location_code = 'LOCATION'
   and r.day between from_unixtime(1291161600) and from_unixtime(1293839999)

The following approach is potentially faster than the above, but it is much uglier and harder to read.

select period
      ,sum(amount) / (count(distinct item_id) * count(day) ) as avg_day
  from (select case when r.day between from_unixtime(1293840000) and from_unixtime(1296518399) then 'Current'
                    when r.day between from_unixtime(1291161600) and from_unixtime(1293839999) then 'Previous'
                end as period
               ,r.amount
               ,r.item_id
               ,r.day
           from report_item_day r
           join client_location_item cla using(item_id)
           join client_location      cl  using(location_id)
          where cl.location_code = 'LOCATION'
            and (    r.day between from_unixtime(1293840000) and from_unixtime(1296518399)
                  or r.day between from_unixtime(1291161600) and from_unixtime(1293839999)
                )
         ) v
 group 
     by period;

Note 1: You didn't give us DDL, so I can't test if the syntax is correct
Note 2: Consider creating a calendar table, keyed by DATE. Add appropriate columns such as MONTH, WEEK, FINANCIAL_YEAR etcetera, to be able to support the reporting you are doing. The queries will be much much easier to write and understand.

First of all (and this might be just aesthetics), why aren't you using ON / USING clauses in your INNER JOIN ? Why make the JOIN on the WHERE clause instead of the actual part, in the FROM?

Second, my guess with the indexed vs non-indexed issue is that now it has to check against an index first for the records that matches said range, whereas in the non-indexed version memory goes faster than disk. But I can't be too sure.

Now, for the query. Here's part of the doc. on JOINs:

The `conditional_expr` used with ON is any conditional expression of the form 
that can be used in a WHERE clause. Generally, you should use the ON clause for
conditions that specify how to join tables, and the WHERE clause to restrict
which rows you want in the result set.

So yeah, move the join conditions to the FROM clause. Also, you might be interested in the Index hint syntax: http://dev.mysql.com/doc/refman/5.0/en/index-hints.html

And lastly, you could try using a view, but be wary of performance issues: http://www.mysqlperformanceblog.com/2007/08/12/mysql-view-as-performance-troublemaker/

Good luck.

继续阅读：group-by query-optimization

Optimizing Mysql Query For Group by with date functions

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？