Optimizing MySQL query using GROUP BY on time functions

2023-01-31 13:47 问答作者：

I have the following query:

SELECT location, step, COUNT(*), AVG(foo), YEAR(start), MONTH(start), DAY(start)
FROM table WHERE jobid = 'xxx' AND start BETWEEEN '2010-01-01' AND '2010-01-08'
GROUP BY location, step, YEAR(start), MONTH(start), DAY(start)

Originally I had indexes on individual columns, such as jobid and start, but quickly realized that MySQL only really honors one index per table in a select. As such, it would use the jobid index and then do a pretty large scan to filter out by the start range.

Adding an index on (jobid, start) helped quite a bit, but the GROUP BY is still causing performance issues. I've read the docs on GROUP BY optimizations and understand that in order to benefit from these optimizations I need an index that contains (location, step, start), but I still have two open questions:

Will the group by optimizations even work with the time functions (YEAR, MONTH, DAY, etc)? Or am I going to have to store these values as separate columns? The reason I like doing the functions is that it means I can control the time zone on a per-connection basis and get back results tailored to the end-users time zone. If I have to pre-store the year, month, and day, I'll do it via UTC and then all my users will just get reports in UTC.
Even if I can solve issue #1, can I even do this? The index (jobid, start) helped with the WHERE clause, but the GROUP BY needs a different index to be optimized (location, step, start) or, depending on the answer to #1, (location, step, year, month, day). But the problem is that those two indexes don't share a common left-hand set of columns, so I don't believe my WHERE and GROUP by can be compatible such that the same index gets used. So my question is: am I just hosed here?

Any other thoughts on how to achieve this would be helpful. And, just to preempt a few questions/comments that might come up:

Yes, this is a time-series data set.
Yes, it would benefit from something like RRDtool, but doing so would cause me to loose doing timezone-specific results.
Yes, pre-calculating rollups would probably be a good idea, but I don't need awesome performance and so I'm OK with good performance if it lets me customize the results for each user's timezone.

With the above said, if anyone has any design suggestions on how to do something like rollups or round-robin databases and still get timezone-specific results, I'm all ears!

Update: as requested, here is some more info:

show indexes from output:

step    0   PRIMARY 1   step_id A   16  NULL    NULL        BTREE   
step    1   start   1   start   A   16  NULL    NULL        BTREE   
step    1   step    1   step    A   2   NULL    NULL        BTREE   
step    1   foo 1   foo A   16  NULL    NULL    YES BTREE   
step    1   location    1   location    A   2   NULL    NULL    YES BTREE   
step    1   jobid   1   jobid   A   2   NULL    NULL    YES BTREE

show create table output:

CREATE TABLE `step` (
  `start` time开发者_如何学Cstamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `step` smallint(2) unsigned NOT NULL,
  `step_id` int(8) unsigned NOT NULL AUTO_INCREMENT,
  `location` varchar(12) DEFAULT NULL,
  `jobid` varchar(37) DEFAULT NULL,
  PRIMARY KEY (`step_id`),
  KEY `start_time` (`start`),
  KEY `step` (`step`),
  KEY `location` (`location`),
  KEY `job_id` (`jobid`)
) ENGINE=InnoDB AUTO_INCREMENT=240 DEFAULT CHARSET=utf8

Instead doing this

GROUP BY location, step, YEAR(start), MONTH(start), DAY(start)
ORDER BY location, step, YEAR(start), MONTH(start), DAY(start)

try

GROUP BY location, step, date_format(start, '%Y%m%d')
ORDER BY location, step, date_format(start, '%Y%m%d')

create a single composite index on jobid, start, location, step

then group by that order first, and sort it:

SELECT location, step, COUNT(*), AVG(foo), YEAR(start), MONTH(start), DAY(start)
FROM table WHERE jobid = 'xxx' AND start BETWEEEN '2010-01-01' AND '2010-01-08'
GROUP BY YEAR(start), MONTH(start), DAY(start), location, step
ORDER BY location, step, YEAR(start), MONTH(start), DAY(start)

UPDATE

Looks like MySql cannot use the index when the YEAR,MONTH and DAY functions are used. since

After removing the start from the WHERE clause, the explain still shows using filesort
Adding 3 columns: y = YEAR(start), m = MONTH(start), d=DAY(start), creating a index on jobid, y, m, d, location, step and updating the WHERE ... AND y = 2010 AND m = 12 AND d BETWEEN 1 AND 08 does remove the using temporary using filesort.

keeping 3 extra column seems like a bad idea, since the performance difference between the GROUP BY shouldn't matter that much if it uses temporary or not.

and understand that in order to benefit from these optimizations I need an index that contains (location, step, start)

Nope. You could create composite index jobid + start + location + step and it would help, if there were no BETWEEN. Since you're using range condition in WHERE - no indexes will be used for GROUP BY and the only and the best thing you can do for this query is just jobid + start index.

The best solution, imho, is to decompose this table to some pre-calculated form. For example: to aggregate data by scheduler hourly.

There's a chance this may select faster if location and step are integer foreign keys into other tables just having name & integer id.

First, the query would be groupped on integer data which will compare a lot faster. Second, there's a chance DB engine may automatically index these numbers.

I'd also consider to offload jobid into a separate table in case the value repeats.

继续阅读：database-design optimization query-optimization

Optimizing MySQL query using GROUP BY on time functions

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？