MySQL query with dependent subquery takes too long

2023-01-08 22:57 问答作者：

I need an SQL guru to help me speed up my query.

I have 2 tables, quantities and prices. quantities records a quantity value between 2 timestamps, 15 minutes apart. prices records a price for a given timestamp, for a given price type and there is a price 5 record for every 5 minutes.

I need 2 work out the total price for each period, e.g. hour or day, between two timestamps. This is calculated by the sum of the (quantity multiplied by the average of the 3 prices in the 15 minute quantity window) in each period.

For example, let's say I want to see the total price each hour for 1 day. The total price value in each row in the result set is the sum of the total prices for each of the four 15 minute periods in that hour. And the total price for each 15 minute period is calculated by multiplying the quantity value in that period by the average of the 3 prices (one for each 5 minutes) in that quantity's period.

Here's the query I'm using, and the results:

SELECT
MIN( `quantities`.`start_timestamp` ) AS `start`,
MAX( `quantities`.`end_timestamp` ) AS `end`,
SUM( `quantities`.`quantity` * (
  SELECT AVG( `prices`.`price` )
  FROM `prices`
  WHERE `prices`.`timestamp` >= `quantities`.`start_timestamp`
  AND `prices`.`timestamp` < `quantities`.`end_timestamp`
  AND `prices`.`type_id` = 1
) ) AS total
FROM `quantities`
WHERE `quantities`.`start_timestamp` >= '2010-07-01 00:00:00'
AND `quantities`.`start_timestamp` < '2010-07-02 00:00:00'
GROUP BY HOUR(  `quantities`.`start_timestamp` );

+---------------------+---------------------+----------+
| start               | end                 | total    |
+---------------------+---------------------+----------+
| 2010-07-01 00:00:00 | 2010-07-01 01:00:00 | 0.677733 |
| 2010-07-01 01:00:00 | 2010-07-01 02:00:00 | 0.749133 |
| 2010-07-01 02:00:00 | 2010-07-01 03:00:00 | 0.835467 |
| 2010-07-01 03:00:00 | 2010-07-01 04:00:00 | 0.692233 |
| 2010-07-01 04:00:00 | 2010-07-01 05:00:00 | 0.389533 |
| 2010-07-01 05:00:00 | 2010-07-01 06:00:00 | 0.335300 |
| 2010-07-01 06:00:00 | 2010-07-01 07:00:00 | 1.231467 |
| 2010-07-01 07:00:00 | 2010-07-01 08:00:00 | 0.352800 |
| 2010-07-01 08:00:00 | 2010-07-01 09:00:00 | 1.447200 |
| 2010-07-01 09:00:00 | 2010-07-01 10:00:00 | 0.756733 |
| 2010-07-01 10:00:00 | 2010-07-01 11:00:00 | 0.599467 |
| 2010-07-01 11:00:00 | 2010-07-01 12:00:00 | 1.056467 |
| 2010-07-01 12:00:00 | 2010-07-01 13:00:00 | 1.252600 |
| 2010-07-01 13:00:00 | 2010-07-01 14:00:00 | 1.285567 |
| 2010-07-01 14:00:00 | 2010-07-01 15:00:00 | 0.442933 |
| 2010-07-01 15:00:00 | 2010-07-01 16:00:00 | 0.692567 |
| 2010-07-01 16:00:00 | 2010-07-01 17:00:00 | 1.281067 |
| 2010-07-01 17:00:00 | 2010-07-01 18:00:00 | 0.652033 |
| 2010-07-01 18:00:00 | 2010-07-01 19:00:00 | 1.721900 |
| 2010-07-01 19:00:00 | 2010-07-01 20:00:00 | 1.362400 |
| 2010-07-01 20:00:00 | 2010-07-01 21:00:00 | 1.099300 |
| 2010-07-01 21:00:00 | 2010-07-01 22:00:00 | 0.646267 |
| 2010-07-01 22:00:00 | 2010-07-01 23:00:00 | 0.873100 |
| 2010-07-01 23:00:00 | 2010-07-02 00:00:00 | 0.546533 |
+---------------------+---------------------+----------+
24 rows in set (5.16 sec)

I need the query to run a lot faster than this, and would have though it would be possible. Here's the results from EXPLAIN EXTENDED ...

+----+--------------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------------------------------------------+
| id | select_type        | table      | type  | possible_keys     | key             | key_len | ref   | rows  | Extra                                        |
+----+--------------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------------------------------------------+
|  1 | PRIMARY            | quantities | range | start_timestamp   | start_timestamp | 8       | NULL  |    89 | Using where; Using temporary; Using filesort |
|  2 | DEPENDENT SUBQUERY | prices     | ref   | timestamp,type_id | type_id         | 4       | const | 22930 | Using where                                  |
+----+--------------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------------------------------------------+
2 rows in set, 3 warnings (0.00 sec)

I noticed the dependent sub query is not using the timestamp field in the key and the query is scanning loads of rows.

Can anyone help me get this running a hell of a lot faster?

Here are the SQL statements required to create the schema and fill it with a lot of data (2 months worth)

# Create prices table

CREATE TABLE `prices` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `timestamp` datetime NOT NULL,
  `type_id` int(11) NOT NULL,
  `price` float(8,2) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `timestamp` (`timestamp`),
  KEY `type_id` (`type_id`)
) ENGINE=MyISAM;

# Create quantities table

CREATE TABLE `quantities` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `start_timestamp` datetime NOT NULL,
  `end_timestamp` datetime NOT NULL,
  `quantity` float(7,2) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `start_timestamp` (`start_timestamp`),
  KEY `end_timestamp` (`end_timestamp`)
) ENGINE=MyISAM;

# Insert 开发者_Python百科first 2 rows into prices, one for each of 2 types, starting 64 days ago

INSERT INTO `prices` (`id`, `timestamp`, `type_id`, `price`) VALUES
(NULL, DATE_SUB(CURDATE(), INTERVAL 64 DAY), '1', RAND()),
(NULL, DATE_SUB(CURDATE(), INTERVAL 64 DAY), '2', RAND());

# Fill the prices table with a record for each type, for every 5 minutes, for the next 64 days

INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 32 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 16 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 8 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 4 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 2 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 1 DAY), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 12 HOUR), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 6 HOUR), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 3 HOUR), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 90 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 45 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 20 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 10 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_ADD(`timestamp`, INTERVAL 5 MINUTE), `type_id`, RAND() FROM prices;
INSERT INTO prices (`timestamp`, `type_id`, `price`) SELECT DATE_SUB(`timestamp`, INTERVAL 5 MINUTE), `type_id`, RAND() FROM prices WHERE MOD( (TIME_TO_SEC( `timestamp`) - TIME_TO_SEC(CONCAT(DATE_SUB(CURDATE(), INTERVAL 64 DAY), ' 00:00:00')) ), 45 *60 ) = 0 AND `timestamp` > CONCAT(DATE_SUB(CURDATE(), INTERVAL 64 DAY), ' 00:00:00');

# Insert first row into quantities, start timestamp is 64 days ago, end timestamp is start timestamp plus 15 minutes

INSERT INTO `quantities` (`id`, `start_timestamp`, `end_timestamp`, `quantity`) VALUES (NULL, DATE_SUB(CURDATE(), INTERVAL 64 DAY), DATE_SUB(CURDATE(), INTERVAL '63 23:45' DAY_MINUTE), RAND());

# Fill the quantities table with a record for each 15 minute period for the next 64 days

INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 32 DAY), DATE_ADD(`end_timestamp`, INTERVAL 32 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 16 DAY), DATE_ADD(`end_timestamp`, INTERVAL 16 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 8 DAY), DATE_ADD(`end_timestamp`, INTERVAL 8 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 4 DAY), DATE_ADD(`end_timestamp`, INTERVAL 4 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 2 DAY), DATE_ADD(`end_timestamp`, INTERVAL 2 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 1 DAY), DATE_ADD(`end_timestamp`, INTERVAL 1 DAY), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 12 HOUR), DATE_ADD(`end_timestamp`, INTERVAL 12 HOUR), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 6 HOUR), DATE_ADD(`end_timestamp`, INTERVAL 6 HOUR), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 3 HOUR), DATE_ADD(`end_timestamp`, INTERVAL 3 HOUR), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 90 MINUTE), DATE_ADD(`end_timestamp`, INTERVAL 90 MINUTE), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 45 MINUTE), DATE_ADD(`end_timestamp`, INTERVAL 45 MINUTE), RAND() FROM quantities;
INSERT INTO `quantities` (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_ADD(`start_timestamp`, INTERVAL 15 MINUTE), DATE_ADD(`end_timestamp`, INTERVAL 15 MINUTE), RAND() FROM quantities;
INSERT INTO quantities (`start_timestamp`, `end_timestamp`, `quantity`) SELECT DATE_SUB(`start_timestamp`, INTERVAL 15 MINUTE), DATE_SUB(`end_timestamp`, INTERVAL 15 MINUTE), RAND() FROM quantities WHERE MOD( (TIME_TO_SEC( `start_timestamp`) - TIME_TO_SEC(CONCAT(DATE_SUB(CURDATE(), INTERVAL 64 DAY), ' 00:00:00')) ), 45 * 60 ) = 0 AND `start_timestamp` > CONCAT(DATE_SUB(CURDATE(), INTERVAL 64 DAY), ' 00:00:00');

Here is my first attempt. This one is dirty and uses the following properties on data:

there are three 5 minute prices for each quarter in quantities (if this is violated in data the query will not work)
notice for each and cardinality of three, this is not guaranteed by data integrity checks so therefore I call it dirty
it is also not flexible to changes in periods

Query 1:

SELECT sql_no_cache
    min(q.start_timestamp) as start,  
    max(q.end_timestamp) as end, 
    sum((p1.price + p2.price + p3.price)/3*q.quantity) as total 
FROM 
    quantities q join 
    prices p1 on q.start_timestamp = p1.timestamp and p1.type_id = 1 join 
    prices p2 on p2.timestamp = adddate(q.start_timestamp, interval 5 minute) and p2.type_id = 1 join 
    prices p3 on p3.timestamp = adddate(q.start_timestamp, interval 10 minute) and p3.type_id = 1 
WHERE 
    q.start_timestamp between '2010-07-01 00:00:00' and '2010-07-01 23:59:59' 
GROUP BY hour(q.start_timestamp);

This one returns results in 0.01 sec on my slow testing machine, where original query runs in ~6 sec, and gnarf's query in ~0.85 sec (all queries always tested with SQL_NO_CACHE keyword which does not reuse the results, but on a warm database).

EDIT: Here is a version that is not sensitive to missing rows on the price side Query 1a

SELECT sql_no_cache
    min(q.start_timestamp) as start,  
    max(q.end_timestamp) as end, 
    sum( ( COALESCE(p1.price,0) + COALESCE(p2.price,0) + COALESCE(p3.price,0) ) / ( 
         3 -
         COALESCE(p1.price-p1.price,1) - 
         COALESCE(p2.price-p2.price,1) - 
         COALESCE(p3.price-p3.price,1)
        )
       *q.quantity) as total 
FROM 
    quantities q LEFT JOIN 
    prices p1 on q.start_timestamp = p1.timestamp and p1.type_id = 1 LEFT JOIN
    prices p2 on p2.timestamp = adddate(q.start_timestamp, interval 5 minute) and p2.type_id = 1 LEFT JOIN
    prices p3 on p3.timestamp = adddate(q.start_timestamp, interval 10 minute) and p3.type_id = 1 
WHERE 
    q.start_timestamp between '2010-07-01 00:00:00' and '2010-07-01 23:59:59' 
GROUP BY hour(q.start_timestamp);

EDIT2: Query 2: Here is a direct improvement, and different approach, to your query with minimal changes that brings the execuction time to ~0.22 sec on my machine

SELECT sql_no_cache
MIN( `quantities`.`start_timestamp` ) AS `start`,
MAX( `quantities`.`end_timestamp` ) AS `end`,
SUM( `quantities`.`quantity` * (
  SELECT AVG( `prices`.`price` )
  FROM `prices`
  WHERE 
    `prices`.`timestamp` >= '2010-07-01 00:00:00' 
    AND `prices`.`timestamp` < '2010-07-02 00:00:00' 
    AND `prices`.`timestamp` >= `quantities`.`start_timestamp`
    AND `prices`.`timestamp` < `quantities`.`end_timestamp`
    AND `prices`.`type_id` = 1
) ) AS total
FROM `quantities`
WHERE `quantities`.`start_timestamp` >= '2010-07-01 00:00:00'
AND `quantities`.`start_timestamp` < '2010-07-02 00:00:00'
GROUP BY HOUR(  `quantities`.`start_timestamp` );

That is mysql 5.1, I think I have read that in 5.5 this kind of thing (merging indexes) will be available to the query planner. Also, if you could make your start_timestamp and timestamp be related through foreign key that should allow these kind of correlated queries to make use of indexes (but for this you would need to modify design and establish some sort of timeline table which could then be referenced by quantities and prices both).

Query 3: Finally, the last version which does it in ~0.03 sec, but should be as robust and flexible as Query 2

SELECT sql_no_cache
MIN(start),
MAX(end),
SUM(subtotal)
FROM 
(
SELECT sql_no_cache
q.`start_timestamp` AS `start`,
q.`end_timestamp` AS `end`,
AVG(p.`price` * q.`quantity`) AS `subtotal`
FROM `quantities` q
LEFT JOIN `prices` p ON p.timestamp >= q.start_timestamp AND 
                        p.timestamp < q.end_timestamp AND
                        p.timestamp >= '2010-07-01 00:00:00' AND 
                        p.`timestamp` < '2010-07-02 00:00:00' 
WHERE q.`start_timestamp` >= '2010-07-01 00:00:00' 
AND q.`start_timestamp` < '2010-07-02 00:00:00'
AND p.type_id = 1
GROUP BY q.`start_timestamp`
) forced_tmp
GROUP BY hour( start );

NOTE: Do not forget to remove sql_no_cache keywords in production.

There are many counter intuitive tricks applied in the above queries (sometimes conditions repeated in the join condition speed up queries, sometimes they slow them down). Mysql is great little RDBMS and really fast when it comes to relatively simple queries, but when the complexity increases it is easy to run into the above scenarios.

So in general, I apply the following principle to set my expectations regarding the performance of a query:

if the base result set has < 1,000 rows then query should do its business in ~0.01 sec (base result set is the number of rows that functionally determine resulting set)

In this particular case you start with less then 1000 rows (all the prices and quantities in one day, with 15 minutes precision) and from that you should be able to compute the final results.

This should return the same results and perform slightly faster:

SELECT
  MIN( `quantities`.`start_timestamp` ) AS `start`,
  MAX( `quantities`.`end_timestamp` ) AS `end`,
  SUM( `quantities`.`quantity` * `prices`.`price` ) 
   * COUNT(DISTINCT `quantities`.`id`) 
   / COUNT(DISTINCT `prices`.`id`)
    AS total
FROM `quantities`
JOIN `prices` ON `prices`.`timestamp` >= `quantities`.`start_timestamp`
  AND `prices`.`timestamp` < `quantities`.`end_timestamp`
  AND `prices`.`type_id` = 1
WHERE `quantities`.`start_timestamp` >= '2010-07-01 00:00:00'
  AND `quantities`.`start_timestamp` < '2010-07-02 00:00:00'
GROUP BY HOUR(  `quantities`.`start_timestamp` );

Since you can't calculate AVG() inside the SUM(), I had to do some interesting COUNT(DISTINCT) to calculate the number of prices returned per quantities. I'm wondering if this gives you the same results with "real" data...

Using JOIN:

+----+-------------+------------+-------+-------------------------------+-----------------+---------+------+-------+----------+----------------------------------------------+
| id | select_type | table      | type  | possible_keys                 | key             | key_len | ref  | rows  | filtered | Extra                                        |
+----+-------------+------------+-------+-------------------------------+-----------------+---------+------+-------+----------+----------------------------------------------+
|  1 | SIMPLE      | quantities | range | start_timestamp,end_timestamp | start_timestamp | 8       | NULL |    89 |   100.00 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | prices     | ALL   | timestamp,type_id             | NULL            | NULL    | NULL | 36862 |    62.20 | Using where; Using join buffer               |
+----+-------------+------------+-------+-------------------------------+-----------------+---------+------+-------+----------+----------------------------------------------+

vs. the same query only adding LEFT to the JOIN

+----+-------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------+----------------------------------------------+
| id | select_type | table      | type  | possible_keys     | key             | key_len | ref   | rows  | filtered | Extra                                        |
+----+-------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------+----------------------------------------------+
|  1 | SIMPLE      | quantities | range | start_timestamp   | start_timestamp | 8       | NULL  |    89 |   100.00 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | prices     | ref   | timestamp,type_id | type_id         | 4       | const | 22930 |   100.00 |                                              |
+----+-------------+------------+-------+-------------------+-----------------+---------+-------+-------+----------+----------------------------------------------+

Interesting that LEFT can completely removes the end_timestamp as a possible key, and changes the selected keys so much, making it take 15 times as long...

This reference page could help you out a little more if you want to look at specifying index hints for your JOINS

Remember, just because you have indexes on your columns doesn't necessarily mean they'll run faster. As it stands, the index created is for each individual column, which, if you were only limiting the data on one column, would return results quite fast.

So to try and avoid "Using filesort" (which you need to do as much as possible), maybe try the following index:

CREATE INDEX start_timestamp_end_timestamp_id ON quantities (start_timestamp,end_timestamp,id);

And something similar for the prices table (combining the 3 individual indexes you have into 1 index for faster lookup)

An excellent resource which explains it in great detail and how to optimize your indexes (and what the different Explain's mean, and what to aim for) is: http://hackmysql.com/case1

继续阅读：mysql-dependent-subquery

MySQL query with dependent subquery takes too long

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？