开发者

Optimize query for multiple columns extracted from the same table

This is a follow up of another question here on SO.

I have this two database tables (more tables omitted):

acquisitions (acq)
    id {PK}
    id_cu {FK}
    datetime
    { Unique Constraint: id_cu - datetime }

data
    id {PK}
    id_acq {FK acquisitions}
    id_meas
    id_elab
    value

Every possible id and datetimeare all indexed.

Granted I will not change db structure I need to extract data this way:

  • rows grouped by datetime
  • each column corresponding data.value for selected acq.id_cu - data.id_meas - data.id_elab combination. (see note on bottom of post)
  • allow empty cells if data is missing for some column but existing for others in a datetime

My current query is built this way (see SO question):

SELECT datetime, MAX(v1) AS v1, MAX(v2) AS v2, MAX(v3) AS v3 FROM (

SELECT acq.datetime AS datetime, data.value AS v1, NULL AS v2, NULL AS v3 
FROM acq INNER JOIN data ON acq.id = data.id_acq
WHERE acq.id_cu = 3 AND data.id_meas = 2 AND data.id_elab = 1

UNION

SELECT acq.datetime AS datetime, NULL AS v1, data.value AS v2, NULL AS v3 
FROM acq INNER JOIN data ON acq.id = data.id_acq
WHERE acq.id_cu = 5 AND data.id_meas = 4 AND data.id_elab = 6

UNION

SELECT acq.datetime AS datetime, NULL AS v1, NULL AS v2, data.value AS v3 
FROM acq INNER JOIN data ON acq.id = data.id_acq
WHERE acq.id_cu = 7 AND data.id_meas = 9 AND data.id_elab = 8

) AS T
WHERE datetime >= "2011-03-01 00:00:00" AND datetime <= "2011-04-30 23:59:59"
GROUP BY datetime

Here for retrieving just 3 columns but, as I said, columns are frequently more than 50.

It works flawlessly but I'd like to know if it can be optimized in speed.

This is MySQL EXPLAIN EXTENDED for the query above:

+----+--------------+--------------+------+------------------------------------------------+-----------------------+---------+------------------------+-------+----------+----------------------------------------------+
| id | select_type  | table        | type | possible_keys                                  | key                   | key_len | ref                    | rows  | filtered | Extra                                        |
+----+--------------+--------------+------+------------------------------------------------+-----------------------+---------+------------------------+-------+----------+----------------------------------------------+
|  1 | PRIMARY      | <derived2>   | ALL  | NULL                                           | NULL                  | NULL    | NULL                   | 82466 |   100.00 | Using where; Using temporary; Using filesort |
|  2 | DERIVED      | acquisitions | ref  | PRIMARY,id_cu,ix_acquisitions_id_cu            | id_cu                 | 4       |                        | 18011 |   100.00 |                                              |
|  2 | DERIVED      | data         | ref  | ix_data_id_meas,ix_data_id_acq,ix_data_id_elab | ix_data_id_acq        | 4       | sensor.acquisitions.id |     9 |   100.00 | Using where                                  |
|  3 | UNION        | acquisitions | ref  | PRIMARY,id_cu,ix_acquisitions_id_cu            | ix_acquisitions_id_cu | 4       |                        | 20864 |   100.00 |                                              |
|  3 | UNION        | data         | ref  | ix_data_id_meas,ix_data_id_acq,ix_data_id_elab | ix_data_id_acq        | 4       | sensor.acquisitions.id |     9 |   100.00 | Using where                                  |
|  4 | UNION        | acquisitions | ref  | PRIMARY,id_cu,ix_acquisitions_id_cu            | id_cu                 | 4       |                        | 31848 |   100.00 |                                              |
|  4 | UNION        | data         | ref  | ix_data_id_meas,ix_data_id_acq,ix_data_id_elab | ix_data_id_acq        | 4       | sensor.acquisitions.id |     9 |   100.00 | Using where                                  |
| NULL | UNION RESULT | <union2,3,4> | ALL  | NULL                                           | NULL                  | NULL    | NULL                   |  NULL |     NULL |                                              |
+----+--------------+--------------+------+------------------------------------------------+-----------------------+---------+------------------------+-------+----------+----------------------------------------------+
8 rows in set, 1 warning (8.24 sec)

Currently with (edit: checked today) 390k acquisitions and 9.2M data values (and growing) it takes about 10 minutes to extract a table of 59 columns. I'm aware the precedent software took up to 1 hour to extract data.

Thanks for your patience reading till here :)


Update

After Denis answer I tried his changes 1. and 2., this is the result of the new query:

SELECT datetime, MAX(v1) AS v1, MAX(v2) AS v2, MAX(v3) AS v3 FROM (

SELECT acq.datetime AS datetime, data.value AS v1, NULL AS v2, NULL AS v3 
FROM acq INNER JOIN data ON acq.id = data.id_acq
WHERE acq.id_cu = 3 AND data.id_meas = 2 AND data.id_elab = 1
AND datetime >= "2011-03-01 00:00:00" AND datetime <= "2011-04-30 23:59:59"

UNION ALL

SELECT acq.datetime AS datetime, NULL AS v1, data.value AS v2, NULL AS v3 
FROM acq INNER JOIN data ON acq.id = data.id_acq
WHERE acq.id_cu = 5 AND data.id_meas = 4 AND data.id_elab = 6
AND datetime >= "2011-03-01 00:00:00" AND datetime <= "2011-04-30 23:59:59"

UNION ALL

SELECT acq.datetime AS datetime, NULL AS v1, NULL AS v2, data.value AS v3 
FROM acq INNER JOIN data ON acq.id = data.id_acq
WHERE acq.id_cu = 7 AND data.id_meas = 9 AND data.id_elab = 8
AND datetime >= "2011-03-01 00:00:00" AND datetime <= "2011-04-30 23:59:59"

) AS T GROUP BY datetime

and here the new EXPLAIN EXTENDED:

+----+--------------+--------------+-------+--------------------------------------------------------------+----------------+---------+------------------------+-------+----------+---------------------------------+
| id | select_type  | table        | type  | possible_keys                                                | key            | key_len | ref                    | rows  | filtered | Extra                           |
+----+--------------+--------------+-------+--------------------------------------------------------------+----------------+---------+------------------------+-------+----------+---------------------------------+
|  1 | PRIMARY      | <derived2>   | ALL   | NULL                                                         | NULL           | NULL    | NULL                   | 51997 |   100.00 | Using temporary; Using filesort |
|  2 | DERIVED      | acquisitions | range | PRIMARY,id_cu,ix_acquisitions_datetime,ix_acquisitions_id_cu | id_cu          | 12      | NULL                   | 14827 |   100.00 | Using where                     |
|  2 | DERIVED      | data         | ref   | ix_data_id_meas,ix_data_id_acq,ix_data_id_elab               | ix_data_id_acq | 4       | sensor.acquisitions.id |     9 |   100.00 | Using where                     |
|  3 | UNION        | acquisitions | range | PRIMARY,id_cu,ix_acquisitions_datetime,ix_acquisitions_id_cu | id_cu          | 12      | NULL                   | 18663 |   100.00 | Using where                     |
|  3 | UNION        | data         | ref   | ix_data_id_meas,ix_data_id_acq,ix_data_id_elab               | ix_data_id_acq | 4       | sensor.acquisitions.id |     9 |   100.00 | Using where                     |
|  4 | UNION        | acquisitions | range | PRIMARY,id_cu,ix_acquisitions_datetime,ix_acquisitions_id_cu | id_cu          | 12      | NULL                   | 13260 |   100.00 | Using where                     |
|  4 | UNION        | data         | ref   | ix_data_id_meas,ix_data_id_acq,ix_data_id_elab               | ix_data_id_acq | 4       | sensor.acquisitions.id |     9 |   100.00 | Using where                     |
| NULL | UNION RESULT | <union2,3,4> | ALL   | NULL                                                         | NULL           | NULL    | NULL                   |  NULL |     NULL |                                 |
+----+--------------+--------------+-------+--------------------------------------------------------------+----------------+---------+------------------------+-------+----------+---------------------------------+
8 rows in set, 1 warning (3.01 sec)

a good gain in performace with no doubt


Update (2)

This comes adding point 3.

EXPLAIN EXTENDED SELECT datetime, MAX(v1) AS v1, MAX(v2) AS v2, MAX(v3) AS v3 FROM (

SELECT acquisitions.datetime AS datetime, MAX(data.value) AS v1, NULL AS v2, NULL AS v3 
FROM acquisitions INNER JOIN data ON acquisitions.id = data.id_acq
WHERE acquisitions.id_cu = 1 AND data.id_meas = 1 AND data.id_elab = 2
AND datetime >= "2011-03-01 00:00:00" AND datetime <= "2011-04-30 23:59:59"
GROUP BY datetime

UNION ALL

SELECT acquisitions.datetime AS datetime, NULL AS v1, MAX(data.value) AS v2, NULL AS v3 
FROM acquisitions INNER JOIN data ON acquisitions.id = data.id_acq
WHERE acquisitions.id_cu = 4 AND data.id_meas = 1 AND data.id_elab = 2
AND datetime >= "2011-03-01 00:00:00" AND datetime <= "2011-04-30 23:59:59"
GROUP BY datetime

UNION ALL

SELECT acquisitions.datetime AS datetime, NULL AS v1, NULL AS v2, MAX(data.value) AS v3 
FROM acquisitions INNER JOIN data ON acquisitions.id = data.id_acq
WHERE acquisitions.id_cu = 8 AND data.id_meas = 1 AND data.id_elab = 2
AND datetime >= "2011-03-01 00:00:00" AND datetime <= "2011-04-30 23:59:59"
GROUP BY datetime

) AS T GROUP BY datetime;

and this is the result of EXPLAIN EXTENDED

+----+--------------+--------------+-------+--------------------------------------------------------------+----------------+---------+------------------------+-------+----------+---------------------------------+
| id | select_type  | table        | type  | possible_keys                                                | key            | key_len | ref                    | rows  | filtered | Extra                           |
+----+--------------+--------------+-------+--------------------------------------------------------------+----------------+---------+------------------------+-------+----------+---------------------------------+
|  1 | PRIMARY      | <derived2>   | ALL   | NULL                               开发者_如何转开发                          | NULL           | NULL    | NULL                   | 51997 |   100.00 | Using temporary; Using filesort |
|  2 | DERIVED      | acquisitions | range | PRIMARY,id_cu,ix_acquisitions_datetime,ix_acquisitions_id_cu | id_cu          | 12      | NULL                   | 14827 |   100.00 | Using where                     |
|  2 | DERIVED      | data         | ref   | ix_data_id_meas,ix_data_id_acq,ix_data_id_elab               | ix_data_id_acq | 4       | sensor.acquisitions.id |     9 |   100.00 | Using where                     |
|  3 | UNION        | acquisitions | range | PRIMARY,id_cu,ix_acquisitions_datetime,ix_acquisitions_id_cu | id_cu          | 12      | NULL                   | 18663 |   100.00 | Using where                     |
|  3 | UNION        | data         | ref   | ix_data_id_meas,ix_data_id_acq,ix_data_id_elab               | ix_data_id_acq | 4       | sensor.acquisitions.id |     9 |   100.00 | Using where                     |
|  4 | UNION        | acquisitions | range | PRIMARY,id_cu,ix_acquisitions_datetime,ix_acquisitions_id_cu | id_cu          | 12      | NULL                   | 13260 |   100.00 | Using where                     |
|  4 | UNION        | data         | ref   | ix_data_id_meas,ix_data_id_acq,ix_data_id_elab               | ix_data_id_acq | 4       | sensor.acquisitions.id |     9 |   100.00 | Using where                     |
| NULL | UNION RESULT | <union2,3,4> | ALL   | NULL                                                         | NULL           | NULL    | NULL                   |  NULL |     NULL |                                 |
+----+--------------+--------------+-------+--------------------------------------------------------------+----------------+---------+------------------------+-------+----------+---------------------------------+
8 rows in set, 1 warning (3.06 sec)

Just a little slower, is this supposed to benefit from large number of coulmns? I'll try it...


Update (3)

I tried with and without MAX(data.value)... GROUP BY datetime and, on a 60 columns query I get better results with. Results vary from try to try, this is one of them.

  • original query 9m12.144s
  • with Denis' 1. and 2. 4m6.597s
  • with Denis' 1., 2. and 3. 4m0.210s

It's about 57% less time required.


Update (4)

I tried Andiry solution but it is way slower than Denis optimization.

Tested on a 3 combinations/columns:

  • unoptimized: 1m3s
  • Denis' optimization: 1.7s
  • Andiry's CASE: 9.3s

I also tested on a 12 combinations/columns:

  • unoptimized: untested
  • Denis' optimization: 3.6s
  • Andiry's CASE: 13.7s

Moreover Andiry's solution also pulls in acquisition dates where there's no data for any of the selected combination but existing for others.

Immagine Control Unit 1 gets data every 30 minutes at :00 and :30 while Control Unit 2 at :15 and :45: I'll double the number of rows with NULL filled empty ones.


Note:

It's all about a sensors system: there are several control units (one for each id_cu) with many sensors each.

A single sensor is identified by a id_cu / id_meas couple and sends different elaborations for each measure, say MIN (id_elab=1), MAX (id_elab=2), AVERAGE (id_elab=3), INSTANT (id_elab=...) etc, one for each id_elab.

The user is free to pickup as many elaborations he wants, say:

  • AVERAGE value (3) of sensor #3 of control unit #1 for a results column so id_cu=1 / id_meas=3 / id_elab=3
  • AVERAGE value (3) of sensor #5 of control unit #1 for a results column so id_cu=1 / id_meas=5 / id_elab=3
  • MIN value (1) of sensor #2 of control unit #4 for another column so id_cu=4 / id_meas=2 / id_elab=1
  • (put any valid id_cu, id_meas, id_elab combination)
  • ...

and so on, up to tens of choices...

Here's a partial DDL (irrelevant tables excluded):

CREATE TABLE acquisitions (
    id INTEGER NOT NULL AUTO_INCREMENT, 
    id_cu INTEGER NOT NULL, 
    datetime DATETIME NOT NULL, 
    PRIMARY KEY (id), 
    UNIQUE (id_cu, datetime), 
    FOREIGN KEY(id_cu) REFERENCES ctrl_units (id) ON DELETE CASCADE
)

CREATE TABLE data (
    id INTEGER NOT NULL AUTO_INCREMENT, 
    id_acq INTEGER NOT NULL, 
    id_meas INTEGER NOT NULL, 
    id_elab INTEGER NOT NULL, 
    value FLOAT, 
    PRIMARY KEY (id), 
    FOREIGN KEY(id_acq) REFERENCES acquisitions (id) ON DELETE CASCADE
)

CREATE TABLE ctrl_units (
    id INTEGER NOT NULL, 
    name VARCHAR(40) NOT NULL, 
    PRIMARY KEY (id)
)

CREATE TABLE sensors (
    id_cu INTEGER NOT NULL, 
    id_meas INTEGER NOT NULL, 
    id_elab INTEGER NOT NULL, 
    name VARCHAR(40) NOT NULL, 
    `desc` VARCHAR(80), 
    PRIMARY KEY (id_cu, id_meas), 
    FOREIGN KEY(id_cu) REFERENCES ctrl_units (id) ON DELETE CASCADE
)


There are three main issues:

  1. Use union all, rather than union. you're grouping and fetching min/max values, so there's no point in introducing a sort step to drop duplicate rows.

  2. The where clause could be placed in each of the union substatements:

    select ...
    from (
    select ... from ...  where ...
    union all
    select ... from ...  where ...
    union all
    ...
    )
    group by ...
    

    The way you've written it, it starts by fetching all rows, appends them all, and finally filters the ones you need. Injecting the where clause in the union substatements will make it fetch only the rows you need, finally append them all.

  3. Along the same lines, pre-aggregate the aggregates:

    select ..., max(foo) as foo
    from (
    select ..., max(foo) as foo from ...  where ... group by ...
    union all
    select ..., max(foo) as foo from ...  where ... group by ...
    union all
    ...
    )
    group by ...
    

    The optimizer will make better use of existing indexes, and you'll end up appending only a few rows, rather than millions.


SELECT
  acq.datetime,
  MAX(CASE WHEN acq.id_cu = 2 AND data.id_meas = 2 AND data.id_elab = 1 THEN data.value END) AS v1,
  MAX(CASE WHEN acq.id_cu = 5 AND data.id_meas = 4 AND data.id_elab = 6 THEN data.value END) AS v2,
  MAX(CASE WHEN acq.id_cu = 7 AND data.id_meas = 9 AND data.id_elab = 8 THEN data.value END) AS v3
FROM acq
  INNER JOIN data acq.id = data.id_acq
WHERE datetime >= 2011-03-01 00:00:00 AND datetime <= 2011-04-30 23:59:59
GROUP BY acq.datetime

This may look about the same as your original query, but the main difference is that logically it scans the tables just once instead of three or however many times with UNIONs.


Basically I think you are going to get better results with the single SELECT and CASE dealing with the conditions. At any rate you might want to benchmark and compare...

SELECT acq.datetime AS datetime, 
       MAX(
           CASE acq.id_cu
           WHEN 1 THEN data.value
           END 
       ) as v1,
       MAX(
           CASE acq.id_cu
           WHEN 4 THEN data.value
           END 
       ) as v2,
       MAX(
           CASE acq.id_cu
           WHEN 8 THEN data.value
           END 
       ) as v3
FROM 
       acq INNER JOIN data ON acq.id = data.id_acq
WHERE 
       data.id_meas = 1 AND data.id_elab = 2 AND
       datetime BETWEEN "2011-03-01 00:00:00" AND "2011-04-30 23:59:59"

This should do a clean range scan. Also, there's more that could be done with composite indexes.

Finally, is there anything wrong with using GROUP BY, for example

SELECT data.id_means, acq.datetime AS datetime, MAX(data.value)
FROM 
       acq INNER JOIN data ON acq.id = data.id_acq
WHERE 
       data.id_elab = 2 AND
       datetime BETWEEN "2011-03-01 00:00:00" AND "2011-04-30 23:59:59" AND
       data.id_means IN (1,4,8)
GROUP BY
       data.id_means

This is the most simple form (and most flexible) - even though the rows have not been transposed to columns for you (for different values of data.id_meas). But this will give you the best idea of what performance to expect and which indexes should are the most useful for the query.

EDIT: To get max data.value for *acq.id_cu - data.id_meas - data.id_elab combination* you should be able to use simply

SELECT 
       acq.id_cu, data.id_meas, data.id_elab, acq.datetime AS datetime, MAX(data.value)
FROM 
       acq INNER JOIN data ON acq.id = data.id_acq
WHERE 
       data.id_elab = 2 AND
       datetime BETWEEN "2011-03-01 00:00:00" AND "2011-04-30 23:59:59" AND
       data.id_means IN (1,4,8)
GROUP BY
       acq.id_cu, data.id_meas, data.id_elab, acq.datetime

will give max(data.value) for all combinations of acq.id_cu, data.id_meas, data.id_elab, acq.datetime (after filtering it with values from where - adjusting where impacts the results). This will not show NULLs for combinations that have no rows, but there is a workaround for that if this is the right direction for you. GROUP BY determines ordering too, so changing the order of columns in group by.

If my answer is still missing the point some sample data/test case would be useful.

The confusing part in your example is when you say

each column corresponding data.value for selected acq.id_cu - data.id_meas - data.id_elab combination.

but when you select the data in example queries you select them straight into columns with only datetime grouping, so if it is really a combination of values it is not possible to identify which row corresponds to which combination (there can be multiple rows for a certain date). If it is not really a combination of all values that you want to filter/group on, but the grouping conditions that determine the value of max depend directly on datetime.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜