Optimizing an SQL query with generated GROUP BY statement

2023-01-24 14:01 问答作者：

   SELECT ROUND(AVG(temp)*multT + conT,2) as temp,
          FLOOR(timestamp/$secondInterval) as meh
     FROM sensor_locass
LEFT JOIN sensor_data USING(sensor_id)
    WHERE sensor_id = '$id'
      AND project_id = '$project'
 GROUP BY meh
 ORDER BY timestamp ASC

The purpose is to select data for drawing a graph, I use the average over a pixels worth of data to make the graph faithful to the data.

So far optimization has included adding indexes, switching between MyISAM and InnoDB but no luck.

Since the time interval changes with graph zoom and period of data collection I cannot make a seperate column for the GROUP BY statement, the query however is slow. Does anyone have ideas for optimizing this query or the table to make this grouping faster, I currently have an index on the timestamp, sensor_id and project_id columns, the timestamp index is not used however.

When running explain extended with the query I get the following:

1   SIMPLE  sensor_locass   ref     sensor_id_lookup,project_id_lookup  sensor_id_lookup    4   const                               2       100.00  Using where; Using temporary; Using filesort
1   SIMPLE  sensor_data     ref     idsensor_lookup idsensor_lookup                         4   webstech.sensor_locass.sensor_id    66857   100.00

The sensor_data table contains at the moment 2.7 million datapoints which is only a small fraction of the amount of data i will end up having to work with. Any helpful ideas, comments or solution would be most welcome

EDIT table definitions:

CREATE TABLE `sensor_data` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `gateway_id` int(11) NOT NULL,
 `timestamp` int(10) NOT NULL,
 `v1` int(11) NOT NULL,
 `v2` int(11) NOT NULL,
 `v3` int(11) NOT NULL,
 `sensor_id` int(11) NOT NULL,
 `temp` decimal(5,3) NOT NULL,
 `oxygen` decimal(5,3) NOT NULL,
 `batVol` decimal(4,3) NOT NULL,
 PRIMARY KEY (`id`),
 KEY `gateway_id` (`gateway_id`),
 KEY `time_lookup` (`timestamp`),
 KEY `idsensor_lookup` (`sensor_id`)
) ENGINE=MyISAM AUTO_INCREMENT=2741126 DEFAULT CHARSET=latin1

CREATE TABLE `sensor_locass` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `project_id` int(11) NOT NULL,
 `sensor_id` int(11) NOT NULL,
 `start` date NOT NULL,
 `end` date NOT NULL,
 `multT` decimal(6,3) NOT NULL,
 `conT` decimal(6,3) NOT NULL,
 `multO` decimal(6,3) NOT NULL,
 `conO` decimal(6,3) NOT NULL,
 `xpos` decimal(4,2) NOT NULL,
 `ypos` decimal(4,2) NOT NULL,
 `lat` decimal(9,6) NOT NULL,
 `lon` decimal(9,6) NOT NULL,
 `isRef` tinyint(1) NOT NULL,
 PRIMARY KEY (`id`),
 KEY `sensor_id_lookup` (`sensor_id`),
 KEY `project_id_lookup` (`project_id`)
) ENGINE=MyISAM AUTO_INCREMENT=238 DEFAULT CHARSET=latin1

Despite everyone's answers, changing the primary key to optimize the search on the table with 238 rows isn't gonna change anything, especially when the EXPLAIN shows a single key narrowing the search to two rows. And adding timestamp to the primary key on sensor_data won't work either since nothing is querying the timestamp, just calculating on it (unless you can restrict on the timestamp values as galymzhan suggests).

Oh, and you can drop the LEFT in your query, since matching on project_id makes it irrelevant anyway (but doesn't slow anything down). And please don't interpolate variables directly into a query if those variables come from customer input to avoid $project_id = "'; DROP TABLES; --" type sql injection exploits.

Adjusting your heap sizes could work for a while but you'll have to continue adjusting it if you need to scale.

The answer vdrmrt suggests might work but then you'd need to populate your aggregate table with every single possible value for $secondInterval which I'm assuming isn't very plausible given the flexibility that you said you needed. In the same vein, you could consider rrdtool, either using it directly or modifying your data in the same way that it does. What I'm referring to specifically is that it keeps the raw data for a given period of time (usually a few days), then averages the data points together over larger and larger periods of time. The end result is that you can zoom in to high detail for recent periods of time but if you look back further, the data has been effectively lossy-compressed to averages over large periods of time (e.g. one data point per second for a day, one data point per minute for a week, one data point per hour for a month, etc). You could customize those averages initially but unless you kept both the raw data and the summarized data, you wouldn't be able to go back and adjust. In particular, you could not dynamically zoom in to high detail on some older arbitrary point (such as looking at the per second data for a 1 hour of time occuring six months ago).

So you'll have to decide whether such restrictions are reasonable given your requirements.

If not, I would then argue that you are trying to do something in MySQL that it was not designed for. I would suggest pulling the raw data you need and taking the averages in php, rather than in your query. As has already been pointed out, the main reason your query takes a long time is because the GROUP BY clause is forcing mysql to crunch all the data in memory but since its too much data its actually writing that data temporarily to disk. (Hence the using filesort). However, you have much more flexibility in terms of how much memory you can use in php. Furthermore, since you are combining nearby rows, you could pull the data out row by row, combining it on the fly and thereby never needing to keep all the rows in memory in your php process. You could then drop the GROUP BY and avoid the filesort. Use an ORDER BY timestamp instead and if mysql doesn't optimize it correctly, then make sure you use FORCE INDEX FOR ORDER BY (timestamp)

I'd suggest that you find a natural primary key to your tables and switch to InnoDB. This a guess at what your data looks like:

sensor_data: PRIMARY KEY (sensor_id, timestamp)

sensor_locass: PRIMARY KEY (sensor_id, project_id)

InnoDB will order all the data in this way so rows you're likely to SELECT together will be together on disk. I think you're group by will always cause some trouble. If you can keep it below the size where it switches over to a file sort (tmp_table_size and max_heap_table_size), it'll be much faster.

How many rows are you generally returning? How long is it taking now?

As Joshua suggested, you should define (sensor_id, project_id) as a primary key for sensor_locass table, because at the moment table has 2 separate indexes on each of the columns. According to mysql docs, SELECT will choose only one index from them (most restrictive, which finds fewer rows), while primary key allows to use both columns for indexing data.

However, EXPLAIN shows that MySQL examined 66857 rows on a joined table, so you should somehow optimize that too. Maybe you could query sensor data for a given interval of time, like timestamp BETWEEN (begin, end) ?

I agree that the first step should be to define sensor_id, project_id as primary key for sensor_locass. If that is not enough and your data is relative static you can create an aggregated table that you can refresh for example everyday and than query from there. What you still have to do is to define a range for secondInterval, store that in new table and add that field to the primary key of your aggregated table.

The query to populate the aggregated table will be something like this:

INSERT INTO aggregated_sensor_data (sensor_id,project_id,secondInterval,timestamp,temp,meh)
SELECT
    sensor_locass.sensor_id,
    sensor_locass.project_id,
    secondInterval,
    timestamp,
    ROUND(AVG(temp)*multT + conT,2) as temp,
    FLOOR(timestamp/secondInterval) as meh
FROM
    sensor_locass
    LEFT JOIN sensor_data
    USING(sensor_id)
    LEFT JOIN secondIntervalRange
    ON 1 = 1
WHERE
        sensor_id = '$id'
    AND
        project_id = '$project'
GROUP BY
    sensor_locass.sensor_id,
    sensor_locass.project_id,
    meh
ORDER BY
    timestamp ASC

And you can use this query to extract the aggregated data:

SELECT
    temp,
    meh
FROM
    aggregated_sensor_data
WHERE
     sensor_id = '$id'
    AND project_id = '$project'
    AND secondInterval = $secondInterval
ORDER BY
    timestamp ASC

If you want to use timestamp index, you will have to tell explicitly to use that index. MySQL 5.1 supports USE INDEX FOR ORDER BY/FORCE INDEX FOR ORDER BY. Have a look at it here http://dev.mysql.com/doc/refman/5.1/en/index-hints.html

继续阅读：optimization php query-optimization

Optimizing an SQL query with generated GROUP BY statement

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？