How to optimize a query that's using group by on a large number of rows

2023-01-19 03:28 问答作者：

The table looks like this:

    CREATE TABLE `tweet_tweet` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `text` varchar(256) NOT NULL,
      `created_at` datetime NOT NULL,
      `created_date` date NOT NULL,
...
      `positive_sentiment` decimal(5,2) DEFAULT NULL,
      `negative_sentiment` decimal(5,2) DEFAULT NULL,
      `entity_id` int(11) DEFAULT NULL,
      PRIMARY KEY (`id`),
      KEY `tweet_tweet_entity_created` (`entity_id`,`created_at`)
    ) ENGINE=MyISAM AUTO_INCREMENT=1097134 DEFAULT CHARSET=utf8

The explain on the query looks like this:

mysql> explain SELECT `tweet_tweet`.`entity_id`, 
       STDDEV_POP(`tweet_tweet`.`positive_sentiment`) AS `sentiment_stddev`, 
       AVG(`tweet_tweet`.`positive_sentiment`) AS `sentiment_avg`, 
       COUNT(`tweet_tweet`.`id`) AS `tweet_count` 
       FROM `tweet_tweet` 
       WHERE `tweet_tweet`.`created_at` > '2010-10-06 16:24:43'  
       GROUP BY `tweet_tweet`.`entity_id` ORDER BY `tweet_tweet`.`entity_id` ASC;

+----+-------------+-------------+------+---------------+------+---------+------+---------开发者_开发百科+----------------------------------------------+
| id | select_type | table       | type | possible_keys | key  | key_len | ref  | rows    | Extra                                        |
+----+-------------+-------------+------+---------------+------+---------+------+---------+----------------------------------------------+
|  1 | SIMPLE      | tweet_tweet | ALL  | NULL          | NULL | NULL    | NULL | 1097452 | Using where; Using temporary; Using filesort |
+----+-------------+-------------+------+---------------+------+---------+------+---------+----------------------------------------------+
  1 row in set (0.00 sec)

About 300k rows are added to the table every day. The query runs about 4 seconds right now but I want to get it down to around 1 second and I'm afraid the query will take exponentially longer as the days go on. Total number of rows in tweet_tweet is currently only a little over 1M, but it will be growing fast.

Any thoughts on optimizing this? Do I need any more indexes? Should I be using something like Cassandra instead of MySQL? =)

You may try to reorder fields in the index (i.e. KEY tweet_tweet_entity_created (created_at, entity_id). That will allow mysql to use the index to reduce the quantity of actual rows that need to be grouped and ordered).

You're not using the index tweet_tweet_entity_created. Change your query to:

explain SELECT `tweet_tweet`.`entity_id`, 
       STDDEV_POP(`tweet_tweet`.`positive_sentiment`) AS `sentiment_stddev`, 
       AVG(`tweet_tweet`.`positive_sentiment`) AS `sentiment_avg`, 
       COUNT(`tweet_tweet`.`id`) AS `tweet_count` 
       FROM `tweet_tweet` FORCE INDEX (tweet_tweet_entity_created)
       WHERE `tweet_tweet`.`created_at` > '2010-10-06 16:24:43'  
       GROUP BY `tweet_tweet`.`entity_id` ORDER BY `tweet_tweet`.`entity_id` ASC;

You can read more about index hints in the MySQL manual http://dev.mysql.com/doc/refman/5.1/en/index-hints.html

Sometimes MySQL's query optimizer needs a little help.

MySQL has a dirty little secret. When you create an index over multiple columns, only the first one is really "used". I've made tables that used Unique Keys and Foreign Keys, and I often had to set a separate index for one or more of the columns.

I suggest adding an extra index to just created_at at a minimum. I do not know if adding indexes to the aggregate columns will also speed things up.

if your mysql version 5.1 or higher ,you can consider partitioning option for large tables.

http://dev.mysql.com/doc/refman/5.1/en/partitioning.html

继续阅读：indexing

How to optimize a query that's using group by on a large number of rows

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？