开发者

How to optimize a query that's using group by on a large number of rows

The table looks like this:

    CREATE TABLE `tweet_tweet` (
      `id` int(11) NOT NULL AUTO_INCREMENT,
      `text` varchar(256) NOT NULL,
      `created_at` datetime NOT NULL,
      `created_date` date NOT NULL,
...
      `positive_sentiment` decimal(5,2) DEFAULT NULL,
      `negative_sentiment` decimal(5,2) DEFAULT NULL,
      `entity_id` int(11) DEFAULT NULL,
      PRIMARY KEY (`id`),
      KEY `tweet_tweet_entity_created` (`entity_id`,`created_at`)
    ) ENGINE=MyISAM AUTO_INCREMENT=1097134 DEFAULT CHARSET=utf8

The explain on the query looks like this:

mysql> explain SELECT `tweet_tweet`.`entity_id`, 
       STDDEV_POP(`tweet_tweet`.`positive_sentiment`) AS `sentiment_stddev`, 
       AVG(`tweet_tweet`.`positive_sentiment`) AS `sentiment_avg`, 
       COUNT(`tweet_tweet`.`id`) AS `tweet_count` 
       FROM `tweet_tweet` 
       WHERE `tweet_tweet`.`created_at` > '2010-10-06 16:24:43'  
       GROUP BY `tweet_tweet`.`entity_id` ORDER BY `tweet_tweet`.`entity_id` ASC;

+----+-------------+-------------+------+---------------+------+---------+------+---------开发者_开发百科+----------------------------------------------+
| id | select_type | table       | type | possible_keys | key  | key_len | ref  | rows    | Extra                                        |
+----+-------------+-------------+------+---------------+------+---------+------+---------+----------------------------------------------+
|  1 | SIMPLE      | tweet_tweet | ALL  | NULL          | NULL | NULL    | NULL | 1097452 | Using where; Using temporary; Using filesort |
+----+-------------+-------------+------+---------------+------+---------+------+---------+----------------------------------------------+
  1 row in set (0.00 sec)

About 300k rows are added to the table every day. The query runs about 4 seconds right now but I want to get it down to around 1 second and I'm afraid the query will take exponentially longer as the days go on. Total number of rows in tweet_tweet is currently only a little over 1M, but it will be growing fast.

Any thoughts on optimizing this? Do I need any more indexes? Should I be using something like Cassandra instead of MySQL? =)


You may try to reorder fields in the index (i.e. KEY tweet_tweet_entity_created (created_at, entity_id). That will allow mysql to use the index to reduce the quantity of actual rows that need to be grouped and ordered).


You're not using the index tweet_tweet_entity_created. Change your query to:

explain SELECT `tweet_tweet`.`entity_id`, 
       STDDEV_POP(`tweet_tweet`.`positive_sentiment`) AS `sentiment_stddev`, 
       AVG(`tweet_tweet`.`positive_sentiment`) AS `sentiment_avg`, 
       COUNT(`tweet_tweet`.`id`) AS `tweet_count` 
       FROM `tweet_tweet` FORCE INDEX (tweet_tweet_entity_created)
       WHERE `tweet_tweet`.`created_at` > '2010-10-06 16:24:43'  
       GROUP BY `tweet_tweet`.`entity_id` ORDER BY `tweet_tweet`.`entity_id` ASC;

You can read more about index hints in the MySQL manual http://dev.mysql.com/doc/refman/5.1/en/index-hints.html

Sometimes MySQL's query optimizer needs a little help.


MySQL has a dirty little secret. When you create an index over multiple columns, only the first one is really "used". I've made tables that used Unique Keys and Foreign Keys, and I often had to set a separate index for one or more of the columns.

I suggest adding an extra index to just created_at at a minimum. I do not know if adding indexes to the aggregate columns will also speed things up.


if your mysql version 5.1 or higher ,you can consider partitioning option for large tables.

http://dev.mysql.com/doc/refman/5.1/en/partitioning.html

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜