Optimize GROUP BY after ranged index query
I have a content application that needs to count responses in a time slice, then order them by number of responses. It currently works great with a small data set, but needs to scale to millions rows. My current query won't work.
mysql> describe Responses;
+---------------+---------------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------+---------------------+------+-----+---------+-------+
| site_id | int(10) unsigned | NO | MUL | NULL | |
| content_id | bigint(20) unsigned | NO | PRI | NULL | |
| response_id | bigint(20) unsigned | NO | PRI | NULL | |
| date | int(10) unsigned | NO | | NULL | |
+---------------+---------------------+------+-----+---------+-------+
The table type is InnoDB, the primary key is on (content_id, response_id). There is an additional index on (content_id, date) used to find responses to a piece of content, and another additional index on (site_id, date) used in the query I am have trouble with:
mysql> explain SELECT content_id id, COUNT(response_id) num_responses
FROM Responses
WHERE site_id = 1
AND date > 1234567890
AND date < 1293579867
GROUP BY content_id
ORDER BY num_responses DESC
LIMIT 0, 10;
+----+-------------+-----------+-------+---------------+------+---------+------+------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-----------+-------+---------------+------+---------+------+------+-----------------------------------------------------------+
| 1 | SIMPLE | Responses | range | date | date | 8 | NULL | 102 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+-----------+-------+---------------+------+---------+------+------+-----------------------------------------------------------+
That's the best I've been able to come up with, but it will end up being in th开发者_如何学运维e 1,000,000's of rows needing to be counted, resulting in 10,000's of rows to sort, to pull in a handful of results.
I can't think of a way to precalculate the count either, as the date range is arbitrary. I have some liberty with changing the primary key: it can be composed of content_id, response_id, and site_id in any order, but cannot contain date.
The application is developed mostly in PHP, so if there is an quicker way to accomplish the same results by splitting the query into subqueries, using temporary tables, or doing things on the application side, I'm open to suggestions.
(Reposted from comments by request)
Set up a table that has three columns: id, date, and num_responses. The column num_responses consists of the number of responses for the given id on the given date. Backfill the table appropriately, and then at around midnight (or later) each night, run a script that adds a new row for the previous day.
Then, to get the rows you want, you can merely query the table mentioned above.
Rather than calculating each time, how about cache the calculated count since the last query, and add the increment of count to update the cache by putting date condition into the WHERE clause?
Have you considered partitioning the table by date? Are there any indices on the table?
精彩评论