开发者

Very slow MYSQL query for 2.5 million row table

I'm really struggling to get a query time down, its currently having to query 2.5 million rows and it takes over 20 seconds

here is the query

SELECT play_date AS date, COUNT(DISTINCT(email)) AS count
FROM log
WHERE play_date BETWEEN '2009-02-23' AND '2020-01-01'
AND type = 'play'
GROUP BY play_date
ORDER BY play_date desc;

 `id` int(11) NOT NULL auto_increment,
  `insta开发者_运维百科nce` varchar(255) NOT NULL,
  `email` varchar(255) NOT NULL,
  `type` enum('play','claim','friend','email') NOT NULL,
  `result` enum('win','win-small','lose','none') NOT NULL,
  `timestamp` timestamp NOT NULL default CURRENT_TIMESTAMP,
  `play_date` date NOT NULL,
  `email_refer` varchar(255) NOT NULL,
  `remote_addr` varchar(15) NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `email` (`email`),
  KEY `result` (`result`),
  KEY `timestamp` (`timestamp`),
  KEY `email_refer` (`email_refer`),
  KEY `type_2` (`type`,`timestamp`),
  KEY `type_4` (`type`,`play_date`),
  KEY `type_result` (`type`,`play_date`,`result`)

id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   SIMPLE  log ref type_2,type_4,type_result   type_4  1   const   270404  Using where

The query is using the type_4 index.

Does anyone know how I could speed this query up?

Thanks Tom


That's relatively good, already. The performance sink is that the query has to compare 270404 varchars for equality for the COUNT(DISTINCT(email)), meaning that 270404 rows have to be read.

You could be able to make the count faster by creating a covering index. This means that the actual rows do not need to be read because all the required information is present in the index itself.

To do this, change the index as follows:

KEY `type_4` (`type`,`play_date`, `email`)

I would be surprised if that wouldn't speed things up quite a bit.

(Thanks to MarkR for the proper term.)


Your indexing is probably as good as you can get it. You have a compound index on the 2 columns in your where clause and the explain you posted indicates that it is being used. Unfortunately, there are 270,404 rows that match the criteria in your where clause and they all need to be considered. Also, you're not returning unnecessary rows in your select list.

My advice would be to aggregate the data daily (or hourly or whatever makes sense) and cache the results. That way you can access slightly stale data instantly. Hopefully this is acceptable for your purposes.


Try an index on play_date, type (same as type_4, just reversed fields) and see if that helps

There are 4 possible types, and I assume 100's of possible dates. If the query uses the type, play_date index, it basically (not 100% accurate, but general idea) says.

(A) Find all the Play records (about 25% of the file)
(B) Now within that subset, find all of the requested dates

By reversing the index, the approach is

> (A) Find all the dates within range
> (Maybe 1-2% of file) (B) Now find all
> PLAY types within that smaller portion
> of the file

Hope this helps


Extracting email to separate table should be a good performance boost since counting distinct varchar fields should take awhile. Other than that - the correct index is used and the query itself is as optimized as it could be (except for the email, of course).


The COUNT(DISTINCT(email)) part is the bit that's killing you. If you only truly need the first 2000 results of 270,404, perhaps it would help to do the email count only for the results instead of for the whole set.

SELECT date, COUNT(DISTINCT(email)) AS count
FROM log,
(
    SELECT play_date AS date
      FROM log
     WHERE play_date BETWEEN '2009-02-23' AND '2020-01-01'
       AND type = 'play'
     ORDER BY play_date desc
     LIMIT 2000
) AS shortlist
WHERE shortlist.id = log.id
GROUP BY date


Try creating an index only on play_date.


Long term, I would recommend building a summary table with a primary key of play_date and count of distinct emails.

Depending on how up to date you need it to be - either allow it to be updated daily (by play_date) or live via a trigger on the log table.


There is a good chance a table scan will be quicker than random access to over 200,000 rows:

SELECT ... FROM log IGNORE INDEX (type_2,type_4,type_result) ...

Also, for large grouped queries you may see better performance by forcing a file sort rather than a hashtable-based group (since if this turns out to need more than tmp_table_size or max_heap_table_size performance collapses):

SELECT SQL_BIG_RESULT ...
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜