Very slow MYSQL query for 2.5 million row table

2022-12-12 15:33 问答作者：

I'm really struggling to get a query time down, its currently having to query 2.5 million rows and it takes over 20 seconds

here is the query

SELECT play_date AS date, COUNT(DISTINCT(email)) AS count
FROM log
WHERE play_date BETWEEN '2009-02-23' AND '2020-01-01'
AND type = 'play'
GROUP BY play_date
ORDER BY play_date desc;

 `id` int(11) NOT NULL auto_increment,
  `insta开发者_运维百科nce` varchar(255) NOT NULL,
  `email` varchar(255) NOT NULL,
  `type` enum('play','claim','friend','email') NOT NULL,
  `result` enum('win','win-small','lose','none') NOT NULL,
  `timestamp` timestamp NOT NULL default CURRENT_TIMESTAMP,
  `play_date` date NOT NULL,
  `email_refer` varchar(255) NOT NULL,
  `remote_addr` varchar(15) NOT NULL,
  PRIMARY KEY  (`id`),
  KEY `email` (`email`),
  KEY `result` (`result`),
  KEY `timestamp` (`timestamp`),
  KEY `email_refer` (`email_refer`),
  KEY `type_2` (`type`,`timestamp`),
  KEY `type_4` (`type`,`play_date`),
  KEY `type_result` (`type`,`play_date`,`result`)

id  select_type table   type    possible_keys   key key_len ref rows    Extra
1   SIMPLE  log ref type_2,type_4,type_result   type_4  1   const   270404  Using where

The query is using the type_4 index.

Does anyone know how I could speed this query up?

Thanks Tom

That's relatively good, already. The performance sink is that the query has to compare 270404 varchars for equality for the COUNT(DISTINCT(email)), meaning that 270404 rows have to be read.

You could be able to make the count faster by creating a covering index. This means that the actual rows do not need to be read because all the required information is present in the index itself.

To do this, change the index as follows:

KEY `type_4` (`type`,`play_date`, `email`)

I would be surprised if that wouldn't speed things up quite a bit.

(Thanks to MarkR for the proper term.)

Your indexing is probably as good as you can get it. You have a compound index on the 2 columns in your where clause and the explain you posted indicates that it is being used. Unfortunately, there are 270,404 rows that match the criteria in your where clause and they all need to be considered. Also, you're not returning unnecessary rows in your select list.

My advice would be to aggregate the data daily (or hourly or whatever makes sense) and cache the results. That way you can access slightly stale data instantly. Hopefully this is acceptable for your purposes.

Try an index on play_date, type (same as type_4, just reversed fields) and see if that helps

There are 4 possible types, and I assume 100's of possible dates. If the query uses the type, play_date index, it basically (not 100% accurate, but general idea) says.

(A) Find all the Play records (about 25% of the file)
(B) Now within that subset, find all of the requested dates

By reversing the index, the approach is

> (A) Find all the dates within range
> (Maybe 1-2% of file) (B) Now find all
> PLAY types within that smaller portion
> of the file

Hope this helps

Extracting email to separate table should be a good performance boost since counting distinct varchar fields should take awhile. Other than that - the correct index is used and the query itself is as optimized as it could be (except for the email, of course).

The COUNT(DISTINCT(email)) part is the bit that's killing you. If you only truly need the first 2000 results of 270,404, perhaps it would help to do the email count only for the results instead of for the whole set.

SELECT date, COUNT(DISTINCT(email)) AS count
FROM log,
(
    SELECT play_date AS date
      FROM log
     WHERE play_date BETWEEN '2009-02-23' AND '2020-01-01'
       AND type = 'play'
     ORDER BY play_date desc
     LIMIT 2000
) AS shortlist
WHERE shortlist.id = log.id
GROUP BY date

Try creating an index only on play_date.

Long term, I would recommend building a summary table with a primary key of play_date and count of distinct emails.

Depending on how up to date you need it to be - either allow it to be updated daily (by play_date) or live via a trigger on the log table.

There is a good chance a table scan will be quicker than random access to over 200,000 rows:

SELECT ... FROM log IGNORE INDEX (type_2,type_4,type_result) ...

Also, for large grouped queries you may see better performance by forcing a file sort rather than a hashtable-based group (since if this turns out to need more than tmp_table_size or max_heap_table_size performance collapses):

SELECT SQL_BIG_RESULT ...

继续阅读：optimization

Very slow MYSQL query for 2.5 million row table

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

Easiest way to get words of one line from istream into a vector?

Infinite gtk warnings when I right click on the icon

Best solution for private video database [closed]

国内夏季避暑旅游胜地有哪些？