MySQL: Indexing Table With 100+ Million Rows

2023-01-19 19:47 问答作者：

I've found myself in a bit of a predicament. I have a table used for page hit tracking with nearly 105 million rows.(!) It looks like this:

CREATE TABLE `media_hits` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `media_code` char(7) NOT NULL,
  `day` date NOT NULL,
  `hits` int(10) unsigned NOT NULL default '0',
  PRIMARY KEY  (`id`),
  UNIQUE KEY `media_code` (`media_code`,`day`)
) ENGINE=InnoDB;

As you can imagine running any kind of query on this table takes a long time. A typical query would be the following:

SELECT DISTINCT(`media_code`), COUNT(*) AS c
FROM `media_hits`
WHERE `day` >= DATE_SUB(NOW(), INTERVAL 1 DAY)
GROUP BY(`media_code`)
ORDER BY c DESC
LIMIT 200;

This query takes forever. And EXPLAIN on the query gives me this:

           id: 1
  select_type: SIMPLE
        table: media_hits
         type: index
possible_keys: NULL
          key: media_code
      key_len: 10
          ref: NULL
         rows: 104773158
        Extra: Using where; Using index; Using temporary; Using filesort

That's just plain awful. So my question is: What can I do about this? Trying to add proper indexes now is impossible. The ALTER TABLE query would probably take over a week to run. I tried deleting rows older than 6 months, but 24 hours later that query was still running.

I need to fix this some how. The only thing that crosses my mind is creating a new table with proper indexes, and start recording hit开发者_开发知识库s in that table. In the background I could have a script slowly inserting records from the old media_hits table. Can anyone offer suggestions on how to index this table, and possibly some hints on which columns I should index?

For this kind of job, indexing alone will most probably not help you much. Better think of some kind of caching strategy with some additional tables storing the aggregates you need.

For example, for your query above, you might add a second table "media_code_per_day" containing 3 columns "media_code", "counter" and "date". Every time you insert a row into you original table, also update "media_code_per_day" accordingly. Then you can run a new query on "media_code_per_day" instead of your original query.

Of course, to initialize your new table in your situation, you will have to make one batch run going through all your existing rows once, but that is only needed once.

For that specific query, an index on (day, media_code) would help the most. It will still have to use a temp table, because of the group by, and do filesort, because you are ordering by count(*), but that index will reduce the number of rows that it has to scan significantly.

If you need better performance than that, then you will probably have to do as @DocBrown says and make some aggregate tables. But I would try the index first to see if it helps enough, before going to all the work of extra tables.

Also, you can add a limit to a delete query, if you want to slowly clean up old rows without having to run a big delete that takes days. You can delete them in batches (like 10K or 100K rows at a time) to slowly reduce the size of that table until it's small enough to add the index.

You also could take a look at vertica community edition. where something simple like

SELECT count(*) FROM event_track;
    count    
------------
1595756573
(1 row)

comes back in 6 seconds on a system where the query has not been recently submitted. Yes that is close to 1.6 billion rows, and I do queries like the one you mentioned above on it all the time within a very reasonable response time (often seconds, less often minutes).

The nice thing is that after you dumped out your live data from mysql into a huge csv file that it is fast and easy to import it into vertica with a single COPY FROM command.

https://dba.stackexchange.com/a/35614/20451 has details on where to download vertica.

MySQL: Indexing Table With 100+ Million Rows

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？