开发者

MySQL: Indexing Table With 100+ Million Rows

I've found myself in a bit of a predicament. I have a table used for page hit tracking with nearly 105 million rows.(!) It looks like this:

CREATE TABLE `media_hits` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `media_code` char(7) NOT NULL,
  `day` date NOT NULL,
  `hits` int(10) unsigned NOT NULL default '0',
  PRIMARY KEY  (`id`),
  UNIQUE KEY `media_code` (`media_code`,`day`)
) ENGINE=InnoDB;

As you can imagine running any kind of query on this table takes a long time. A typical query would be the following:

SELECT DISTINCT(`media_code`), COUNT(*) AS c
FROM `media_hits`
WHERE `day` >= DATE_SUB(NOW(), INTERVAL 1 DAY)
GROUP BY(`media_code`)
ORDER BY c DESC
LIMIT 200;

This query takes forever. And EXPLAIN on the query gives me this:

           id: 1
  select_type: SIMPLE
        table: media_hits
         type: index
possible_keys: NULL
          key: media_code
      key_len: 10
          ref: NULL
         rows: 104773158
        Extra: Using where; Using index; Using temporary; Using filesort

That's just plain awful. So my question is: What can I do about this? Trying to add proper indexes now is impossible. The ALTER TABLE query would probably take over a week to run. I tried deleting rows older than 6 months, but 24 hours later that query was still running.

I need to fix this some how. The only thing that crosses my mind is creating a new table with proper indexes, and start recording hit开发者_开发知识库s in that table. In the background I could have a script slowly inserting records from the old media_hits table. Can anyone offer suggestions on how to index this table, and possibly some hints on which columns I should index?


For this kind of job, indexing alone will most probably not help you much. Better think of some kind of caching strategy with some additional tables storing the aggregates you need.

For example, for your query above, you might add a second table "media_code_per_day" containing 3 columns "media_code", "counter" and "date". Every time you insert a row into you original table, also update "media_code_per_day" accordingly. Then you can run a new query on "media_code_per_day" instead of your original query.

Of course, to initialize your new table in your situation, you will have to make one batch run going through all your existing rows once, but that is only needed once.


For that specific query, an index on (day, media_code) would help the most. It will still have to use a temp table, because of the group by, and do filesort, because you are ordering by count(*), but that index will reduce the number of rows that it has to scan significantly.

If you need better performance than that, then you will probably have to do as @DocBrown says and make some aggregate tables. But I would try the index first to see if it helps enough, before going to all the work of extra tables.

Also, you can add a limit to a delete query, if you want to slowly clean up old rows without having to run a big delete that takes days. You can delete them in batches (like 10K or 100K rows at a time) to slowly reduce the size of that table until it's small enough to add the index.


You also could take a look at vertica community edition. where something simple like

SELECT count(*) FROM event_track;
    count    
------------
1595756573
(1 row)

comes back in 6 seconds on a system where the query has not been recently submitted. Yes that is close to 1.6 billion rows, and I do queries like the one you mentioned above on it all the time within a very reasonable response time (often seconds, less often minutes).

The nice thing is that after you dumped out your live data from mysql into a huge csv file that it is fast and easy to import it into vertica with a single COPY FROM command.

https://dba.stackexchange.com/a/35614/20451 has details on where to download vertica.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜