开发者

A very interesting MYSQL problem (related to indexing, million records, algorithm.)

This problem is pretty hard to describe and therefore difficult to search the answer. I hope some expert could share you opinions on that.

I have a table with around 1 million of records. The table structure is similar to something like this:

items{
  uid (primary key, bigint, 15)
  updated (indexed, int, 11)
  enabled (indexed, tinyint, 1)
}

The scenario is like this. I have to select all of the records everyday and do some processing. It takes around 3 second to handle each item.

I have written a PHP script to fetch 200 items each time using the following.

select * from items where updated > unix_timestamp(now()) - 86400 and enabled = 1 limit 200;

I will then update the "updated" field of the selected items to make sure that it wont' be selected again within one day. The selected query is something like that.

update items set updated = unix_timestamp(now()) where uid in (1,2,3,4,...);

Then, the PHP will continue to run and process the data which doesn't require any MYSQL connection anymore.


Since I have million records and each record take 3 seconds to process, it's definitely impossible to do it sequentially. Therefore, I will execute the PHP in every 10 seconds.

However, as time goes by and the table growth, the select getting much slower. Sometimes, it take more than 100 seconds to run!


Do you guys have any suggestion how may I solve this problem开发者_高级运维?


There are two points that I can think of that should help:

a. unix_timestamp(now()) - 86400)

... this will evaluate now() for every single row, make it a constant by setting a variable to that value before each run.

b. Indexes help reads but can slow down writes

Consider dropping indexes before updating (DISABLE KEYS) - and then re-add them before reading (ENABLE KEYS).


I don't think the index on enabled is doing you any good, the cardinality is too low. Remove that and your UPDATEs should go faster.

I am not sure what you mean when you say each record takes 3 seconds since, you are handling them in batches of 200. How are you determining this and what other processing is involved?


You could do this:

  1. dispatcher.php: Manages the whole process.
    • fetches items in convenient packages from the database
    • calls worker.php on the same server with an HTTP post containing all UIDs fetched (I understand that worker.php would not need more than the UID to do its job)
    • maintains a counter of how many worker.php scrips are running. When one is started, the counter increments until a certain limit, when one worker returns then the counter is decremented. See "Asynchronous PHP calls?".
    • repeats until all records are fetched once. Maintain a MySQL LIMIT counter and do not work with updated.
  2. worker.php: does the actual work
    • does its thing with each item posted.
    • writes to a helper table the ID of each item it has processed (no index on that table)
  3. dispatcher.php: housekeping.
    • once all workers have returned, updates the main table with the helper table in a single statement
  4. error recovery
    • since worker.php would update the helper table after each item done, you can use the state of the helper table to recover from a crash. Saving the "work package" of each individual worker before it starts running would help to recover worker states as well.

You would have a multi-threaded processing chain this way and could even distribute the whole thing across multiple machines.


You could try running this before the update:

ALTER TABLE items DISABLE KEYS;

and then when you're done updating,

ALTER TABLE items ENABLE KEYS;

That should recreate the index much faster than updating each record at a time will.


For a table with fewer than a couple of billion records, the primary key should be an unsigned int rather than a bigint.


One idea:

Use a HANDLER, that will improve your performance considerably:

http://dev.mysql.com/doc/refman/5.1/en/handler.html

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜