A performance question in MySQL
I’m seeing a performance behavior in mysqld that I don’t understand.
I have a table t with a primary key id and three data columns col1, … col4.
The data are in 4 TSV files 'col1.tsv', … 'col4.tsv'. The procedure I use to ingest them is:
CREATE TABLE t (
id INT NOT NULL,
col1 INT NOT NULL,
col2 INT NOT NULL,
col3 INT NOT NULL,
col4 CHAR(12) CHARACTER SET latin1 NOT NULL );
LOAD DATA LOCAL INFILE # POP 1
'col1.tsv' INTO TABLE t (id, col1);
ALTER TABLE t ADD PRIMARY KEY (id);
SET GLOBAL hot_keys.key_buffer_size= # something suitable
CACHE INDEX t IN hot_keys;
LOAD INDEX INTO CACHE t;
DROP TABLE IF EXISTS tmpt;
CREATE TABLE tmpt ( id INT NOT NULL, val INT NOT NULL );
LOAD DATA LOCAL INFILE 'col2.tsv' INTO TABLE tmpt tt;
INSERT INTO t (id, col2) # POP 2
SELECT tt.id, tt.val FROM tmpt tt
ON DUPLICATE KEY UPDATE col2=tt.val;
DROP TABLE IF EXISTS tmpt;
CREATE TABLE tmpt ( id INT NOT NULL, val INT NOT NULL );
LOAD DATA LOCAL INFILE 'col3.tsv' INTO TABLE tmpt tt;
INSERT INTO t (id, col3) # POP 3
SELECT tt.id, tt.val FROM tmpt tt
ON DUPLICATE KEY UPDATE col3=tt.val;
DROP TABLE IF EXISTS tmpt;
CREATE TABLE tmpt ( id INT NOT NULL,
val CHAR(12) CHARACTER SET latin1 NOT NULL );
LOAD DATA LOCAL INFILE 'col4.tsv' INTO TABLE tmpt tt;
INSERT INTO t (id, col4) # POP 4
SELECT tt.id, tt.val FROM tmpt tt
ON DUPLICATE KEY UPDATE col4=tt.val;
开发者_开发知识库Now here’s the performance thing I don’t understand. Sometimes the POP 2 and 3 INSERT INTO … SELECT … ON DUPLICATE KEY UPDATE queries run very fast with mysqld occupying 100% of a core and at other times mysqld bogs down at 1% CPU reading t.MYD, i.e. table t’s MyISAM data file, at random offsets.
I’ve had a very hard time isolating in which circumstances it is fast and in which it is slow but I have found one repeatable case:
In the above sequence, POP 2 and 3 are very slow. But if I create t without col4 then POP 2 and POP 3 are very fast. Why?
And if, after that, I add col4 with an ALTER TABLE query then POP 4 runs very fast too.
Again, when the INSERTs run slow, mysqld is bogged down in file IO reading from random offsets in table t’s MyISAM data file. I don’t even understand why it is reading that file.
MySQL server version 5.0.87. OS X 10.6.4 on Core 2 Duo iMac.
UPDATE
I eventually found (what I think is) the answer to this question. The mysterious difference between some inserts being slow and some fast is dependent on the data.
The clue was: when the insert is slow, mysqld is seeking on average 0.5GB between reads on t.MYD. When it is fast, successive reads have tiny relative offsets.
The confusion arose because some of the 'col?.tsv' files happen to have their rows in roughly the same order w.r.t. the id
column while others are randomly ordered relative to them.
I was able to drastically reduce overall processing time by using sort(1) on the tsv files before loading and inserting them.
It's a pretty open question... here's a speculative, open answer. :)
... when the INSERTs run slow, mysqld is bogged down in file IO reading from random offsets in table t’s MyISAM data file. I don’t even understand why it is reading that file.
I can think of two possible explanations:
- Even after it knows there is a primary key collision, it has to see what used to be in the field that will be updated -- if it is coincidentally the destination value already, 0 in this case, it won't perform the update -- i.e. zero rows affected.
- Moreover, when you update a field, I believe MySQL re-writes the whole row back to disk (if not multiple rows due to paging), and not just that single field as one might assume.
But if I create t without col4 then POP 2 and POP 3 are very fast. Why?
If it's a fixed-row size MyISAM table, which it looks like due to the datatypes in the table, then including the CHAR
field, even if it's blank, will make the table 75% larger on disk (4 bytes per INT
field = 16 bytes, whereas the CHAR(12)
would add another 12 bytes). So, in theory, you'll need to read/write 75% more.
Does your dataset fit in memory? Have you considered using InnoDB or Memory tables?
Addendum
If the usable/active/hot dataset goes from fitting in memory to not fitting in memory, an orders of magnitude decrease in performance isn't unheard of. A couple reads:
http://jayant7k.blogspot.com/2010/10/foursquare-outage-post-mortem.html
http://www.mysqlperformanceblog.com/2010/11/19/is-there-benefit-from-having-more-memory/
精彩评论