How can I make MySQL as fast as a flat file in this scenario?
- Assume a key-value table with at least 10s of millions of rows.
- Define an operation that takes a large number of IDs (again, 10s of millions) finds the corresponding values and sums them.
Using a database, this operation seems like it can approach (disk seek time) * (number of lookups)
.
Using a flat file, and reading through the entire contents, this operation will approach (file size)/(drive transfer rate)
.
Plugging in some (rough) values (from wikipedia and/or experimentation):
seek time = 0.5ms
transfer rate = 64MByte/s
file size = 800M
(for 70 million int/double key/values)
65 million value lookups
DB time = 0.5ms * 65000000
= 32500s
= 9 hours
800M/(64MB/s)
= 12s
Experimental results are not as bad for MySQL, but the flat file still wins.
Experiments:
Create InnoDB and MyISAM id/value pair tables. e.g.CREATE TABLE `ivi` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`val` double DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
Fill with 32 million rows of data of your choice. Query with:
s开发者_JAVA百科elect sum(val) from ivm where id not in (1,12,121,1121); //be sure to change the numbers each time or clear the query cache
Use the following code to create & read key/value flat file from java.
private static void writeData() throws IOException {
long t = -System.currentTimeMillis();
File dat = new File("/home/mark/dat2");
if (dat.exists()){
dat.delete();
}
FileOutputStream fos = new FileOutputStream(dat);
ObjectOutputStream os = new ObjectOutputStream(new BufferedOutputStream(fos));
for (int i=0; i< 32000000; i++){
os.writeInt(i);
os.writeDouble(i / 2.0);
}
os.flush();
os.close();
t += System.currentTimeMillis();
System.out.println("time ms = " + t);
}
private static void performSummationQuery() throws IOException{
long t = -System.currentTimeMillis();
File dat = new File("/home/mark/dat2");
FileInputStream fin = new FileInputStream(dat);
ObjectInputStream in = new ObjectInputStream(new BufferedInputStream(fin));
HashSet<Integer> set = new HashSet<Integer>(Arrays.asList(11, 101, 1001, 10001, 100001));
int i;
double d;
double sum = 0;
try {
while (true){
i = in.readInt();
d = in.readDouble();
if (!set.contains(i)){
sum += d;
}
}
} catch (EOFException e) {
}
System.out.println("sum = " + sum);
t += System.currentTimeMillis();
System.out.println("time ms = " + t);
}
RESULTS:
InnoDB 8.0-8.1s MyISAM 3.1-16.5s Stored proc 80-90s FlatFile 1.6-2.4s (even after: echo 3 > /proc/sys/vm/drop_caches)
My experiments have shown that a flat file wins against the database here. Unfortunately, I sill need to do "standard" CRUD operations on this table. But this is the use pattern that's killing me.
So what's the best way I can have MySQL behave like itself most of the time, yet win over a flat file in the above scenario?
EDIT:
To clarify some points: 1. I have dozens such tables, some will have hundreds of millions of rows and I and cannot store them all in RAM. 2. The case I have described is what I need to support. The values associated to an ID might change, and the selection of IDs is ad-hoc. Therefor there is no way to pre-generate & cache any sums. I need to do the work of "find each value and sum them all" every time.Thanks.
Your numbers assume that MySQL will perform disk I/O 100% of the time while in practice that is rarely the case. If your MySQL server has enough RAM and your table is indexed appropriately your cache hit rate will rapidly approach 100% and MySQL will perform very little disk I/O as a direct result of your sum operation. If you are frequently having to deal with calculations across 10,000,000 rows you may also consider adjusting your schema to reflect real-world usage (keeping a "cached" sum on hand isn't always a bad idea depending on your specific needs).
I highly recommend you put together a test database, throw in 10s millions of test rows, and run some real queries in MySQL to determine how the system will perform. Spending 15 minutes doing this would give you far more accurate information.
Telling MySQL to ignore the primary (and only) index speeds both queries up.
For InnoDB it saves a second the queries. On MyISAM it keeps the query time consistently at the minimum time seen.
The cange is to add
ignore index(`PRIMARY`)
after the tablename in the query.
EDIT:
I appreciate all the input but much of it was of the form "you shouldn't do this", "do something completely different", etc. None of it addressed the question at hand:
"So what's the best way I can have MySQL behave like itself most of the time, yet win over a flat file in the above scenario?"
So far, the solution I have posted: use MyISAM and ignore the index, seems to be closest to flat file performance for this use case, while still giving me a database when I need the database.
I'd use a summary table maintained by triggers which gives sub 1 second performance - something like as follows:
select
st.tot - v.val
from
ivi_sum_total st
join
(
select sum(val) as val from ivi where id in (1,12,121,1121)
) v;
+---------------------+
| st.tot - v.val |
+---------------------+
| 1048317638720.78064 |
+---------------------+
1 row in set (0.07 sec)
Full schema
drop table if exists ivi_sum_total;
create table ivi_sum_total
(
tot decimal(65,5) default 0
)
engine=innodb;
drop table if exists ivi;
create table ivi
(
id int unsigned not null auto_increment,
val decimal(65,5) default 0,
primary key (id, val)
)
engine=innodb;
delimiter #
create trigger ivi_before_ins_trig before insert on ivi
for each row
begin
update ivi_sum_total set tot = tot + new.val;
end#
create trigger ivi_before_upd_trig before update on ivi
for each row
begin
update ivi_sum_total set tot = (tot - old.val) + new.val;
end#
-- etc...
Testing
select count(*) from ivi;
+----------+
| count(*) |
+----------+
| 32000000 |
+----------+
select
st.tot - v.val
from
ivi_sum_total st
join
(
select sum(val) as val from ivi where id in (1,12,121,1121)
) v;
+---------------------+
| st.tot - v.val |
+---------------------+
| 1048317638720.78064 |
+---------------------+
1 row in set (0.07 sec)
select sum(val) from ivi where id not in (1,12,121,1121);
+---------------------+
| sum(val) |
+---------------------+
| 1048317638720.78064 |
+---------------------+
1 row in set (29.89 sec)
select * from ivi_sum_total;
+---------------------+
| tot |
+---------------------+
| 1048317683047.43227 |
+---------------------+
1 row in set (0.03 sec)
select * from ivi where id = 2;
+----+-------------+
| id | val |
+----+-------------+
| 2 | 11781.30443 |
+----+-------------+
1 row in set (0.01 sec)
start transaction;
update ivi set val = 0 where id = 2;
commit;
Query OK, 1 row affected (0.01 sec)
Rows matched: 1 Changed: 1 Warnings: 0
select * from ivi where id = 2;
+----+---------+
| id | val |
+----+---------+
| 2 | 0.00000 |
+----+---------+
1 row in set (0.00 sec)
select * from ivi_sum_total;
+---------------------+
| tot |
+---------------------+
| 1048317671266.12784 |
+---------------------+
1 row in set (0.00 sec)
select
st.tot - v.val
from
ivi_sum_total st
join
(
select sum(val) as val from ivi where id in (1,12,121,1121)
) v;
+---------------------+
| st.tot - v.val |
+---------------------+
| 1048317626939.47621 |
+---------------------+
1 row in set (0.01 sec)
select sum(val) from ivi where id not in (1,12,121,1121);
+---------------------+
| sum(val) |
+---------------------+
| 1048317626939.47621 |
+---------------------+
1 row in set (31.07 sec)
You are comparing apples and oranges as far as I see. MySQL (or any other relational databases) doesn't suppose work with data which does I/O all the time. then you are destroying the meaning of index. Even worse index would become a burden since it doesn't fit to RAM at all. Thats why people use sharding / summary tables. In you example the size of database (so the disk io) would be much more than flat file since there is a primary index on top of data itself. as z5h stated ignoring primary index can save you some time but it would never be as fast as plain text file.
I would suggest you to use summary tables like having a bg job doing a summary and you UNION this summary table with the rest of the "live" table. But even mysql would not handle rapidly growing data well after some 100s of millions it would start to fail. Thats why people are working for distributed systems like hdfs and map/reduce frameworks like hadoop.
P.S: My technical examples are not 100% right, I just want to go through the concepts.
Is it a single-user system?
Performance of Flat file will degrade significantly with multiple users. With DB, it "should" schedule disk reads to satisfy queries running in parallel.
There is one option nobody has consider as of yet...
Since the aforementioned JAVA code uses a HashSet, why not use a Hash Index ?
By default, indexes in MyISAM tables use BTREE indexing.
By default, indexes in MEMORY tables use HASH indexing.
Simply force the MyISAM table to use a HASH index instead of a BTREE
CREATE TABLE `ivi`
(
`id` int(11) NOT NULL AUTO_INCREMENT,
`val` double DEFAULT NULL,
PRIMARY KEY (`id`) USING HASH
) ENGINE=MyISAM;
Now that should level the playing field a litte. However, index range searching has poor performance when using a hash index. If you retrieve one id at a time, it should be faster than your previous testing n MyISAM.
If you want to load the data much faster
- Get rid of the AUTO_INCREMENT property
- Get rid of the primary key
- Use a regular index
CREATE TABLE `ivi`
(
`id` int(11) NOT NULL,
`val` double DEFAULT NULL,
KEY id (`id`) USING HASH
) ENGINE=MyISAM;
Then do something like this:
ALTER TABLE ivi DISABLE KEYS;
...
... (Load data and manually generate id)
...
ALTER TABLE ivi ENABLE KEYS;
This will build the index after it is done being load
You shoudld also consider sizing the key_buffer_size in /etc/my.cnf to handle large numbers of MyISAM keys.
Give it a Try and let us know if this helped and what you found !!!
You might want to have a look at NDBAPI. I imagine these people were able to achieve speeds that are close to working with file but still have the data stored in the InnoDB.
精彩评论