How can I make MySQL as fast as a flat file in this scenario?

2023-02-15 08:33 问答作者：

Assume a key-value table with at least 10s of millions of rows.
Define an operation that takes a large number of IDs (again, 10s of millions) finds the corresponding values and sums them.

Using a database, this operation seems like it can approach (disk seek time) * (number of lookups).

Using a flat file, and reading through the entire contents, this operation will approach (file size)/(drive transfer rate).

Plugging in some (rough) values (from wikipedia and/or experimentation):

seek time = 0.5ms

transfer rate = 64MByte/s

file size = 800M (for 70 million int/double key/values)

65 million value lookups

DB time = 0.5ms * 65000000 = 32500s = 9 hours

Flat file = 800M/(64MB/s) = 12s

Experimental results are not as bad for MySQL, but the flat file still wins.

Experiments:

Create InnoDB and MyISAM id/value pair tables. e.g.

CREATE TABLE `ivi` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `val` double DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB

Fill with 32 million rows of data of your choice. Query with:

s开发者_JAVA百科elect sum(val) from ivm where id not in (1,12,121,1121);  //be sure to change the numbers each time or clear the query cache

Use the following code to create & read key/value flat file from java.

 private static void writeData() throws IOException {
        long t = -System.currentTimeMillis();
        File dat = new File("/home/mark/dat2");
        if (dat.exists()){
            dat.delete();
        }
        FileOutputStream fos = new FileOutputStream(dat);
        ObjectOutputStream os = new ObjectOutputStream(new BufferedOutputStream(fos));
        for (int i=0; i< 32000000; i++){
            os.writeInt(i);
            os.writeDouble(i / 2.0);
        }
        os.flush();
        os.close();        
        t += System.currentTimeMillis();
        System.out.println("time ms = " + t);
    }
    private static void performSummationQuery() throws IOException{
        long t = -System.currentTimeMillis();

        File dat = new File("/home/mark/dat2");
        FileInputStream fin = new FileInputStream(dat);
        ObjectInputStream in = new ObjectInputStream(new BufferedInputStream(fin));
        HashSet<Integer> set = new HashSet<Integer>(Arrays.asList(11, 101, 1001, 10001, 100001));
        int i;
        double d;
        double sum = 0;
        try {
            while (true){
                i = in.readInt();
                d = in.readDouble();
                if (!set.contains(i)){
                    sum += d;
                }
            }
        } catch (EOFException e) {
        }

        System.out.println("sum = " + sum);
        t += System.currentTimeMillis();
        System.out.println("time ms = " + t);
    }

RESULTS:


InnoDB        8.0-8.1s            
MyISAM        3.1-16.5s
Stored proc   80-90s
FlatFile      1.6-2.4s (even after: echo 3 > /proc/sys/vm/drop_caches)

My experiments have shown that a flat file wins against the database here. Unfortunately, I sill need to do "standard" CRUD operations on this table. But this is the use pattern that's killing me.

So what's the best way I can have MySQL behave like itself most of the time, yet win over a flat file in the above scenario?

EDIT:

To clarify some points:

1. I have dozens such tables, some will have hundreds of millions of rows and I and cannot store them all in RAM.

2. The case I have described is what I need to support. The values associated to an ID might change, and the selection of IDs is ad-hoc. Therefor there is no way to pre-generate & cache any sums. I need to do the work of "find each value and sum them all" every time.

Thanks.

Your numbers assume that MySQL will perform disk I/O 100% of the time while in practice that is rarely the case. If your MySQL server has enough RAM and your table is indexed appropriately your cache hit rate will rapidly approach 100% and MySQL will perform very little disk I/O as a direct result of your sum operation. If you are frequently having to deal with calculations across 10,000,000 rows you may also consider adjusting your schema to reflect real-world usage (keeping a "cached" sum on hand isn't always a bad idea depending on your specific needs).

I highly recommend you put together a test database, throw in 10s millions of test rows, and run some real queries in MySQL to determine how the system will perform. Spending 15 minutes doing this would give you far more accurate information.

Telling MySQL to ignore the primary (and only) index speeds both queries up.

For InnoDB it saves a second the queries. On MyISAM it keeps the query time consistently at the minimum time seen.

The cange is to add

ignore index(`PRIMARY`)

after the tablename in the query.

EDIT:
I appreciate all the input but much of it was of the form "you shouldn't do this", "do something completely different", etc. None of it addressed the question at hand:

"So what's the best way I can have MySQL behave like itself most of the time, yet win over a flat file in the above scenario?"

So far, the solution I have posted: use MyISAM and ignore the index, seems to be closest to flat file performance for this use case, while still giving me a database when I need the database.

I'd use a summary table maintained by triggers which gives sub 1 second performance - something like as follows:

select
 st.tot - v.val 
from
 ivi_sum_total st
join
(
 select sum(val) as val from ivi where id in (1,12,121,1121)
) v;

+---------------------+
| st.tot - v.val      |
+---------------------+
| 1048317638720.78064 |
+---------------------+
1 row in set (0.07 sec)

Full schema

drop table if exists ivi_sum_total;
create table ivi_sum_total
(
tot decimal(65,5) default 0
) 
engine=innodb;

drop table if exists ivi;
create table ivi 
(
id int unsigned not null auto_increment,
val decimal(65,5) default 0,
primary key (id, val)
) 
engine=innodb;

delimiter #

create trigger ivi_before_ins_trig before insert on ivi
for each row
begin
  update ivi_sum_total set tot = tot + new.val;
end#

create trigger ivi_before_upd_trig before update on ivi
for each row
begin
  update ivi_sum_total set tot = (tot - old.val) + new.val;
end#

-- etc...

Testing

select count(*) from ivi;

+----------+
| count(*) |
+----------+
| 32000000 |
+----------+

select
 st.tot - v.val 
from
 ivi_sum_total st
join
(
 select sum(val) as val from ivi where id in (1,12,121,1121)
) v;

+---------------------+
| st.tot - v.val      |
+---------------------+
| 1048317638720.78064 |
+---------------------+
1 row in set (0.07 sec)

select sum(val) from ivi where id not in (1,12,121,1121);

+---------------------+
| sum(val)            |
+---------------------+
| 1048317638720.78064 |
+---------------------+
1 row in set (29.89 sec)

select * from ivi_sum_total;

+---------------------+
| tot                 |
+---------------------+
| 1048317683047.43227 |
+---------------------+
1 row in set (0.03 sec)


select * from ivi where id = 2;

+----+-------------+
| id | val         |
+----+-------------+
|  2 | 11781.30443 |
+----+-------------+
1 row in set (0.01 sec)

start transaction;
update ivi set val = 0 where id = 2;
commit;

Query OK, 1 row affected (0.01 sec)
Rows matched: 1  Changed: 1  Warnings: 0

select * from ivi where id = 2;

+----+---------+
| id | val     |
+----+---------+
|  2 | 0.00000 |
+----+---------+
1 row in set (0.00 sec)


select * from ivi_sum_total;

+---------------------+
| tot                 |
+---------------------+
| 1048317671266.12784 |
+---------------------+
1 row in set (0.00 sec)


select
 st.tot - v.val 
from
 ivi_sum_total st
join
(
 select sum(val) as val from ivi where id in (1,12,121,1121)
) v;

+---------------------+
| st.tot - v.val      |
+---------------------+
| 1048317626939.47621 |
+---------------------+
1 row in set (0.01 sec)

select sum(val) from ivi where id not in (1,12,121,1121);

+---------------------+
| sum(val)            |
+---------------------+
| 1048317626939.47621 |
+---------------------+
1 row in set (31.07 sec)

You are comparing apples and oranges as far as I see. MySQL (or any other relational databases) doesn't suppose work with data which does I/O all the time. then you are destroying the meaning of index. Even worse index would become a burden since it doesn't fit to RAM at all. Thats why people use sharding / summary tables. In you example the size of database (so the disk io) would be much more than flat file since there is a primary index on top of data itself. as z5h stated ignoring primary index can save you some time but it would never be as fast as plain text file.

I would suggest you to use summary tables like having a bg job doing a summary and you UNION this summary table with the rest of the "live" table. But even mysql would not handle rapidly growing data well after some 100s of millions it would start to fail. Thats why people are working for distributed systems like hdfs and map/reduce frameworks like hadoop.

P.S: My technical examples are not 100% right, I just want to go through the concepts.

Is it a single-user system?

Performance of Flat file will degrade significantly with multiple users. With DB, it "should" schedule disk reads to satisfy queries running in parallel.

There is one option nobody has consider as of yet...

Since the aforementioned JAVA code uses a HashSet, why not use a Hash Index ?

By default, indexes in MyISAM tables use BTREE indexing.
By default, indexes in MEMORY tables use HASH indexing.

Simply force the MyISAM table to use a HASH index instead of a BTREE

CREATE TABLE `ivi`
(
    `id` int(11) NOT NULL AUTO_INCREMENT,
    `val` double DEFAULT NULL,
    PRIMARY KEY (`id`) USING HASH
) ENGINE=MyISAM;

Now that should level the playing field a litte. However, index range searching has poor performance when using a hash index. If you retrieve one id at a time, it should be faster than your previous testing n MyISAM.

If you want to load the data much faster

Get rid of the AUTO_INCREMENT property
Get rid of the primary key
Use a regular index

CREATE TABLE `ivi`
(
    `id` int(11) NOT NULL,
    `val` double DEFAULT NULL,
    KEY id (`id`) USING HASH
) ENGINE=MyISAM;

Then do something like this:

ALTER TABLE ivi DISABLE KEYS;
...
... (Load data and manually generate id)
...
ALTER TABLE ivi ENABLE KEYS;

This will build the index after it is done being load

You shoudld also consider sizing the key_buffer_size in /etc/my.cnf to handle large numbers of MyISAM keys.

Give it a Try and let us know if this helped and what you found !!!

You might want to have a look at NDBAPI. I imagine these people were able to achieve speeds that are close to working with file but still have the data stored in the InnoDB.

继续阅读：performance

How can I make MySQL as fast as a flat file in this scenario?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？