开发者

SQLite import tab file: is .import doing one insert per row or grouping them with transaction?

I am importing millions of rows from a tab file and SQLite .import .mode tabs is very slow. I have three indexes so probably the slowness comes from the indexing. But first I would like to check that .import add the rows grouping lots/all of them into a single commit. I was unable to find documentation how .import works. Do someone knows?. If the index is the problem (I had that issue before with mysql) how can I disable it and reindex at the end of the .import?

[Update 1]

Following @sixfeetsix comment.

My schema is:

CREATE TABLE ensembl_vf_b36 (
        variation_name  varchar(20),
        chr     varchar(4),
        start   integer,
        end     integer,
        strand  varchar(5),
        allele_string    varchar(3),
        map_weight      varchar(2),
        flags           varchar(50),
        validation_status       varchar(100),
        consequence_type        varchar(50)
);
CREATE INDEX pos_vf_b36_idx on ensembl_vf_b36 (chr, start, end);

data:

rs35701516      NT_113875       352     352     1       G/A     2       NULL    NULL    INTERGENIC
rs12090193      NT_113875       566     566     1       G/A     2       NULL    NULL    INTERGENIC
rs35448845      NT_113875       758     758     1       A/C     2       NULL    NULL    INTERGENIC
rs17274850      NT_113875       1758    1758    1       G/A     2       genotyped       cluster,freq    INTERGENIC

There are 15_608_032 entries in this table

And these are the stats

 $  time sqlite3 -separator '   ' test_import.db '.import variations_build_36_ens-54.开发者_运维知识库tab ensembl_vf_b36'

real    29m27.643s
user    4m14.176s
sys     0m15.204s

[Update 2]

@sixfeetsix has a good answer and if you are reading this, you would be also interested in

Faster bulk inserts in sqlite3?

Sqlite3: Disabling primary key index while inserting?

[update3] Solution from 30 min -> 4 min

Even with all the optimisations (see accepted answer) still takes almost 30 minutes but if the indexes are not used and added at the end then total time is 4 minutes:

-- importing without indexes:
       real    2m22.274s
       user    1m38.836s
       sys     0m4.850s

 -- adding indexes
     $  time sqlite3 ensembl-test-b36.db < add_indexes-b36.sql

     real    2m18.344s
     user    1m26.264s
     sys     0m6.422s


I believe the slowness indeed comes from building the index as more and more records are added. Depending on the RAM you have, you can tell sqlite to use enough memory so that all this index building activity is done in memory (ie without all the I/O that would happen otherwise with less memory).

For 15M records, I'd say you should set your cache size at 500000.

You can also tell sqlite to keep its transaction journal in memory.

Finally, you can set synchronous to OFF so sqlite never waits for writes to be committed to disk.

Using this I was able to divide the time it takes to import 15M records by 5 (14 minutes down to 2.5) with records made of random GUIDs split in 5 columns, using the three middle columns as an index:

b40c1c2f    912c    46c7    b7a0    3a7d8da724c1
9c1cdf2e    e2bc    4c60    b29d    e0a390abfd26
b9691a9b    b0db    4f33    a066    43cb4f7cf873
01a360aa    9e2e    4643    ba1f    2aae3fd013a6
f1391f8b    f32c    45f0    b137    b99e6c299528

So to try this I suggest you put all the instructions in some file, say import_test:

pragma journal_mode=memory;
pragma synchronous=0;
pragma cache_size=500000;
.mode tabs
.import variations_build_36_ens-54.tab ensembl_vf_b36

Then try it:

time sqlite3 test_import.db < import_test

EDIT

This is a reply to the Pablo's (the OP) comments following this answer (it's to long to fit as a comment): My (educated) guesses are that:

  1. Because .import is not sql per se, it hasn't much ado with transactions, I'm even inclined to think that it is written to go faster than even if you had all this done in one "normal" transaction; and,
  2. If you have enough memory to allocate, and you set up your environment as I suggest, the real (time) hog here is reading the flat file, then writing the final content of the database, because what happens in between happens extremely fast; i.e. fast enough that there ain't much time to gain by optimizing it when you compare such potential gains with the (probably) non-compressible time spent on disk I/O.

If I'm wrong though I'd be glad to hear why for my own benefit.

EDIT 2

I did a comparison test between having the index in place during .import, and having it added immediately after .import finishes. I used the same technique of generating a 15M record made of split random UUIDs:

import csv, uuid
w = csv.writer(open('bla.tab', 'wb'), dialect='excel-tab')
for i in xrange(15000000):
    w.writerow(str(uuid.uuid4()).split('-'))

Then I tested importing with the index created before and after (here the index is created after):

pragma journal_mode=memory;
pragma synchronous=0;
pragma cache_size=500000;
create table test (f1 text, f2 text, f3 text, f4 text, f5 text);
CREATE INDEX test_idx on test (f2, f3, f4);
.mode tabs
.import bla.tab test

So here is the time when adding the index before:

[someone@somewhere ~]$ time sqlite3 test_speed.sqlite < import_test 
memory

real   2m58.839s
user   2m21.411s
sys    0m6.086s

And when the index is added after:

[someone@somewhere ~]$ time sqlite3 test_speed.sqlite < import_test 
memory

real   2m19.261s
user   2m12.531s
sys    0m4.403s

You see how the "user" times difference (~9s) don't account for the full times difference (~40s)? I To me this means that there is some extra I/O happening when the index is created before, and so I was wrong to think that all was being done in memory with no extra I/O.

Conclusion: create the index after and you'll have even better import times (just as Donal mentioned).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜