SQLite import tab file: is .import doing one insert per row or grouping them with transaction?
I am importing millions of rows from a tab file and SQLite .import .mode tabs is very slow. I have three indexes so probably the slowness comes from the indexing. But first I would like to check that .import add the rows grouping lots/all of them into a single commit. I was unable to find documentation how .import works. Do someone knows?. If the index is the problem (I had that issue before with mysql) how can I disable it and reindex at the end of the .import?
[Update 1]
Following @sixfeetsix comment.
My schema is:
CREATE TABLE ensembl_vf_b36 (
variation_name varchar(20),
chr varchar(4),
start integer,
end integer,
strand varchar(5),
allele_string varchar(3),
map_weight varchar(2),
flags varchar(50),
validation_status varchar(100),
consequence_type varchar(50)
);
CREATE INDEX pos_vf_b36_idx on ensembl_vf_b36 (chr, start, end);
data:
rs35701516 NT_113875 352 352 1 G/A 2 NULL NULL INTERGENIC
rs12090193 NT_113875 566 566 1 G/A 2 NULL NULL INTERGENIC
rs35448845 NT_113875 758 758 1 A/C 2 NULL NULL INTERGENIC
rs17274850 NT_113875 1758 1758 1 G/A 2 genotyped cluster,freq INTERGENIC
There are 15_608_032 entries in this table
And these are the stats
$ time sqlite3 -separator ' ' test_import.db '.import variations_build_36_ens-54.开发者_运维知识库tab ensembl_vf_b36'
real 29m27.643s
user 4m14.176s
sys 0m15.204s
[Update 2]
@sixfeetsix has a good answer and if you are reading this, you would be also interested in
Faster bulk inserts in sqlite3?
Sqlite3: Disabling primary key index while inserting?
[update3] Solution from 30 min -> 4 min
Even with all the optimisations (see accepted answer) still takes almost 30 minutes but if the indexes are not used and added at the end then total time is 4 minutes:
-- importing without indexes:
real 2m22.274s
user 1m38.836s
sys 0m4.850s
-- adding indexes
$ time sqlite3 ensembl-test-b36.db < add_indexes-b36.sql
real 2m18.344s
user 1m26.264s
sys 0m6.422s
I believe the slowness indeed comes from building the index as more and more records are added. Depending on the RAM you have, you can tell sqlite to use enough memory so that all this index building activity is done in memory (ie without all the I/O that would happen otherwise with less memory).
For 15M records, I'd say you should set your cache size at 500000.
You can also tell sqlite to keep its transaction journal in memory.
Finally, you can set synchronous to OFF so sqlite never waits for writes to be committed to disk.
Using this I was able to divide the time it takes to import 15M records by 5 (14 minutes down to 2.5) with records made of random GUIDs split in 5 columns, using the three middle columns as an index:
b40c1c2f 912c 46c7 b7a0 3a7d8da724c1
9c1cdf2e e2bc 4c60 b29d e0a390abfd26
b9691a9b b0db 4f33 a066 43cb4f7cf873
01a360aa 9e2e 4643 ba1f 2aae3fd013a6
f1391f8b f32c 45f0 b137 b99e6c299528
So to try this I suggest you put all the instructions in some file, say import_test:
pragma journal_mode=memory;
pragma synchronous=0;
pragma cache_size=500000;
.mode tabs
.import variations_build_36_ens-54.tab ensembl_vf_b36
Then try it:
time sqlite3 test_import.db < import_test
EDIT
This is a reply to the Pablo's (the OP) comments following this answer (it's to long to fit as a comment): My (educated) guesses are that:
- Because .import is not sql per se, it hasn't much ado with transactions, I'm even inclined to think that it is written to go faster than even if you had all this done in one "normal" transaction; and,
- If you have enough memory to allocate, and you set up your environment as I suggest, the real (time) hog here is reading the flat file, then writing the final content of the database, because what happens in between happens extremely fast; i.e. fast enough that there ain't much time to gain by optimizing it when you compare such potential gains with the (probably) non-compressible time spent on disk I/O.
If I'm wrong though I'd be glad to hear why for my own benefit.
EDIT 2
I did a comparison test between having the index in place during .import, and having it added immediately after .import finishes. I used the same technique of generating a 15M record made of split random UUIDs:
import csv, uuid
w = csv.writer(open('bla.tab', 'wb'), dialect='excel-tab')
for i in xrange(15000000):
w.writerow(str(uuid.uuid4()).split('-'))
Then I tested importing with the index created before and after (here the index is created after):
pragma journal_mode=memory;
pragma synchronous=0;
pragma cache_size=500000;
create table test (f1 text, f2 text, f3 text, f4 text, f5 text);
CREATE INDEX test_idx on test (f2, f3, f4);
.mode tabs
.import bla.tab test
So here is the time when adding the index before:
[someone@somewhere ~]$ time sqlite3 test_speed.sqlite < import_test
memory
real 2m58.839s
user 2m21.411s
sys 0m6.086s
And when the index is added after:
[someone@somewhere ~]$ time sqlite3 test_speed.sqlite < import_test
memory
real 2m19.261s
user 2m12.531s
sys 0m4.403s
You see how the "user" times difference (~9s) don't account for the full times difference (~40s)? I To me this means that there is some extra I/O happening when the index is created before, and so I was wrong to think that all was being done in memory with no extra I/O.
Conclusion: create the index after and you'll have even better import times (just as Donal mentioned).
精彩评论