Faster Insertion of Records into a Table with SQLAlchemy
I am parsing a log and inserting it into either MySQL or SQLite using SQLAlchemy and Python. Right now I open a connection to the DB, and as I loop over each line, I insert it after it is parsed (This is just one big table right now, not very experienced with SQL). I then close the connection when the loop is done. The summarized code is:
log_table = schema.Table('log_table', metadata,
schema.Column('id', types.Integer, primary_key=True),
schema.Column('time', types.DateTime),
schema.Column('ip', types.String(length=15))
....
engine = create_engine(...)
metadata.bind = engine
connection = engine.connect()
....
for line in file_to_parse:
m = line_regex.match(line)
if m:
fields = m.groupdict()
pythonified = pythoninfy_log(fields) #Turn them into ints, datatimes, etc
if use_sql:
ins = log_table.insert(values=pythonified)
connection.execute(ins)
parsed += 1
My two questions are:
- Is there a way to speed up the inserts within this basic framework? Maybe have a Queue of inserts and some insertion threads, some sort of bulk inserts, etc?
- When I used MySQL, for about ~1.2 million records the insert time was 15 minutes. With SQLite, the insert time was a little over an hour. Does that time difference between the db engines seem about right, or 开发者_如何学JAVAdoes it mean I am doing something very wrong?
The big thing you should try is putting a transaction around multiple inserts since it is the committing of the database to disk that really takes a long time. You'll need to decide the batching level, but a crude first attempt would be to wrap a transaction around the whole lot.
Without knowing the table engine (MyISAM? InnoDB?), schema, and indexes, it's hard to comment on specifics between the two databases you're using there.
However, when using MySQL like this, you will likely find that it is far faster to write your data out to a temporary text file and then use the LOAD DATA INFILE syntax to load it all into your database. It looks like you can call the execute method on your connection object to run the SQL necessary to do this.
Further, if you are dead set on adding things row by row, and you're recreating the table every time, you can verify key constraints in your program and add those constraints only after all rows have been inserted, saving the DB the time of doing constraints checks on every insert.
I did the following to achieve some batching:
inserts = []
insert_every = 1000
for line in file_to_parse:
m = line_regex.match(line)
if m:
fields = m.groupdict()
if use_sql: #This uses Globals, Ick :-/
inserts.append(pythonified)
if (parsed % insert_every) == 0:
connection.execute(log_table.insert(), inserts)
inserts = []
parsed += 1
if use_sql:
if len(inserts) > 0:
connection.execute(log_table.insert(), inserts)
This doesn't use transactions, but in a very lazy manner it allowed me to turn the insert/parse stage from ~13 seconds to about ~2 seconds with mysql backend using a smaller sample. I will see what the difference between mysql and sqlite is now with this change using the full sample.
I found the basic information for this here.
Results:
Engine:Non-Grouped Insert Time in Minutes: Grouped Insert Time in Minutes
Sqlite:61:8
MySql:15:2.5
I didn't flush my cache between the mysql and sqlite which would have had the source text file possibly, but I don't think that would be a relatively significant difference.
精彩评论