Bulk loading of data with libpqxx
I need to bulk load a large amount of data (about 7.000.000 entries) into a PostgreSQL database using libpqxx. I have read the documentation on how to populate the database, but I am not sure on how to use this in my case. First I cannot use files, so a COPY on the database is ou开发者_Go百科t of question. Also the database and the table I am loading into needs to be functional while I am importing.
The scenario is as follows: We get a file with all the data (including the entries already present) from another application at regular intervals (about once a month). Because of the amount of entries it is just not possible to check each entry for existence and we just do a bulk insertion of the new data (after preprocessing).
Currently to handle this, I create a new table, insert the data using the tablewriter from libpqxx (without a transaction) then in a transaction I rename the old table away and the new table to the correct place.
Also we not only need to do this for one table, but for multiple tables with different layout. Thus I tried to separate the writing of the table from the parsing of the date. Now I just need to factor out the creation of the table. For this I use a
create temporary table foo_temp (like foo including indexes including defaults including constraints );
This way I get a table similar to foo
and I do not need to actually know the layout at the place where I do the writing. However this leaves me with the problem, that this will produce a table with the indexes and the constraints and the guide above says that indexes will make the bulk insertion slow. However if I drop the indexes and constraints (or do not copy them in the first place) I need a way to recreate them afterwards in the same manor they were set for the original table.
Any good hints on how to handle this in a fast way?
EDIT:
On a related side: Playing around with the database I just noticed, that the CREATE TABLE
above will not copy any foreign key constraints, so it seems I need to manually specify these as well. Or is there a way to handle these together with all the other constraints?
I achieve more than 100000 insert per second with indexing and constraints creation using next strategy: 1. In first connection create table which inherits parent, copy data in binary mode using PQputCopyData. 2. Use several another connections for index creation. PostgreSQL create one thread per client, so for using multi-core advantage we need to use several connections.
You can send data to indexing threads inside application using something like thread-safe queue or using PostgreSQL NOTIFY.
There are a pg_index and pg_constraint tables, which references indexes and constraints. (The former requires a join with pg_class to get the full details.) See the pg_catalog for the gory details.
You can use the latter to fetch the needed index definitions and so on after the mass-inserts/updates.
You don't need external files in order to use COPY. You can copy from "STDIN". "STDIN Specifies that input comes from the client application."
For an example, use pg_dumpall
which will show you how to use/format COPY with your existing data.
You can use the PQputCopyData
to transmit the necessary bulk data to the PostgreSQL server.
http://www.postgresql.org/docs/9.0/interactive/libpq-copy.html#LIBPQ-COPY-SEND
精彩评论