Using COPY FROM stdin to load tables, reading input file only once
I've got a large (~60 million row) fixed width source file with ~1800 records per row.
I need to load this file into 5 different tables on an instance of Postgres 8.3.9.
My dilemma is that, because the file is so large, I'd like to have to read it only once.
This is straightforward enough using INSERT or COPY as normal, but I'm trying to get a load speed boost by including my COPY FROM statements in a transaction that includes a TRUNCATE--avoiding logging, which is supposed to give a considerable load speed boost (according to http://www.cirrusql.com/node/3). As I understand it, you can disable logging in Postgres 9.x--but I don't have that option on 8.3.9.
The script below has me reading the input file twice, which I want to avoid... any ideas on how I could accomplish this by reading the input file only once? Doesn't have to be bash--I also tried using psycopg2, but couldn't figure out how to stream file output into the COPY statement as I'm doing below. I can't COPY FROM file because I need to parse it on the fly.
#!/bin/bash
table1="copytest1"
table2="copytest2"
#note: $1 refers to the first argument used when invoking this script
#w开发者_如何学编程hich should be the location of the file one wishes to have python
#parse and stream out into psql to be copied into the data tables
( echo 'BEGIN;'
echo 'TRUNCATE TABLE ' ${table1} ';'
echo 'COPY ' ${table1} ' FROM STDIN'
echo "WITH NULL AS '';"
cat $1 | python2.5 ~/parse_${table1}.py
echo '\.'
echo 'TRUNCATE TABLE ' ${table2} ';'
echo 'COPY ' ${table2} ' FROM STDIN'
echo "WITH NULL AS '';"
cat $1 | python2.5 ~/parse_${table2}.py
echo '\.'
echo 'COMMIT;'
) | psql -U postgres -h chewy.somehost.com -p 5473 -d db_name
exit 0
Thanks!
You could use named pipes instead your anonymous pipe. With this concept your python script could fill the tables through different psql processes with the corresponding data.
Create pipes:
mkfifo fifo_table1
mkfifo fifo_table2
Run psql instances:
psql db_name < fifo_table1 &
psql db_name < fifo_table2 &
Your python script would look about so (Pseudocode):
SQL_BEGIN = """
BEGIN;
TRUNCATE TABLE %s;
COPY %s FROM STDIN WITH NULL AS '';
"""
fifo1 = open('fifo_table1', 'w')
fifo2 = open('fifo_table2', 'w')
bigfile = open('mybigfile', 'r')
print >> fifo1, SQL_BEGIN % ('table1', 'table1') #ugly, with python2.6 you could use .format()-Syntax
print >> fifo2, SQL_BEGIN % ('table2', 'table2')
for line in bigfile:
# your code, which decides where the data belongs to
# if data belongs to table1
print >> fifo1, data
# else
print >> fifo2, data
print >> fifo1, 'COMMIT;'
print >> fifo2, 'COMMIT;'
fifo1.close()
fifo2.close()
Maybe this is not the most elegant solution, but it should work.
Why use COPY for the second table? I would assume that doing a:
INSERT INTO table2 (...) SELECT ... FROM table1;
would be faster than using COPY.
Edit
If you need to import different rows into different tables but from the same source file, maybe inserting everything into a staging table and then inserting the rows from there into the target tables is faster:
Import the .whole* text file into one staging table:
COPY staging_table FROM STDIN ...;
After that step, the whole input file is in staging_table
Then copy the rows from the staging table to the individual target tables by selecting only those that qualify for the corresponding table:
INSERT INTO table_1 (...) SELECT ... FROM staging_table WHERE (conditions for table_1); INSERT INTO table_2 (...) SELECT ... FROM staging_table WHERE (conditions for table_2);
This is of course only feasible if you have enough space in your database to keep the staging table around.
精彩评论