Using COPY FROM stdin to load tables, reading input file only once

2023-02-14 20:46 问答作者：

I've got a large (~60 million row) fixed width source file with ~1800 records per row.

I need to load this file into 5 different tables on an instance of Postgres 8.3.9.

My dilemma is that, because the file is so large, I'd like to have to read it only once.

This is straightforward enough using INSERT or COPY as normal, but I'm trying to get a load speed boost by including my COPY FROM statements in a transaction that includes a TRUNCATE--avoiding logging, which is supposed to give a considerable load speed boost (according to http://www.cirrusql.com/node/3). As I understand it, you can disable logging in Postgres 9.x--but I don't have that option on 8.3.9.

The script below has me reading the input file twice, which I want to avoid... any ideas on how I could accomplish this by reading the input file only once? Doesn't have to be bash--I also tried using psycopg2, but couldn't figure out how to stream file output into the COPY statement as I'm doing below. I can't COPY FROM file because I need to parse it on the fly.

#!/bin/bash

table1="copytest1"
table2="copytest2"

#note: $1 refers to the first argument used when invoking this script
#w开发者_如何学编程hich should be the location of the file one wishes to have python
#parse and stream out into psql to be copied into the data tables

( echo 'BEGIN;'
  echo 'TRUNCATE TABLE ' ${table1} ';'
  echo 'COPY ' ${table1} ' FROM STDIN'
  echo "WITH NULL AS '';"
  cat $1 | python2.5 ~/parse_${table1}.py 
  echo '\.'
  echo 'TRUNCATE TABLE ' ${table2} ';'
  echo 'COPY ' ${table2} ' FROM STDIN'
  echo "WITH NULL AS '';"
  cat $1 | python2.5 ~/parse_${table2}.py 
  echo '\.'
  echo 'COMMIT;'
) | psql -U postgres -h chewy.somehost.com -p 5473 -d db_name

exit 0

Thanks!

You could use named pipes instead your anonymous pipe. With this concept your python script could fill the tables through different psql processes with the corresponding data.

Create pipes:

mkfifo fifo_table1
mkfifo fifo_table2

Run psql instances:

psql db_name < fifo_table1 &
psql db_name < fifo_table2 &

Your python script would look about so (Pseudocode):

SQL_BEGIN = """
BEGIN;
TRUNCATE TABLE %s;
COPY %s FROM STDIN WITH NULL AS '';
"""
fifo1 = open('fifo_table1', 'w')
fifo2 = open('fifo_table2', 'w')

bigfile = open('mybigfile', 'r')

print >> fifo1, SQL_BEGIN % ('table1', 'table1') #ugly, with python2.6 you could use .format()-Syntax     
print >> fifo2, SQL_BEGIN % ('table2', 'table2')      

for line in bigfile:
  # your code, which decides where the data belongs to
  # if data belongs to table1
  print >> fifo1, data
  # else
  print >> fifo2, data

print >> fifo1, 'COMMIT;'
print >> fifo2, 'COMMIT;'

fifo1.close()
fifo2.close()

Maybe this is not the most elegant solution, but it should work.

Why use COPY for the second table? I would assume that doing a:

INSERT INTO table2 (...)
SELECT ...
FROM table1;

would be faster than using COPY.

Edit
If you need to import different rows into different tables but from the same source file, maybe inserting everything into a staging table and then inserting the rows from there into the target tables is faster:

Import the .whole* text file into one staging table:

COPY staging_table FROM STDIN ...;

After that step, the whole input file is in staging_table

Then copy the rows from the staging table to the individual target tables by selecting only those that qualify for the corresponding table:

INSERT INTO table_1 (...)
SELECT ...
FROM staging_table
WHERE (conditions for table_1);

INSERT INTO table_2 (...)
SELECT ...
FROM staging_table
WHERE (conditions for table_2);

This is of course only feasible if you have enough space in your database to keep the staging table around.

继续阅读：bash database postgresql psycopg2

Using COPY FROM stdin to load tables, reading input file only once

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？