Filling PostgreSQL database with large amount of data

2023-02-07 00:49 问答作者：

I have a PostgreSQL database with a certain structure and I have several million of xml files. I have to parse each file and, get certain data and fill the tables in the database. What I want to know is the most optimal language/framework/algorithm to perform this routine.

I wrote a program in C# (Mono) using DbLinq ORM. It does not use threading, it just parses file by file, filles table objects and submits certain group of objects开发者_运维技巧 (for example 200) to the database. It appears to be rather slow: it processes about 400 files per minute and it will take about a month to finish the job.

I ask for your thoughts and tips.

I think it would be faster when you'll use small programs in a pipe that will:

join your files into one big stream;
parse input stream and generate an output stream in PostgreSQL COPY format - the same format pg_dump uses when creating backups, similar to tab-separated-values, looks like this:

COPY table_name (table_id, table_value) FROM stdin;
1   value1
2   value2
3   value3
\.

load COPY stream into Postgresq started temporarily with "-F" option to disable fsync calls.

For example on Linux:

find -name \*.xml -print0 | xargs -0 cat \
  | parse_program_generating_copy \
  | psql dbname

Using COPY is much faster than inserting with ORM. Joining files will parallelise reading and writing to database. Disabling "fsync" will allow for big speedup, but will require restoring a database from backup if a server crashes during loading.

Generally I believe Perl is a good option for parsing tasks. I do not know Perl myself. It sounds to me is that you have so extreme performance demands that you might need to create an XML parser as the performance of a standard one might become bottleneck (you should test this before you start implementing). I myself use Python and psycopg2 to communicate with Postgres.

Whichever language you choose, you certainly want to use COPY FROM and probably stdin using Perl/Python/other language to feed data into Postgres.

Instead of spending a lot of time optimizing everything, you could also use a suboptimal solution and run it in extreme parallel on say 100 EC2 instances. This would be a lot cheaper than spending hours and hours on finding the optimal solution.

Without knowing anything about the size of the files 400 files per minute does not sound TOO bad. Ask yourself whether it is worth spending a week of development to reduce the time to a third or just running it now and wait for a month.

继续阅读：postgresql xml

Filling PostgreSQL database with large amount of data

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？