Writing to PostgreSQL database format without using PostgreSQL

2023-03-11 21:17 问答作者：

I am collecting lots of data from lots of machines. These machine开发者_Go百科s cannot run PostgreSQL and the cannot connect to a PostgreSQL database. At the moment I save the data from these machines in CSV files and use the COPY FROM command to import the data into the PostgreSQL database. Even on high-end hardware this process is taking hours. Therefore, I was thinking about writing the data to the format of PostgreSQL database directly. I would then simply copy these files into the /data directory, start the PostgreSQL server. The server would then find the database files and accept them as databases.

Is such a solution feasible?

Theoretically this might be possible if you studied the source code of PostgreSQL very closely.

But you essentially wind up (re)writing the core of PostgreSQL, which qualifies as "not feasible" from my point of view.

Edit:
You might want to have a look at pg_bulkload which claims to be faster than COPY (haven't used it though)

Why can't they connect to the database server? If it is because of library-dependencies, I suggest that you set up some sort of client-server solution (web services perhaps) that could queue and submit data along the way.

Relying on batch operations will always give you a headache when dealing with large amount of data, and if COPY FROM isn't fast enough for you, I don't think anything will be.

Yeah, you can't just write the files out in any reasonable way. In addition to the data page format, you'd need to replicate the commit logs, part of the write-ahead logs, some transaction visibility parts, any conversion code for types you use, and possibly the TOAST and varlena code. Oh, and the system catalog data, as already mentioned. Rough guess, you might get by with only needing to borrow 200K lines of code from the server. PostgreSQL is built from the ground up around being extensible; you can't even interpret what an integer means without looking up the type information around the integer type in the system catalog first.

There are some tips for speeding up the COPY process at Bulk Loading and Restores. Turning off synchronous_commit in particular may help. Another trick that may be useful: if you start a transaction, TRUNCATE a table, and then COPY into it, that COPY goes much faster. It doesn't bother with the usual write-ahead log protection. However, it's easy to discover COPY is actually bottlenecked on CPU performance, and there's nothing useful you can do about that. Some people split the incoming file into pieces and run multiple COPY operations at once to work around this.

Realistically, pg_bulkload is probably your best bet, unless it too gets CPU bound--at which point a splitter outside the database and multiple parallel loading is really what you need.

继续阅读：data-structures database postgresql

Writing to PostgreSQL database format without using PostgreSQL

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？