开发者

Writing to PostgreSQL database format without using PostgreSQL

I am collecting lots of data from lots of machines. These machine开发者_Go百科s cannot run PostgreSQL and the cannot connect to a PostgreSQL database. At the moment I save the data from these machines in CSV files and use the COPY FROM command to import the data into the PostgreSQL database. Even on high-end hardware this process is taking hours. Therefore, I was thinking about writing the data to the format of PostgreSQL database directly. I would then simply copy these files into the /data directory, start the PostgreSQL server. The server would then find the database files and accept them as databases.

Is such a solution feasible?


Theoretically this might be possible if you studied the source code of PostgreSQL very closely.

But you essentially wind up (re)writing the core of PostgreSQL, which qualifies as "not feasible" from my point of view.

Edit:
You might want to have a look at pg_bulkload which claims to be faster than COPY (haven't used it though)


Why can't they connect to the database server? If it is because of library-dependencies, I suggest that you set up some sort of client-server solution (web services perhaps) that could queue and submit data along the way.

Relying on batch operations will always give you a headache when dealing with large amount of data, and if COPY FROM isn't fast enough for you, I don't think anything will be.


Yeah, you can't just write the files out in any reasonable way. In addition to the data page format, you'd need to replicate the commit logs, part of the write-ahead logs, some transaction visibility parts, any conversion code for types you use, and possibly the TOAST and varlena code. Oh, and the system catalog data, as already mentioned. Rough guess, you might get by with only needing to borrow 200K lines of code from the server. PostgreSQL is built from the ground up around being extensible; you can't even interpret what an integer means without looking up the type information around the integer type in the system catalog first.

There are some tips for speeding up the COPY process at Bulk Loading and Restores. Turning off synchronous_commit in particular may help. Another trick that may be useful: if you start a transaction, TRUNCATE a table, and then COPY into it, that COPY goes much faster. It doesn't bother with the usual write-ahead log protection. However, it's easy to discover COPY is actually bottlenecked on CPU performance, and there's nothing useful you can do about that. Some people split the incoming file into pieces and run multiple COPY operations at once to work around this.

Realistically, pg_bulkload is probably your best bet, unless it too gets CPU bound--at which point a splitter outside the database and multiple parallel loading is really what you need.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜