开发者

Copy data from PostgreSQL to CouchDB

I have a simple PHP script:

$cc = new couchClient('http://localhost:5984', 'posts_votes');
for ($i = 0; $i < 107000000; $i += 100000) {
    // Usage of dibi database layer
    $data = dibi::select('*')->from('posts_votes')
        ->offset($i)
        ->limit(100000);
    foreach ($data as $n => $row) {
        $doc = new stdClass();
        $doc->post_id = $row->post_id;
        $doc->date = $row->date->format('Y-m-d H:i:s');
        $doc->upvotes = $row->upvotes;

        $cc->storeDoc($doc);
        if (($i+$n) % 1000 == 0) {
          开发者_JS百科  echo ($i+$n).'...';
        }
    }
}

but this script is able to copy aproximately 8500 records per minute (when running the script directly through PHP interpret without Apache), which is not so fast in case i need to copy ~100 000 000 records.

Is there any way how to do that faster?


Look into the storeDocs() method instead of storeDoc(). storeDocs (plural) will use the CouchDB _bulk_docs API, which will be much faster.

Perhaps you can experiment to find the best batch size.


An excerpt from guide.couchdb.org about bulk inserts:

The fastest mode for importing data into CouchDB via HTTP is the _bulk_docs endpoint. The bulk documents API accepts a collection of documents in a single POST request and stores them all to CouchDB in a single index operation.

Bulk docs is the API to use when you are importing a corpus of data using a scripting language. It can be 10 to 100 times faster than individual bulk updates and is just as easy to work with from most languages.


but this script is able to copy aproximately 8500 records per minute

Part of the problem might be that you're fetching 100k rows in memory each time before processing them. This might be one of those cases where good old pdo will work better.

  1. create and execute a PDO statement for select post_id, date, upvotes from posts_votes (optionally with a limit and an offset).

  2. Use: while ($row = $stmt->fetch(DO::FETCH_OBJ))

That way you'll only be creating an object at a time.

An even faster way, though, would likely be to build one huge couchdb statement from within psql, and using the output to bulk load things in couchdb. I'm unsure of the precise couch syntax, but here's a basis to get you started:

select string_agg('{' ||
         'post_id: ' || post_id || ',' ||
         'date: ' || date::timestamp::text || ',' ||
         'upvotes: ' || upvotes ||
       '}', ',
')
from post_votes
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜