Berkeley DB Bulk Feature
Hi I can't find anywhere about Berkeley DB Bulk insert feature written in C开发者_JAVA技巧. I can find about update, select and delete at http://download.oracle.com/docs/cd/E17076_02/html/programmer_reference/am_misc_bulk.html. Can anybody tell me how to write this bulk insert feature? I'm new to both C and Berkeley DB.
- I also want to write quite a lot of data (may be 30GB) using this feature , so please also advise me for the performance too.
- my boss wants me to use Hash access method.
Thanks
Kevin
I don't know if this is going to help or hurt given your newness to both C and BerkleyDB.
You would need to use the DB_MULTIPLE
flag with DB->put()
.
In order to do this you need to create a bulk DBT
structure for your keys, and one for your data. The buffers must be large enough to hold the entire set of keys and values accordingly. You then have to initialize both of them with DB_MULTIPLE_WRITE_INIT
, then add your keys and values to the respective buffer with DB_MULTIPLE_WRITE_NEXT
.
This was added in 4.8 and honestly, I can't find a concrete example for you via google.
EDIT: At least in the latest releases there's example code provided with BerkeleyDB for bulk operations. You need to take a look at examples/c/ex_bulk.c
You can try doing one or more commits/transactions. For example: start a transaction, do inserts, end transaction. That's a normal way to speed up database changes because it reduces the transaction overhead of independent SQL statements.
I'm not familiar with Berkely DB API, so it might have something better suited for bulk operations, just offering advice.
Edit:
Some links regarding transactions:
1. Wikipedia entry
2. Berkley DB Transaction Throughput
For the sake of C++ users, heres how to do it using the Berkeley C++ api, which is both undocumented, and has zero examples. It does work pretty well though!.
Create a Dbt (a database Thang, Im not making that up) to keep hold of a memory buffer:
void* buf = new unsigned char[bufferSize]; dbt = new Dbt; dbt->set_data(buf); dbt->set_ulen(bufferSize); dbt->set_flags(DB_DBT_USERMEM);
Associate that with a DBMultipleKeyDataBuilder:
DBMultipleKeyDataBuilder* dbi=new DBMultipleKeyDataBuilder(dbt);
Append your Key and Value pairs one at a time until done or buffer full
dbi->append(curKeyBuf,curKeyLen,curDataBuf,curDataLen); ...(lots more of these)...
USe your DB* db, and a transaction if you wish in txn, and bulk write: db->put(txn, dbt, NULL, DB_MULTIPLE_KEY );
delete dbi;
I've missed lots of detail, such as checking the buffer is full, or big enough to hold even one KV pair.
A DBMultipleKeyDataBuilder can only be used once, but a really efficient implementation will keep a pool of buffer Dbt objects and reuse them. You can use these Dbts for bulk reading as well, so a common pool of them can be used.
The Berkeley DB forums are monitored by several Berkeley DB developers. That would be another good place to post such questions.
Bulk loading a hash in Berkeley DB has been a problem in the past. The following paper explores this further and suggests an algorithm to speed it up. The suggested algorithm sorts the data in a way linear hash (in Berkeley DB) expects hence loading can be done in one scan of the sorted data. This scales very well for large datasets. Davood Rafiei, Cheng Hu, Bulk Loading a Linear Hash File, Proc. of the DaWak Conference, 2006. https://webdocs.cs.ualberta.ca/~drafiei/papers/dawak06.pdf
精彩评论