Insert MANY key value pairs fast into berkeley db with hash access

2023-01-01 09:27 问答作者：

i'm trying to build a hash with berkeley db, which shall contain many tuples (approx 18GB of key value pairs), but in all my tests the performance of the insert operations degrades drastically over time. I've written this script to test the performance:

#include<iostream>
#include<db_cxx.h>
#include<ctime>

#define MILLION 1000000

int main () {
    long long a = 0;
    long long b = 0;

    int passes = 0;
    int i = 0;
    u_int32_t flags = DB_CREATE;

    Db* dbp = new Db(NULL,0);
    dbp->set_cachesize( 0, 1024 * 1024 * 1024, 1 );

    int ret = dbp->open(
            NULL,
            "test.db",
            NULL,
            DB_HASH,
            flags,
            0);
    time_t time1 = time(NULL);

    while ( passes < 100 ) {
        while( i < MILLION ) {

            Dbt key( &a, sizeof(long long) );
            Dbt data( &b, sizeof(long long) );

            dbp->put( NULL, &key, &data, 0);
            a++; b++; i++;  
        }

        DbEnv* dbep = dbp->get_env();
        int tmp;
        dbep->memp_trickle( 50, &tmp );

        i=0;
        passes++;
开发者_运维问答        std::cout << "Inserted one million --> pass: " << passes << " took: " << time(NULL) - time1 << "sec" << std::endl;
        time1 = time(NULL);
    }

}

Perhaps you can tell me why after some time the "put" operation takes increasingly longer and maybe how to fix this.

Thanks for your help, Andreas

You may want to look at the information provided by the db_stat utility and the HASH-specific tuning functions that are available. Please see BDB Reference Guide section on configuring a HASH database.

I would expect you to get 10s of thousands of inserts per second on commodity hardware. What are you experiencing and what is your performance target?

Regards,

Dave

I would recommend trying the bulk insert API, you can read about that in the documentation here: http://www.oracle.com/technology/documentation/berkeley-db/db/api_reference/CXX/dbput.html#put_DB_MULTIPLE_KEY

Also, I would guess that your call to memp_trickle is responsible for most of the slowdown. As the cache becomes dirtier, finding pages to trickle becomes more expensive. In fact, since you are only writing, having a large cache only hurts (once you've written the data, you don't use it again, so you don't want it to hang around in the cache.) I would recommend testing different (smaller) cache sizes.

Finally, if your sole concern is insert performance, using a larger page size will help. You'll be able to fit more data on each page and that will result in fewer disk writes.

-Ben

The memp_trickle is almost certainly slowing things down. It's often good to use trickle, but it belongs in its own thread to be effective. BDB (except when you get into the higher level replication APIs) creates no threads for you -- nothing happens behind the scenes (thread-wise). trickle will be effective when you have dirty pages forced from the cache (look at a stat output to see if that's what's happening).

You might also consider using a BTREE instead of a HASH. Yeah, I know you specifically said hash, but why? If you are looking to maximize performance, why add that restriction? You may be able to take advantage of locality of reference to reduce your cache footprint - there is often much more locality that you believe, or perhaps you can create some -- if you generate keys that are random digits, for example, prepend the date and time. That usually introduces locality into a perceived 'random' system. If you do use btree, you'll need to pay some attention to byte ordering for your keys for your system (look up Endianness in Wikipedia), if you are using a Little Endian system, you'll need to swap the bytes. Using BTREE with the right ordering and introduced locality means your key/value pairs will be stored in 'key-generation-time' order, so if you see the most action on the recent keys, you'll be tending to hit the same pages over and over (see your cache hit rate in your stats). So you'll need less cache. Another way to think of it is with the same cache amount, your solution will scale by a larger multiple.

I expect your actual app really doesn't insert integers keys in order (if it does, you'll be lucky). So you should write a benchmark that closely simulates your access patterns, at least with respect to: size of key, size of data, access pattern, number of items in the database, mix of read/write. Once you have that, look at the stats -- pay close attention to anything that implies IO or contention.

BTW, I've recently started a blog at http://libdb.wordpress.com to discuss BDB performance tuning (and other matters related to BDB). You may get some good ideas there. There can be a huge difference of latency and throughput depending on the kind of tuning you do. In particular, see http://libdb.wordpress.com/2011/01/31/revving-up-a-benchmark-from-626-to-74000-operations-per-second/

Your performance degrade can have several reasons, which aren't actually connected with your code. I may be mistaken, but I think this is all about internal database structure (and used data structure).

Think of a situation where database uses some approach other than a hash table, for example an RB Tree. Inserting in that tree would take O(logN) in Big-O sense and every inserted element increases the time needed for the next insert.

Unfortunately, the same could happen with a plain hash-table, so that initial O(1) insertion operation time degrades to something worse. This could have several reasons, but it's all about hash collisions, which could happen due to wrong hash function, wrong data (that is bad for currently used hash function) or even due to the moon phase.

If I were you, I would try to dig into your db internal structure. Also, I think testing your keys with something other than your db (e.g boost::unordered_map) could also benefit your testing and profiling.

Edit: Also to mention, did you try to change that cache_size stuff in your sample? Or maybe there are some other performance-related parameters that could be modified?

继续阅读：berkeley-db

Insert MANY key value pairs fast into berkeley db with hash access

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？