Fast Bloom Filters in C- 64bit Ints, High frequency Initialize/Query/Destroy cyle
I need a bloom filter implementation, for a part of large project. The whole project is in C (and C only! no C++) and unfortunately, I have not been able to find any decent C based bloom filter implementations (except for a proof-of-concept implementation ).
My bloom filter requirements:
1. The module containing the bloom filter runs every 50ms. The entire module needs to finish executing within 5-6ms, which means the whole bloom filter code has to be done in less than 3ms. 2. Elements are 64 bit integers 3. I have less than 8k elements in total (inserts/queries inclusive) Common case is few hundred inserts into the filter, and 1000-1500 queries.Every 50ms, I receive two sets (W, R) of 64 bit ints. I need to find the intersection between W & R received in this epoch (IOW, the bloom filter has to start fresh for every epoch). The code below shows the general control flow
sleep(50ms)
...module code..
clear(bloomfilter) /* basically a memset(0) on bl开发者_JS百科oomfilter bitmap */
W = getListW()
for each entry in W
insert(bloomfilter, entry)
R = getListR()
for each entry in R
if (present(bloomfilter, entry))
..do something with entry..
..rest of module code..
Now, I have seen several papers that claim to do fast bloom filter operations on very large data sets. But my requirements are different. I need fast seeding (insert W) and fast query. Hash functions are another concern. I cannot afford heavy duty hash functions like SHA1 due to time constraints.
You'll want to keep this simple. Since you're dealing with a small number of elements and they are 64-bit ints (which are fast to compare on 32-bit machine, and lightning fast on a 64-bit). As a first shot, I'd go with a hash table of 64K elements. When you insert, do a 16-bit 'hash' the 64-bit int by xoring each of the 16-bit pieces together to get a table index. If that isn't fast enough, profile it to find out why.
This doesn't sound as sexy as doing something with bloom filters. But really, you're only dealing with 8K integers. Here's some code I whipped up right now (didn't try to compile it). It's probably pretty fast assuming random distribution of inserted numbers, and it won't work if any of the inserts is 0.
uint64_t table[65536] = {0};
void clear()
{
memset(table, 0, sizeof(table));
}
uint16_t hash(uint64_t val)
{
assert(ele != 0);
uint16_t *parts = (uint16_t*)&ele;
uint16_t h = 0x5AA5;
h = h * 131 + parts[0];
h = h * 131 + parts[1];
h = h * 131 + parts[2];
h = h * 131 + parts[3];
return h;
}
void insert(uint64_t ele)
{
uint16_t h = hash(ele);
while (table[h])
++h;
table[h] = ele;
}
int find(uint64_t ele)
{
int res = 0;
uint16_t h = hash(ele);
while (table[h] != ele)
{
if (!table[h])
return 0;
++h;
}
return 1;
}
You'll need better collision resolution if your inserts aren't randomly distributed. You could also probably come up with a better hash method.
If I understand you:
- You will implement each bloom filter as a bitmap of size N.
- You assume a hash function which uniformly distributes the elements.
If you have ~1000 elements, you would size the bloom filter bitset size so only some tolerable load factor of them are set, perhaps average 1 in 8 to keep the set intersection false positive rate low. Nevertheless you may always get some false positives. For example with bloom filter set intersection, you may get some false positives when set1 = { e1 }
and set2 = { e2 }
, e1 != e2
, that set1 intersect set2 = { }
but bf(set1) interesect bf(set2) <> {}
. Note you will never get false negatives -- if bf(set1) intersect bf(set2) = {}
then necessarily set1 intersect set2 = {}
.
I think your algorithm should form BFs for both R and W, then intersect them as many bits at a time as possible, as shown in variant 2 below.
Quick hack, rusty C:
const unsigned N = 1024 * 8;
const unsigned BPW = 8 * sizeof ulong;
typedef unsigned long ulong;
typedef struct BF { ulong bits[N/BPW]; } BF;
unsigned hash(ulong e) { return foo(e) % N; }
void clear(BF* pbf) { memset(pbf->bits, 0, sizeof(pbf->bits)); }
void add(BF* pbf, ulong e) { unsigned h = hash(e); bf.bits[h/BPW] |= 1 << (h%BPW); }
bool hit(BF* pbf, ulong e) { unsigned h = hash(e); return (bf.bits[h/BPW]>>(h%BPW)) & 1; }
bool intersect(BF* pbfResult, BF* pbf1, BF* pbf2) {
bool empty = TRUE;
for (unsigned i = 0; i < N/BPW; i++)
if ((pbfResult->bits[i] = pbf1->bits[i] & pbf2->bits[i]) != 0)
empty = FALSE;
return !empty;
}
void intersectRW(unsigned nr, ulong* r, unsigned nw, ulong* w) {
BF bfR, bfW, bfIntesection;
unsigned i;
clear(&bfR);
for (i = 0; i < nr; i++)
add(&bfR, r[i]);
// variant 1: enumerate elements of W that hit in BF(R)
for (i = 0; i < nw; i++)
if (hit(&bfR, w[i]))
... w[i] ...
// variant 2: determine if intersection of BFs is empty and get intersection BF
clear(&bfW);
for (i = 0; i < nw; i++)
add(&bfW, w[i]);
bool any = intersect(&bfIntersection, &bfR, &bfW);
...
}
Expected runtime?
- Each invocation initializes 3 BFs of 1 KB e.g. 128 ulongs, and these smallish bitmaps sit on the TOS and should easily fit into L1$, and at any rate have great spatial locality;
- adds 100-1000 elements to bfR e.g. ~1000 inlined invocations of add, some bit shifts and stores;
- hit tests 100-1000 elements of bfR e.g. ~1000 inlined invocations of hit, some bit shifts, masks, tests;
- or variant 2, performs elementwise ANDs on just ~128 pairs of ulongs
(Note of course all the / and % in the code above are optimized into shifts and masks.)
In total this might be some tens of thousands of instructions and a few thousand L1 or L2 cache hits; with a 2 GHz cycle time machine, I would be surprised if this takes more than a few ms once warmed up.
As for hash functions, you didn't tell us about the distribution of these 64-bit elements. If they are already well distributed, you can just fold the 64-bits down to 16-bits with a couple of shifts, xors, and a mask.
* Today's curious fact -- MS VC++ 4.0's fine grained 'minimal rebuild' feature (http://msdn.microsoft.com/en-us/library/kfz8ad09(VS.80).aspx) depends upon bloom filters galore -- but we had never heard of bloom filters at the time. Rather, we thought we had invented a novel set with probablistic-membership-test data structure... *
What do you think?
Happy hacking!
Wait, I forgot to mention:
- Overkill, but you can speed up the clear and the intersect operation using vector SIMD instructions (e.g. SSE).
- You may be able to take advantage of other properties of the data. For example, if there is any similarity between each invocation's R and W arrays, you may be able to turn the brute force algorithm into an incremental algorithm, athough you might have to use counting bloom filters.
- Depending upon the load factor and the repetitiveness of the elements themselves, you may not need to clear the bitmaps each iteration. You only need to clear them when you finally get a non-empty intersection (then rerun the add()s and the intersect().)
- Your problem size doesn't need it here, but if you had millions of elements, you could divide the input R and W lists into sublists, hand them out to multiple cores, build private copies of BFs for R and W, then fold (OR) the BF(R)s and BF(W)s together.
You've got a relatively small number of integers and 3 ms to process them.
Is your CPU fast enough to keep this simple and sort both lists? The sort should be fast as everything will fit comfortably in the cache. Running through two lists to find the intersection is quite fast and you'll never have to be concerned about dealing with false-positives as you would with a Bloom filter.
精彩评论