What's the most efficient bit vector compression method for my use case?
- allow fast individual bit setting/unsetting
- permit efficient range queries over the bitmap
- possibly allow fast XOR-ing/AND-ing of two indexes
Thx in advance for your suggestions.
Check out FastBit:
https://sdm.lbl.gov/fastbit/
You could use a simple tree data structure like this:
struct node {
node * leftChild;
node * rightChild;
long mask;
};
struct tree {
int exponent; // the size of the tree is 2^exponent
node rootNode;
};
Every node represents a sub-array of the large bit array that is (2^n)*sizeof(long) bits, n>=0. Leaf nodes store a raw bit mask in 'mask' if they're at the bottom of the tree, otherwise they store 0 in 'mask'. This way, leaf node with a 'mask' value of 0 can represent a (2^n)*sizeof(long) -sized empty area in the bit array, so sparse bit arrays can be stored efficiently.
leftChild and rightChild are of course null in all leaf nodes. Every other node has both a leftChild and rightChild pointer, and every node that isn't a leaf node has at least one descendant node with mask that has bits set in it.
To find a bit at a given index:
bool find_bit_at_index(tree t, long ind) {
long divider = 1 << (t.exponent - 1);
node *n = &t.rootNode;
node *lastNode;
while (n)
{
lastNode = n;
if (ind >= divider) {
n = n->rightChild;
ind -= divider;
}
else {
n = n->leftChild;
}
divider >>= 1;
}
return lastNode->mask & (1 << ind);
}
Constructing the tree and developing the rest of the algorithms should be easy enough once you understand the idea. I haven't actually tested the code, since this isn't a complete solution, some typos or such might remain. And I'm not a bitmap index expert, there might be (probably is) a ready-made package that does this better, but this solution is simple and should be relatively efficient. 1% might not yet be sparse enough to make this better compared to just a plain bit array (assuming longs storing 64 bits each, it doesn't take more than 2 longs to have more than one bit set on average), but if the sparsity increases beyond that the space and time savings will show.
精彩评论