What's the most efficient bit vector compression method for my use case?

2023-02-06 07:14 问答作者：

I'm working on a project in computational biology and I need to store an index of locuses that differ between ma开发者_运维百科ny sequences. For now, I'm using a B+Tree for that purpose, but I guess using a bitmap index would be way faster for such a use case : only a small number of locus differ between two sequences, 1% on average, and they are nearly equally distributed along the sequence; so it seems like there is a lot of room for bitmap index compression. My problem is that I cannot manage to find a compression method that can efficiently:

allow fast individual bit setting/unsetting
permit efficient range queries over the bitmap
possibly allow fast XOR-ing/AND-ing of two indexes

Thx in advance for your suggestions.

Check out FastBit:

https://sdm.lbl.gov/fastbit/

You could use a simple tree data structure like this:

struct node {
    node * leftChild;
    node * rightChild;
    long mask;
};
struct tree {
    int exponent; // the size of the tree is 2^exponent
    node rootNode;
};

Every node represents a sub-array of the large bit array that is (2^n)*sizeof(long) bits, n>=0. Leaf nodes store a raw bit mask in 'mask' if they're at the bottom of the tree, otherwise they store 0 in 'mask'. This way, leaf node with a 'mask' value of 0 can represent a (2^n)*sizeof(long) -sized empty area in the bit array, so sparse bit arrays can be stored efficiently.

leftChild and rightChild are of course null in all leaf nodes. Every other node has both a leftChild and rightChild pointer, and every node that isn't a leaf node has at least one descendant node with mask that has bits set in it.

To find a bit at a given index:

bool find_bit_at_index(tree t, long ind) {
    long divider = 1 << (t.exponent - 1);
    node *n = &t.rootNode;
    node *lastNode;
    while (n)
    {
       lastNode = n;
       if (ind >= divider) {
          n = n->rightChild;
          ind -= divider;
       }
       else {
          n = n->leftChild;
       }
       divider >>= 1;
    }
    return lastNode->mask & (1 << ind);
}

Constructing the tree and developing the rest of the algorithms should be easy enough once you understand the idea. I haven't actually tested the code, since this isn't a complete solution, some typos or such might remain. And I'm not a bitmap index expert, there might be (probably is) a ready-made package that does this better, but this solution is simple and should be relatively efficient. 1% might not yet be sparse enough to make this better compared to just a plain bit array (assuming longs storing 64 bits each, it doesn't take more than 2 longs to have more than one bit set on average), but if the sparsity increases beyond that the space and time savings will show.

继续阅读：bit-manipulation bitmap c compression indexing

What's the most efficient bit vector compression method for my use case?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？