java: sparse bit vector
Are there any well-known li开发者_开发知识库braries in Java for sparse bit vectors?
(And are there guidelines for how sparse is useful to use them vs. java.util.BitSet?)
TL;DR go here Efficient Sparse BitSet implementation in Java
I know this is an "old" question, but having the same question I stumbled across this post. While the answers are good, I was ultimately not satisfied. After further digging, I think I've come across the "definitive" answer to the question of sparse BitSets in Java.
In this presentation the author, Dr. Bruce Haddon, discusses the efforts of his researchers to create a highly memory-efficient and high-performance replacement for the standard Java BitSet.
The original links to his presentation are dead, but I contacted Dr. Haddon and have preserved both the code and presentation here:
https://github.com/brettwooldridge/SparseBitSet
I cannot recommend reading this presentation more highly. It is a fascinating read even if you have no interest in sparse bit sets, it is more about the true nature of problem solving...
Slides: Is it Computer Science, Software Engineering, or Hacking?
If its really sparse (e.g., less than 1% loading), then using a hash table indexed by bit index is probably pretty good; mere presence or absence of the index in the table is all you need to know if the bit is one or zero respectively.
If the density is upwards of a few percent, you can use a hash table indexed by bit index divided by 64, and store long words in the hash table containing actual bits. Bit N is set if the hash table contains value V for int(N/64) and (V>>(N mod 64))&1 is true.
Both of these answers assume you want to optimize random access to bits. If you want to optimize sequential (or other access) to bits by index, then you may want a sparse matrix structure, using the same kind of low-level bit vector representation depending on expected density. See Sparse Matrices
The colt library has sparse matrices (1D, 2D and 3D). It also has an efficient BitVector, with 1 bit per value, rather than 8-bits as boolean[]
does.
However, the sparse matrices do not support bits directly - only doubles and objects. You could wrap the 1D sparse double matrix by maping bit index to long indices (bitIndex>>6)
since each long holds 64 bits, convert the retrieved double to a raw long value, and use bit manipulation to access the bits of the retrieved long. A little work, but nowhere near as much as implementing the sparse vector yourself. Once your wrapper is working, you might avoid converting doubles to longs, and implement a real sparse long 1d matrix using the available Colt source code for the double 1D sparse matrix as a starting point.
EDIT: More info. The Colt vectors/matrices require no memory initially for storage, assuming all bits (longs) are initially 0. Setting a value to non-zero consumes memory. Setting the value back to 0 continues to consume memory, although memory for zero values is reclaimed periodically.
If the bits are truly sparse, such that each backing long value only has one bit set, then the storage overhead will be very poor, requiring 64-bits per actual bit stored. But as you mention typical case is 20-40% sparse, then the overhead will be much lower, with possibly no wasted storage if bits are clustered in ranges, e.g. bits from 0-100, then 1000-1100, and 2000-2200 (values in hex.) Overall, only 1/16 of the region is assigned to bits, but the clustering means that the bits are stored with no wasted space.
You could try FastUtil's AVL Tree Map.
CERN COLT is widely used for vector and matrix computation, and has sparse matrices, but isn't specifically used for bit vectors.
http://acs.lbl.gov/software/colt/api/cern/colt/matrix/impl/SparseObjectMatrix1D.html
A hash table where the mere presence or absence of the key tells you something? That would be a hash set then! I'm skeptical of the performance of a set (even a hashed one) over the BitSet. It really depends on whether speed or memory is the primary driver.
You could try the JavaEWAH library.
https://code.google.com/p/javaewah/
Depending on your problem it may be a good fit.
(It is used by Apache Hive and others.)
精彩评论