Storing and indexing binary strings in a database
A binary string as defined here is fixed size "array" of bits. I call them strings since there is no order on them (sorting/indexing them as numbers has no meaning), each bit is independent of the others. Each such string is N bits long, with N in the hundreds.
I need to store these strings and given a new binary string query for the nearest neighbor using the Hamming distance as the distance metric.
There are specialized data-structures (metric-trees) for metric-based search (VP-trees, cover-trees, M-trees), but I need to use a regular database (MongoDB in my case).Is there some indexing function that can be applied to the binary strings that can help the DB access only a subset of the records before performing the o开发者_高级运维ne-to-one Hamming distance match? Alternatively, how would it be possible to implement such Hamming based search on a standard DB?
The hamming distance is a metric so it satisfies the triangle inequality. For each bitstring in your database, you could store the it's hamming distance to some pre-defined constant bitstring. Then you can use the triangle inequality to filter out bitstrings in the database.
So let's say
C <- some constant bitstring
S <- bitstring you're trying to find the best match for
B <- a bitstring in the database
distS <- hamming_dist(S,C)
distB <- hamming_dist(B,C)
So for each B
, you would store it's corresponding distB
.
A lower bound for hamming(B,S)
would then be abs(distB-distS)
. And the upper bound would be distB+distS
.
You can discard all B
such that the lower bound is higher than the lowest upper bound.
I'm not 100% sure as to the optimal way to choose which C
. I think you would want it to be a bitstring that's close to the "center" of your metric space of bitstrings.
You could use an approach similar to levenshtein automata, applied to bitstrings:
- Compute the first (lexicographically smallest) bitstring that would meet your hamming-distance criteria.
- Fetch the first result from the database that's greater than or equal to that value
- If the value is a match, output it and fetch the next result. Otherwise, compute the next value greater than it that is a match, and goto 2.
This is equivalent to doing a merge join over your database and the list of possible matches, without having to generate every possible match. It'll reduce the search space, but it's still likely to require a significant number of queries.
精彩评论