Storing and indexing binary strings in a database

2023-03-18 16:06 问答作者：

A binary string as defined here is fixed size "array" of bits. I call them strings since there is no order on them (sorting/indexing them as numbers has no meaning), each bit is independent of the others. Each such string is N bits long, with N in the hundreds.

I need to store these strings and given a new binary string query for the nearest neighbor using the Hamming distance as the distance metric.

There are specialized data-structures (metric-trees) for metric-based search (VP-trees, cover-trees, M-trees), but I need to use a regular database (MongoDB in my case).

Is there some indexing function that can be applied to the binary strings that can help the DB access only a subset of the records before performing the o开发者_高级运维ne-to-one Hamming distance match? Alternatively, how would it be possible to implement such Hamming based search on a standard DB?

The hamming distance is a metric so it satisfies the triangle inequality. For each bitstring in your database, you could store the it's hamming distance to some pre-defined constant bitstring. Then you can use the triangle inequality to filter out bitstrings in the database.

So let's say

C <- some constant bitstring
S <- bitstring you're trying to find the best match for
B <- a bitstring in the database
distS <- hamming_dist(S,C)
distB <- hamming_dist(B,C)

So for each B, you would store it's corresponding distB.

A lower bound for hamming(B,S) would then be abs(distB-distS). And the upper bound would be distB+distS.

You can discard all B such that the lower bound is higher than the lowest upper bound.

I'm not 100% sure as to the optimal way to choose which C. I think you would want it to be a bitstring that's close to the "center" of your metric space of bitstrings.

You could use an approach similar to levenshtein automata, applied to bitstrings:

Compute the first (lexicographically smallest) bitstring that would meet your hamming-distance criteria.
Fetch the first result from the database that's greater than or equal to that value
If the value is a match, output it and fetch the next result. Otherwise, compute the next value greater than it that is a match, and goto 2.

This is equivalent to doing a merge join over your database and the list of possible matches, without having to generate every possible match. It'll reduce the search space, but it's still likely to require a significant number of queries.

继续阅读：algorithm binary database hamming-distance indexing

Storing and indexing binary strings in a database

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？