开发者

How to fuzzy match a short bit pattern in a long one?

I encounter a problem when try to match a short bit pattern in a long one: I have one long bit pattern, e.g. 6k bits, stored in a char array, also a short one, say 150 bits, stored in a char array, too. Now I want to check whether the short bit pattern is in the long bit pattern. While there is no need for short bit pattern to match some part of long bit pattern exactly, I will define a threshold, if the bit-error-rate under it, I will take the two pattern match.

Given the misalignment problem, I can't come up with an elegant solution. One way I can find out is convert the bit pattern into char pattern, i.e. convert bit 1 to '1', 0 to '0' and apply some string matching algorithm. But I'm afrai开发者_开发技巧d it may cost memory 7-8 times more burden my system. Someone around me recommend the Rabin Fingerprint, however it seems not designed for this kind of problem.

Hope you can give me a hand.

Thanks and best Regards.


The operation you're looking for is population count or the closely related hamming distance.

Rather than implement lots of bitwise arithmetic by hand, try the Gnu Multi-Precision Library, which includes several bit-string functions.

  • Use mpz_tdiv_q_2exp to right-shift the long pattern one bit at a time,
  • mpz_tdiv_r_2exp to extract the last 150 bits, and
  • mpz_hamdist to find the number of bits flipped between the extracted bits and the short pattern.

Should be plenty fast, and fast to write as well!

As an initial optimization, I would suggest shifting the 150-bit pattern by one-bit increments up to 7 bits, so you have 8 patterns to match against, from 150 up to 157 bits. Then, rather than shift the long pattern one bit at a time (which is slow and likely dominates the runtime), shift 8 bits at a time. Be sure to clear the bits you do not wish to compare.


Lets call short bit sequence S and long bit sequence L. The algorithm I have in mind is as following:

1- XOR S with size(S) rightmost bits of L. Say this is R
2- AND R with R-1 until zero, count how many times, if less than threshold 
   pattern is found
3- Shift right L and go to 1 if size(L) >= size(S)

This should take O(size(L)*size(S)) time in the worst case. But since the number of 1s is way smaller than size(S) in each iteration, in practice it should be efficient.


Solutions with moving the short bit pattern along the longer one have complexity O(N*M) (N - size of short segment, M - size of long segment).

If sizes will grow, you may consider this as a problem of finding shift maximizing (or exceeding threshold of) correlation between two signals and speed it up with Fast Fourier Transform. This may give something like O(N*logN) if I don't mistake.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜