开发者

Minimal binary diff for similar 1000 byte blocks with static noise?

I need a minimal diff for similar 1000 byte blocks. These开发者_如何学Python blocks will have at most 20% of the bits different. The flipped bits will be like radio static -- randomly flipped bits with a uniform distribution over the whole block. Here's my pseudo code using XOR and lzo compression:

minimal_diff=lzo(XOR(block1,block2))

Since the blocks are small, I'm using lzo's compression with the hope that this compression format has minimal boilerplate.

I have reviewed algorithms such as xdelta and bsdiff, but these will not work for random static noise like this. These are more oriented around finding shifted sequences of bytes.

Can error correcting codes work here for creating a minimal diff? How exactly?

Exact algorithms would be nice. If it's just a research paper theory and not implemented then I'm not interested.

NOTE: The similar bits in each block line up. There is no shifting. There is just some random noise bit flips that differentiate the blocks.


if its truly random noise then it does not really compress. This means that if you have 8,000 bits (1,000 bytes x 8 bits / byte) and every individual bit has 1/5 (20%) probability of flipping, then you can't encode the changed bits in less than 8,000 x (-4/5 x ln2 4/5 + -1/5 x ln2 1/5) = 8,000 x (-4/5 x -0.322 + -1/5 x -2.322) = 8,000 x (0.2576 + 0.4644) = 5,776 bits i.e. 722 bytes. This is based on Shannon's information theory.

Because the trivial way to represent the changed bits takes 1000 bytes (just encode the XOR of two blocks), you can save at most 30% of the space by compression. If you achieve consistently more then the bits are not randomly distributed or the bit flip probability is less than 20%.

Standard algorithms like Lempel-Ziv are designed for structured data (i.e. data that is not random noise). Random noise like this is best encoded by simple Huffman-coding and that kind of stuff. But you can save at most 30%, so it's a question whether it's actually worth the effort.


Have you tried standard compression algorithms already? What performance do you see? You should get fairly good compression ratios on the xor of the old and new blocks, due to the high bias towards 0s.

Other than the standard options, one alternative that springs to mind is encoding each diff as a list of variable-length integers specifying the distance between flipped bits. For example, using 5-bit variable length integers, you could describe gaps of up to 16 bits in 5 bits, gaps of 17 to 1024 bits in 10 bits, and so forth. If there's any regularity to the intervals between flipped bits, you can use a regular compressor on this encoding for further savings.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜