Minimal binary diff for similar 1000 byte blocks with static noise?

2022-12-12 16:45 问答作者：

I need a minimal diff for similar 1000 byte blocks. These开发者_如何学Python blocks will have at most 20% of the bits different. The flipped bits will be like radio static -- randomly flipped bits with a uniform distribution over the whole block. Here's my pseudo code using XOR and lzo compression:

minimal_diff=lzo(XOR(block1,block2))

Since the blocks are small, I'm using lzo's compression with the hope that this compression format has minimal boilerplate.

I have reviewed algorithms such as xdelta and bsdiff, but these will not work for random static noise like this. These are more oriented around finding shifted sequences of bytes.

Can error correcting codes work here for creating a minimal diff? How exactly?

Exact algorithms would be nice. If it's just a research paper theory and not implemented then I'm not interested.

NOTE: The similar bits in each block line up. There is no shifting. There is just some random noise bit flips that differentiate the blocks.

if its truly random noise then it does not really compress. This means that if you have 8,000 bits (1,000 bytes x 8 bits / byte) and every individual bit has 1/5 (20%) probability of flipping, then you can't encode the changed bits in less than 8,000 x (-4/5 x ln2 4/5 + -1/5 x ln2 1/5) = 8,000 x (-4/5 x -0.322 + -1/5 x -2.322) = 8,000 x (0.2576 + 0.4644) = 5,776 bits i.e. 722 bytes. This is based on Shannon's information theory.

Because the trivial way to represent the changed bits takes 1000 bytes (just encode the XOR of two blocks), you can save at most 30% of the space by compression. If you achieve consistently more then the bits are not randomly distributed or the bit flip probability is less than 20%.

Standard algorithms like Lempel-Ziv are designed for structured data (i.e. data that is not random noise). Random noise like this is best encoded by simple Huffman-coding and that kind of stuff. But you can save at most 30%, so it's a question whether it's actually worth the effort.

Have you tried standard compression algorithms already? What performance do you see? You should get fairly good compression ratios on the xor of the old and new blocks, due to the high bias towards 0s.

Other than the standard options, one alternative that springs to mind is encoding each diff as a list of variable-length integers specifying the distance between flipped bits. For example, using 5-bit variable length integers, you could describe gaps of up to 16 bits in 5 bits, gaps of 17 to 1024 bits in 10 bits, and so forth. If there's any regularity to the intervals between flipped bits, you can use a regular compressor on this encoding for further savings.

继续阅读：algorithm diff python

Minimal binary diff for similar 1000 byte blocks with static noise?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？