How does the ability to compress a stream affect a compression algorithm?

2023-03-30 01:36 问答作者：

I recently backed up my soon-to-expire university home directory by sending it as a tar stream and compressing it on my end: ssh user@host "tar cf - my_dir/" | bzip2 > uni_backup.tar.bz2.

This got me thinking: I only know the basics of how compression works, but I would imagine that this ability to compress a stream of data would lead to poorer compression since the algorithm needs to finish handling a block of data at one point, write this to the output stream and continue to the next block.

Is this the case? Or do these programs simply read a lot of data into memory compress this, write it, and then do this over again? Or are there any clever tricks used in these “stream compressors”? I see that both bzip2 and xz's man pages talk about memory usage, and man bzip2 also hints to the fact that little is lost on chopping the data to be compressed into blocks:

Larger block sizes give rapidly diminishing marginal returns. Most of the compression comes from the first two or three hundred k of block size, a fact worth bearing in mind when using bzip2 on small machines. It is also important to appreciate that the decompression memory requirement is set开发者_运维知识库 at compression time by the choice of block size.

I would still love to hear if other tricks are used, or about where I can read more about this.

This question relates more to buffer handling than compression algorithm, although a bit could be said about it too.

Some compression algorithm are inherently "block based", which means they absolutely need to work with blocks of specific size. This is the situation of bzip2, which block size is selected thanks to the "level" switch, from 100kb to 900kb. So, if you stream data into it, it will wait for the block to be filled, and start compressing this block when it's full (alternatively, for the last block, it will work with whatever size it receives).

Some other compression algorithm can handle streams, which means they can continuously compress new data using older one kept in a memory buffer. Algorithms based on "sliding windows" can do it, and typically zlib is able to achieve that.

Now, even "sliding window" compressors may nonetheless select to cut input data into blocks, either for easier buffer management, or to develop multi-threading capabilities, such as pigz.

继续阅读：bzip2 compression xz

How does the ability to compress a stream affect a compression algorithm?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？