How does the ability to compress a stream affect a compression algorithm?
I recently backed up my soon-to-expire university home directory by sending it as a tar stream and compressing it on my end: ssh user@host "tar cf - my_dir/" | bzip2 > uni_backup.tar.bz2
.
This got me thinking: I only know the basics of how compression works, but I would imagine that this ability to compress a stream of data would lead to poorer compression since the algorithm needs to finish handling a block of data at one point, write this to the output stream and continue to the next block.
Is this the case? Or do these programs simply read a lot of data into memory compress this, write it, and then do this over again? Or are there any clever tricks used in these “stream compressors”? I see that both bzip2 and xz's man pages talk about memory usage, and man bzip2 also hints to the fact that little is lost on chopping the data to be compressed into blocks:
Larger block sizes give rapidly diminishing marginal returns. Most of the compression comes from the first two or three hundred k of block size, a fact worth bearing in mind when using bzip2 on small machines. It is also important to appreciate that the decompression memory requirement is set开发者_运维知识库 at compression time by the choice of block size.
I would still love to hear if other tricks are used, or about where I can read more about this.
This question relates more to buffer handling than compression algorithm, although a bit could be said about it too.
Some compression algorithm are inherently "block based", which means they absolutely need to work with blocks of specific size. This is the situation of bzip2, which block size is selected thanks to the "level" switch, from 100kb to 900kb. So, if you stream data into it, it will wait for the block to be filled, and start compressing this block when it's full (alternatively, for the last block, it will work with whatever size it receives).
Some other compression algorithm can handle streams, which means they can continuously compress new data using older one kept in a memory buffer. Algorithms based on "sliding windows" can do it, and typically zlib is able to achieve that.
Now, even "sliding window" compressors may nonetheless select to cut input data into blocks, either for easier buffer management, or to develop multi-threading capabilities, such as pigz.
精彩评论