Does DeflateStream "skip" decompression if the data was not originally compressed?
I'm not familiar with the internals of DeflateStream, but I need to store files in a Vendor's DB system that uses DeflateStream on binary attachments. The first thing I noticed was that all of my files were 10-50% BIGGER after compression, but I attribute that to a less sophisticated compression algo on top of files that are already highly compressed (in this case they were all PDFs). My question however relates to the fact that when I just wrote the original file into the BLOB the Vendor's application开发者_开发技巧 had no problem opening it (it opened the attachments I compressed with deflate as well). Is there a header on the compressed data that tells DeflateStream that the data's not compressed and basically pass it on as-is? This is the specification; can anyone familiar with it point where this is defined - or am I off base and the vendor is doing some magic behind the scenes?
no, there is no such magic in DeflateStream.
The built-in deflateStream exhibits a compression anomaly in which previously-compressed data actually increases in size. This has been reported to Microsoft previously, but they declined to fix the problem. it has to do with a naive implementation in DeflateStream of the DEFLATE protocol. Ways that I know of to avoid the problem:
use an alternative deflateStream that does not exhibit this problem. See DotNetZip for one example. It includes a DeflateStream that just works.
use the broken DeflateStream, compress the stream, compare sizes, and if the "compressed" stream is larger, then fallback to using the "uncompressed" stream.
If you choose the former case, you still have the condition where you are compressing stuff that has already been compressed. In other words, unnecessary double-compression. so you may want to look into avoiding that, regardless what you choose.
Stream compression is different from file compression. When compressing a file, it's generally possible to make multiple passes over the entire file and determine which compression scheme to use before having to commit to one. When compressing a stream, it's often necessary to start outputting data before the compression routine has processed enough data to know what compression method is going to be optimal.
This effect can be somewhat mitigated by dividing data into blocks, deciding for each block how to represent the data, and including a header at the start of each block identifying how it is stored. Unfortunately, the extra block headers will add to the size of the resulting stream. Further, many compression schemes improve in effectiveness as they process a stream; it may well be that every 1k block in a file would expand if "compressed" individually, even if compressing the whole file would result in a considerable space savings (since the compresser could e.g. build up a dictionary of common byte sequences). It would be possible to design a compress/uncompress pair so that a block of data which would expand would be written out verbatim by the compresser (with a header byte indicating that's what it was), and have the uncompresser process that block the same way the compresser could have done, so as to add to the dictionary the same byte sequences that would have been added had the block been stored in "compressed" form. Such an approach would probably be a good one, though it would add considerably to the complexity of the uncompresser.
I suspect the biggest problem for DeflateStream, though, is that there may not be any way to improve the worst-case "compression" performance without producing compressed data that is incompatible with the existing "uncompress" code. Suppose one has a string of bytes Q, and one needs a sequence of bytes which, when fed to the "uncompress" code that shipped with .net 2.0, will yield that same sequence. It may well be that for some possible values of Q, there are no such input sequences which aren't a lot bigger than Q. If that's the case, there's no way Microsoft could "fix" the problem without a time machine.
It all depends on how the DEFLATE stream was created.
DEFLATE supports a "non-compressed block" (BTYPE=00) and all data in this block, should it be utilized, is stored verbatim with no compression -- just the block header, length, and raw data. However, a stream can be a valid DEFLATE stream and contain zero (or not enough) "non-compressed" blocks even if this resulted in a sub-par compression ratio.
The overall compression ratio will depend upon the data, compressor algorithm/implementation, and amount of effort it puts into performing the compression.
Happy coding.
精彩评论