开发者

What if data compressed with GZipStream or DeflateStream is longer than the raw data?

I'm no expert on the formats, but I'm guessing it's possible for certain input data for the compressed data to actually be longer, due to formatting overheads.

I'm OK with this but what I'm not OK with is the documented behaviour of the count parameter to GZipStream/DeflateStream.Write(): "The maximum number of compressed bytes to write." The usual practice (unless compressing in chunks) is to pass in the length of the input data:

public static byte[] Compress(byte[] data)
{
    using (var compressed = new IO.MemoryStream(data.Length))
    {
        using (var compressor = new IO.Compressio开发者_StackOverflow社区n.DeflateStream(compressed, IO.Compression.CompressionMode.Compress))
            compressor.Write(data, 0, data.Length);
        return compressed.ToArray();
    }
}

In the edge case I'm talking about, the write statement won't write out the whole compressed data stream, just the first data.Length bytes of it. I could just double the buffer size but for large data sets that's a bit wasteful, and anyway I don't like the guesswork.

Is there a better way to do this?


I am pretty sure that it is a mistake in the documentation. Documentation in earlier versions reads "The number of bytes compressed.", which is consistent with how all other streams work.

The same change was made to the documentation of the Read method, where it makes sense, but I think that the change was made by mistake to the documentation of the Write method. Someone corrected the documentation of the Read method, and thought that the same correction would apply to the Write method also.

The normal behavior for the Read method of a stream is that it can return less data than requested, and the method returns the number of bytes actually placed in the buffer. The Write method on the other hand always writes all the data specified. It wouldn't make any sense for the method to write less data in any implementation. As the method doesn't have a return value, it could not return the number of bytes written.

The count specified is not the size of the output, it's the size of the data that you send into the method. If the output is larger than the input, it will still all be written to the stream.

Edit:

I added a comment about this to the community content of the documentation of the method in MSDN Library. Let's see if Microsoft follows up on that...


You are right. If a compression algorithm makes some inputs shorter then some others must become longer. This follows from the pigeonhole principle.

Many algorithms have a good worst case behaviour because if the data expands too much they can instead choose to insert a noncompressed block into the stream, which is just a few bytes header and then a copy of the original data in uncompressed form.

For example the DEFLATE algorithm has this feature:

3.2.4. Non-compressed blocks (BTYPE=00)

         Any bits of input up to the next byte boundary are ignored.
         The rest of the block consists of the following information:

              0   1   2   3   4...
            +---+---+---+---+================================+
            |  LEN  | NLEN  |... LEN bytes of literal data...|
            +---+---+---+---+================================+

         LEN is the number of data bytes in the block.  NLEN is the
         one's complement of LEN.

So if you add room for the headers plus an extra 1% you should be fine.

If you want to test if your code works when the compressed output is larger than the input then you can try generating a few kilobytes of completely random data and try compressing that. It's extremely likely that the output will be longer than the input if you choose the bytes uniformly randomly.


The documentation is poorly worded in this case. The maximum number of compressed bytes to write in this case means the number of bytes from the source that you want to write as compressed data. You can test this by trying to compress a single letter that's encoded using the ASCII Encoding. The buffer length will be 1 obviously but you'll get a 108 byte array from it.


According to Jean-Loup Gailly and the zlib maintainers (zlib being the compression algorithm underlying gzip, and zip, derived from the original PKWare Zip application), "_the compression method currently used in zlib essentially never expands the data. _"

Unlike LZW as used in *nix's compress(1) and GIF images, which can double or triple the size of the input. Try running compress against a compressed or encrypted file and see what you get. Then try running gzip against a compressed file and see what happens.

http://www.zlib.net/

As noted, for degenerate input, the gzipped size will just incur a small bit of overhead for required header and control blocks.


Thanks for the great and extremely fast answers. You guys are awesome.

After a bit of digging around, it seems .NET 4 (didn't I tell you I was using .NET 4 :)) has added a new CopyTo method which makes the whole thing a lot easier.

public static byte[] Compress(byte[] data)
{
    using (var rawData = new IO.MemoryStream(data))
    using (var compressed = new IO.MemoryStream(data.Length))
    {
        using (var compressor = new IO.Compression.DeflateStream(compressed, IO.Compression.CompressionMode.Compress))
            rawData.CopyTo(compressor);
        return compressed.ToArray();
    }
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜