get the filesize of very large .gz file on a 64bit platform
According to the specifiction of gz the filesize is saved in the last 4bytes of a .gz file.
I have created 2 files with
dd if=/dev/urandom of=500M bs=1024 count=500000
dd if=/dev/urandom开发者_如何学运维 of=5G bs=1024 count=5000000
I gziped them
gzip 500M 5G
I checked the last 4 bytes doing
tail -c4 500M|od -I (returns 512000000 as expected)
tail -c4 5G|od -I (returns 825032704 as not expected)
It seems that hitting the invisible 32bit barrier, makes the value written into the ISIZE completely nonsense. Which is more annoying, than if they had used some error bit instead.
Does anyone know of a way to get the uncompressed .gz filesize from the .gz without extracting it?
thanks
specification: http://www.gzip.org/zlib/rfc-gzip.html
edit: if anyone to try it out, you could use /dev/zero instead of /dev/urandom
There isn't one.
The only way to get the exact size of a compressed stream is to actually go and decompress it (even if you write everything to /dev/null and just count the bytes).
Its worth noting that ISIZE is defined as
ISIZE (Input SIZE)
This contains the size of the original (uncompressed) input
data modulo 2^32.
in the gzip RFC so it isn't actually breaking at the 32-bit barrier, what you're seeing is expected behavior.
I haven't tried this with a file of the size you mentioned, but I often find the uncompressed size of a .gz file with
zcat file.gz | wc -c
when I don't want to leave the uncompressed file lying around, or bother to compress it again.
Obviously, the data is uncompressed, but is then piped to wc
.
It's worth a try, anyway.
EDIT: When I tried creating a 5G file with data from /dev/random it produced a file 5G
of size 5120000000, although my file manager reported this as 4.8G
Then I compressed it with gzip 5G
, the results 5G.gz
was the same size (not much compression of random data).
Then zcat 5G.gz | wc -c
reported the same size as the original file: 5120000000 bytes. So my suggestion seemed to have worked for this trial, anyway.
Thanks for waiting
gzip does have a -l option:
-l --list
For each compressed file, list the following fields:
compressed size: size of the compressed file
uncompressed size: size of the uncompressed file
ratio: compression ratio (0.0% if unknown)
uncompressed_name: name of the uncompressed file
The uncompressed size is given as -1 for files not in gzip format, such as compressed .Z files. To
get the uncompressed size for such a file, you can use:
zcat file.Z | wc -c
In combination with the --verbose option, the following fields are also displayed:
method: compression method
crc: the 32-bit CRC of the uncompressed data
date & time: time stamp for the uncompressed file
The compression methods currently supported are deflate, compress, lzh (SCO compress -H) and pack.
The crc is given as ffffffff for a file not in gzip format.
With --name, the uncompressed name, date and time are those stored within the compress file if
present.
With --verbose, the size totals and compression ratio for all files is also displayed, unless some
sizes are unknown. With --quiet, the title and totals lines are not displayed.
精彩评论