Random access gzip stream

2022-12-24 02:19 问答作者：

I'd like to be able to do random access into a gzipped file. I can afford to do some preprocessing on it (say, build some kind of index), provided that the result of the preprocessing is much smaller than the file itself.

Any advice?

My thoughts were:

Hack on an existing gzip implementation and serialize its decompressor state every, say, 1 megab开发者_Go百科yte of compressed data. Then to do random access, deserialize the decompressor state and read from the megabyte boundary. This seems hard, especially since I'm working with Java and I couldn't find a pure-java gzip implementation :(
Re-compress the file in chunks of 1Mb and do same as above. This has the disadvantage of doubling the required disk space.
Write a simple parser of the gzip format that doesn't do any decompressing and only detects and indexes block boundaries (if there even are any blocks: I haven't yet read the gzip format description)

Have a look at this link (C code example).

/* zran.c -- example of zlib/gzip stream indexing and random access
...

Gzip is just zlib with an envelope.

The BGZF file format, compatible with GZIP was developped by the biologists.

(...) The advantage of BGZF over conventional gzip is that BGZF allows for seeking without having to scan through the entire file up to the position being sought.

In http://picard.svn.sourceforge.net/viewvc/picard/trunk/src/java/net/sf/samtools/util/ , have a look at BlockCompressedOutputStream and BlockCompressedInputStream.java

FWIW: I've developed a command line tool upon zlib's zran.c source code which can do random access to gzip with the creation of indexes for gzip files: https://github.com/circulosmeos/gztool

It can even create an index for a still-growing gzip file (for example a log created by rsyslog directly in gzip format) thus reducing in the practice to zero the time of index creation. See the -S (Supervise) option.

interesting question. I don't understand why your 2nd option (recompress file in chunks) would double the disk space. Seems to me it would be the same, less a small amount of overhead. If you have control over the compression piece, then that seems like the right idea.

Maybe what you mean is that you don't have control over the input, and therefore it would double.

If you can do it, I'm imagining modelling it as a CompressedFileStream class that uses as its backing store, a series of 1mb gzip'd blobs. When reading, a Seek() on the stream would move to the appropriate blob and decompress. A Read() past the end of a blob would cause the stream to open the next blob.

ps: GZIP is described in IETF RFC 1952, but it uses DEFLATE for the compression format. There'd be no reason to use the GZIP elaboration if you implemented this CompressedFileStream class as I've imagined it.

继续阅读：compression language-agnostic large-files random-access

Random access gzip stream

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？