Indexed Compression Library

2023-03-24 12:07 问答作者：

I am working with a system that compresses large files (40 GB) and then stores them in an archive.

Currently I am using libz.a to compress the files with C++ but when I want to get data out of the file I need to extract the whole thing. Does anyone know a compression component (preferably .NET compatible) that can store an index of original file positions and then, instead of decompressing the entire file, seek to what is needed?

Example:

Original File       Compressed File
10 - 27         =>  2-5
100-202         =>  10-19
..............
10230-102020    =>  217-298

Since I know the data I need in the file only occurs in the original file between position 10-27, i'd like a way to map the original file positions to the compressed file positions.

Does anyone know of a compression library or similar readily available tool that can offer this fu开发者_StackOverflow中文版nctionality?

I'm not sure if this is going to help you a lot, as the solution depends on your requirements, but I had similar problem with project I am working on (at least I think so), where I had to keep many text articles on drive and access them in quite random manner, and because of size of data I had to compress them.

Problem with compressing all this data at once is that, most algorithms depends on previous data when decompressing it. For example, popular LZW method creates adictionary (an instruction on how to decompress data) on run, while doing the decompression, so decompressing stream from the middle is not possible, although I believe those methods might be tuned for it.

Solution I have found to be working best, although it does decrease your compression ratio is to pack data in chunks. In my project it was simple - each article was 1 chunk, and I compressed them 1 by 1, then created an index file that kept where each "chunk" starts, decompressing was easy in that case - just decompress whole stream, which was one article that I wanted.

So, my file looked like this:

Index; compress(A1); compress(A2); compress(A3)

instead of

compress(A1;A2;A3).

If you can't split your data in such elegant manner, you can always try to split chunks artificially, for example, pack data in 5MB chunks. So when you will need to read data from 7MB to 13MB, you will just decompress chunks 5-10 and 10-15. Your index file would then look like:

0     -> 0
5MB   -> sizeof(compress 5MB)
10MB  -> sizeof(compress 5MB) + sizeof(compress next 5MB)

The problem with this solution is that it gives slightly worse compression ratio. The smaller the chunks are - the worse the compression will be.

Also: Having many chunks of data don't mean you have to have different files in hard drive, just pack them after each other in 1 file and remember when they start.

Also: http://dotnetzip.codeplex.com/ is a nice library for creating zip files that you can use to compress and is written in c#. Worked pretty nice for me and you can use its built functionality of creating many files in 1 zip file to take care of splitting data into chunks.

继续阅读：compression zlib

Indexed Compression Library

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？