开发者

Indexed Compression Library

I am working with a system that compresses large files (40 GB) and then stores them in an archive.

Currently I am using libz.a to compress the files with C++ but when I want to get data out of the file I need to extract the whole thing. Does anyone know a compression component (preferably .NET compatible) that can store an index of original file positions and then, instead of decompressing the entire file, seek to what is needed?

Example:

Original File       Compressed File
10 - 27         =>  2-5
100-202         =>  10-19
..............
10230-102020    =>  217-298

Since I know the data I need in the file only occurs in the original file between position 10-27, i'd like a way to map the original file positions to the compressed file positions.

Does anyone know of a compression library or similar readily available tool that can offer this fu开发者_StackOverflow中文版nctionality?


I'm not sure if this is going to help you a lot, as the solution depends on your requirements, but I had similar problem with project I am working on (at least I think so), where I had to keep many text articles on drive and access them in quite random manner, and because of size of data I had to compress them.

Problem with compressing all this data at once is that, most algorithms depends on previous data when decompressing it. For example, popular LZW method creates adictionary (an instruction on how to decompress data) on run, while doing the decompression, so decompressing stream from the middle is not possible, although I believe those methods might be tuned for it.

Solution I have found to be working best, although it does decrease your compression ratio is to pack data in chunks. In my project it was simple - each article was 1 chunk, and I compressed them 1 by 1, then created an index file that kept where each "chunk" starts, decompressing was easy in that case - just decompress whole stream, which was one article that I wanted.

So, my file looked like this:

Index; compress(A1); compress(A2); compress(A3)

instead of

compress(A1;A2;A3).

If you can't split your data in such elegant manner, you can always try to split chunks artificially, for example, pack data in 5MB chunks. So when you will need to read data from 7MB to 13MB, you will just decompress chunks 5-10 and 10-15. Your index file would then look like:

0     -> 0
5MB   -> sizeof(compress 5MB)
10MB  -> sizeof(compress 5MB) + sizeof(compress next 5MB)

The problem with this solution is that it gives slightly worse compression ratio. The smaller the chunks are - the worse the compression will be.

Also: Having many chunks of data don't mean you have to have different files in hard drive, just pack them after each other in 1 file and remember when they start.

Also: http://dotnetzip.codeplex.com/ is a nice library for creating zip files that you can use to compress and is written in c#. Worked pretty nice for me and you can use its built functionality of creating many files in 1 zip file to take care of splitting data into chunks.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜