开发者

compressed string storage

Lets say I have many objects containing strings of non-trivial length (around ~3-4kb). The strings are all different from each other yet at the same time contain lots of common parts/subsequences. On average maybe 80-90% of any individual string is contained withing the others as well. Is there an easy way to automatically exploit this huge redundancy for compressing the data?

Ideall开发者_运维问答y the solution would be C++ and transparent for the user (i.e. I can use it as if I was accessing a regular read only const std::string but instead reading from compressed storage).


Algorithmically, Lempel–Ziv–Welch with one dictionary for all objects/strings might be a good start.


You can use huffman coding implementation is not hard, Also there are zip algorithms in languages (like C# and java) and you can use them.

Also If you sure 80-90% are repeated in all, create a dictionary of all words, then for each string store the position of dictionary word, means have a bit array of big size (10000 i.e) and mark the related position bits[i] to 1 if a words[i] exists in the current string. think each word length is 5 character then the abbreviation takes around 1/5 size.


If the common parts of the strings are common because they are composed from other strings, then you might get some traction by using the stlport rope class, which looks for all the world like a std::string, but uses substring tree representation with copy on write that makes them both very space efficient (common substrings are shared) and very good at inserts and deletes (log(n))

When to use rope:

  • you are making a template engine. document instances are made from a template by substituting varying data in the template, and then cached for future uses. Parts that are common to templates and instances are stored only once and shared across instances, inserts and deletes are cheap.

When not to use rope:

  • you are loading many documents from outside the domain of your application (from disk, or over a network) and using them without modification. rope doesn't share strings if they are not copied from one rope to another. If you can afford to do the work to find the common substrings, rope can still be used to improve your final representations.


Like @Saeed mentioned, a simple Huffman coding will perform well here.

There is no need in dictionary, if the common words are known apriori (you've mentioned that it's a HTML). Just precompute a huffman table using statistical data from many HTML files (Note that you can encode whole tag by a single symbol, and you can have as many symbols as you want).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜