compressed string storage

2023-01-28 12:16 问答作者：

Lets say I have many objects containing strings of non-trivial length (around ~3-4kb). The strings are all different from each other yet at the same time contain lots of common parts/subsequences. On average maybe 80-90% of any individual string is contained withing the others as well. Is there an easy way to automatically exploit this huge redundancy for compressing the data?

Ideall开发者_运维问答y the solution would be C++ and transparent for the user (i.e. I can use it as if I was accessing a regular read only const std::string but instead reading from compressed storage).

Algorithmically, Lempel–Ziv–Welch with one dictionary for all objects/strings might be a good start.

You can use huffman coding implementation is not hard, Also there are zip algorithms in languages (like C# and java) and you can use them.

Also If you sure 80-90% are repeated in all, create a dictionary of all words, then for each string store the position of dictionary word, means have a bit array of big size (10000 i.e) and mark the related position bits[i] to 1 if a words[i] exists in the current string. think each word length is 5 character then the abbreviation takes around 1/5 size.

If the common parts of the strings are common because they are composed from other strings, then you might get some traction by using the stlport rope class, which looks for all the world like a std::string, but uses substring tree representation with copy on write that makes them both very space efficient (common substrings are shared) and very good at inserts and deletes (log(n))

When to use rope:

you are making a template engine. document instances are made from a template by substituting varying data in the template, and then cached for future uses. Parts that are common to templates and instances are stored only once and shared across instances, inserts and deletes are cheap.

When not to use rope:

you are loading many documents from outside the domain of your application (from disk, or over a network) and using them without modification. rope doesn't share strings if they are not copied from one rope to another. If you can afford to do the work to find the common substrings, rope can still be used to improve your final representations.

Like @Saeed mentioned, a simple Huffman coding will perform well here.

There is no need in dictionary, if the common words are known apriori (you've mentioned that it's a HTML). Just precompute a huffman table using statistical data from many HTML files (Note that you can encode whole tag by a single symbol, and you can have as many symbols as you want).

继续阅读：algorithm compression data-structures string

compressed string storage

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？