Effectively compress strings of 10-1000 characters in Java?

2023-02-21 16:50 问答作者：

I need to compress strings (written in a known but variable language) of anywhere from 10 to 1000 char开发者_开发知识库acters into individual UDP packets.

What compression algorithms available in Java are well suited to this task?

Are there maybe open source Java libraries available to do this?

"It depends".

I would start with just the primary candidates: LZMA ("7-zip"), deflate (direct, zlib: deflate + small wrapper, gzip: deflate + slightly larger wrapper, zip: deflate + even larger wrapper), bzip2 (I doubt this would be that good here, works best with a relative large window), perhaps even one of other LZ* branches like LZS which has an RFC for IP Payload compression but...

...run some analysis based upon the actual data and compression/throughput using several different approaches. Java has both GZIPOutputStream ("deflate in gzip wrapper") and DeflaterOutputStream ("plain deflate", recommend over gzip or zip "wrappers") standard and there are LZMA Java implementations (just need compressor, not container) so these should all be trivial to mock-up.

If there is regularity between the packets then it is is possible this could be utilized -- e.g. build cache mappings, Huffman tables, or just modify the "windows" of one of the other algorithms -- but packet-loss and "de-compressibility" likely needs to be accounted for. Going down this route though adds far more complexity. More ideas for helping out the compressor may be found at SO: How to find a good/optimal dictionary for zlib 'setDictionary' when processing a given set of data?.

Also the protocol should likely have a simple "fall back" of zero-compression because some [especially small random] data might not be practically compressible or might "compress" to a larger size (zlib actually has this guard, but also has the "wrapper overhead" so it would be better encoded separately for very small data). The overhead of the "wrapper" for the compressed data -- such as gzip or zip -- also needs to be taken into account for such small sizes. This is especially important to consider of string data less than ~100 characters.

Happy coding.

Another thing to consider is the encoding used to shove the characters into the output stream. I would first start with UTF-8, but that may not always be ideal.

See SO: Best compression algorithm for short text strings which suggests SMAZ, but I do not know how this algorithm will transfer to unicode / binary.

Also consider that not all deflate (or other format) implementations are created equal. I am not privy on Java's standard deflate compared to a 3rd party (say JZlib) in terms of efficiency for small data, but consider Compressing Small Payloads [.NET] which shows rather negative numbers for "the same compression" format. The article also ends nicely:

...it’s usually most beneficial to compress anyway, and determine which payload (the compressed or the uncompressed one) has the smallest size and include a small token to indicate whether decompression is required.

My final conclusion: always test using real-world data and measure the benefits, or you might be in for a little surprise in the end!

Happy coding. For real this time.

The simplest thing to do would be to layer a GZIPOutputStream on top of a ByteArrayOutputStream, as that is built into the JDK, using

ByteArrayOutputStream baos = new ByteArrayOutputStream();
GZIPOutputStream zos = new GZIPOutputStream(baos);

zos.write(someText.getBytes());
zos.finish();
zos.flush();


byte[] udpBuffer = baos.toByteArray();

There maybe other algorithms that do a better job, but I'd try this first, to see if it fits your needs as it doesn't require any extra jars, and does a pretty good job.

Most standard compression algorithims doesn't work so well with small amounts of data. Often there is a header and a checksum and it takes time for the compression to warmup. I.e. it builds a data dictionary based on the data it has seen.

For this reason you can find that

small packets may be smaller or the same size with no compression.
a simple application/protocol specific compression is better
you have to provide a prebuilt data dictionary to the compression algorithim and strip out the headers as much as possible.

I usually go with second option for small data packets.

good compression algorithm for short strings/url is lzw implementation, it is in java and can be easily ported for client gwt: https://code.google.com/p/lzwj/source/browse/src/main/java/by/dev/madhead/lzwj/compress/LZW.java

some remarks

use 9 bit code word length for small strings (though you may try which is better). original ratio is from 1 (very small strings, compressed is not larger than original string) to 0.5 (larger strings)
in case of client gwt for other code word lengths it was required to adjust input/output processing to work on per-byte basis, to avoid bugs when buffering bit sequence into long, which is emulated for js.

I'm using it for complex url parameters encoding in client gwt, together with base64 encoding and autobean serialization to json.

upd: base64 implementation is here: http://www.source-code.biz/base64coder/java you have to change it to make url-safe, i.e. change following characters:

'+' -> '-'

'/' -> '~'

'=' -> '_'

继续阅读：compression

Effectively compress strings of 10-1000 characters in Java?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？