开发者

Better compression algorithm for vector data?

I need to compress some spatially correlated data records. Currently I am getting 1.2x-1.5x compression with zlib, but I figure it should be possible to get more like 2x. The data records have various fields, but for example, zlib seems to have trouble compressing lists of points.

The points represent a road network. They are pairs of fixed-point 4-byte integers of the form XXXXYYYY. Typically, if a single data block has 100 points, there will be only be a few combinations of the top two bytes of X and Y (spatial correlation). But the bottom bytes are always changing and must look like random data to zlib.

Similarly, the records have 4-byte IDs which tend to have constant high bytes and variable low bytes.

Is there another开发者_开发技巧 algorithm that would be able to compress this kind of data better? I'm using C++.

Edit: Please no more suggestions to change the data itself. My question is about automatic compression algorithms. If somebody has a link to an overview of all popular compression algorithms I'll just accept that as answer.


You'll likely get much better results if you try to compress the data yourself based on your knowledge of its structure.

General-purpose compression algorithms just treat your data as a bitstream. They look for commonly-used sequences of bits, and replace them with a shorter dictionary indices.

But the duplicate data doesn't go away. The duplicated sequence gets shorter, but it's still duplicated just as often as it was before.

As I understand it, you have a large number of data points of the form

XXxxYYyy, where the upper-case letters are very uniform. So factor them out.

Rewrite the list as something similar to this:

XXYY // a header describing the common first and third byte for all the subsequent entries
xxyy // the remaining bytes, which vary
xxyy
xxyy
xxyy
...
XXYY // next unique combination of 1st and 3rd byte)
xxyy
xxyy
...

Now, each combination of the rarely varying bytes is listed only once, rather than duplicated for every entry they occur in. That adds up to a significant space saving.

Basically, try to remove duplicate data yourself, before running it through zlib. You can do a better job of it because you have additional knowledge about the data.

Another approach might be, instead of storing these coordinates as absolute numbers, write them as deltas, relative deviations from some location chosen to be as close as possible to all the entries. Your deltas will be smaller numbers, which can be stored using fewer bits.


Not specific to your data, but I would recommend checking out 7zip instead of zlib if you can. I've seen ridiculously good compression ratios using this.

http://www.7-zip.org/


Without seeing the data and its exact distribution, I can't say for certain what the best method is, but I would suggest that you start each group of 1-4 records with a byte whose 8 bits indicate the following:

0-1 Number of bytes of ID that should be borrowed from previous record 2-4 Format of position record 6-7 Number of succeeding records that use the same 'mode' byte

Each position record may be stored one of eight ways; all types other than 000 use signed displacements. The number after the bit code is the size of the position record.

000 - 8 - Two full four-byte positions 001 - 3 - Twelve bits for X and Y 010 - 2 - Ten-bit X and six-bit Y 011 - 2 - Six-bit X and ten-bit Y 100 - 4 - Two sixteen-bit signed displacements 101 - 3 - Sixteen-bit X and 8-bit Y signed displacement 110 - 3 - Eight-bit signed displacement for X; 16-bit for Y 111 - 2 - Two eight-bit signed displacements

A mode byte of zero will store all the information applicable to a point without reference to any previous point, using a total of 13 bytes to store 12 bytes of useful information. Other mode bytes will allow records to be compacted based upon similarity to previous records. If four consecutive records differ only in the last bit of the ID, and either have both X and Y within +/- 127 of the previous record, or have X within +/- 31 and Y within +/- 511, or X within +/- 511 and Y within +/- 31, then all four records may be stored in 13 bytes (an average of 3.25 bytes each (a 73% reduction in space).

A "greedy" algorithm may be used for compression: examine a record to see what size ID and XY it will have to use in the output, and then grab up to three more records until one is found that either can't "fit" with the previous records using the chosen sizes, or could be written smaller (note that if e.g. the first record has X and Y displacements both equal to 12, the XY would be written with two bytes, but until one reads following records one wouldn't know which of the three two-byte formats to use).

Before setting your format in stone, I'd suggest running your data through it. It may be that a small adjustment (e.g. using 7+9 or 5+11 bit formats instead of 6+10) would allow many data to pack better. The only real way to know, though, is to see what happens with your real data.


It looks like the Burrows–Wheeler transform might be useful for this problem. It has a peculiar tendency to put runs of repeating bytes together, which might make zlib compress better. This article suggests I should combine other algorithms than zlib with BWT, though.

Intuitively it sounds expensive, but a look at some source code shows that reverse BWT is O(N) with 3 passes over the data and a moderate space overhead, likely making it fast enough on my target platform (WinCE). The forward transform is roughly O(N log N) or slightly over, assuming an ordinary sort algorithm.


Sort the points by some kind of proximity measure such that the average distance between adjacent points is small. Then store the difference between adjacent points.

You might do even better if you manage to sort the points so that most differences are positive in both the x and y axes, but I can't say for sure.

As an alternative to zlib, a family of compression techniques that works well when the probability distribution is skewed towards small numbers is universal codes. They would have to be tweaked for signed numbers (encode abs(x)<<1 + (x < 0 ? 1 : 0)).


You might want to write two lists to the compressed file: a NodeList and a LinkList. Each node would have an ID, x, y. Each link would have a FromNode and a ToNode, along with a list of intermediate xy values. You might be able to have a header record with a false origin and have node xy values relative to that.

This would provide the most benefit if your streets follow an urban grid network, by eliminating duplicate coordinates at intersections.

If the compression is not required to be lossless, you could use truncated deltas for intermediate coordinates. While someone above mentioned deltas, keep in mind that a loss in connectivity would likely cause more problems than a loss in shape, which is what would happen if you use truncated deltas to represent the last coordinate of a road (which is often an intersection).

Again, if your roads aren't on an urban grid, this probably wouldn't buy you much.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜