开发者

C# compress a byte array

I do not know much about compression algorithms. I am looking for a simple compression algorithm (or code snippet) which can reduce the size of a byte[,,] or byte[]. I cannot make use of System.IO.Compression. Also, the data has lots of repetition.

I tried implementing the RLE algorithm (posted below for your inspection). However, it produces array's 1.2 to 1.8 times larger.

public static class RLE
{
    public static byte[] Encode(byte[] source)
    {
        List<byte> dest = new List<byte>();
        byte runLength;

        for (int i = 0; i < source.Length; i++)
        {
            runLength = 1;
            while (runLength < byte.MaxValue 
                && i + 1 < source.Length 
                && source[i] == source[i + 1])
            {
                runLength++;
                i++;
            }
            dest开发者_Go百科.Add(runLength);
            dest.Add(source[i]);
        }

        return dest.ToArray();
    }

    public static byte[] Decode(byte[] source)
    {
        List<byte> dest = new List<byte>();
        byte runLength; 

        for (int i = 1; i < source.Length; i+=2)
        {
            runLength = source[i - 1];

            while (runLength > 0)
            {
                dest.Add(source[i]);
                runLength--;
            }
        }
        return dest.ToArray();
    }

}

I have also found a java, string and integer based, LZW implementation. I have converted it to C# and the results look good (code posted below). However, I am not sure how it works nor how to make it work with bytes instead of strings and integers.

public class LZW
{
    /* Compress a string to a list of output symbols. */
    public static int[] compress(string uncompressed)
    {
        // Build the dictionary.
        int dictSize = 256;
        Dictionary<string, int> dictionary = new Dictionary<string, int>();
        for (int i = 0; i < dictSize; i++)
            dictionary.Add("" + (char)i, i);

        string w = "";
        List<int> result = new List<int>();

        for (int i = 0; i < uncompressed.Length; i++)
        {
            char c = uncompressed[i];
            string wc = w + c;
            if (dictionary.ContainsKey(wc))
                w = wc;
            else
            {
                result.Add(dictionary[w]);
                // Add wc to the dictionary.
                dictionary.Add(wc, dictSize++);
                w = "" + c;
            }
        }

        // Output the code for w.
        if (w != "")
            result.Add(dictionary[w]);
        return result.ToArray();
    }

    /* Decompress a list of output ks to a string. */
    public static string decompress(int[] compressed)
    {
        int dictSize = 256;
        Dictionary<int, string> dictionary = new Dictionary<int, string>();
        for (int i = 0; i < dictSize; i++)
            dictionary.Add(i, "" + (char)i);

        string w = "" + (char)compressed[0];
        string result = w;
        for (int i = 1; i < compressed.Length; i++)
        {
            int k = compressed[i];
            string entry = "";
            if (dictionary.ContainsKey(k))
                entry = dictionary[k];
            else if (k == dictSize)
                entry = w + w[0];

            result += entry;

            // Add w+entry[0] to the dictionary.
            dictionary.Add(dictSize++, w + entry[0]);

            w = entry;
        }

        return result;
    }
}


Have a look here. I used this code as a basis to compress in one of my work projects. Not sure how much of the .NET Framework is accessbile in the Xbox 360 SDK, so not sure how well this will work for you.


The problem with that RLE algorithm is that it is too simple. It prefixes every byte with how many times it is repeated, but that does mean that in long ranges of non-repeating bytes, each single byte is prefixed with a "1". On data without any repetitions this will double the file size.

This can be avoided by using Code-type RLE instead; the 'Code' (also called 'Token') will be a byte that can have two meanings; either it indicates how many times the single following byte is repeated, or it indicates how many non-repeating bytes follow that should be copied as they are. The difference between those two codes is made by enabling the highest bit, meaning there are still 7 bits available for the value, meaning the amount to copy or repeat per such code can be up to 127.

This means that even in worst-case scenarios, the final size can only be about 1/127th larger than the original file size.

A good explanation of the whole concept, plus full working (and, in fact, heavily optimised) C# code, can be found here:

http://www.shikadi.net/moddingwiki/RLE_Compression

Note that sometimes, the data will end up larger than the original anyway, simply because there are not enough repeating bytes in it for RLE to work. A good way to deal with such compression failures is by adding a header to your final data. If you simply add an extra byte at the start that's on 0 for uncompressed data and 1 for RLE compressed data, then, when RLE fails to give a smaller result, you just save it uncompressed, with the 0 in front, and your final data will be exactly one byte larger than the original. The system at the other side can then read that starting byte and use that to determine if the following data should be uncompressed or just copied.


Look into Huffman codes, it's a pretty simple algorithm. Basically, use fewer bits for patterns that show up more often, and keep a table of how it's encoded. And you have to account in your codewords that there are no separators to help you decode.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜