开发者

Simple java algorithm to encode/decode the following string

Suppose I have

String input = "1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,2,3,0,4,0,0,0,4,0,3"; I want to encode it into a string with less character and actually hides the actual information by representing it in roman character, IE. the above encodes to something like "Adqwqkjlhs". Must be able to decode to original string if given the encoded string.

The string input is actually something I parse from the hash of an URL, but the original format is lengthy and open 开发者_运维技巧to manipulation.

Any ideas?

Thanks

Edit #1

The number can be from 0 to 99, and each number is separate by a comma for String.split(",") to retrieve the String[]

Edit #2 (Purpose of encoded string)

Suppose the above string encodes to bmtwva1131gpefvb1xv, then I can have URL link like www.shortstring.com/input#bmtwva1131gpefvb1xv. From there I would decode bmtwva1131gpefvb1xv into comma separate numbers.


This isn't really much of an improvement from Nathan Hughes' solution, but the longer the Strings are, the more of a savings you get.

Encoding: create a String starting with "1", making each of the numbers in the source string 2 digits, thus "0" becomes "00", "5" becomes "05", "99" becomes "99", etc. Represent the resulting number in base 36.

Decoding: Take the base 36 number/string, change it back to base 10, skip the first "1", then turn every 2 numbers/letters into an int and rebuild the original string.

Example Code:

    String s = "1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,2,3,0,4,0,0,0,4,0,3";

    // ENCODE the string
    StringTokenizer tokenizer = new StringTokenizer(s,",");
    StringBuilder b = new StringBuilder();
    b.append("1");  // This is a primer character, in case we end up with a bunch of zeroes at the beginning
    while(tokenizer.hasMoreTokens()) {
        String token = tokenizer.nextToken().trim();
        if(token.length()==1) {
            b.append("0");
            b.append(token);
        }
        else {
            b.append(token);
        }
    }

    System.out.println(b);
    // We get this String: 101020000000000000000000000000000000000010202030004000000040003

    String encoded = (new BigInteger(b.toString())).toString(36);
    System.out.println(encoded);
    // We get this String: kcocwisb8v46v8lbqjw0n3oaad49dkfdbc5zl9vn


    // DECODE the string

    String decoded = (new BigInteger(encoded, 36)).toString();
    System.out.println(decoded);
    // We should get this String: 101020000000000000000000000000000000000010202030004000000040003

    StringBuilder p = new StringBuilder();
    int index = 1;   // we skip the first "1", it was our primer
    while(index<decoded.length()) {
        if(index>1) {
            p.append(",");
        }
        p.append(Integer.parseInt(decoded.substring(index,index+2)));
        index = index+2;
    }

    System.out.println(p);
    // We should get this String: 1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,2,3,0,4,0,0,0,4,0,3

I don't know of an easy way to turn a large number into base 64. Carefully chosen symbols (like +,,-) are ok to be URL encoded, so 0-9, a-z, A-Z, with a "" and "-" makes 64. The BigInteger.toString() method only takes up to Character.MAX_RADIX which is 36 (no uppercase letters). If you can find a way to take a large number and change to base 64, then the resulting encoded String will be even shorter.

EDIT: looks like this does it for you: http://commons.apache.org/codec/apidocs/org/apache/commons/codec/binary/Base64.html


How about saving it as a base 36 number?

In Java that would be

new java.math.BigInteger("120000000000000000012230400403").toString(36)

which would evaluate to "bmtwva1131gpefvb1xv"

You would get the original number back with

new java.math.BigInteger("bmtwva1131gpefvb1xv", 36)

It's a good point that this doesn't handle leading 0s (Thilo's suggestion of adding a leading 1 would work). About the commas: if the numbers were equally sized (01 instead of 1) then i think there wouldn't be a need to commas.


Suggest you look at base64 which provides 6 bits of information per character -- in general your encoding efficiency is log2(K) bits per symbol where K is the number of symbols in the set of allowable symbols.

For 8-bit character set, many of these are impermissible in URLs, so you need to choose some subset that are legal URL characters.


Just to clarify: I didn't mean encode your "1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,2,2,3,0,4,0,0,0,4,0,3" string as base64 -- I meant figure out what information you really want to encode, expressed as a string of raw binary bytes, and encode that in base64. It will exclude control characters (although you might want to use an alternate form where all 64 characters can be used in URLs without escaping) and be more efficient than converting numbers to a printable number form.


The number can be from 0 to 99, and each number is separate by a comma for String.split(",") to retrieve the String[]

OK, now you have a clear definition. Here's a suggestion:

Convert your information from its original form to a binary number / byte array. If all you have is a string of comma-separated numbers from 0-99, then here's two options:

  • (slow) -- treat as numbers in base 100, convert to a BigInteger (e.g. n = n * 100 + x[i] for each number x in the array), convert to a byte array, and be sure to precede the whole thing by its length, so that "0,0,0,0" can be distinguished from "0,0" (numerically equal in base 100 but it has a different length. Then convert the result to base64.

  • (more efficient) -- treat as numbers in base 128 (since that is a power of 2), and use any number from 100-127 as a termination character. Each block of 6 numbers therefore contains 42 (=6*7) bits of information, which can be encoded as a string of 7 characters using base64. (Pad with termination characters as needed to reach an even multiple of 6 of the original numbers.)

Because you have a potentially variable-length array of numbers as inputs, you need to encode the length somehow -- either directly as a prefix, or indirectly by using a termination character.

For the inverse algorithm, just reverse the steps and you'll get an array of numbers from 0 to 99 -- using either the prefixed length or termination character to determine the size of the array -- which you can convert to a human-readable string separated with commas.

If you have access to the original information in a raw binary form before it's encoded as a string, use that instead. (but please post a question with the input format requirements for that information)


If numbers are between 0 and 255, you can create a byte array out of it. Once you have a byte array, you have manu choices :

  1. Use base64 on the byte array, which will create a compact string (almost) URL compatible
  2. Convert them to chars, using your own algorithm based on maximum values
  3. Convert them to longs, and then use Long.toString(x,31).

To convert back, you'll obviously have to apply the chosen algorithm in the opposite way.


Modified UUENCODE:-

Split the binary into groups of 6 bits

Make an array of 64 characters (choose ones allowable and keep in ASCII order for easy search):- 0..9, A..Z, _, a..z, ~

Map between the binary and the characters.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜