Shorten an already short string in Java
I'm looking for a way to shorten an already short string as much as possible.
The string is a hostname:port combo and could look like "my-domain.se:2121" or "123.211.80.4:2122".
I know regul开发者_如何学JAVAar compression is pretty much out of the question on strings this short due to the overhead needed and the lack of repetition but I have an idea of how to do it.
Because the alphabet is limited to 39 characters ([a-z][0-9]-:.) every character could fit in 6 bits. This reduce the length with up to 25% compared to ASCII. So my suggestion is somthing along these lines:
- Encode the string to a byte array using some kind of custom encoding
- Decode the byte array to a UTF-8 or ASCII string (this string will obviously not make any sense).
And then reverse the process to get the original string.
So to my questions:
- Could this work?
- Is there a better way?
- How?
You could encode the string as base 40 which is more compact than base 64. This will give you 12 such tokens into a 64 bit long. The 40th token could be the end of string marker to give you the length (as it will not be a whole number of bytes any more)
If you use arithmetic encoding, it could be much smaller but you would need a table of frequencies for each token. (using a long list of possible examples)
class Encoder {
public static final int BASE = 40;
StringBuilder chars = new StringBuilder(BASE);
byte[] index = new byte[256];
{
chars.append('\0');
for (char ch = 'a'; ch <= 'z'; ch++) chars.append(ch);
for (char ch = '0'; ch <= '9'; ch++) chars.append(ch);
chars.append("-:.");
Arrays.fill(index, (byte) -1);
for (byte i = 0; i < chars.length(); i++)
index[chars.charAt(i)] = i;
}
public byte[] encode(String address) {
try {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
for (int i = 0; i < address.length(); i += 3) {
switch (Math.min(3, address.length() - i)) {
case 1: // last one.
byte b = index[address.charAt(i)];
dos.writeByte(b);
break;
case 2:
char ch = (char) ((index[address.charAt(i+1)]) * 40 + index[address.charAt(i)]);
dos.writeChar(ch);
break;
case 3:
char ch2 = (char) ((index[address.charAt(i+2)] * 40 + index[address.charAt(i + 1)]) * 40 + index[address.charAt(i)]);
dos.writeChar(ch2);
break;
}
}
return baos.toByteArray();
} catch (IOException e) {
throw new AssertionError(e);
}
}
public static void main(String[] args) {
Encoder encoder = new Encoder();
for (String s : "twitter.com:2122,123.211.80.4:2122,my-domain.se:2121,www.stackoverflow.com:80".split(",")) {
System.out.println(s + " (" + s.length() + " chars) encoded is " + encoder.encode(s).length + " bytes.");
}
}
}
prints
twitter.com:2122 (16 chars) encoded is 11 bytes.
123.211.80.4:2122 (17 chars) encoded is 12 bytes.
my-domain.se:2121 (17 chars) encoded is 12 bytes.
www.stackoverflow.com:80 (24 chars) encoded is 16 bytes.
I leave decoding as an exercise. ;)
First of all, IP addresses are designed to fit into 4 bytes and port numbers into 2. The ascii representation is only for humans to read, so it doesn't make sense to do compression on that.
Your idea for compressing domain name strings is doable.
Well in your case, I would use a specialized algo for your usecase. Recognize that you can store something other than strings. So for a IPv4 address : port, you would have a class that captured 6 bytes -- 4 for the address and 2 for the port. Another for type for apha-numeric hostnames. The port would always be stored in two bytes. The hostname part itself could also have specialized support for .com
, for example. So a sample hierarchy may be:
HostPort
|
+----+--------+
| |
IPv4 HostnamePort
|
DotComHostnamePort
public interface HostPort extends CharSequence { }
public HostPorts {
public static HostPort parse(String hostPort) {
...
}
}
In this case, the DotComHostnamePort allows you to drop .com
from the host name and save 4 chars/bytes, depending on whether you store hostnames in punyform or in UTF16 form.
The first two bytes could contain the port number. If you always start with this fixed length port number, you don't need to include the separator :
. Instead use a bit that indicates whether an IP address follows (see Karl Bielefeldt's solution) or a host name.
You could encode them using the CDC Display code. This encoding was used back in the old days when bits were scarce and programmers were nervous.
What you are suggesting is similar to base 64 encoding/decoding and there might be some mileage in looking at some of those implementations (base 64 encoding uses 6 bits).
As a starter if you use Apaches base 64 library
String x = new String(Base64.decodeBase64("my-domain.se:2121".getBytes()));
String y = new String(Base64.encodeBase64(x.getBytes()));
System.out.println("x = " + x);
System.out.println("y = " + y);
It will shorten your string by a few chars. This obviously does not work as what you end up with is not what you started with.
精彩评论