Encoding byte data into digits
Is there a common method to encode and decode arbitrary data so the encoded end result consists of numbers only - like base64_encode but without the letters?
Fictitious example:
$encoded = numbers_encode("Mary had a little lamb");
echo $encoded; // outputs e.g. 12238433742239423742322 (开发者_如何学运维fictitious result)
$decoded = numbers_decode("12238433742239423742322");
echo $decoded; // outputs "Mary had a little lamb"
You can think of a (single byte character) string as a base-256 encoded number where "\x00" represents 0, ' ' (space, i.e., "\x20") represents 32 and so on until "\xFF", which represents 255.
A representation only with numbers 0-9 can be accomplished simply by changing the representation to base 10.
Note that "base64 encoding" is not actually a base conversion. base64 breaks the input into groups of 3 bytes (24 bits) and does the base conversion on those groups individually. This works well because a number with 24 bits can be represented with four digits in base 64 (2^24 = 64^4).
This is more or less what el.pescado does – he splits the input data into 8-bit pieces and then converts the number into base 10. However, this technique has one disadvantage relatively to base 64 encoding – it does not align correctly with the byte boundary. To represent a number with 8-bits (0-255 when unsigned) we need three digits in base 10. However, the left-most digit has less information than the others. It can either be 0, 1 or 2 (for unsigned numbers).
A digit in base 10 stores log(10)/log(2) bits. No matter the chunk size you choose, you're never going to be able to align the representations with 8-bit bytes (in the sense of "aligning" I've described in the paragraph before). Consequently, the most compact representation is a base conversion (which you can see as if it were a "base encoding" with only one big chunk).
Here is an example with bcmath.
bcscale(0);
function base256ToBase10(string $string) {
//argument is little-endian
$result = "0";
for ($i = strlen($string)-1; $i >= 0; $i--) {
$result = bcadd($result,
bcmul(ord($string[$i]), bcpow(256, $i)));
}
return $result;
}
function base10ToBase256(string $number) {
$result = "";
$n = $number;
do {
$remainder = bcmod($n, 256);
$n = bcdiv($n, 256);
$result .= chr($remainder);
} while ($n > 0);
return $result;
}
For
$string = "Mary had a little lamb";
$base10 = base256ToBase10($string);
echo $base10,"\n";
$base256 = base10ToBase256($base10);
echo $base256;
we get
36826012939234118013885831603834892771924668323094861 Mary had a little lamb
Since each digit encodes only log(10)/log(2)=~3.32193
bits expect the number to tend to be 140% longer (not 200% longer, as would be with el.pescado's answer).
Well, that would be "base 8" encoding rather than Base 64. This is better know as Octal.
All Base64 does is convert bit streams in to 6 bit blocks (0-63), and assigns a character from a 64 character character set. Octal uses 3 bits, 0-7. So it COULD use ABCDEFGH, but instead uses 0-7. You can't (easily) use 0-9 because 0-9 is up to 4 bits, but not completely 4 bits. That's what makes it a lousy encoding for binary data.
Regardless of how you encode you'll always end back up at a smaller base. It may be possible to shrink the resultant integer a bit smaller with some dechex() conversions but ultimately you'll only save a few characters. That being said, the number really balloons the moment you start representing multi-byte characters with 0-9.
I have to wonder if integers as IDs, representing words, or complete strings, wouldn't provide a smaller footprint. Not really a direct encoding but a viable option.
@el.pescado gets credit for the first half but he did challenge the reader. So, I responded (mainly because I wanted to understand what's happening).
function pekka_encode($s) {
$out = '';
for ($i=0;$i<strlen($s); $i++) {
$out .= sprintf("%03d", ord($s[$i]));
}
return $out;
}
function pekka_decode($s) {
$out = '';
for ($i=0;$i<strlen($s);$i+=3) {
$out .= chr($s[$i].$s[$i+1].$s[$i+2]);
}
return $out;
}
Very simple example - it represents every input byte as 3-digit decimal number:
function data2numbers ($data) {
$out = "";
for ($i = 0; $i < strlen ($data); $i++) {
$out .= sprintf ("%03d", ord ($data[$i]));
}
return $out;
}
Downside is that it triples size of any input data (every input byte is represented as three output bytes).
Decoding function is left as an exercise to the reader;)
精彩评论