Perl: How to encode and decode characters in uppercase alpha letters only
im dealing with that quite some time now. I have characters from lets say the Latin alphabet and want them to be encoded in uppercase alpha strings only. Is there any module that could do this? Or any BaseX encoding that i can modify to just use uc alpha characters?
i currently have implemented parts of it using regex substitutions, but it only covers a few characters and is definetly not efficient :)
anyway if there is no way to deal with that via a module or function, is there any way to do this efficient via a regex?
i thought about a tr开发者_开发问答/[\+,\-,...]/[PLUS,MINUS,...]/cds;
but it seems like tr only substitutes char by char and not char by sequence of chars :(
any ideas?
achim
To answer the tr
question:
%subs = ( '+' => 'PLUS' );
my $pat = join '|', map quotemeta, keys %subs;
s/($pat)/$subs{$1}/g;
Base 26 is possible to do, but it's a bit hard and inefficient to implement since 26 is not a power of 2. But it's definitely what you want. I'll see about coding it up.
In the meantime, here's a base 16 solution:
sub bytes_to_base16 {
my $e = unpack('H*', $_);
$e =~ tr/0123456789ABCDEFabcdef/ABCDEFGHIJKLMNOPKLMNOP/;
return $e;
}
sub base16_to_bytes {
my $e = $_[0];
$e =~ tr/ABCDEFGHIJKLMNOP/0123456789ABCDEF/;
return pack('H*', $_);
}
Let's see how efficient base 26 is compared to base 16:
$ perl -MMath::BigInt -MMath::BigFloat -E'
my $n = Math::BigInt->new(1);
my $bs = 0;
for (1..10) {
$n <<= 8;
++$bs;
my $bd16 = 2*$bs;
my $bd26 = Math::BigFloat->new($n)->blog(26, 5)->bceil->numify;
say sprintf "%2d bytes takes %2d base16 digits or %2d base26 digits.".
" base26 is %3.0f%% of the size of base16.",
$bs, $bd16, $bd26, $bd26/$bd16*100;
}
'
1 bytes takes 2 base16 digits or 2 base26 digits. base26 is 100% of the size of base16.
2 bytes takes 4 base16 digits or 4 base26 digits. base26 is 100% of the size of base16.
3 bytes takes 6 base16 digits or 6 base26 digits. base26 is 100% of the size of base16.
4 bytes takes 8 base16 digits or 7 base26 digits. base26 is 88% of the size of base16.
5 bytes takes 10 base16 digits or 9 base26 digits. base26 is 90% of the size of base16.
6 bytes takes 12 base16 digits or 11 base26 digits. base26 is 92% of the size of base16.
7 bytes takes 14 base16 digits or 12 base26 digits. base26 is 86% of the size of base16.
8 bytes takes 16 base16 digits or 14 base26 digits. base26 is 88% of the size of base16.
9 bytes takes 18 base16 digits or 16 base26 digits. base26 is 89% of the size of base16.
10 bytes takes 20 base16 digits or 18 base26 digits. base26 is 90% of the size of base16.
An efficient implementation would produce slightly less efficient output.
$ perl -MMath::BigInt -MMath::BigFloat -E'
my $bs = 0;
for (1..10) {
++$bs;
my $bd16 = 2*$bs;
my $bd26 = int($bs/4)*7 + ($bs%4)*2;
say sprintf "%2d bytes takes %2d base16 digits or %2d base26 digits.".
" base26 is %3.0f%% of the size of base16.",
$bs, $bd16, $bd26, $bd26/$bd16*100;
}
'
1 bytes takes 2 base16 digits or 2 base26 digits. base26 is 100% of the size of base16.
2 bytes takes 4 base16 digits or 4 base26 digits. base26 is 100% of the size of base16.
3 bytes takes 6 base16 digits or 6 base26 digits. base26 is 100% of the size of base16.
4 bytes takes 8 base16 digits or 7 base26 digits. base26 is 88% of the size of base16.
5 bytes takes 10 base16 digits or 9 base26 digits. base26 is 90% of the size of base16.
6 bytes takes 12 base16 digits or 11 base26 digits. base26 is 92% of the size of base16.
7 bytes takes 14 base16 digits or 13 base26 digits. base26 is 93% of the size of base16.
8 bytes takes 16 base16 digits or 14 base26 digits. base26 is 88% of the size of base16.
9 bytes takes 18 base16 digits or 16 base26 digits. base26 is 89% of the size of base16.
10 bytes takes 20 base16 digits or 18 base26 digits. base26 is 90% of the size of base16.
Note that the efficient implementation uses an extra digits for inputs that are 7 bytes long.
So is it worth the effort of using base26 over base16? Probably not, unless each byte is really precious.
And finally, here's a base 26 implementation.
my @syms = ('A'..'Z');
my %syms = map { $syms[$_] => $_ } 0..$#syms;
sub bytes_to_base26 {
my $e = '';
my $full_blocks = int(length($_[0]) / 4);
for (0..$full_blocks-1) {
my $block = unpack('N', substr($_[0], $_*4, 4));
$e .= join '', @syms[
$block / 26**6 % 26,
$block / 26**5 % 26,
$block / 26**4 % 26,
$block / 26**3 % 26,
$block / 26**2 % 26,
$block / 26**1 % 26,
$block / 26**0 % 26,
];
}
my $extra = substr($_[0], $full_blocks*4);
for my $block (unpack('C*', $extra)) {
$e .= join '', @syms[
$block / 26**1 % 26,
$block / 26**0 % 26,
];
}
return $e;
}
sub base26_to_bytes {
my $d = '';
my $full_blocks = int(length($_[0]) / 7);
for (0..$full_blocks-1) {
my $block = 0;
$block = $block*26 + $syms{$_} for unpack '(a)*', substr($_[0], $_*7, 7);
$d .= pack('N', $block);
}
my $extra = substr($_[0], $full_blocks*7);
my @extra = unpack('(a)*', $extra);
while (@extra) {
my $block = 0;
$block = $block*26 + $syms{ shift(@extra) };
$block = $block*26 + $syms{ shift(@extra) };
$d .= pack('C', $block);
}
return $d;
}
The simplest approach is to use base16 encoding, as others have suggested, and remap the digits to letters -- but then you're only using 16 out of 26 characters, which is wasteful.
The most efficient possible encoding would be base26, but that would be very difficult -- in effect you'd be treating the entire input as a large binary number and converting it from base 2 to base 26.
log2(26) is just over 4.7, so at best (in the absence of compression) you can encode 4.7 bits per letter. A less wasteful encoding might encode 4 bytes (32 bits) in 7 letters. 7 letters gives you about 32.9 bits of information, so you're not losing as much information. And it can all be done in 32-bit arithmetic. Then you'll have to decide what to do if the input isn't a multiple of 4 bytes.
(The actual implementation is left as an exercise -- at least for now.)
You can use Base32 encoding, with 26 uppercase letters and 6 digits:
http://pastebin.com/YPvfrpHW
Just change the $code
array to whatever charset you want to use.
Edit: Whoops, just noticed you're Perl and not PHP, sorry. You should be able to find a Base32 module on CPAN that does the same thing.
Edit 2: FWIW, I see Convert::Base32, Encode::Base32, and MIME::Base32 on CPAN.
For a bit of fun, here's my Enigma simulator. There isn't an easy way to achieve what you want to do as the wheels don't have any escape chars, and any sequences you invent to represent an escape sequence will significantly reduce the strength of the cipher.
However, 8 bit latin input could be mapped from 0-255 to [A-P][A-P] using 65+($Char&15).65+($Char>>4), and reversed on output, but R-Z would be wasted and there would be a lot of holes in the input, though this could be solved running through gzip first.
The Germans usually used X to represent spaces, and spelled out punctuation if really necessary, trying to avoid spelling the same thing twice the same. I know it is annoying, but that is the way it is. If we increase the number of letters on the wheels, then it ceases to be an Enigma machine!
#!/usr/bin/perl
#Tinigma 2010 Usage:tinigma.pl 123 rng ini "GHWVYYDVPQGEWQWVT"
($n,$o,$p)=map(ord()-65,split//,uc$ARGV[1]);($z,$y,$x)=map(ord
()-65,split//,uc$ARGV[2]);($l,$m,$r)=map$_-1,split//,$ARGV[0];
$t=uc$ARGV[3];$t=~s/[^A-Z]//g;$b=26;$j=0;@N=qw(7 25 11 6 1);@R
=('EKMFLGDQVZNTOWYHXUSPAIBRCJ'x3,'AJDKSIRUXBLHWTMCQGZNPYFVOE'x
3,'BDFHJLCPRTXVZNYEIWGAKMUSQO'x3,'ESOVPZJAYQUIRHXLNFTGKDCMWB'x
3,'VZBRGITYUPSDNHLXAWMJQOFECK'x3,'YRUHQSLDPXNGOKMIEBFZCWVJAT'x
3);@t=split//,$t;for$v(@R){$i=0;for(split//,$v){$c=ord($_)-65;
$F[$j][$i]=$c;$R[$j][$c+$b*(int($i/$b))]=$i;$i++}$j++}@S=@{$F[
5]};$f=$y==$F[$m][$N[$m]]?1:0;$i=0;for(@t){if($f){$y++;$y%=$b;
$z++;$z%=$b;$f=0}if($x==$F[$r][$N[$r]]){$y++;$y%=$b;if($y==$F[
$m][$N[$m]]){$f=1}}$x++;$x%=$b;$e.=chr(($R[$r][$R[$m][$R[$l][$
S[$F[$l][$F[$m][$F[$r][ord($_)-39+$x-$n]-$x+$n+$y-$o]-$y+$o+$z
-$p]-$z+$p]+$z-$p]-$z+$p+$y-$o]-$y+$o+$x-$n]-$x+$n)%$b+65)}
print"$e\n"
This solution was already briefly mentioned by Keith Thompson and jrockway.
Here we look into it and implement it.
The problem is very simple if you know that:
- Any file (binary or text) can be seen as one big number.
Think of the file's bytes as the digits of a number in a base 28 = 256. - Any number can be converted to any natural base N ≥ 2.
- The N digits for representing a number in base N can be chosen freely.
Usually we use0
,1
,2
, … but its possible to useA
,B
,C
, … or even
精彩评论