Perl: How to encode and decode characters in uppercase alpha letters only

2023-03-29 06:29 问答作者：

im dealing with that quite some time now. I have characters from lets say the Latin alphabet and want them to be encoded in uppercase alpha strings only. Is there any module that could do this? Or any BaseX encoding that i can modify to just use uc alpha characters?

i currently have implemented parts of it using regex substitutions, but it only covers a few characters and is definetly not efficient :)

anyway if there is no way to deal with that via a module or function, is there any way to do this efficient via a regex?

i thought about a tr开发者_开发问答/[\+,\-,...]/[PLUS,MINUS,...]/cds;

but it seems like tr only substitutes char by char and not char by sequence of chars :(

any ideas?

achim

To answer the tr question:

%subs = ( '+' => 'PLUS' );
my $pat = join '|', map quotemeta, keys %subs;
s/($pat)/$subs{$1}/g;

Base 26 is possible to do, but it's a bit hard and inefficient to implement since 26 is not a power of 2. But it's definitely what you want. I'll see about coding it up.

In the meantime, here's a base 16 solution:

sub bytes_to_base16 {
   my $e = unpack('H*', $_);
   $e =~ tr/0123456789ABCDEFabcdef/ABCDEFGHIJKLMNOPKLMNOP/;
   return $e;
}

sub base16_to_bytes {
   my $e = $_[0];
   $e =~ tr/ABCDEFGHIJKLMNOP/0123456789ABCDEF/;
   return pack('H*', $_);
}

Let's see how efficient base 26 is compared to base 16:

$ perl -MMath::BigInt -MMath::BigFloat -E'
   my $n = Math::BigInt->new(1);
   my $bs = 0;
   for (1..10) {
      $n <<= 8;
      ++$bs;
      my $bd16 = 2*$bs;
      my $bd26 = Math::BigFloat->new($n)->blog(26, 5)->bceil->numify;
      say sprintf "%2d bytes takes %2d base16 digits or %2d base26 digits.".
                  " base26 is %3.0f%% of the size of base16.",
         $bs, $bd16, $bd26, $bd26/$bd16*100;
      }
'
 1 bytes takes  2 base16 digits or  2 base26 digits. base26 is 100% of the size of base16.
 2 bytes takes  4 base16 digits or  4 base26 digits. base26 is 100% of the size of base16.
 3 bytes takes  6 base16 digits or  6 base26 digits. base26 is 100% of the size of base16.
 4 bytes takes  8 base16 digits or  7 base26 digits. base26 is  88% of the size of base16.
 5 bytes takes 10 base16 digits or  9 base26 digits. base26 is  90% of the size of base16.
 6 bytes takes 12 base16 digits or 11 base26 digits. base26 is  92% of the size of base16.
 7 bytes takes 14 base16 digits or 12 base26 digits. base26 is  86% of the size of base16.
 8 bytes takes 16 base16 digits or 14 base26 digits. base26 is  88% of the size of base16.
 9 bytes takes 18 base16 digits or 16 base26 digits. base26 is  89% of the size of base16.
10 bytes takes 20 base16 digits or 18 base26 digits. base26 is  90% of the size of base16.

An efficient implementation would produce slightly less efficient output.

$ perl -MMath::BigInt -MMath::BigFloat -E'
   my $bs = 0;
   for (1..10) {
      ++$bs;
      my $bd16 = 2*$bs;
      my $bd26 = int($bs/4)*7 + ($bs%4)*2;
      say sprintf "%2d bytes takes %2d base16 digits or %2d base26 digits.".
                  " base26 is %3.0f%% of the size of base16.",
         $bs, $bd16, $bd26, $bd26/$bd16*100;
      }
'
 1 bytes takes  2 base16 digits or  2 base26 digits. base26 is 100% of the size of base16.
 2 bytes takes  4 base16 digits or  4 base26 digits. base26 is 100% of the size of base16.
 3 bytes takes  6 base16 digits or  6 base26 digits. base26 is 100% of the size of base16.
 4 bytes takes  8 base16 digits or  7 base26 digits. base26 is  88% of the size of base16.
 5 bytes takes 10 base16 digits or  9 base26 digits. base26 is  90% of the size of base16.
 6 bytes takes 12 base16 digits or 11 base26 digits. base26 is  92% of the size of base16.
 7 bytes takes 14 base16 digits or 13 base26 digits. base26 is  93% of the size of base16.
 8 bytes takes 16 base16 digits or 14 base26 digits. base26 is  88% of the size of base16.
 9 bytes takes 18 base16 digits or 16 base26 digits. base26 is  89% of the size of base16.
10 bytes takes 20 base16 digits or 18 base26 digits. base26 is  90% of the size of base16.

Note that the efficient implementation uses an extra digits for inputs that are 7 bytes long.

So is it worth the effort of using base26 over base16? Probably not, unless each byte is really precious.

And finally, here's a base 26 implementation.

my @syms = ('A'..'Z');
my %syms = map { $syms[$_] => $_ } 0..$#syms;

sub bytes_to_base26 {
   my $e = '';

   my $full_blocks = int(length($_[0]) / 4);
   for (0..$full_blocks-1) {
      my $block = unpack('N', substr($_[0], $_*4, 4));
      $e .= join '', @syms[
         $block / 26**6 % 26,
         $block / 26**5 % 26,
         $block / 26**4 % 26,
         $block / 26**3 % 26,
         $block / 26**2 % 26,
         $block / 26**1 % 26,
         $block / 26**0 % 26,
      ];
   }

   my $extra = substr($_[0], $full_blocks*4);
   for my $block (unpack('C*', $extra)) {
      $e .= join '', @syms[
         $block / 26**1 % 26,
         $block / 26**0 % 26,
      ];
   }

   return $e;
}

sub base26_to_bytes {
   my $d = '';

   my $full_blocks = int(length($_[0]) / 7);
   for (0..$full_blocks-1) {
      my $block = 0;
      $block = $block*26 + $syms{$_} for unpack '(a)*', substr($_[0], $_*7, 7);
      $d .= pack('N', $block);
   }

   my $extra = substr($_[0], $full_blocks*7);
   my @extra = unpack('(a)*', $extra);
   while (@extra) {
      my $block = 0;
      $block = $block*26 + $syms{ shift(@extra) };
      $block = $block*26 + $syms{ shift(@extra) };
      $d .= pack('C', $block);
   }

   return $d;
}

The simplest approach is to use base16 encoding, as others have suggested, and remap the digits to letters -- but then you're only using 16 out of 26 characters, which is wasteful.

The most efficient possible encoding would be base26, but that would be very difficult -- in effect you'd be treating the entire input as a large binary number and converting it from base 2 to base 26.

log2(26) is just over 4.7, so at best (in the absence of compression) you can encode 4.7 bits per letter. A less wasteful encoding might encode 4 bytes (32 bits) in 7 letters. 7 letters gives you about 32.9 bits of information, so you're not losing as much information. And it can all be done in 32-bit arithmetic. Then you'll have to decide what to do if the input isn't a multiple of 4 bytes.

(The actual implementation is left as an exercise -- at least for now.)

You can use Base32 encoding, with 26 uppercase letters and 6 digits:

http://pastebin.com/YPvfrpHW

Just change the $code array to whatever charset you want to use.

Edit: Whoops, just noticed you're Perl and not PHP, sorry. You should be able to find a Base32 module on CPAN that does the same thing.

Edit 2: FWIW, I see Convert::Base32, Encode::Base32, and MIME::Base32 on CPAN.

For a bit of fun, here's my Enigma simulator. There isn't an easy way to achieve what you want to do as the wheels don't have any escape chars, and any sequences you invent to represent an escape sequence will significantly reduce the strength of the cipher.

However, 8 bit latin input could be mapped from 0-255 to [A-P][A-P] using 65+($Char&15).65+($Char>>4), and reversed on output, but R-Z would be wasted and there would be a lot of holes in the input, though this could be solved running through gzip first.

The Germans usually used X to represent spaces, and spelled out punctuation if really necessary, trying to avoid spelling the same thing twice the same. I know it is annoying, but that is the way it is. If we increase the number of letters on the wheels, then it ceases to be an Enigma machine!

#!/usr/bin/perl
#Tinigma 2010 Usage:tinigma.pl 123 rng ini "GHWVYYDVPQGEWQWVT"
($n,$o,$p)=map(ord()-65,split//,uc$ARGV[1]);($z,$y,$x)=map(ord
()-65,split//,uc$ARGV[2]);($l,$m,$r)=map$_-1,split//,$ARGV[0];
$t=uc$ARGV[3];$t=~s/[^A-Z]//g;$b=26;$j=0;@N=qw(7 25 11 6 1);@R
=('EKMFLGDQVZNTOWYHXUSPAIBRCJ'x3,'AJDKSIRUXBLHWTMCQGZNPYFVOE'x
3,'BDFHJLCPRTXVZNYEIWGAKMUSQO'x3,'ESOVPZJAYQUIRHXLNFTGKDCMWB'x
3,'VZBRGITYUPSDNHLXAWMJQOFECK'x3,'YRUHQSLDPXNGOKMIEBFZCWVJAT'x
3);@t=split//,$t;for$v(@R){$i=0;for(split//,$v){$c=ord($_)-65;
$F[$j][$i]=$c;$R[$j][$c+$b*(int($i/$b))]=$i;$i++}$j++}@S=@{$F[
5]};$f=$y==$F[$m][$N[$m]]?1:0;$i=0;for(@t){if($f){$y++;$y%=$b;
$z++;$z%=$b;$f=0}if($x==$F[$r][$N[$r]]){$y++;$y%=$b;if($y==$F[
$m][$N[$m]]){$f=1}}$x++;$x%=$b;$e.=chr(($R[$r][$R[$m][$R[$l][$
S[$F[$l][$F[$m][$F[$r][ord($_)-39+$x-$n]-$x+$n+$y-$o]-$y+$o+$z
-$p]-$z+$p]+$z-$p]-$z+$p+$y-$o]-$y+$o+$x-$n]-$x+$n)%$b+65)}
print"$e\n"

This solution was already briefly mentioned by Keith Thompson and jrockway.
Here we look into it and implement it.

The problem is very simple if you know that:

Any file (binary or text) can be seen as one big number.
Think of the file's bytes as the digits of a number in a base 2⁸ = 256.
Any number can be converted to any natural base N ≥ 2.
The N digits for representing a number in base N can be chosen freely.
Usually we use 0, 1, 2, … but its possible to use A, B, C, … or even

继续阅读：character-encodingdecodingencodingperlregex


                            更多精彩内容
                            如何在Go语言中高效使用Redis的Pipeline
如何实现Python编写的图形界面可以自由拖动
Linux软连接实现方式
mysql字符集引起的java.sql.SQLException:Incorrect string value:问题
Go使用Redis实现分布式锁的常见方法

Perl: How to encode and decode characters in uppercase alpha letters only

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？