Perl's use encoding pragma breaking UTF strings
I have a problem with Perl and Encoding pr开发者_StackOverflow社区agma.
(I use utf-8 everywhere, in input, output, the perl scripts themselves. I don't want to use other encoding, never ever.)
However. When I write
binmode(STDOUT, ':utf8');
use utf8;
$r = "\x{ed}";
print $r;
I see the string "í" (which is what I want - and what is U+00ED unicode char). But when I add the "use encoding" pragma like this
binmode(STDOUT, ':utf8');
use utf8;
use encoding 'utf8';
$r = "\x{ed}";
print $r;
all I see is a box character. Why?
Moreover, when I add Data::Dumper and let the Dumper print the new string like this
binmode(STDOUT, ':utf8');
use utf8;
use encoding 'utf8';
$r = "\x{ed}";
use Data::Dumper;
print Dumper($r);
I see that perl changed the string to "\x{fffd}". Why?
use encoding 'utf8' is broken. Rather than interpreting \x{ed} as the code point U+00ED, it interprets it as the single byte 237 and then tries to interpret that as UTF-8. Which of course fails, so it winds up replacing it with the replacement character U+FFFD, literally "�".
Just stick with use utf8 to specify that your source is in UTF-8, and binmode or the open pragma to specify the encoding for your file handles.
Your actual code needs neither use encoding nor use utf8 to run properly -- the only thing it depends on is the encoding layer on STDOUT.
binmode(STDOUT, ":utf8");
print "\xed";
is an equally valid complete program that does what you want.
use utf8 should be used only if you have UTF-8 in literal strings in your program -- e.g. if you had written
my $r = "í";
then use utf8 would cause that string to be interpreted as the single character U+00ED instead of the series of bytes C3 AD.
use encoding should never be used, especially by someone who likes Unicode. If you want the encoding of stdin/out to be changed you should use -C or PERLUNICODE or binmode them yourself, and if you want other handles to be automatically openhed with encoding layers you should useopen.
加载中,请稍侯......
精彩评论