Perl's use encoding pragma breaking UTF strings
I have a problem with Perl and Encoding pr开发者_StackOverflow社区agma.
(I use utf-8 everywhere, in input, output, the perl scripts themselves. I don't want to use other encoding, never ever.)
However. When I write
binmode(STDOUT, ':utf8');
use utf8;
$r = "\x{ed}";
print $r;
I see the string "í" (which is what I want - and what is U+00ED unicode char). But when I add the "use encoding" pragma like this
binmode(STDOUT, ':utf8');
use utf8;
use encoding 'utf8';
$r = "\x{ed}";
print $r;
all I see is a box character. Why?
Moreover, when I add Data::Dumper and let the Dumper print the new string like this
binmode(STDOUT, ':utf8');
use utf8;
use encoding 'utf8';
$r = "\x{ed}";
use Data::Dumper;
print Dumper($r);
I see that perl changed the string to "\x{fffd}"
. Why?
use encoding 'utf8'
is broken. Rather than interpreting \x{ed}
as the code point U+00ED, it interprets it as the single byte 237 and then tries to interpret that as UTF-8. Which of course fails, so it winds up replacing it with the replacement character U+FFFD, literally "�".
Just stick with use utf8
to specify that your source is in UTF-8, and binmode
or the open pragma to specify the encoding for your file handles.
Your actual code needs neither use encoding
nor use utf8
to run properly -- the only thing it depends on is the encoding layer on STDOUT
.
binmode(STDOUT, ":utf8");
print "\xed";
is an equally valid complete program that does what you want.
use utf8
should be used only if you have UTF-8 in literal strings in your program -- e.g. if you had written
my $r = "í";
then use utf8
would cause that string to be interpreted as the single character U+00ED instead of the series of bytes C3 AD.
use encoding
should never be used, especially by someone who likes Unicode. If you want the encoding of stdin/out to be changed you should use -C
or PERLUNICODE
or binmode them yourself, and if you want other handles to be automatically openhed with encoding layers you should use
open
.
精彩评论