Question about the "utf-8"-behavior
#!/usr/bin/env perl
use warnings;
use 5.012;
use Encode qw(encode);
no warnings qw(utf8);
my $c = "\x{ffff}";
my $utf_8 = encode( 'utf-8', $c );
my $utf8 = encode( 'utf8', $c );
say "utf-8 : @{[ unpack '(B8)*', $utf_8 ]}";
say "utf8 : @{[ unpack '(B8)*', $utf8 ]}";
# utf-8 : 11101111 10111111 10111101
# utf8 : 11101111 10111111 101111开发者_Python百科11
Does the "utf-8" encode this way, to fix my codepoint automaticaly to the last interchangeable codepoint (of the first plane)?
See the UTF-8 vs. utf8 vs. UTF8 section of the Encode docs.
To summarize, Perl has two different UTF-8 encodings. Its native encoding is called utf8
, and basically allows any codepoint, regardless of what the Unicode standard says about that codepoint.
The other encoding is called utf-8
(a.k.a. utf-8-strict
). This allows only codepoints that are assigned by the Unicode standard.
\x{FFFF}
is not a valid codepoint according to Unicode. But Perl's utf8
encoding doesn't care about that.
By default, the encode
function replaces any character that does not exist in the destination charset with a substitution character (see the Handling Malformed Data section). For utf-8
, that substitution character is U+FFFD (REPLACEMENT CHARACTER), which is encoded in UTF-8 as 11101111 10111111 10111101 (binary).
精彩评论