开发者

Question about the "utf-8"-behavior

#!/usr/bin/env perl
use warnings;
use 5.012;
use Encode qw(encode);

no warnings qw(utf8);

my $c = "\x{ffff}";

my $utf_8 = encode( 'utf-8', $c );
my $utf8 = encode( 'utf8', $c );

say "utf-8 :  @{[ unpack '(B8)*', $utf_8 ]}";
say "utf8  :  @{[ unpack '(B8)*', $utf8 ]}";

# utf-8 :  11101111 10111111 10111101
# utf8  :  11101111 10111111 101111开发者_Python百科11

Does the "utf-8" encode this way, to fix my codepoint automaticaly to the last interchangeable codepoint (of the first plane)?


See the UTF-8 vs. utf8 vs. UTF8 section of the Encode docs.

To summarize, Perl has two different UTF-8 encodings. Its native encoding is called utf8, and basically allows any codepoint, regardless of what the Unicode standard says about that codepoint.

The other encoding is called utf-8 (a.k.a. utf-8-strict). This allows only codepoints that are assigned by the Unicode standard.

\x{FFFF} is not a valid codepoint according to Unicode. But Perl's utf8 encoding doesn't care about that.

By default, the encode function replaces any character that does not exist in the destination charset with a substitution character (see the Handling Malformed Data section). For utf-8, that substitution character is U+FFFD (REPLACEMENT CHARACTER), which is encoded in UTF-8 as 11101111 10111111 10111101 (binary).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜