Unicode::Normalize - query about the 'Normalization From'
#!/usr/local/bin/perl
use warnings;
use 5.014;
use Unicode::Normalize qw(NFD NFC compose);
my $string1 = "\x{f5}";
my $NFD_string1 = NFD( $string1 );
# PV = 0x831150 "o\314\203"\0 [UTF8 "o\x{303}"] *
my $composed_NFD_string1 = compose( $NFD_string1 );
# PV = 0x77bc40 "\303\265"\0 [UTF8 "\x{f5}"] *
my $NFC_string1 = NFC( $string1 );
# PV = 0x836e30 "\303\265"\0 [UTF8 "\x{f5}"] *
my $string2 = "o\x{303}";
my $NFD_string2 = NFD( $string2 );
# PV = 0x780da0 "o\314\203"\0 [UTF8 "o\x{303}"] *
my $composed_NFD_string2 = compose( $NFD_string2 );
# PV = 0x782dc0 "\303\265"\0 [UTF8 "\x{f5}"] *
my $NFC开发者_运维知识库_string2 = NFC( $string2 );
# PV = 0x7acba0 "\303\265"\0 [UTF8 "\x{f5}"] *
# * from Devel::Peek::Dump output
say 'OK' if $NFD_string1 eq $NFD_string2;
say 'OK' if $NFC_string1 eq $NFC_string2;
Output:
OK
OK
After trying this I asked me:
Is there a reason to use the Normalization Form D
instead of the Normalization Form C
?
Not everything has a composite form, and NFC actually does an NFD first. Part of NFD is putting continuation characters in order after the starter character so you can compare two grapheme clusters (the fancy name for a starter along with its continuation characters) to see if they are the same. For what you are doing in this example, you should get the same answers, but NFC actually does more work.
There are a couple of reasons that some things don't have a special NFC version. Many of those came from historical character sets. The composed version of é is there to make the Latin-1 people happy. There's also the e and ´ versions designed to allow you to build the grapheme on your own. There are many ways to do that, and it's not just accents and diacriticals. Grapheme clusters can have several of those continuation characters, and as you build them yourself, you can put them in any order you like (for whatever reason). However, they have assigned weights. NFD will reorder them by their weights so you can compare two grapheme clusters despite the order you used.
It's all in Unicode Technical Report 15, just as daxim said in the comment. You'll want to see the diagrams and read around the part that says:
Once a string has been fully decomposed, any sequences of combining marks that it contains are put into a well-defined order. This rearrangement of combining marks is done according to a subpart of the Unicode Normalization Algorithm known as the Canonical Ordering Algorithm. That algorithm sorts sequences of combining marks based on the value of their Canonical_Combining_Class (ccc) property, whose values are also defined in UnicodeData.txt. Most characters (including all non-combining marks) have a Canonical_Combining_Class value of zero, and are unaffected by the Canonical Ordering Algorithm. Such characters are referred to by a special term, starter. Only the subset of combining marks which have non-zero Canonical_Combining_Class property values are subject to potential reordering by the Canonical Ordering Algorithm. Those characters are called non-starters.
Some things explicitly use NFD for their data, such as the HFS+ file system. That doesn't much matter in many cases because your programming language probably binds to library functions that transforms your filename strings into the right form.
Sometime later today I'll be uploading Unicode::Support which demonstrates many of these things.
And, later today, Tom will come along and school us all. :)
精彩评论