开发者

PHP and Unicode: Weirdness between Windows and Linux

Look at IBM's Unicode for the working PHP programmer, especially listings 3 and 4.

On Ubuntu Lucid I get the same output from the code as IBM does, viz:

Здравсствуйте
Array
(
    [1] => 65279
    [2] => 1047
    [3] => 1076
    [4] => 1088
    [5] => 1072
    [6] => 1074
    [7] => 1089
    [8] => 1089
    [9] => 1090
    [10] => 1074
    [11] => 1091
    [12] => 1081
    [13] => 1090
    [14] => 1077开发者_开发知识库
)
Здравсствуйте

However, on Windows I get a completely different response.

ðùð┤ÐÇð░ð▓ÐüÐüÐéð▓Ðâð╣ÐéðÁ
Array
(
    [1] => -131072
    [2] => 386138112
    [3] => 872677376
    [4] => 1074003968
    [5] => 805568512
    [6] => 839122944
    [7] => 1090781184
    [8] => 1090781184
    [9] => 1107558400
    [10] => 839122944
    [11] => 1124335616
    [12] => 956563456
    [13] => 1107558400
    [14] => 889454592
)
ðùð┤ÐÇð░ð▓ÐüÐüÐéð▓Ðâð╣ÐéðÁ

Aside from the fact that the Russian characters (which are in UTF-32) don't render in a CMD.EXE shell (because they're in UTF-32 not Windows' own UTF-16), why do the character values differ so significantly?


function utf8_to_unicode_code($utf8_string)
{
    $expanded = iconv("UTF-8", "UTF-32", $utf8_string);
    return unpack("L*", $expanded);
}

This does two things wrong:

  1. It uses “UTF-32”, which will drop an unwanted BOM at the start of the string, which is why you get 65279 (0xFEFF BOM). You don't want stray BOMs hanging around the place causing trouble.

  2. It uses machine-specific byte endianness (capital L) which iconv may well not agree with. To be honest I wouldn't have expected it to clash on a Windows box (as i386 is little-endian regardless of OS), but clearly it has, as the values you've got are all what would result from a reversed byte order.

Better to state both byte orderings explicitly, and avoid the BOM. Use UCS-4LE as the encoding, and unpack with V*. The same goes for unicode_code_to_utf8.

Also ignore listing 6. The ellipsis character—like the fi-ligature and others—is a ‘compatibility character’ which we wouldn't use in the modern Unicode-and-OpenType world. It's up to the font to provide contextual alternatives for fi or ... if it wants to, instead of requiring us to mangle the text.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜