开发者

Length of strings in unicode are different

How come the length of the following strings is different although the number of characters in the str开发者_Go百科ings are the same

echo strlen("馐 馑 馒 馓 馔 馕 首 馗 馘")."<BR>";
echo strlen("Ɛ Ƒ ƒ Ɠ Ɣ ƕ Ɩ Ɨ Ƙ")."<BR>";

Outputs

35
26


The first batch of characters take up three bytes each, because they're way down in the 39-thousand-ish character list, whereas the second group only take two bytes each, being around 400. (The number of bytes/octets required per character are discussed in the UTF-8 wikipedia article.)

strlen counts the number of bytes taken by the string, which gives such odd results in Unicode.


I am no PHP expert but it seems that strlen it counts bytes... there is mb_strlen which counts characters...

EDIT - for further reference on how multi-byte encoding works see http://en.wikipedia.org/wiki/Variable-width_encoding and esp. UTF8 see http://en.wikipedia.org/wiki/UTF-8 and


It looks like it's counting the number of bytes in the encoding being used. For example, it looks like the second string is taking two bytes per non-space character, whereas the first string is taking three bytes per non-space character. I would expect:

echo strlen("A B C D E F G H I")

to print out 17 - a single byte per ASCII character.

My guess it that this is all using the UTF-8 encoding - which would certainly be in-line with the varying width of representation.


According to this post on php.net/strlen, PHP interprets all strings passed to strlen as ASCII.


Use mb_strlen, it count characters in provided encoding, not bytes as strlen

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜