Length of strings in unicode are different
How come the length of the following strings is different although the number of characters in the str开发者_Go百科ings are the same
echo strlen("馐 馑 馒 馓 馔 馕 首 馗 馘")."<BR>";
echo strlen("Ɛ Ƒ ƒ Ɠ Ɣ ƕ Ɩ Ɨ Ƙ")."<BR>";
Outputs
35
26
The first batch of characters take up three bytes each, because they're way down in the 39-thousand-ish character list, whereas the second group only take two bytes each, being around 400. (The number of bytes/octets required per character are discussed in the UTF-8 wikipedia article.)
strlen counts the number of bytes taken by the string, which gives such odd results in Unicode.
I am no PHP expert but it seems that strlen
it counts bytes... there is mb_strlen
which counts characters...
EDIT - for further reference on how multi-byte encoding works see http://en.wikipedia.org/wiki/Variable-width_encoding and esp. UTF8 see http://en.wikipedia.org/wiki/UTF-8 and
It looks like it's counting the number of bytes in the encoding being used. For example, it looks like the second string is taking two bytes per non-space character, whereas the first string is taking three bytes per non-space character. I would expect:
echo strlen("A B C D E F G H I")
to print out 17 - a single byte per ASCII character.
My guess it that this is all using the UTF-8 encoding - which would certainly be in-line with the varying width of representation.
According to this post on php.net/strlen, PHP interprets all strings passed to strlen
as ASCII.
Use mb_strlen, it count characters in provided encoding, not bytes as strlen
精彩评论