开发者

In a Unicode string, how are planes indicated (or are they not)?

I have read the article by Joel and have done a lot of searching. Every site and article on Unicode talks about开发者_如何学编程 how there are 16 bits per code point, but Unicode supports more than 2^16 code points with Unicode planes.

But none explain how a Unicode string indicates the plane. Further more, this leaves the question of how a Unicode string can hold characters from multiple planes.

So, how are planes indicated in Unicode strings?


Someone can feel free to correct me on this, I'm still learning about Unicode myself.

I think your confusion is between a code point and how an encoding represents that code point. The number of bits/bytes per code point is going to depend on your encoding. Let's take the simplest example of UTF-32. UTF-32 uses, drum roll, please - 32 bits for each code point. It can directly represent every Unicode character in each plane. UTF-16 is a variable length encoding. It encodes each code-point in one or two code-units. The first plane is represented using a single code-unit. The rest, well, you can read more about it here. http://en.wikipedia.org/wiki/UTF-16 and http://en.wikipedia.org/wiki/UTF-8.

In essence, if the encoding supports specific planes, they are there and represented in the encoding. It's just more clear in the case of UTF-32 than the others.


I wrote a chapter that explains this topic (and some other Unicode issues) in a manual for an open-source project. Here is a link to the PDF manual (read Chapter 10). And here is a link to that chapter in the HTML version of the manual.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜