开发者

Unicode code point limit

As explained here, All unicode encodings end at largest code point 10FFFF But I've heard differently that they can go upto 6开发者_开发技巧 bytes, is it true?


UTF-8 underwent some changes during its life, and there are many specifications (most of which are outdated now) which standardized UTF-8. Most of the changes were introduced to help compatibility with UTF-16 and to allow for the ever-growing amount of codepoints.

To make the long story short, UTF-8 was originally specified to allow codepoints with up to 31 bits (or 6 bytes). But with RFC3629, this was reduced to 4 bytes max. to be more compatible to UTF-16.

Wikipedia has some more information. The specification of the Universal Character Set is closely linked to the history of Unicode and its transformation format (UTF).


See the answers to Do UTF-8,UTF-16, and UTF-32 Unicode encodings differ in the number of characters they can store?

UTF-8 and UTF-32 are theoretically capable of representing characters above U+10FFFF, but were artificially restricted to match UTF-16's capacity.


The largest unicode codepoint and the encodings for unicode characters used, are two things. According to the standard, the highest codepoint really is 0x10ffff but herefore you'll need just 21 bits which fit easily into 4 bytes, even with 11 bits wasted!

I guess with your question about 6 bytes you mean a 6-byte utf-8 sequence, right? As others have answered already, using the utf-8 mechanism you could really deal with 6-byte sequences, you can even deal with 7-byte sequences and even with an 8-byte sequence. The 7-byte sequence gives you a range of just what the following bytes have to offer, 6 x 6 bits = 36 bits and a 8-byte sequence gives you 7 x 6 bits = 42 bits. You could deal with it but it is not allowed because unneeded, the highest codepoint is 0x10ffff.

It is also forbidden to use longer sequences than needed as Hibou57 has mentioned. With utf-8 one must always use the shortest sequence possible or the sequence will be treated as invalid! An ASCII character must be in a 7-bit singlebyte of course. The second thing is that the utf-8 4-byte sequence gives you 3 bits of payload in the startbyte and 18 bits of payload in the following bytes which are 21 bits and that matches to the calculation of surrogates when using the utf-16 encoding. The bias 0x10000 is subtracted from the codepoint and the remaining 20 bits go to the high- as well lo-surrogate payload area, each of 10 bits. The third and last thing is, that within utf-8 it is not allowed to encode hi- or -lo-surrogate values. Surrogates are not characters but containers for them, surrogates can only appear in utf-16, not in utf-8 or utf-32 encoded files.


Indeed, for some view of the UTF‑8 encoding, UTF‑8 may technically permit to encode code‑points beyond the forever‑fixed valid range upper‑limit; so one may encode a code‑point beyond that range, but it will not be a valid code‑point anywhere. On the other hand, you may encode a character with unneeded zeroed high‑order bits, ex. encoding an ASCII code‑point with multiple bits, like in 2#1100_0001#, 2#1000_0001# (using Ada's notation), which would for the ASCII letter A UTF‑8 encoded with two bytes. But then, it may be rejected by some safety/security filters, at this use to be used for hacking and piracy. RFC 3629 has some explanation about it. One should just stick to encode valid code‑points (as defined by Unicode), the safe way (no extraneous bytes).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜