开发者

If UTF-8 is an 8-bit encoding, why does it need 1-4 bytes?

On the Unicode site it's written that UTF-8 can be represented by 1-4 bytes. As I understand from this question https://softwareengineering.stackexchange.com/questions/77758/why-are-there-multiple-unicode-encodings UTF-8 is an 8-bits encoding. So, what's the truth? If it's 8-bits encoding, then what's the difference between ASCII a开发者_如何学JAVAnd UTF-8? If it's not, then why is it called UTF-8 and why do we need UTF-16 and others if they occupy the same memory?


The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky - Wednesday, October 08, 2003

Excerpt from above:

Thus was invented the brilliant concept of UTF-8. UTF-8 was another system for storing your string of Unicode code points, those magic U+ numbers, in memory using 8 bit bytes. In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes. This has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don't even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet. Now, if you are so bold as to use accented letters or Greek letters or Klingon letters, you'll have to use several bytes to store a single code point, but the Americans will never notice. (UTF-8 also has the nice property that ignorant old string-processing code that wants to use a single 0 byte as the null-terminator will not truncate strings).

So far I've told you three ways of encoding Unicode. The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits), and you still have to figure out if it's high-endian UCS-2 or low-endian UCS-2. And there's the popular new UTF-8 standard which has the nice property of also working respectably if you have the happy coincidence of English text and braindead programs that are completely unaware that there is anything other than ASCII.

There are actually a bunch of other ways of encoding Unicode. There's something called UTF-7, which is a lot like UTF-8 but guarantees that the high bit will always be zero, so that if you have to pass Unicode through some kind of draconian police-state email system that thinks 7 bits are quite enough, thank you it can still squeeze through unscathed. There's UCS-4, which stores each code point in 4 bytes, which has the nice property that every single code point can be stored in the same number of bytes, but, golly, even the Texans wouldn't be so bold as to waste that much memory.

And in fact now that you're thinking of things in terms of platonic ideal letters which are represented by Unicode code points, those unicode code points can be encoded in any old-school encoding scheme, too! For example, you could encode the Unicode string for Hello (U+0048 U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding, or the Hebrew ANSI Encoding, or any of several hundred encodings that have been invented so far, with one catch: some of the letters might not show up! If there's no equivalent for the Unicode code point you're trying to represent in the encoding you're trying to represent it in, you usually get a little question mark: ? or, if you're really good, a box. Which did you get? -> �

There are hundreds of traditional encodings which can only store some code points correctly and change all the other code points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-1, aka Latin-1 (also useful for any Western European language). But try to store Russian or Hebrew letters in these encodings and you get a bunch of question marks. UTF 7, 8, 16, and 32 all have the nice property of being able to store any code point correctly.


UTF-8 is an 8-bit variable width encoding. The first 128 characters in the Unicode, when represented with UTF-8 encoding have the representation as the characters in ASCII.

To understand this further, Unicode treats characters as codepoints - a mere number that can be represented in multiple ways (the encodings). UTF-8 is one such encoding. It is most commonly used, for it gives the best space consumption characteristics among all encodings. If you are storing characters from the ASCII character set in UTF-8 encoding, then the UTF-8 encoded data will take the same amount of space. This allowed for applications that previously used ASCII to seamlessly move (well, not quite, but it certainly didn't result in something like Y2K) to Unicode, for the character representations are the same.

I'll leave this extract here from RFC 3629, on how the UTF-8 encoding would work:

   Char. number range  |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

You'll notice why the encoding will result in characters occupying anywhere between 1 and 4 bytes (the right-hand column) for different ranges of characters in Unicode (the left-hand column).

UTF-16, UTF-32, UCS-2 etc. will employ different encoding schemes where the codepoints would represented as 16-bit or 32-bit codes, instead of 8-bit codes that UTF-8 does.


The '8-bit' encoding means that the individual bytes of the encoding use 8 bits. In contrast, pure ASCII is a 7-bit encoding as it only has code points 0-127. It used to be that software had problems with 8-bit encodings; one of the reasons for Base-64 and uuencode encodings was to get binary data through email systems that did not handle 8-bit encodings. However, it's been a decade or more since that ceased to be allowable as a problem - software has had to be 8-bit clean, or capable of handling 8-bit encodings.

Unicode itself is a 21-bit character set. There are a number of encodings for it:

  • UTF-32 where each Unicode code point is stored in a 32-bit integer
  • UTF-16 where many Unicode code points are stored in a single 16-bit integer, but some need two 16-bit integers (so it needs 2 or 4 bytes per Unicode code point).
  • UTF-8 where Unicode code points can require 1, 2, 3 or 4 bytes to store a single Unicode code point.

So, "UTF-8 can be represented by 1-4 bytes" is probably not the most appropriate way of phrasing it. "Unicode code points can be represented by 1-4 bytes in UTF-8" would be more appropriate.


Just complementing the other answer about UTF-8 coding, that uses 1 to 4 bytes

As people said above, a code with 4 bytes totals 32 bits, but of these 32 bits, 11 bits are used as a prefix in the control bytes, i.e. to identify the code size of a Unicode symbol between 1 and 4 bytes and also enable to recover a text easily even in the middle of the text.

The gold question is: Why we need so much bits (11) for control in a 32 bits code? Wouldn't it be useful to have more than 21 bits for codification?

The point is that the planned scheme needs to be such that it is easily known to go back to the 1st. bite of a code.

Thus, bytes besides the first byte cannot have all their bits released for codify a Unicode symbol because otherwise they could easily to be confused as the first byte of a valid code UTF-8.

So the model is

  • 0UUUUUUU for 1 byte code. We have 7 Us, so there are 2^7 = 128 possibilities that are the traditional ASCII codes.
  • 110UUUUU 10UUUUUU for 2 bytes code. Here we have 11 Us so there are 2^11 = 2,048 - 128 = 1,921 possibilities. It discounts the previous gross number 2^7 because you need to discount the codes up to 2^7 = 127, corresponding to the 1 byte legacy ASCII.
  • 1110UUUU 10UUUUUU 10UUUUUU for 3 bytes code. Here we have 16 Us so there are 2^16 = 65,536 - 2,048 = 63,488 possibilities)
  • 11110UUU 10UUUUUU 10UUUUUU 10UUUUUU for 4 bytes code. Here we have 21 Us so there are 2^21 = 2,097,152 - 65,536 = 2,031,616 possibilities,

where U is a bit 0 or 1 used to codify a Unicode UTF-8 symbol.

So the total possibilities are 127 + 1,921 + 63,488 + 2,031,616 = 2,097,152 Unicode symbols.

In the Unicode tables available (for example, in the Unicode Pad App for Android or here) appear the Unicode code in form (U+H), where H is a hex number of 1 to 6 digits. For example U+1F680 represents a rocket icon:

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜