Confusion on Unicode and Multibyte Articles

2022-12-21 20:38 问答作者：

By referring Joel's Article

Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct.

After reading the whole article, my point is that, if someone told you, his text is in unicode, you will have no idea how much memory space taken up 开发者_StackOverflowby every of his character. He have to tell you, "My unicode text is encoded in UTF-8", then only you will have idea how much memory space is taken up by every of his character.

Unicode = not necessary 2 byte for each character

However, when comes to Code Project's Article and Microsoft's Help, this confused me :

Microsoft :

Unicode is a 16-bit character encoding, providing enough encodings for all languages. All ASCII characters are included in Unicode as "widened" characters.

Code Project :

The Unicode character set is a "wide character" (2 bytes per character) set that contains every character available in every language, including all technical symbols and special publishing characters. Multibyte character set (MBCS) uses either 1 or 2 bytes per character

Unicode = 2 byte for each character ?

Is 65536 possible characters able to represent all language in this world?

Why the concept seems different among web developer community and desktop developer community?

Once upon a time,

Unicode had only as many characters as fit in 16 bits, and
UTF-8 did not exist or was not the de facto encoding to use.

These factors led to UTF-16 (or rather, what is now called UCS-2) to be considered synonymous with “Unicode”, because it was after all the encoding which supported all of Unicode.

Practically, you will see “Unicode” being used where “UTF-16” or “UCS-2” is meant. This is a historical confusion and should be ignored and not propagated. Unicode is a set of characters; UTF-8, UTF-16, and UCS-2 are different encodings.

(The difference between UTF-16 and UCS-2 is that UCS-2 is a true 16-bits-per-“character” encoding, and therefore encodes only the “BMP” (Basic Multilingual Plane) portion of Unicode, whereas UTF-16 uses “surrogate pairs” (for a total of 32 bits) to encode above-BMP characters.)

To expand on @Kevin's answer:

The description is Microsoft's Help is quite out of date, describing the state of the world in the NT 3.5/4.0 timeline.

You'll also occasionally see UTF-32 and UCS-4 mentioned as well, most often in the *nix world. UTF-32 is a 32-bit encoding of Unicode, a subset of UCS-4. The Unicode Standard Annex #19 describes the differences between them.

The best reference I've found describing the various encoding models is the Unicode Technical Report #17 Unicode Character Encoding Model, especially the tables in section 4.

Is 65536 possible characters able to represent all language in this world?

No.

Why the concept seems different among web developer community and desktop developer community?

Because Windows documentation is wrong. It took me a while to figure this out. MSDN says in at least two places that Unicode is a 16-bit encoding:

http://www.microsoft.com/typography/unicode/cscp.htm
http://msdn.microsoft.com/en-us/library/cwe8bzh0.aspx

One reason for the confusion is that at one point Unicode was a 16-bit encoding. From Wikipedia:

“Originally, both Unicode and ISO 10646 standards were meant to be fixed-width, with Unicode being 16 bit”

The other problem is that today in Windows APIs strings containing utf-16 encoded string data is usually represented using an array of wide characters, each one being 16-bits long. Despite that that Windows APIs support surrogate pairs of two 16-bit character types, to represent one Unicode code point.

Check out this blog post for more detailed information on the source of the confusion.

继续阅读：internationalization unicode visual-c++

Confusion on Unicode and Multibyte Articles

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？