Why was wchar_t invented?

2022-12-09 01:17 问答作者：

Why is wchar_t needed? 开发者_运维技巧How is it superior to short (or __int16 or whatever)?

(If it matters: I live in Windows world. I don't know what Linux does to support Unicode.)

See Wikipedia.

Basically, it's a portable type for "text" in the current locale (with umlauts). It predates Unicode and doesn't solve many problems, so today, it mostly exists for backward compatibility. Don't use it unless you have to.

Why is wchar_t needed? How is it superior to short (or __int16 or whatever)?

In the C++ world, wchar_t is its own type (I think it's a typedef in C), so you can overload functions based on this. For example, this makes it possible to output wide characters and not to output their numerical value. In VC6, where wchar_t was just a typedef for unsigned short, this code

wchar_t wch = L'A'
std::wcout << wch;

would output 65 because

std::ostream<wchar_t>::operator<<(unsigned short)

was invoked. In newer VC versions wchar_t is a distinct type, so

std::ostream<wchar_t>::operator<<(wchar_t)

is called, and that outputs A.

The reason there's a wchar_t is pretty much the same reason there's a size_t or a time_t - it's an abstraction that indicates what a type is intended to represent and allows implementations to chose an underlying type that can represent the type properly on a particular platform.

Note that wchar_t doesn't need to be a 16 bit type - there are platforms where it's a 32-bit type.

It is usually considered a good thing to give things such as data types meaningful names.

What is best, char or int8? I think this:

char name[] = "Bob";

is much easier to understand than this:

int8 name[] = "Bob";

It's the same thing with wchar_t and int16.

As I read the relevant standards, it seems like Microsoft fcked this one up badly.

My manpage for the POSIX <stddef.h> says that:

wchar_t: Integer type whose range of values can represent distinct wide-character codes for all mem‐ bers of the largest character set specified among the locales supported by the compilation environment: the null character has the code value 0 and each member of the portable character set has a code value equal to its value when used as the lone character in an integer character constant.

So, 16 bits wchar_t is not enough if your platform supports Unicode. Each wchar_t is supposed to be a distinct value for a character. Therefore, wchar_t goes from being a useful way to work at the character level of texts (after a decoding from the locale multibyte, of course), to being completely useless on Windows platforms.

wchar_t is the primitive for storing and processing the platform's unicode characters. Its size is not always 16 bit. On unix systems wchar_t is 32 bit (maybe unix users are more likely to use the klingon charaters that the extra bits are used for :-).

This can pose problems for porting projects especially if you interchange wchar_t and short, or if you interchange wchar_t and xerces' XMLCh.

Therefore having wchar_t as a different type to short is very important for writing cross-platform code. Cleaning up this was one of the hardest parts of porting our application to unix and then from VC6 to VC2005.

To add to Aaron's comment - in C++0x we are finally getting real Unicode char types: char16_t and char32_t and also Unicode string literals.

It is "superior" in a sense that it allows you to separate contexts: you use wchar_t in character contexts (like strings), and you use short in numerical contexts (numbers). Now the compiler can perform type checking to help you catch situations where you mistakenly mix one with another, like pass an abstract non-string array of shorts to a string processing function.

As a side node (since this was a C question), in C++ wchar_t allows you to overload functions independently from short, i.e. again provide independent overloads that work with strings and numbers (for example).

wchar_t is a bit of a hangover from before unicode standardisation. Unfortunately it's not very helpful because the encoding is platform specific (and on Solaris, locale-specific!), and the width is not specified. In addition there are no guarantees that utf-8/16/32 codecvt facets will be available, or indeed how you would access them. In general it's a bit of a nightmare for portable usage.

Apparently c++0x will have support for unicode, but at the current rate of progress that may never happen...

Except for a small, ISO 2022 japanese minority, wchar_t is always going to be unicode. If you are really anxious you can make sure of that at compile time:

#ifndef __STDC_ISO_10646__
#error "non-unicode wchar_t, unsupported system"
#endif

Sometimes wchar_t is 16bits UCS-2 sometimes 32bits UCS-4, so what? Just use sizeof(wchar_t). wchar_t is NOT meant to be sent to disk nor to the network, it is only meant to be used in memory.

See also Should UTF-16 be considered harmful? on this site.

继续阅读：c windows

Why was wchar_t invented?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？