Inconsistency in Unicode with wchar_t vs. ICU in C++

2023-01-31 23:26 问答作者：

While wchar_t is inconsistent in case of support on different compilers, but is it safe to assume wchar_t implementation and size are similar in GNU开发者_StackOverflow社区/GCC at least on Linux ?

Despite to the fact that wchar_t size has system architecture dependency in terms of bit-size (32bit/64bit) is Wide Character Type on Linux (GNU/GCC) actually compiler dependent or libstdc++ libraries dependent? I mean by changing or upgrading which one I should consider that wchar_t might not work as expected in terms of size and support

While IBM ICU is another option, can it be used in conjunction with std::string ?

Should I totally dismiss wchar_t in favor of ICU?

Note: On Unix Like Operating Systems such as Linux with GNU/GCC libstdc++ brings core C++ functionality to the compiler, thus occasionally updated.

If you want to present strings to the user, you might have to take wchar_t (or some other library defined type) into consideration. Different compilers and platforms define wchar_t differently, because they use different Unicode encoding techniques. On Windows/Visual C++ for instance, wchar_t is a 16 bit type, suitable for UTF-16. On GCC/Linux for instance, wchar_t is a 32 bit type, suitable for UTF-32.

The IBM ICU library has conversion functions for transforming from one encoding to another. Your platform (Win32 for instance) might also have functions for transforming from one encoding to another.

Depending on your requirements (speed, memory usage), you should pick an internal format that suits the platform. On Windows it might be UTF-16, and on Linux it might be UTF-32. That way you won't have to transcode strings all the time, just to make simple platform-defined operations on them (wcslen(), wcscmp(), etc).

For external formats (text files, etc), I tend to use UTF-8. The reason is that files are considerably smaller if they contain text in a western language. Another benefit is that you don't have to consider endianess in UTF-8, which makes the chance of errors (on your or some other's part) less likely.

The IBM ICU is a very big and competent library for handling Unicode strings. Although, it might be using a sledge hammer to drive in a small nail. Do you need all of its functionality? The Unicode functionality supported by the target platform might meet your requirements.

In principle, yes, wchar_t can change with a new compiler version (it is a language feature though, not a library one, so it doesn't depend on libraries).

In practice, the odds of it suddenly changing size are pretty much zero.

It's not really clear what you actually need though. wchar_t just allows you to store wide characters, and not much more. ICU is a complete unicode library which does a lot more, and is pretty much essential if you want to do more complex text processing than simply printing strings.

Finally, on *nix, plain char's, or std::string usually use an UTF-8 encoding, so those are perfectly suitable for storing Unicode text. wchar_t is rarely used for that reason.

继续阅读：icu unicode

Inconsistency in Unicode with wchar_t vs. ICU in C++

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？