Can you write UTF-8, UTF-16 and std::wstring representation of "U+9FA5 (龥)" and "U+0041 (A)" unicode characters?

2023-01-26 10:00 问答作者：

Please specify if there is a difference in representation between Windows and Linux machines (like std::wstring consuming 4 bytes in Linux and 2 bytes in Windows).

And开发者_开发技巧 please also specify endianness if necessary.

No, I can't. But this site can.

utf-16BE which is the code page used inside the MS office family of products will store all characters as 2 bytes and is pretty much identical to the "standard" part of the Unicode character set.

Linux is probably using utf-8 which will store standard ASCII characters in a single byte but may store other unicode characters in two , three or four bytes, depending on the unicode code point. As the left most bits are taken up with flags to indicate its not ascii and and how far into a multibyte character you are. (The idea being that you can jump into a utf-8 string at a random byte and be able to find the start of the character you are in.)

For most of the far eastern character sets which have high code points in unicode proper (as used by Java) is usually more efficient in space and processing time than UTF-8.

Is this what you want:

int main()
{
    std::wstring  data1 = L"U+9FA5 (\0x9FA5)";
    std::wstring  data2 = L“U+0041 (A)";
}

The wstring is just a container of wchar_t objects.
There is no implied encoding of the characters (it just stores what you put it).

Windows wchar_t is currently 2 bytes so it can probably only store UTF-16 characters. Linus wchar_t is usually 4 bytes. So it can use an encoding of UTF-16 or UTF-32. Though it most normal situations these overlap and top half is just all zero (exceptions of course are code-points not on the BMP or surrogate pairs).

Note: UTF-8 characters are not normally used internally (though they can be) in an application as they are not fixed width. But it is extremely useful for transport and storage because of its compressibility (and backwards compatibility with ASCII does not hurt).

Note: C/C++ does not preclude the use of other encoding formats for its strings.

继续阅读：unicode utf-16 utf-8 wstring

Can you write UTF-8, UTF-16 and std::wstring representation of "U+9FA5 (龥)" and "U+0041 (A)" unicode characters?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？