Can you write UTF-8, UTF-16 and std::wstring representation of "U+9FA5 (龥)" and "U+0041 (A)" unicode characters?
Please specify if there is a difference in representation between Windows and Linux machines (like std::wstring consuming 4 bytes in Linux and 2 bytes in Windows).
And开发者_开发技巧 please also specify endianness if necessary.No, I can't. But this site can.
utf-16BE which is the code page used inside the MS office family of products will store all characters as 2 bytes and is pretty much identical to the "standard" part of the Unicode character set.
Linux is probably using utf-8 which will store standard ASCII characters in a single byte but may store other unicode characters in two , three or four bytes, depending on the unicode code point. As the left most bits are taken up with flags to indicate its not ascii and and how far into a multibyte character you are. (The idea being that you can jump into a utf-8 string at a random byte and be able to find the start of the character you are in.)
For most of the far eastern character sets which have high code points in unicode proper (as used by Java) is usually more efficient in space and processing time than UTF-8.
Is this what you want:
int main()
{
std::wstring data1 = L"U+9FA5 (\0x9FA5)";
std::wstring data2 = L“U+0041 (A)";
}
The wstring is just a container of wchar_t objects.
There is no implied encoding of the characters (it just stores what you put it).
Windows wchar_t is currently 2 bytes so it can probably only store UTF-16 characters. Linus wchar_t is usually 4 bytes. So it can use an encoding of UTF-16 or UTF-32. Though it most normal situations these overlap and top half is just all zero (exceptions of course are code-points not on the BMP or surrogate pairs).
Note: UTF-8 characters are not normally used internally (though they can be) in an application as they are not fixed width. But it is extremely useful for transport and storage because of its compressibility (and backwards compatibility with ASCII does not hurt).
Note: C/C++ does not preclude the use of other encoding formats for its strings.
精彩评论