What exactly does U+ stand for and why can't I create a table of Unicode intermediate strings in my C++ application?
I'm trying to convert an application from Java + Swing to C++ + Qt. At one point I had to deal with some Unicode intermediates. In Java, this was fairly easy:
private static String[] hiraganaTable = {
"\u3042", "\u3044", "\u3046", "\u3048", "\u304a",
"\u304b", "\u304d", "\u304f", "\u3051"开发者_运维百科, "\u3053",
...
}
...whereas in C++ I'm having problems:
QString hiraganaTable[] = {
"\x30\x42", "\x30\x44", "\x30\x46", "\x30\x48", "\x30\x4a",
"\x30\x4b", "\x30\x4d", "\x30\x4f", "\x30\x51", "\x30\x53",
...
};
I couldn't use \u in VS2008 because I got a heap of warnings of the form:
character represented by universal-character-name '\u3042' cannot be represented in the current code page (1250)
And don't call me stupid, I tried to use File->Advanced Save Options to no avail, the codepage didn't seem to change at all. Seems like this is a known problem: How to create a UTF-8 string literal in Visual C++ 2008
The table I'm using is fairly short, so with the help of Vim and some introductory-level regexp-magic, I was able to convert it to \x30\x42 notation. Unfortunately, the QStrings would not initialize properly from such an input. I tried everything. fromAscii(), fromUtf8(), fromLocal8Bit(), QString(QByteArray), the works. Then, trying to write U+3042 without BOM to a file and then viewing it in hex mode, I found out it actually turns out to be "E3 81 82". Suddenly, an entry like this seemed to work with QString::fromAscii(). Now I'm left wondering how much does exactly the "U+" stand for in "U+3042" (since 0xE38182 - 0x3042 = E35140, maybe I'd better add this Magic Constant to all my would-be Unicode chars?). How should I proceed from here to get an array of proper UTF-8 strings?
The problem is that C++ is based on C, which dates back to the ASCII age. The "default" C strings "abc" are 8 bits. Your Visual C++ compiler has 16 bits Unicode (UTF-16) literals, though, with a slightly different syntax: L"abc\u3042"
. The type of such literals is wchar_t[N]
instead of char[N]
, you can store them in a std::wstring
.
Qt fully understands wchar_t
and QStrings can be directly constructed from them without conversion problems.
What you're seeing is the UTF-8 encoding of that character.
>>> u'\u3042'.encode('utf-8').encode('hex')
'e38182'
If you write them all out in UTF-8 then you should be fine.
The "U+" just indicates that you're looking at a Unicode codepoint as opposed to some specific encoding.
EDIT:
A small scriptlet to help you get started, in Python (same language as above):
>>> print ',\n'.join(', '.join('"%s"' % (y.encode('utf-8').encode('string-escape')
,) for y in x) for x in [u'あいうえお', u'かきくけこ', u'さしすせそ'])
"\xe3\x81\x82", "\xe3\x81\x84", "\xe3\x81\x86", "\xe3\x81\x88", "\xe3\x81\x8a",
"\xe3\x81\x8b", "\xe3\x81\x8d", "\xe3\x81\x8f", "\xe3\x81\x91", "\xe3\x81\x93",
"\xe3\x81\x95", "\xe3\x81\x97", "\xe3\x81\x99", "\xe3\x81\x9b", "\xe3\x81\x9d"
"U+dddd" where each d is a hexadecimal digit denotes a Unicode code point.
You cannot store 16-bit values in 8-bit chars; that's the main problem you're having.
Use wide characters, e.g. (these are string literals) L"\0x3042"
or L"\u3042"
.
Then figure out how to make QString accept those.
Note: Visual C++ will emit sillywarning for the \U
notation used within literals, while g++ will emit sillywarnings for that notation used outside literals.
Cheers & hth.,
精彩评论