How to handle unicode character sequences in C/C++?
What are the more portable and clean ways to handle unicode character sequences in C and C++ ?
Moreover, how to:
-Read unicode strings
-Convert unicode strings to ASCII to save some by开发者_Go百科tes (if the user only inputs ASCII)
-Print unicode strings
Should I use the environment too ? I've read about LC_CTYPE for example, should I care about it as a developer ?
What are the more portable and clean ways to handle unicode character sequences in C and C++ ?
Have all strings in your program be UTF-8, UTF-16, or UTF-32. If for some reason you need to work with a non-Unicode encoding, do the conversion on input and output.
Read unicode strings
Same way you'd read an ASCII file. But there's still a lot of non-Unicode data around, so you'll want to check whether the data is Unicode. If it's not (or if it's UTF-8 when your preferred internal encoding is UTF-32), you'll need to convert it.
- UTF-8 and UTF-32 can be reliably detected by validation.
- UTF-16 can be detected by the presence of a BOM.
- If it's not a UTF encoding, it's likely in ISO-8859-1 or windows-1252.
Convert unicode strings to ASCII to save some bytes (if the user only inputs ASCII)
Don't. If your data is all ASCII, then UTF-8 will take exactly the same amount of space. And if it isn't, you'll lose information when you convert to ASCII. If you care about saving bytes.
- Choose the optimal UTF encoding. For characters U+0000 to U+007F, UTF-8 is the smallest. For characters U+0800 to U+FFFF, UTF-16 is the smallest.
- Use data compression like gzip. There is a SCSU encoding specifically designed for Unicode, but I don't know how good it is.
Print unicode strings
Writing UTF-8 is no different from writing ASCII.
Except at the Windows command prompt, because it still uses the old "OEM" code pages. There you can use WriteConsoleW with UTF-16 strings.
Should I use the environment too ? I've read about LC_CTYPE for example, should I care about it as a developer ?
LC_CTYPE
is a holdover from the days when every language had its own character encoding, and thus its own ctype.h
functions. Today, the Unicode Character Database takes care of that. The beauty of Unicode is that it separates character encoding handling from locale handling (except for the special uppercase/lowercase rules for Lithuanian, Turkish, and Azeri).
But each language still has its own collation rules and number formatting rules, so you'll still need locales for those. And you'll need to set your locale's character encoding to UTF-8.
What are the more portable and clean ways to handle unicode character sequences in C and C++ ?
Use a library like ICU. If you can't, that is abso-freaking-lutely can't roll your own. Be prepared to have a Hard Time though. Also, do look up Unicode.org documentation on sample source code.
Should I use the environment too ?
Yes. You will probably need to use the std::setlocale
function as well. This would allow you to set a locale corresponding to the encoding you want e.g. if you want to use British English as a language and UTF-8 as encoding you'd set LC_CTYPE
to en_GB.UTF8
.
C++03 does not give you a way to deal with Unicode. Your best bet is to use the wchar_t
data type (and by extension std::wstring
). However, note that the size and character encoding is different on different OS. E.g. Windows uses 2 bytes for wchar_t
and UTF-16 encoding whereas GNU/Linux and Mac OSX use 4 bytes and UTF-32.
C++0x is supposed to amend the situation by allowing Unicode literals codecvt
facets, C Unicode TR support (read <uchar.h>
) etc. but then that's a long way for most compilers. (There are a few questions here on SO that ought to help you get started.)
You need to Read, Print or Convert Unicode to ASCII if it fits? Just use UTF-8 and all this would be absolutely transparent for you.
- Reading, Writing no difference
- ASCII is already subset of UTF-8
For text analysis/handling use good libraries like ICU, Boost.Locale or even Qt, Glib that give quite good text analysis/handling tools.
There are good answers written here before this one but none of them mentioned one particular thing that I see as a probable problem, since this question has also C
tag. My C knowledge is outdated so please correct me if I'm wrong.
Note that presumably zero-terminated strings, traditional C string functions and UTF-16 encoded datastream are likely a tricky combination, because in UTF-16 many western alphanumeric characters will be encoded in two bytes that has the other byte all zeros and therefore reading the character data as series of char
s is not what it used to be with single byte charsets.
精彩评论