wchar_t argv in C -- Unicode
Does GCC support the Microsoft equivalent of wmain()? I'm writing C program and 开发者_Python百科need to use Unicode throughout. If not, can char be converted to wchar_t?
You don't need wchar_t
for Unicode. You can use char
for the utf-8
encoding of Unicode. Plus, wchar_t
can be different sizes. On Windows, it is 16 bits, but on many Linux/Unix platforms it is 32 bits.
For more info specific to GCC, see this post I found via a Google search:
http://article.gmane.org/gmane.comp.gnu.mingw.user/22962
(According to that, the answer to your question of whether GCC supports wmain
is "no".)
Many of C's standard string functions are encoding agnostic. You can use char*
to store UTF-8 encoded strings and use them safely with:
strcpy strncpy strcat strncat strcmp strncmp strdup strchr
strrchr strcspn strspn strpbrk strstr strtok
Some other functions will not give you correct results with Unicode strings. For example, strlen
always count bytes, not characters. The number of characters can be counted in C in a portable way using mbstowcs(NULL,s,0)
. It will return the number of characters in s
successfully translated to wchar_t
. This works for UTF-8 like for any other supported encoding, as long as the appropriate locale has been selected.
If you want to do advanced operations on Unicode strings like complex code page conversions, regular expressions, text wrapping on word boundaries etc, I suggest you use a good library like ICU.
Refer: Using Unicode in C/C++.
If you want to process Unicode command line arguments without wmain
, you can use the argument-less standard main
functions and the Windows API functions GetCommandLineW
, CommandLineToArgvW
, and LocalFree
. CommandLineToArgvW
uses the same same rules for command line parsing as the Microsoft runtime library.
If you do want to use wide strings, mbstowcs
will convert a multi-byte string to a wchar_t
string. The encoding it assumes the multi-byte string is in depends on the LC_CTYPE category of the current locale. It's necessary to set this with setlocale
; otherwise you will get the "C" locale by default.
The question remains of what character encoding is used in argv
. This could be UTF-8, or it could be one of the single-byte encodings like Latin-1. This depends on your terminal settings. Experimenting with xterm, I got different values for argv
when I passed "é" on the command line, depending on the value of LANG that xterm inherited: for LANG=en_US.UTF-8, it gave "c3 a9"; for LANG=en_US, it gave "e9" (I think this is Latin-1.)
You can get the locale from the environment with setlocale(LC_CTYPE, "")
or setlocale(LC_ALL, "")
first to use the correct multi-byte format (set from environmental variables). You will have problems if LANG is changed after the terminal emulator is started, though.
glibc provides several other functions for character set conversion which may be more appropriate - see the "Character Set Handling" section of the glibc manual for more information. My experience is that converting a string in argv to a given encoding is quite tricky and it may have to be done in two stages: once to convert it to wchar_t
format, and secondly to convert it from wchar_t
to the desired encoding (e.g. UTF-8).
精彩评论