wchar_t argv in C -- Unicode

2023-02-16 03:53 问答作者：

Does GCC support the Microsoft equivalent of wmain()? I'm writing C program and 开发者_Python百科need to use Unicode throughout. If not, can char be converted to wchar_t?

You don't need wchar_t for Unicode. You can use char for the utf-8 encoding of Unicode. Plus, wchar_t can be different sizes. On Windows, it is 16 bits, but on many Linux/Unix platforms it is 32 bits.

For more info specific to GCC, see this post I found via a Google search:

http://article.gmane.org/gmane.comp.gnu.mingw.user/22962

(According to that, the answer to your question of whether GCC supports wmain is "no".)

Many of C's standard string functions are encoding agnostic. You can use char* to store UTF-8 encoded strings and use them safely with:

strcpy strncpy strcat strncat strcmp strncmp strdup strchr 
strrchr strcspn strspn strpbrk strstr strtok

Some other functions will not give you correct results with Unicode strings. For example, strlen always count bytes, not characters. The number of characters can be counted in C in a portable way using mbstowcs(NULL,s,0). It will return the number of characters in s successfully translated to wchar_t. This works for UTF-8 like for any other supported encoding, as long as the appropriate locale has been selected.

If you want to do advanced operations on Unicode strings like complex code page conversions, regular expressions, text wrapping on word boundaries etc, I suggest you use a good library like ICU.

Refer: Using Unicode in C/C++.

If you want to process Unicode command line arguments without wmain, you can use the argument-less standard main functions and the Windows API functions GetCommandLineW, CommandLineToArgvW, and LocalFree. CommandLineToArgvW uses the same same rules for command line parsing as the Microsoft runtime library.

If you do want to use wide strings, mbstowcs will convert a multi-byte string to a wchar_t string. The encoding it assumes the multi-byte string is in depends on the LC_CTYPE category of the current locale. It's necessary to set this with setlocale; otherwise you will get the "C" locale by default.

The question remains of what character encoding is used in argv. This could be UTF-8, or it could be one of the single-byte encodings like Latin-1. This depends on your terminal settings. Experimenting with xterm, I got different values for argv when I passed "é" on the command line, depending on the value of LANG that xterm inherited: for LANG=en_US.UTF-8, it gave "c3 a9"; for LANG=en_US, it gave "e9" (I think this is Latin-1.)

You can get the locale from the environment with setlocale(LC_CTYPE, "") or setlocale(LC_ALL, "") first to use the correct multi-byte format (set from environmental variables). You will have problems if LANG is changed after the terminal emulator is started, though.

glibc provides several other functions for character set conversion which may be more appropriate - see the "Character Set Handling" section of the glibc manual for more information. My experience is that converting a string in argv to a given encoding is quite tricky and it may have to be done in two stages: once to convert it to wchar_t format, and secondly to convert it from wchar_t to the desired encoding (e.g. UTF-8).

继续阅读：c command-line unicode

wchar_t argv in C -- Unicode

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？