What is the encoding of argv?
It's not clear to me what encodings are used where in C's argv
. In particular, I'm interested in the following scenario:
- A user uses locale L1 to create a file whose name,
N
, contains non-ASCII characters - Later on, a user uses locale L2 to tab-complete the name of that file on the command line, which is fed into a program P as a command line argument
What sequence of bytes does P see on the command line?
I have observed that on Linux, creating a filename in the UTF-8 locale and then tab-completing it in (e.g.) the zw_TW.big5
locale seems to cause my program P to be fed UTF-8 rather than Big5
. However, on OS X the same series of actions results in my program P getting a Big5
encoded filename.
Here is what I think is going on so far (long, and I'm probably wrong and need to be corrected):
Windows
File names are stored on disk in some Unicode format. So Windows takes the name N
, converts from L1 (the current code page) to a Unicode version of N
we will call N1
, and stores N1
on disk.
What I then assume happens is that when tab-completing later on, the name N1
is converted to locale L2 (the new current code page) for display. With luck, this will yield the original name N
-- but this won't be true if N
contained characters unrepresentable in L2. We call the new name N2
.
When the user actually presses enter to run P with that argument, the name N2
is converted back into Unicode, yielding N1
again. This N1
is now available to the program in UCS2 format via GetCommandLineW
/wmain
/tmain
, but users of GetCommandLine
/main
will see the name N2
in the current locale (code page).
OS X
The disk-storage story is the same, as far as I know. OS X stores file names as Unicode.
With a Unicode terminal, I think what happens is that the terminal builds the command line in a Unicode buffer. So when you tab complete, it copies the file name as a Unicode file name to that buffer.
When you run the command, that Unicode buffer is converted to the current locale, L2, and fed to the program via argv
, and the program can decode argv with the current locale into Unicode for display.
Linux
On Linux, everything is different and I'm extra-confused about what is going on. Linux stores file names as byte strings, not in Unicode. So if you create a file with name N
in locale L1 that N
as a byte string is what is stored on disk.
When I later run the terminal and try and tab-complete the name, I'm not sure what happens. It looks to me like the command line is constructed as a byte buffer, and the name of the file as a byte string is just concatenated onto that buffer. I assume that when you type a standard character it is encoded on the fly to bytes that are appended to that buffer.
When you run a program, I think that buffer is sent directly to argv
. Now, what encoding does argv
have? It looks like any characters you typed in the command line while in locale L2 will be in the L2 encoding, but the file name will be in the L1 encoding. So argv
contains a mixture of two encodings!
Question
I'd really like it if someone could let me know what is going on here. All I have at the moment is half-guesses and speculation, and it doesn't really fit together. What I'd really like to be true is for argv
to be encoded in the current code page (Windows) or the current locale (Linux / OS X) but that doesn't seem to be the case...
Extras
Here is a simple candidate program P that lets you observe encodings for yourself:
#include <stdio.h>
int main(int argc, char **argv)
{
if (argc < 2) {
printf("Not enough arguments\n");
return 1;
}
int len = 0;
for (char *c = argv[1]; *c; c++, len++) {
printf("%d ", (int)(*c))开发者_如何转开发;
}
printf("\nLength: %d\n", len);
return 0;
}
You can use locale -a
to see available locales, and use export LC_ALL=my_encoding
to change your locale.
Thanks everyone for your responses. I have learnt quite a lot about this issue and have discovered the following things that has resolved my question:
As discussed, on Windows the argv is encoded using the current code page. However, you can retrieve the command line as UTF-16 using GetCommandLineW. Use of argv is not recommended for modern Windows apps with unicode support because code pages are deprecated.
On Unixes, the argv has no fixed encoding:
a) File names inserted by tab-completion/globbing will occur in argv verbatim as exactly the byte sequences by which they are named on disk. This is true even if those byte sequences make no sense in the current locale.
b) Input entered directly by the user using their IME will occur in argv in the locale encoding. (Ubuntu seems to use LOCALE to decide how to encode IME input, whereas OS X uses the Terminal.app encoding Preference.)
This is annoying for languages such as Python, Haskell or Java, which want to treat command line arguments as strings. They need to decide how to decode argv
into whatever encoding is used internally for a String
(which is UTF-16 for those languages). However, if they just use the locale encoding to do this decoding, then valid filenames in the input may fail to decode, causing an exception.
The solution to this problem adopted by Python 3 is a surrogate-byte encoding scheme (http://www.python.org/dev/peps/pep-0383/) which represents any undecodable byte in argv as special Unicode code points. When that code point is decoded back to a byte stream, it just becomes the original byte again. This allows for roundtripping data from argv that is not valid in the current encoding (i.e. a filename named in something other than the current locale) through the native Python string type and back to bytes with no loss of information.
As you can see, the situation is pretty messy :-)
I can only speak about Windows for now. On Windows, code pages are only meant for legacy applications and not used by the system or by modern applications. Windows uses UTF-16 (and has done so for ages) for everything: text display, file names, the terminal, the system API. Conversions between UTF-16 and the legacy code pages are only performed at the highest possible level, directly at the interface between the system and the application (technically, the older API functions are implemented twice—one function FunctionW
that does the real work and expects UTF-16 strings, and one compatibility function FunctionA
that simply converts input strings from the current (thread) code page to UTF-16, calls the FunctionW
, and converts back the results). Tab-completion should always yield UTF-16 strings (it definitely does when using a TrueType font) because the console uses only UTF-16 as well. The tab-completed UTF-16 file name is handed over to the application. If now that application is a legacy application (i.e., it uses main
instead of wmain
/GetCommandLineW
etc.), then the Microsoft C runtime (probably) uses GetCommandLineA
to have the system convert the command line. So basically I think what you're saying about Windows is correct (only that there is probably no conversion involved while tab-completing): the argv
array will always contain the arguments in the code page of the current application because the information what code page (L1) the original program has uses has been irreversibly lost during the intermediate UTF-16 stage.
The conclusion is as always on Windows: Avoid the legacy code pages; use the UTF-16 API wherever you can. If you have to use main
instead of wmain
(e.g., to be platform independent), use GetCommandLineW
instead of the argv
array.
The output from your test app needed some modifications to make any sense, you need the hex codes and you need to get rid of the negative values. Or you can't print things like UTF-8 special chars so you can read them.
First the modified SW:
#include <stdio.h>
int main(int argc, char **argv)
{
if (argc < 2) {
printf("Not enough arguments\n");
return 1;
}
int len = 0;
for (unsigned char *c = argv[1]; *c; c++, len++) {
printf("%x ", (*c));
}
printf("\nLength: %d\n", len);
return 0;
}
Then on my Ubuntu box that is using UTF-8 I get this output.
$> gcc -std=c99 argc.c -o argc
$> ./argc 1ü
31 c3 bc
Length: 3
And here you can see that in my case ü is encoded over 2 chars, and that the 1 is a single char. More or less exactly what you expect from a UTF-8 encoding.
And this actually match what is in the env LANG varible.
$> env | grep LANG
LANG=en_US.utf8
Hope this clarifies the linux case a little.
/Good luck
Yep, users has to be careful when mixing locales on Unix in general. GUI file managers that displays and changes filenames also have this problem. On Mac OS X the standard Unix encoding is UTF-8. In fact the HFS+ filesystem, when called via the Unix interfaces, enforces UTF-8 filenames because it needs to convert it to UTF-16 for storage in the filesystem itself.
精彩评论