gcc, UTF-8 and limits.h
My OS is Debian, my default locale is UTF-8 and my compiler is gcc. By default CHAR_BIT in limits.h is 8 which is ok for ASCII because in ASCII 1 char = 8 bits. But since I am using UTF-8, chars can be up to 32 bits which contradicts the CHAR_BIT default value of 8.
If I modify CHAR_BIT to 32 in limits开发者_运维技巧.h to better suit UTF-8, what do I have to do in order for this new value to come into effect ? I guess I have to recompile gcc ? Do I have to recompile the linux kernel ? What about the default installed Debian packages, will they work ?
CHAR_BIT
is the number of bits in a char
; never, ever change this. It is not going to have the effect you want.
Instead, work with strings of UTF-8 encoded char
s, or use strings of wchar_t
if you want to store Unicode characters directly.*
* Small print: The size of wchar_t
is system-dependent as well. On Windows with MSVC, it's only 16 bits, which is only sufficient for the Basic Multilingual Plane. You can use it with UTF-16, though, which plays nice with the Windows API. On most other systems, wchar_t
gives you the full 32 bits.
You do not need char to be 32 bits to have UTF-8 encoding. UTF-8 is variable length encoding and it is designed for characters of 8-bit and it is backward compatible to ascii.
You may also use wchar_t
that is 32 bit (on Linux) but generally it would not you give to
much added value because Unicode processing is much more complicated then just code-points management.
C and C++ define char
as a byte, i.e., the integer type for which sizeof
returns 1. It doesn't have to be 8 bits, but the overwhelming majority of the time, it is. IMHO, it should have been named byte
. But back in 1972 when C was created, Westerners didn't have to deal with multi-byte character encodings, so you could get away with conflating the "character" and "byte" types.
You just have to live with the confusing terminology. Or typedef
it away. But don't edit your system header files. If you want a character type instead of a byte type, use wchar_t
.
But a UTF-8 string is made of 8-bit code units, so char
will work just fine. You just have to remember the distinction between char
and character. For example, don't do this:
void make_upper_case(char* pstr)
{
while (*pstr != '\0')
{
*pstr = toupper(*pstr);
pstr++;
}
}
toupper('a')
works as expected, but toupper('\xC3')
is a nonsensical attempt to uppercase half of a character.
UTF-8 encodes 1 character in several bytes.
Also, do not edit your system header files. (and no, modifying CHAR_BITS will not work, recompiling the kernel/gcc or whatnot).
I'm pretty sure that CHAR_BIT is the number of bits in the 'char' variable type, not the maximum number of bits in any character. As you noticed it's a constant in limits.h, which doesn't change based on the locale settings.
CHAR_BIT will equal 8 on any reasonably new / sane system... non 8-bit words is rare these days :)
精彩评论