gcc, UTF-8 and limits.h

2022-12-27 02:41 问答作者：

My OS is Debian, my default locale is UTF-8 and my compiler is gcc. By default CHAR_BIT in limits.h is 8 which is ok for ASCII because in ASCII 1 char = 8 bits. But since I am using UTF-8, chars can be up to 32 bits which contradicts the CHAR_BIT default value of 8.

If I modify CHAR_BIT to 32 in limits开发者_运维技巧.h to better suit UTF-8, what do I have to do in order for this new value to come into effect ? I guess I have to recompile gcc ? Do I have to recompile the linux kernel ? What about the default installed Debian packages, will they work ?

CHAR_BIT is the number of bits in a char; never, ever change this. It is not going to have the effect you want.

Instead, work with strings of UTF-8 encoded chars, or use strings of wchar_t if you want to store Unicode characters directly.*

* Small print: The size of wchar_t is system-dependent as well. On Windows with MSVC, it's only 16 bits, which is only sufficient for the Basic Multilingual Plane. You can use it with UTF-16, though, which plays nice with the Windows API. On most other systems, wchar_t gives you the full 32 bits.

You do not need char to be 32 bits to have UTF-8 encoding. UTF-8 is variable length encoding and it is designed for characters of 8-bit and it is backward compatible to ascii.

You may also use wchar_t that is 32 bit (on Linux) but generally it would not you give to much added value because Unicode processing is much more complicated then just code-points management.

C and C++ define char as a byte, i.e., the integer type for which sizeof returns 1. It doesn't have to be 8 bits, but the overwhelming majority of the time, it is. IMHO, it should have been named byte. But back in 1972 when C was created, Westerners didn't have to deal with multi-byte character encodings, so you could get away with conflating the "character" and "byte" types.

You just have to live with the confusing terminology. Or typedef it away. But don't edit your system header files. If you want a character type instead of a byte type, use wchar_t.

But a UTF-8 string is made of 8-bit code units, so char will work just fine. You just have to remember the distinction between char and character. For example, don't do this:

void make_upper_case(char* pstr)
{
   while (*pstr != '\0')
   {
      *pstr = toupper(*pstr);
      pstr++;
   }
}

toupper('a') works as expected, but toupper('\xC3') is a nonsensical attempt to uppercase half of a character.

UTF-8 encodes 1 character in several bytes.

Also, do not edit your system header files. (and no, modifying CHAR_BITS will not work, recompiling the kernel/gcc or whatnot).

I'm pretty sure that CHAR_BIT is the number of bits in the 'char' variable type, not the maximum number of bits in any character. As you noticed it's a constant in limits.h, which doesn't change based on the locale settings.

CHAR_BIT will equal 8 on any reasonably new / sane system... non 8-bit words is rare these days :)

继续阅读：c gcc locale utf-8

gcc, UTF-8 and limits.h

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？