开发者

Handling multibyte (non-ASCII) characters in C

I am trying to do my own version of wc (unix filter), but I have a problem with non-ASCII characters. I did a HEX dump of a text file and found out that these characters occupy more than one byte. So they won't fit to char. Is there any way I can read these characters from file and handle them like single characters (in order to count characters in a file) in C? I've been googling a little bit and found some wc开发者_StackOverflowhar_t type, but there were not any simple examples how to use it with files.


I've been googling a little bit and found some wchar_t type, but there was not any simple example how to use it with files.

Well met. There weren't any simple examples because, unfortunately, proper character set support isn't simple.

Aside: In an ideal world, everybody would use UTF-8 (a Unicode encoding that is memory-efficient, robust, and backward-compatible with ASCII), the standard C library would include UTF-8 encoding-decoding support, and the answer to this question (and dealing with text in general) would be simple and straightforward.

The answer to the question "What is the best unicode library for C?" is to use the ICU library. You may want to look at ustdio.h, as it has a u_fgetc function, and adding Unicode support to your program will probably take little more than typing u_ a few times.

Also, if you can spare a few minutes for some light reading, you may want to read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know about Unicode and Character Sets (No Excuses!) from Joel On Software.

I, personally, have never used ICU, but I probably will from now on :-)


If you want to write a standard C version of the wc utility that respects the current language setting when it is run, then you can indeed use the wchar_t versions of the stdio functions. At program startup, you should call setlocale():

setlocale(LC_CTYPE, "");

This will cause the wide character functions to use the appropriate character set defined by the environment - eg. on Unix-like systems, the LANG environment variable. For example, this means that if your LANG variable is set to a UTF8 locale, the wide character functions will handle input and output in UTF8. (This is how the POSIX wc utility is specified to work).

You can then use the wide-character versions of all the standard functions. For example, if you have code like this:

long words = 0;
int in_word = 0;
int c;

while ((c = getchar()) != EOF)
{
    if (isspace(c))
    {
        if (in_word)
        {
            in_word = 0;
            words++;
        }
    }
    else
    {
        in_word = 1;
    }
}

...you would convert it to the wide character version by changing c to a wint_t, getchar() to getwchar(), EOF to WEOF and isspace() to iswspace():

long words = 0;
int in_word = 0;
wint_t c;

while ((c = getwchar()) != WEOF)
{
    if (iswspace(c))
    {
        if (in_word)
        {
            in_word = 0;
            words++;
        }
    }
    else
    {
        in_word = 1;
    }
}


Go have a look at ICU. That library is what you need to deal with all the issues.


Most of the answers so far have merit, but which you use depends on the semantics you want:

  • If you want to process text in the configured locale's encoding, and don't care about complete failure in the case of encountering invalid sequences, using getwchar() is fine.
  • If you want to process text in the configured locale's encoding, but need to detect and recover from invalid sequences, you need to read bytes and use mbrtowc manually.
  • If you always want to process text as UTF-8, you need to read bytes and feed them to your own decoder. If you know in advance the file will be valid UTF-8, you can just count bytes in the ranges 00-7F and C2-F4 and skip counting all other bytes, but this could give wrong results in the presence of invalid sequences. A more robust approach would be decoding the bytestream to Unicode codepoints and counting the number of successful decodes.

Hope this helps.


Are you sure you really need the number of characters? wc counts the number of bytes.

~$ echo 'דניאל' > hebrew.txt
~$ wc hebrew.txt 
 1  1 11 hebrew.txt

(11 = 5 two-byte characters + 1 byte for '\n')

However, if you really do want to count characters rather than bytes, and can assume that your text files are encoded in UTF-8, then the easiest approach is to count all bytes that are not trail bytes (i.e., in the range 0x80 to 0xBF).

If you can't assume UTF-8 but can assume that any non-UTF-8 files are in a single-byte encoding, then perform a UTF-8 validation check on the data. If it passes, return the number of UTF-8 lead bytes. If if fails, return the number of total bytes.

(Note that the above approach is specific to wc. If you're actually doing something with the characters rather than just counting them, you'll need to know the encoding.)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜