开发者

Extending 'isalnum' to recognize UTF-8 umlaut

I wrote a function which extends isalnum to recognize UTF-8 coded umlaut.

Is there maybe a more elegant way to solve this issue?

The code is as follo开发者_JAVA技巧ws:

bool isalnumlaut(const char character) {
    int cr = (int) (unsigned char) character;
    if (isalnum(character)
            || cr == 195 // UTF-8
            || cr == 132 // Ä
            || cr == 164 // ä
            || cr == 150 // Ö
            || cr == 182 // ö
            || cr == 156 // Ü
            || cr == 188 // ü
            || cr == 159 // ß
    ) {
        return true;
    } else {
        return false;
    }
}

EDIT:

I tested my solution now several times, and it seems to do the job for my purpose though. Any strong objections?


Your code doesn't do what you're claiming.

The utf-8 representation of Ä is two bytes - 0xC3,0x84. A lone byte with a value above 0x7F is meaningless in utf-8.


Some general suggestions:

  • Unicode is large. Consider using a library that has already handled the issues you're seeing, such as ICU.

  • It doesn't often make sense for a function to operate on a single code unit or code point. It makes much more sense to have functions that operate on either ranges of code points or single glyphs (see here for definitions of those terms).

  • Your concept of alpha-numeric is likely to be underspecified for a character set as large as the Universal Character Set; do you want to treat the characters in the Cyrillic alphabet as alphanumerics? Unicode's concept of what is alphabetic may not match yours - especially if you haven't considered it.


I'm not 100% sure but the C++ std::isalnum in <locale> almost certainly recognizes locale specific additional characters: http://www.cplusplus.com/reference/std/locale/isalnum/


It's impossible with the interface you define, since UTF-8 is a multibyte encoding; a single character requires multiple char to represent it. (I've got code for determining whether a UTF-8 is a member of a specified set of characters in my library, but the character is specified by a pair of iterators, and not a single char.)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜