开发者

Test if char* string contains multibyte characters

I re开发者_如何学Cceive a byte stream buffer from a TCP server which could contain multibyte characters forming unicode characters. I was wondering if there's always a way to check for BOM to detect those characters or else how would you like to do it?


If you know that the data is UTF-8, then you just have to check the high bit:

  • 0xxxxxxx = single-byte ASCII character
  • 1xxxxxxx = part of multi-byte character

Or, if you need to distinguish lead/trail bytes:

  • 10xxxxxx = 2nd, 3rd, or 4th byte of multi-byte character
  • 110xxxxx = 1st byte of 2-byte character
  • 1110xxxx = 1st byte of 3-byte character
  • 11110xxx = 1st byte of 4-byte character


In UTF-8 anything that has 8th bit on is part of multibyte codepoint. So basically checking (0x80 & c)!=0 for each byte is the simples way to do this.


There are lots of ways to detect multibyte characters, and unfortunately... none of them are reliable.

If this is a web request being returned, check the headers, for the Content-Type header will often indicate the page encoding (which can be indicative of multibyte character presense).

You can also check for BOMs, as they are invalid characters they shouldn't appear in normal text anyways, so it can't hurt to see if they're there. However, they are optional and many times will not be present (depends on implementation, configuration, etc.).


Let me implement dan04's answer.

Hereafter I use C++14. If you can only use an older version of C++, you have to rewrite binary literals (e.g. 0b10) to integer literals (e.g. 2).

Implementation

int is_utf8_character(unsigned char c) { //casts to `unsigned char` to force logical shifts
    if ((c >> 7) == 0b1) {
        if ((c >> 6) == 0b10) {
            return 2; //2nd, 3rd or 4th byte of a utf-8 character
        } else {
            return 1; //1st byte of a utf-8 character
        }
    } else {
        return 0; //a single byte character (not a utf-8 character)
    }
}

Example

Code

using namespace std;
#include <iostream>

namespace N {

    int is_utf8_character(unsigned char c) { //casts to `unsigned char` to force logical shifts
        if ((c >> 7) == 0b1) {
            if ((c >> 6) == 0b10) {
                return 2; //2nd, 3rd or 4th byte of a utf-8 character
            } else {
                return 1; //1st byte of a utf-8 character
            }
        } else {
            return 0; //a single byte character (not a utf-8 character)
        }
    }

    unsigned get_string_length(const string &s) {
        unsigned width = 0;
        for (int i = 0; i < s.size(); ++i) {
            if (is_utf8_character(s[i]) != 2) {
                ++width;
            }
        }
        return width;
    }

    unsigned get_string_display_width(const string &s) {
        unsigned width = 0;
        for (int i = 0; i < s.size(); ++i) {
            if (is_utf8_character(s[i]) == 0) {
                width += 1;
            } else if (is_utf8_character(s[i]) == 1) {
                width += 2; //We assume a multi-byte character consumes double spaces than a single-byte character.
            }
        }
        return width;
    }

}

int main() {

    const string s = "こんにちはhello"; //"hello" is "こんにちは" in Japanese.

    for (int i = 0; i < s.size(); ++i) {
        cout << N::is_utf8_character(s[i]) << " ";
    }
    cout << "\n\n";

    cout << "       Length: " << N::get_string_length(s) << "\n";
    cout << "Display Width: " << N::get_string_display_width(s) << "\n";

}

Output

1 2 2 1 2 2 1 2 2 1 2 2 1 2 2 0 0 0 0 0 

       Length: 10
Display Width: 15


BOM are mostly optional. If the server that you're receiving from is serving multibyte characters, it might assume that you know this, and save itself the 2 bytes for the BOM. Are you asking for a way to tell whether data that you receive is likely to be a multi-byte string?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜