Test if char* string contains multibyte characters

2023-02-11 05:10 问答作者：

I re开发者_如何学Cceive a byte stream buffer from a TCP server which could contain multibyte characters forming unicode characters. I was wondering if there's always a way to check for BOM to detect those characters or else how would you like to do it?

If you know that the data is UTF-8, then you just have to check the high bit:

0xxxxxxx = single-byte ASCII character
1xxxxxxx = part of multi-byte character

Or, if you need to distinguish lead/trail bytes:

10xxxxxx = 2nd, 3rd, or 4th byte of multi-byte character
110xxxxx = 1st byte of 2-byte character
1110xxxx = 1st byte of 3-byte character
11110xxx = 1st byte of 4-byte character

In UTF-8 anything that has 8th bit on is part of multibyte codepoint. So basically checking (0x80 & c)!=0 for each byte is the simples way to do this.

There are lots of ways to detect multibyte characters, and unfortunately... none of them are reliable.

If this is a web request being returned, check the headers, for the Content-Type header will often indicate the page encoding (which can be indicative of multibyte character presense).

You can also check for BOMs, as they are invalid characters they shouldn't appear in normal text anyways, so it can't hurt to see if they're there. However, they are optional and many times will not be present (depends on implementation, configuration, etc.).

Let me implement dan04's answer.

Hereafter I use C++14. If you can only use an older version of C++, you have to rewrite binary literals (e.g. 0b10) to integer literals (e.g. 2).

Implementation

int is_utf8_character(unsigned char c) { //casts to `unsigned char` to force logical shifts
    if ((c >> 7) == 0b1) {
        if ((c >> 6) == 0b10) {
            return 2; //2nd, 3rd or 4th byte of a utf-8 character
        } else {
            return 1; //1st byte of a utf-8 character
        }
    } else {
        return 0; //a single byte character (not a utf-8 character)
    }
}

Example

Code

using namespace std;
#include <iostream>

namespace N {

    int is_utf8_character(unsigned char c) { //casts to `unsigned char` to force logical shifts
        if ((c >> 7) == 0b1) {
            if ((c >> 6) == 0b10) {
                return 2; //2nd, 3rd or 4th byte of a utf-8 character
            } else {
                return 1; //1st byte of a utf-8 character
            }
        } else {
            return 0; //a single byte character (not a utf-8 character)
        }
    }

    unsigned get_string_length(const string &s) {
        unsigned width = 0;
        for (int i = 0; i < s.size(); ++i) {
            if (is_utf8_character(s[i]) != 2) {
                ++width;
            }
        }
        return width;
    }

    unsigned get_string_display_width(const string &s) {
        unsigned width = 0;
        for (int i = 0; i < s.size(); ++i) {
            if (is_utf8_character(s[i]) == 0) {
                width += 1;
            } else if (is_utf8_character(s[i]) == 1) {
                width += 2; //We assume a multi-byte character consumes double spaces than a single-byte character.
            }
        }
        return width;
    }

}

int main() {

    const string s = "こんにちはhello"; //"hello" is "こんにちは" in Japanese.

    for (int i = 0; i < s.size(); ++i) {
        cout << N::is_utf8_character(s[i]) << " ";
    }
    cout << "\n\n";

    cout << "       Length: " << N::get_string_length(s) << "\n";
    cout << "Display Width: " << N::get_string_display_width(s) << "\n";

}

Output

1 2 2 1 2 2 1 2 2 1 2 2 1 2 2 0 0 0 0 0 

       Length: 10
Display Width: 15

BOM are mostly optional. If the server that you're receiving from is serving multibyte characters, it might assume that you know this, and save itself the 2 bytes for the BOM. Are you asking for a way to tell whether data that you receive is likely to be a multi-byte string?

继续阅读：multibyte unicode

Test if char* string contains multibyte characters

Implementation

Example

Code

Output

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Implementation

Example

Code

Output

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？