How to iterate over unicode characters in C++?

2023-04-09 06:06 问答作者：

I know that to get a unicode character in C++ I can do:

std::wstring str = L"\u4FF0";

However, what if I want to get all the characters in the range 4FF0 to 5FF0? Is it possible to dynamically build a unicode 开发者_开发百科character? What I have in mind is something like this pseudo-code:

for (int i = 20464; i < 24560; i++ { // From 4FF0 to 5FF0
    std::wstring str = L"\u" + hexa(i); // build the unicode character
    // do something with str
}

How would I do that in C++?

The wchar_t type held within a wstring is an integer type, so you can use it directly:

for (wchar_t c = 0x4ff0;  c <= 0x5ff0;  ++c) {
    std::wstring str(1, c);
    // do something with str
}

Be careful trying to do this with characters above 0xffff, since depending on the platform (e.g. Windows) they will not fit into a wchar_t.

If for example you wanted to see the Emoticon block in a string, you can create surrogate pairs:

std::wstring str;
for (int c = 0x1f600; c <= 0x1f64f; ++c) {
    if (c <= 0xffff || sizeof(wchar_t) > 2)
        str.append(1, (wchar_t)c);
    else {
        str.append(1, (wchar_t)(0xd800 | ((c - 0x10000) >> 10)));
        str.append(1, (wchar_t)(0xdc00 | ((c - 0x10000) & 0x3ff)));
    }
}

You cannot increment over Unicode characters as if it is an array, some characters are build up out of multiple 'char's (UTF-8) and multiple 'WCHAR's (UTF-16) that's because of the diacritics etc. If you're really serious about this stuff you should use an API like UniScribe or ICU.

Some resources to read:

http://en.wikipedia.org/wiki/UTF-16/UCS-2

http://en.wikipedia.org/wiki/Precomposed_character

http://en.wikipedia.org/wiki/Combining_character

http://scripts.sil.org/cms/scripts/page.php?item_id=UnicodeNames#4d2aa980

http://en.wikipedia.org/wiki/Unicode_equivalence

http://msdn.microsoft.com/en-us/library/dd374126.aspx

What about:

for (std::wstring::value_type i(0x4ff0); i <= 0x5ff0; ++i)
{
    std::wstring str(1, i);
}

Note that the code has not been tested, so it may not compile as-is.

Also, given the platform you are working on a wstring's character unit may be 2, 4, or N bytes wide- so be intentional about how you use it.

继续阅读：unicode wstring

How to iterate over unicode characters in C++?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？