Find length in characters of a std::wstring

2023-03-30 02:07 问答作者：

I am working with std::wstring variables (C++ Language) & I am trying to determine the length(in characters) of the string.

The functions .length & .size() give results that aren't the length in characters(I think they tell me how m开发者_Python百科any wide chars there are?).

So is there a way to determine the length in characters of a wstring?

What do you mean by "characters"?

std::basic_string is just a container for a series of values, that we think of as a string. It doesn't care what encoding the values are in; all it does is store and manage an ordered sequence of values. Therefore it's size and length functions state how many values it stores.

If your std::wstring contains a string that represents, say, a valid UTF-16-encoded string, std::wstring does not care. Unicode encodings are just ways to encode codepoints. UTF-16 uses 16-bit code units to encode its codepoints, which can include surrogate pairs of 16-bit values that correspond to a single Unicode codepoint.

However, a Unicode codepoint is not a "character" by some definitions of that term. For example, there are combining codepoints, where multiple codepoints are combined to form a grapheme. There are non-visible codepoints (control codes and such).

If you want to know how many codepoints are in a std::wstring, then you will have to walk that string with a function that can process UTF-16 data. If you want to know how many graphemes (logical glyphs) are in the string, then you will need to walk it with a much more complex algorithm.

To do this you must use the Unicode database. You should use ICU (how to do it in ICU) or some other Unicode library. Boost.Locale is already accepted to boost and will be available soon, it wraps some of the functionality of ICU in a nice way.

However, I doubt you actually need to do this. See definitions of grapheme, character, codepoint, codeunit. Probably what you mean is codepoints, but almost surely it's not very useful.

Depending on where your string comes from, you may not have any control over what it means, i.e. how it is encoded. To turn your string into something definite semantics, you may have to perform the following steps:

Read byte string from the environment via argv or getenv. This is a byte string with platform and locale-dependent encoding.
Turn the byte string into an internal, fixed width (there are caveats) wide string by means of mbstowcs(). You still don't know the encoding of the result! All you know is that each wide character is big enough to hold any of the "platform's character values", whatever that means. (In Windows, it means something broken).
Obtain a sequence of Unicode code points (i.e. definitive data that you can manipulate codepoint-wise) by using ICU or iconv() to translate WCHAR into UCS-4/UTF-32. Now you know what you're dealing with!

If you are reading data from a file with a documented encoding, or from the network, you would instead convert from the documented file encoding to UCS-4.

Once you have obtained a sequence of code points, the low-level language support for text processing ends. A sequence of code points is the best you can get at the binary level to represent a text. Any higher-level textual manipulation and processing is complicated and subtle and depends deeply on a proper definition of "text", so this is best left to a dedicated Unicode library (such as ICU). At the programming language level, "characters" are code points, but in any serious application that's probably not what you want and you want to know about graphemes and normalization and a hundred other little things.

Are you looking for wcslen?

#include <wchar.h>
size_t wcslen(const wchar_t *s);

继续阅读：unicode

Find length in characters of a std::wstring

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？