Find length in characters of a std::wstring
I am working with std::wstring variables (C++ Language) & I am trying to determine the length(in characters) of the string.
The functions .length & .size() give results that aren't the length in characters(I think they tell me how m开发者_Python百科any wide chars there are?).
So is there a way to determine the length in characters of a wstring?
What do you mean by "characters"?
std::basic_string
is just a container for a series of values, that we think of as a string. It doesn't care what encoding the values are in; all it does is store and manage an ordered sequence of values. Therefore it's size
and length
functions state how many values it stores.
If your std::wstring
contains a string that represents, say, a valid UTF-16-encoded string, std::wstring
does not care. Unicode encodings are just ways to encode codepoints. UTF-16 uses 16-bit code units to encode its codepoints, which can include surrogate pairs of 16-bit values that correspond to a single Unicode codepoint.
However, a Unicode codepoint is not a "character" by some definitions of that term. For example, there are combining codepoints, where multiple codepoints are combined to form a grapheme. There are non-visible codepoints (control codes and such).
If you want to know how many codepoints are in a std::wstring
, then you will have to walk that string with a function that can process UTF-16 data. If you want to know how many graphemes (logical glyphs) are in the string, then you will need to walk it with a much more complex algorithm.
To do this you must use the Unicode database. You should use ICU (how to do it in ICU) or some other Unicode library. Boost.Locale is already accepted to boost and will be available soon, it wraps some of the functionality of ICU in a nice way.
However, I doubt you actually need to do this. See definitions of grapheme, character, codepoint, codeunit. Probably what you mean is codepoints, but almost surely it's not very useful.
Depending on where your string comes from, you may not have any control over what it means, i.e. how it is encoded. To turn your string into something definite semantics, you may have to perform the following steps:
Read byte string from the environment via
argv
orgetenv
. This is a byte string with platform and locale-dependent encoding.Turn the byte string into an internal, fixed width (there are caveats) wide string by means of
mbstowcs()
. You still don't know the encoding of the result! All you know is that each wide character is big enough to hold any of the "platform's character values", whatever that means. (In Windows, it means something broken).Obtain a sequence of Unicode code points (i.e. definitive data that you can manipulate codepoint-wise) by using ICU or
iconv()
to translate WCHAR into UCS-4/UTF-32. Now you know what you're dealing with!
If you are reading data from a file with a documented encoding, or from the network, you would instead convert from the documented file encoding to UCS-4.
Once you have obtained a sequence of code points, the low-level language support for text processing ends. A sequence of code points is the best you can get at the binary level to represent a text. Any higher-level textual manipulation and processing is complicated and subtle and depends deeply on a proper definition of "text", so this is best left to a dedicated Unicode library (such as ICU). At the programming language level, "characters" are code points, but in any serious application that's probably not what you want and you want to know about graphemes and normalization and a hundred other little things.
Are you looking for wcslen?
#include <wchar.h>
size_t wcslen(const wchar_t *s);
精彩评论