How to check if the casting to wchar_t "failed"
I have a code that does something like this:
char16_t msg[256]={0};
//...
wstring wstr;
for (int i =0;i<len;++i)
{
if((unsigned short)msg[i]!=167)
wstr.push_back((wchar_t) msg[i]);
e开发者_如何学Clse
wstr.append(L"_<?>_");
}
as you can see it uses some rather ugly hardcoding(I'm not sure it works, but it works for my data) to figure out if wchar_t casting "failed"(that is the value of the replacement character) From wiki:
The replacement character � (often a black diamond with a white question mark) is a symbol found in the Unicode standard at codepoint U+FFFD in the Specials table. It is used to indicate problems when a system is not able to decode a stream of data to a correct symbol. It is most commonly seen when a font does not contain a character, but is also seen when the data is invalid and does not match any character:
So I have 2 questions: 1. Is there a proper way to do this nicely? 2. Are there other characters like replacement character that signal the failed conversion?
EDIT: i use gcc on linux so wchar_t is 32 bit, and the reason why I need this cast to work is because weird wstrings kill my glog library. :) Also wcout dies. :( :)
Doesn't work like that. wchar_t
and char16_t
are both integer types in C++. Casting from one to the other follows the usual rules for integer conversions, it does not attempt to convert between charsets in any way, or verify that anything is a genuine unicode code point.
Any replacement characters will have to come from more sophisticated code than a simple cast (or could be from the original input, of course).
Provided that:
- The input in
msg
is a sequence of code points in the BMP wchar_t
in your implementation is at least 16 bits and the wide character set used by your implementation is Unicode (or a 16-bit version of Unicode, whether that's BMP-only, or UTF-16).
Then the code you have should work fine. It will not validate the input, though, just copy the values.
If you want to actually handle Unicode strings in C++ (and not merely sequences of 16-bit values), you should use the International Components for Unicode (ICU) library. Quoting the FAQ:
Why ICU4C?
The C and C++ languages and many operating system environments do not provide full support for Unicode and standards-compliant text handling services. Even though some platforms do provide good Unicode text handling services, portable application code can not make use of them. The ICU4C libraries fills in this gap. ICU4C provides an open, flexible, portable foundation for applications to use for their software globalization requirements. ICU4C closely tracks industry standards, including Unicode and CLDR (Common Locale Data Repository).
As a side effect, you get proper error reporting if a conversion fails...
If you don't mind platform-specific code, Windows has the MultiByteToWideChar API.
*Edit: I see you're on linux; I'll leave my answer here though in case Windows people can benefit from it.
A cast can not fail neither it will produce any replacement characters. The 167
value in your code does not indicate a failed cast, it means something else what only the code's author knows.
Just for reference, Unicode code point 167 (0x00A7) is a section sign: §. Maybe that will ring some bells about what the code was supposed to do.
And though I don't know what it is, consider rewriting it with:
wchar_t msg[256];
...
wstring wstr(msg, wcslen(msg));
or
char16_t msg[256];
...
u16string u16str(msg, wcslen(msg));
then do something to that 167
values if you need to.
精彩评论