why does mbstowcs return "invalid multibyte character"

2023-03-14 08:02 问答作者：

"קמ"ד חיר!" is the input string copy pasted from a print of the variable in gdb. Calling mbs开发者_StackOverflow社区towcs returns -1 with the other input as NULL. Any ideas on what's wrong/how to fix this?

"\327\247\327\236"\327\223 \327\227\327\231\327\250!\000\000\000" is the string with non ascii characters in octal

The programs locale is C.

The mbtowcs function doesn't handle UTF-8 encoding, there isn't a locale you can set to have it translate UTF-8 to wchar_t. Therefore, I'll use Windows examples but the general idea is the same on most OS.

In the multi-byte character set world there may not be one meaning for a given octal value and there may not be one octal value for any given character. What a particular octal value means and how a character is represented (or even if it can be represented) is determined by locale.

When mbstowcs returns an error it is basically telling you that there is no wide character equivalent to the multibyte character passed in to it. That might mean there is no UNICODE character (unlikely but not impossible) or it might mean that the locale does not define a character for a given octal value (or sequence of octal values in the case of multi-byte characters).

If you don't explicitly set your locale (by calling setlocale) then you get a locale based on your system configuration. To retrieve your current locale you can call _get_current_locale. Once you know your locale, you can figure out what character (if any) a particular octal value represents and then you can figure out what the UNICODE equivalent would be (if any).

One way to identify a problem character is to vary the length passed in to mbstowcs until you find a single character that causes the error. A brute force approach might be to start at length=1 and increase it until mbstowcs returns -1.

Update July 25th

From the comments discussion we discovered that the input string is (most likely) encoded as UTF-8. While the original answer is correct (so far as it goes) it doesn't go far enough. On Windows you cannot create a locale that will handle characters encoded in UTF-8.

When faced with UTF-8, instead of calling mbtowcs, we can call MultiByteToWideChar using the code page CP_UTF8 but that code will only work on Windows...

BYTE bytes [] = {0xD7,0x99,0xD7,0x95,0xD7,0x97,0xD7,0x90,0xD7,0x99,0x20,0xD7,0x95,0xD7,0x9B,0xD7,0x98,0xD7,0xA8, 0x00};

int result;

// get length of converted string in characters
result = MultiByteToWideChar (CP_UTF8, MB_ERR_INVALID_CHARS, (char *)bytes, 
    sizeof (bytes), NULL, 0);

wchar_t * name = new wchar_t [result];

// convert string
result = MultiByteToWideChar (CP_UTF8, MB_ERR_INVALID_CHARS, (char *)bytes, 
    sizeof (bytes), name, result);

I bet it will work if you set UTF-8 like so:

setlocale(LC_CTYPE, "UTF-8");

继续阅读：c utf-16 utf-8

why does mbstowcs return "invalid multibyte character"

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

Best solution for private video database [closed]

国内夏季避暑旅游胜地有哪些？

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?