utf-8 encoded text arrives with extra characters, howcome?

2023-02-09 07:38 问答作者：

Data is coming to my app via an XML with utf-8 encoded data. The text that the user inputs is saved in the 开发者_如何学编程XML and then my app reads it.

Recently it failed when the user wrote one special character at the end. The result is that in the XML every character has an extra 0x40 character before it. So instead of receiving:

67 6f 20 61 68 65 61 64 (go ahead)

it received:

40 67 40 6f 40 20 40 61 40 68 40 65 40 61 40 64 (@g@o@ @a@h@e@a@d)

what went wrong?

0x40 in binary is 01000000 which makes me thing that 1 is some sort of control bit and it came in a different encoding...

If I am understanding correctly you are saying the payload is a string of supposedly UTF-8 bytes. i.e.

40 62 20 C6 40 62

But this is not valid UTF-8. The C6 corrupts it. In UTF-8 it is never valid to have only one byte > 0x80. You can see this if you paste the above (space sperated bytes) into my little conversion utility Use the UFT-8 bytes field).

http://sodved.awardspace.info/unicode.pl

So I suspect whichever tool/library you are using is encountering the invalid UTF-8 data and is then trying some other way of processing it. In none of the standard encodings syngle byte is C6 a curly quote. And C6 is not valid in GSM7bit (http://www.developershome.com/sms/gsmAlphabet.asp).

So you real problem is what is it doing there? Are you sure about the encoding of the payload? Even in the GSM7 default alphabet without the C6 it seems weird

¡b ¡b

The bytes 40 62 20 C6 40 62 are not valid utf-8! The problem is the orphaned startbyte C6. C6 is in dual 11000110 so it is a startbyte of a 2-byte sequence (because it begins with 110, the remaining 5 bits are payload bits of the codepoint which is 110). But the following byte for the startbyte is missing, so this is an illegal 2-byte sequence! Possibly the bytes are NOT utf-encoded and the C6 is an ANSI character e. g. a single character. However C6 is higher than 127 and so not an ASCII character. Every character higher than 127 must be decoded with a proper utf-8 sequence when encoding to utf-8.

继续阅读：encoding unicode utf-16 utf-8 xml

utf-8 encoded text arrives with extra characters, howcome?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？