开发者

utf-8 encoded text arrives with extra characters, howcome?

Data is coming to my app via an XML with utf-8 encoded data. The text that the user inputs is saved in the 开发者_如何学编程XML and then my app reads it.

Recently it failed when the user wrote one special character at the end. The result is that in the XML every character has an extra 0x40 character before it. So instead of receiving:

67 6f 20 61 68 65 61 64 (go ahead)

it received:

40 67 40 6f 40 20 40 61 40 68 40 65 40 61 40 64 (@g@o@ @a@h@e@a@d)

what went wrong?

0x40 in binary is 01000000 which makes me thing that 1 is some sort of control bit and it came in a different encoding...


If I am understanding correctly you are saying the payload is a string of supposedly UTF-8 bytes. i.e.

40 62 20 C6 40 62

But this is not valid UTF-8. The C6 corrupts it. In UTF-8 it is never valid to have only one byte > 0x80. You can see this if you paste the above (space sperated bytes) into my little conversion utility Use the UFT-8 bytes field).

http://sodved.awardspace.info/unicode.pl

So I suspect whichever tool/library you are using is encountering the invalid UTF-8 data and is then trying some other way of processing it. In none of the standard encodings syngle byte is C6 a curly quote. And C6 is not valid in GSM7bit (http://www.developershome.com/sms/gsmAlphabet.asp).

So you real problem is what is it doing there? Are you sure about the encoding of the payload? Even in the GSM7 default alphabet without the C6 it seems weird

¡b ¡b


The bytes 40 62 20 C6 40 62 are not valid utf-8! The problem is the orphaned startbyte C6. C6 is in dual 11000110 so it is a startbyte of a 2-byte sequence (because it begins with 110, the remaining 5 bits are payload bits of the codepoint which is 110). But the following byte for the startbyte is missing, so this is an illegal 2-byte sequence! Possibly the bytes are NOT utf-encoded and the C6 is an ANSI character e. g. a single character. However C6 is higher than 127 and so not an ASCII character. Every character higher than 127 must be decoded with a proper utf-8 sequence when encoding to utf-8.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜