Java UTF-8 differences

2023-03-14 11:39 问答作者：

The JavaDoc says开发者_JAVA技巧 "The null byte '\u0000' is encoded in 2-byte format rather than 1-byte, so that the encoded strings never have embedded nulls."

But what does this even mean? What's an embedded null in this context? I am trying to convert from a Java saved UTF-8 string to "real" UTF-8.

In C a string is terminated by the byte value 00.

The thing here is that you can have 0-chars in Java strings but to avoid confusion when passing the string over to C (which all native methods are written in) the character is encoded in another way, namely as two bytes

11000000 10000000

(according to the javadoc) neither of which is actually 00.

This is a hack to work around something you cannot change easily.

Also note, that this is valid UTF-8 and decode correctly to 00.

No "embedded nulls" means that the raw data does not contain a single 0x00 (NULL) byte.

\u0000 gets encoded to (binary) 11000000 10000000, (hex) 0xC080.

That's not a Java-wide difference, only in DataInput/OutputStream. If the string data was written using DataOutputStream then just read it in using DataInputStream.

If you need to write the string data to, say, a file, don't use DataOutputStream, use a Writer, which is meant for character streams.

This is only for the method writeUTF of DataOutputStream, not for normal converted streams (OutputStreamWriter or such).

It means that if you have a string "\u0000", it will be encoded as 0xC0 0x80 instead of simply 0x00.

And in the other way around, this sequence 0xB0 0x80, which will never occur in normal UTF-8 strings, represents a nul character.

Also, the documentation you linked seems to be from the time when Unicode still was a 16-bit character set - nowadays it also allows characters over 0xFFFF, which will be represented by two Java char values each (in UTF-16 format, a surrogate pair), and will need 4 bytes in UTF-8, if I calculated right. I'm note sure about the implementation here, though - it looks like these are simply written in CESU-8 format (e.g. two 3-byte sequences, each corresponding to a UTF-16 surrogate, which together give one Unicode character). You will have to take care of this, too.

If you are using Java, the simplest thing would be to use DataInputStream to read this into a string, and then convert it (with getBytes("UTF-8") or a OutputStreamWriter to real UTF-8 data.

If you are having difficulty reading a "saved" Java string, you need to look at the specification for the methods that read/write in that format:

If the string was written using DataOutput.writeUTF8, the DataInput.readUTF8() javadoc is a definitive spec. In addition to the non-standard handling of NUL, it specifies that the string starts with an unsigned 16-bit byte count.
If the string was written using ObjectOutputStream.writeObject() then the serialization spec is definitive.

继续阅读：utf-8

Java UTF-8 differences

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？