开发者

How to convert a UTF-8 byteOffset into a charOffset for a Java String?

I have a byte offset for a byte array containing a UTF-8 encoded string, how can I transform tha开发者_StackOverflow中文版t into a char offset for the corresponding Java String?

NB. this question used to read:

I have a byte offset into a standard Java String, and I would like to convert that to a character offset.

In practice this will mean a method like charOffsetBefore(int byteOffset) since any byte offset could be in the middle of a code point.

Thanks.


Please be extremely wary of your terminology, otherwise you'll get confused. There is no such thing as "byte offset into a Java string". Java strings are made up from 16bit characters.

So I assume that you have a byte array and an offset and you want to convert that into a Java string and still preserve locations (so you can map back and forth).

This depend on the encoding of the byte array. If it's UTF-8, then any byte that has it's MSB set is part of a encoding sequence. Search for the byte which byte & 0xc0 == 0xc0. That's the start of the encoding sequence (see the Wikipedia article).

If you're asking about characters, then the encoding is UTF-16 and you need to look for surrogate pairs.


I would suggest that you do not have a byte offset into a standard Java String. If indeed you do, can yu tell us who you got it (code please)

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜