How to convert a UTF-8 byteOffset into a charOffset for a Java String?
I have a byte offset for a byte array containing a UTF-8 encoded string, how can I transform tha开发者_StackOverflow中文版t into a char offset for the corresponding Java String?
NB. this question used to read:
I have a byte offset into a standard Java String, and I would like to convert that to a character offset.
In practice this will mean a method like charOffsetBefore(int byteOffset)
since any byte offset could be in the middle of a code point.
Thanks.
Please be extremely wary of your terminology, otherwise you'll get confused. There is no such thing as "byte offset into a Java string". Java strings are made up from 16bit characters.
So I assume that you have a byte array and an offset and you want to convert that into a Java string and still preserve locations (so you can map back and forth).
This depend on the encoding of the byte array. If it's UTF-8, then any byte that has it's MSB set is part of a encoding sequence. Search for the byte which byte & 0xc0 == 0xc0
. That's the start of the encoding sequence (see the Wikipedia article).
If you're asking about characters, then the encoding is UTF-16 and you need to look for surrogate pairs.
I would suggest that you do not have a byte offset into a standard Java String. If indeed you do, can yu tell us who you got it (code please)
精彩评论