开发者

Java reading in character streams with supplementary unicode characters

I'm having trouble reading in supplementary unicode characters using Java. I have a file that potentially contains characters in the supplementary set (anything greater than \uFFFF). When I setup my InputStreamReader to read the file using UTF-8 I would expect the read() method to return a single character for each supplementary character, instead it seems to split on the 16 bit threshold.

I saw some other questions about basic unicode character streams, but nothing seems to deal with the greater than 16 bit case.

Here's some simplified sample code:

InputStreamReader input = new InputStreamReader(file, "UTF8");
int n开发者_开发百科extChar = input.read();
while(nextChar != -1) {
    ...
    nextChar = input.read();
}

Does anyone know what I need to do to correctly read in a UTF-8 encoded file that contains supplementary characters?


Java works with UTF-16. So, if your input stream has astral characters, they will appear as a surrogate pair, i.e., as two chars. The first character is the high surrogate, and the second character is the low surrogate.


Though read() is defined to return int, and could theoretically return a supplementary character's code point "all at once", I believe the return type is only int to allow a value of -1 to be returned.

The value you're getting from read() is basically a char by another name, and Java a char is limited to 16 bits.

Java can only represent supplementary characters as a UTF-16 surrogate pair, there is no such thing as a "single character" (at least in the char sense) once you get above 0xFFFF as far as Java is concerned.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜