Java reading in character streams with supplementary unicode characters
I'm having trouble reading in supplementary unicode characters using Java. I have a file that potentially contains characters in the supplementary set (anything greater than \uFFFF). When I setup my InputStreamReader to read the file using UTF-8 I would expect the read() method to return a single character for each supplementary character, instead it seems to split on the 16 bit threshold.
I saw some other questions about basic unicode character streams, but nothing seems to deal with the greater than 16 bit case.
Here's some simplified sample code:
InputStreamReader input = new InputStreamReader(file, "UTF8");
int n开发者_开发百科extChar = input.read();
while(nextChar != -1) {
...
nextChar = input.read();
}
Does anyone know what I need to do to correctly read in a UTF-8 encoded file that contains supplementary characters?
Java works with UTF-16. So, if your input stream has astral characters, they will appear as a surrogate pair, i.e., as two char
s. The first character is the high surrogate, and the second character is the low surrogate.
Though read()
is defined to return int
, and could theoretically return a supplementary character's code point "all at once", I believe the return type is only int
to allow a value of -1 to be returned.
The value you're getting from read()
is basically a char
by another name, and Java a char
is limited to 16 bits.
Java can only represent supplementary characters as a UTF-16 surrogate pair, there is no such thing as a "single character" (at least in the char
sense) once you get above 0xFFFF as far as Java is concerned.
精彩评论