Java: String.toCharArray() with unicode characters
I know that char cannot contain Unicode characters (like char c = '\u1023'). So how would I go about doing
String s = "ABCDEFG\u1023";
char[] c = s.toCharArray();
I would like to convert s to a CharArray for performance reasons as I have to loop through every character in a potentially ve开发者_高级运维ry long string which is inefficient. Anything which achieves the same result is fine.
Thanks a lot!
EDIT: Actually char can contain unicode chars. I'm just being stupid. Thanks to those who helped out anyway.
Whoever told you that in Java char
can't contain Unicode characters, was wrong:
The values of the integral types are integers in the following ranges:
- For
char
, from'\u0000'
to'\uffff'
inclusive, that is, from 0 to 65535
Three things:
- A char most certainly can have u1023.
toCharArray()
will return a char array that is virtually the same as UTF16- Since a char is 16 bit, and the Unicode spans 21 bits, the characters outside the BMP are encoded as two surrogate chars. Java 1.5 onwards has APIs for this, for example
String.codePointAt(...)
. If you are using Java 1.4 or earlier, look into ICU4J.
In Java, a char is essentially an unsigned short. In order to iterate through a string that has unicode characters outside of the range supported by char (the first 65536), you should use the following pattern, which stores each codepoint as an int.
for (int i = 0; i < str.length();) {
int ch = str.codePointAt(i);
// do stuff with ch...
i += Character.charCount(ch);
}
Java was designed with first-class support for the first 65536 characters, which at the time was an improvement over C/C++, which had first-class support for only the first 128 or 256 characters. Unfortunately, it means that the above pattern is necessary in Java to support the out-of-range characters that are becoming more common.
Java char
can contain the most Unicode characters as the others have already mentioned, but the characters outside of Basic Multilingual Plane (BMP) are split into multiple char
s and handling them independently might break the string.
To be safe you can split the string into string array:
String[] c = s.codePoints()
.mapToObj(cp -> new String(Character.toChars(cp)))
.toArray(size -> new String[size]);
... or use isSurrogate
, isLowSurrogate
and isHighSurrogate
methods of Character
object to prevent alteration of a single char
within the pair:
Character.isSurrogate('a');
精彩评论