开发者

Java: String.toCharArray() with unicode characters

I know that char cannot contain Unicode characters (like char c = '\u1023'). So how would I go about doing

    String s = "ABCDEFG\u1023";
    char[] c = s.toCharArray();

I would like to convert s to a CharArray for performance reasons as I have to loop through every character in a potentially ve开发者_高级运维ry long string which is inefficient. Anything which achieves the same result is fine.

Thanks a lot!

EDIT: Actually char can contain unicode chars. I'm just being stupid. Thanks to those who helped out anyway.


Whoever told you that in Java char can't contain Unicode characters, was wrong:

The values of the integral types are integers in the following ranges:

  • For char, from '\u0000' to '\uffff' inclusive, that is, from 0 to 65535


Three things:

  1. A char most certainly can have u1023.
  2. toCharArray() will return a char array that is virtually the same as UTF16
  3. Since a char is 16 bit, and the Unicode spans 21 bits, the characters outside the BMP are encoded as two surrogate chars. Java 1.5 onwards has APIs for this, for example String.codePointAt(...). If you are using Java 1.4 or earlier, look into ICU4J.


In Java, a char is essentially an unsigned short. In order to iterate through a string that has unicode characters outside of the range supported by char (the first 65536), you should use the following pattern, which stores each codepoint as an int.

for (int i = 0; i < str.length();) {
    int ch = str.codePointAt(i);
    // do stuff with ch...
    i += Character.charCount(ch);
}

Java was designed with first-class support for the first 65536 characters, which at the time was an improvement over C/C++, which had first-class support for only the first 128 or 256 characters. Unfortunately, it means that the above pattern is necessary in Java to support the out-of-range characters that are becoming more common.


Java char can contain the most Unicode characters as the others have already mentioned, but the characters outside of Basic Multilingual Plane (BMP) are split into multiple chars and handling them independently might break the string.

To be safe you can split the string into string array:

String[] c = s.codePoints()
    .mapToObj(cp -> new String(Character.toChars(cp)))
    .toArray(size -> new String[size]);

... or use isSurrogate, isLowSurrogate and isHighSurrogate methods of Character object to prevent alteration of a single char within the pair:

Character.isSurrogate('a');
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜