Java: String.toCharArray() with unicode characters

2023-04-10 21:46 问答作者：

I know that char cannot contain Unicode characters (like char c = '\u1023'). So how would I go about doing

    String s = "ABCDEFG\u1023";
    char[] c = s.toCharArray();

I would like to convert s to a CharArray for performance reasons as I have to loop through every character in a potentially ve开发者_高级运维ry long string which is inefficient. Anything which achieves the same result is fine.

Thanks a lot!

EDIT: Actually char can contain unicode chars. I'm just being stupid. Thanks to those who helped out anyway.

Whoever told you that in Java char can't contain Unicode characters, was wrong:

The values of the integral types are integers in the following ranges:

For char, from '\u0000' to '\uffff' inclusive, that is, from 0 to 65535

Three things:

A char most certainly can have u1023.
toCharArray() will return a char array that is virtually the same as UTF16
Since a char is 16 bit, and the Unicode spans 21 bits, the characters outside the BMP are encoded as two surrogate chars. Java 1.5 onwards has APIs for this, for example String.codePointAt(...). If you are using Java 1.4 or earlier, look into ICU4J.

In Java, a char is essentially an unsigned short. In order to iterate through a string that has unicode characters outside of the range supported by char (the first 65536), you should use the following pattern, which stores each codepoint as an int.

for (int i = 0; i < str.length();) {
    int ch = str.codePointAt(i);
    // do stuff with ch...
    i += Character.charCount(ch);
}

Java was designed with first-class support for the first 65536 characters, which at the time was an improvement over C/C++, which had first-class support for only the first 128 or 256 characters. Unfortunately, it means that the above pattern is necessary in Java to support the out-of-range characters that are becoming more common.

Java char can contain the most Unicode characters as the others have already mentioned, but the characters outside of Basic Multilingual Plane (BMP) are split into multiple chars and handling them independently might break the string.

To be safe you can split the string into string array:

String[] c = s.codePoints()
    .mapToObj(cp -> new String(Character.toChars(cp)))
    .toArray(size -> new String[size]);

... or use isSurrogate, isLowSurrogate and isHighSurrogate methods of Character object to prevent alteration of a single char within the pair:

Character.isSurrogate('a');

继续阅读：arrays char string

Java: String.toCharArray() with unicode characters

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？