How can I iterate through the unicode codepoints of a Java String?

2022-12-08 01:55 问答作者：

So I know about String#codePointAt(int), but it's开发者_如何学JAVA indexed by the char offset, not by the codepoint offset.

I'm thinking about trying something like:

using String#charAt(int) to get the char at an index
testing whether the char is in the high-surrogates range
- if so, use String#codePointAt(int) to get the codepoint, and increment the index by 2
- if not, use the given char value as the codepoint, and increment the index by 1

But my concerns are

I'm not sure whether codepoints which are naturally in the high-surrogates range will be stored as two char values or one
this seems like an awful expensive way to iterate through characters
someone must have come up with something better.

Yes, Java uses a UTF-16-esque encoding for internal representations of Strings, and, yes, it encodes characters outside the Basic Multilingual Plane (BMP) using the surrogacy scheme.

If you know you'll be dealing with characters outside the BMP, then here is the canonical way to iterate over the characters of a Java String:

final int length = s.length();
for (int offset = 0; offset < length; ) {
   final int codepoint = s.codePointAt(offset);

   // do something with the codepoint

   offset += Character.charCount(codepoint);
}

Java 8 added CharSequence#codePoints which returns an IntStream containing the code points. You can use the stream directly to iterate over them:

string.codePoints().forEach(c -> ...);

or with a for loop by collecting the stream into an array:

for(int c : string.codePoints().toArray()){
    ...
}

These ways are probably more expensive than Jonathan Feinbergs's solution, but they are faster to read/write and the performance difference will usually be insignificant.

Thought I'd add a workaround method that works with foreach loops (ref), plus you can convert it to java 8's new String#codePoints method easily when you move to java 8:

You can use it with foreach like this:

 for(int codePoint : codePoints(myString)) {
   ....
 }

Here's the method:

public static Iterable<Integer> codePoints(final String string) {
  return new Iterable<Integer>() {
    public Iterator<Integer> iterator() {
      return new Iterator<Integer>() {
        int nextIndex = 0;
        public boolean hasNext() {
          return nextIndex < string.length();
        }
        public Integer next() {
          int result = string.codePointAt(nextIndex);
          nextIndex += Character.charCount(result);
          return result;
        }
        public void remove() {
          throw new UnsupportedOperationException();
        }
      };
    }
  };
}

Or alternately if you just want to convert a string to an array of int codepoints (if your code could use a codepoint int array more easily) (might use more RAM than the above approach):

 public static List<Integer> stringToCodePoints(String in) {
    if( in == null)
      throw new NullPointerException("got null");
    List<Integer> out = new ArrayList<Integer>();
    final int length = in.length();
    for (int offset = 0; offset < length; ) {
      final int codepoint = in.codePointAt(offset);
      out.add(codepoint);
      offset += Character.charCount(codepoint);
    }
    return out;
  }

Thankfully uses "codePointAt" which safely handles the surrogate pair-ness of UTF-16 (java's internal string representation).

Iterating over code points is filed as a feature request at Sun.

See Bug Report

There is also an example on how to iterate over String CodePoints there.

继续阅读：string unicode

How can I iterate through the unicode codepoints of a Java String?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？