开发者

Java: Interpreting UTF-8 in a Java program

My program is receiving an integer array from a browser application that's interpreted as UTF-8 (example in code). I can echo my resulting string ("theString" shown in the code below) back to the browser and e开发者_开发百科verything's fine. But it's not fine in the Java program. The input string is "Hällo". But it prints out from the Java program as "Hõllo".

import java.io.*;
import java.nio.charset.*;

public class TestCode {
   public static void main (String[] args) throws IOException {

      // H : 72
      // ä : 195 164
      // l : 108
      // o : 111
      // the following is the input sent from browser representing String = "Hällo"
      int[] utf8Array = {72, 195, 164, 108, 108, 111};

      String notYet = new String(utf8Array, 0, utf8Array.length);
      String theString = new String(notYet.getBytes(), Charset.forName("UTF-8"));

      System.out.println(theString);
   }
}


This will do the trick:

int[] utf8Array = {72, 195, 164, 108, 108, 111};
byte[] bytes = new byte[utf8Array.length];
for (int i = 0; i < utf8Array.length; ++i) {
    bytes[i] = (byte) utf8Array[i];
}
String theString = new String(bytes, Charset.forName("UTF-8"));

The problem with passing int[] directly is that the String class interprets every int as a separate char, while after converting to byte[] String treats input as raw bytes and understands that 195, 164 is actually is a single character consisting of two bytes rather than two characters.

UPDATE: Answering your comment, unfortunately, Java is that verbose. Compare it to Scala:

val ints = Array(72, 195, 164, 108, 108, 111)
println(new String(ints map (_.toByte), "UTF-8"))

Once again the difference between int and byte is not just the compiler being picky, they really mean different things when it comes to UTF-8 encoding.


You need to feed it with bytes instead of ints so that you can use the String constructor taking the charset as argument:

byte[] utf8Array = {72, (byte) 195, (byte) 164, 108, 108, 111};
String theString = new String(utf8Array, 0, utf8Array.length, "UTF-8");
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜