Java: Interpreting UTF-8 in a Java program
My program is receiving an integer array from a browser application that's interpreted as UTF-8 (example in code). I can echo my resulting string ("theString" shown in the code below) back to the browser and e开发者_开发百科verything's fine. But it's not fine in the Java program. The input string is "Hällo". But it prints out from the Java program as "Hõllo".
import java.io.*;
import java.nio.charset.*;
public class TestCode {
public static void main (String[] args) throws IOException {
// H : 72
// ä : 195 164
// l : 108
// o : 111
// the following is the input sent from browser representing String = "Hällo"
int[] utf8Array = {72, 195, 164, 108, 108, 111};
String notYet = new String(utf8Array, 0, utf8Array.length);
String theString = new String(notYet.getBytes(), Charset.forName("UTF-8"));
System.out.println(theString);
}
}
This will do the trick:
int[] utf8Array = {72, 195, 164, 108, 108, 111};
byte[] bytes = new byte[utf8Array.length];
for (int i = 0; i < utf8Array.length; ++i) {
bytes[i] = (byte) utf8Array[i];
}
String theString = new String(bytes, Charset.forName("UTF-8"));
The problem with passing int[]
directly is that the String
class interprets every int
as a separate char, while after converting to byte[]
String
treats input as raw bytes and understands that 195, 164
is actually is a single character consisting of two bytes rather than two characters.
UPDATE: Answering your comment, unfortunately, Java is that verbose. Compare it to Scala:
val ints = Array(72, 195, 164, 108, 108, 111)
println(new String(ints map (_.toByte), "UTF-8"))
Once again the difference between int
and byte
is not just the compiler being picky, they really mean different things when it comes to UTF-8 encoding.
You need to feed it with bytes instead of ints so that you can use the String
constructor taking the charset as argument:
byte[] utf8Array = {72, (byte) 195, (byte) 164, 108, 108, 111};
String theString = new String(utf8Array, 0, utf8Array.length, "UTF-8");
精彩评论