Encoding error while trying to read and write in html file with Java
I'm trying to read some text from an html file, modify it in a specific way and write the result in a new html file. But the problem is that the text is not written in English and as 开发者_运维技巧a result some characters are replaced with black and white "?" marks. In my html file, I have < meta http-equiv="Content-Type" content="text/html; charset=utf-8">
. What am I doing wrong? Maybe not the right Readers and Writers?
StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(new FileReader("inputFile.html"));
String line;
while ( (line = br.readLine()) != null) {
sb.append(line);
}
String result = doSomeChanges(sb);
BufferedWriter out = new BufferedWriter(new FileWriter("outputFile.html"));
out.write(result);
out.close();
Maybe not the right Readers and Writers?
Exactly. FileReader
and FileWriter
are garbage; forget that they exist. They implicitly use the platform default encoding and do not allow you to override this default.
Instead, use this:
BufferedReader br = new BufferedReader(
new InputStreamReader(new FileInputStream("inputFile.html"), "UTF-8"));
BufferedWriter out = new BufferedWriter(
new OutputStreamWriter(new FileOutputStream("outputFile.html"), "UTF-8"));
FileReader
and FileWriter
use the platform default encoding, which isn't what you want here. (I've always viewed this as a fatal flaw in these APIs.)
Instead, use FileInputStream
and FileOutputStream
, wrapped in an InputStreamReader
and OutputStreamWriter
respectively. This allows you to explicitly set the encoding - which in this case should be UTF-8.
To make life easier you can also use FileUtils from the Apache Commons IO project which has read and write methods for Files and Strings which consider encoding.
You use BufferedReader, which ignores the html-structure of the file. Thats why <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
has no effect.
Try this one:
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream("zzz"), "utf8")));
精彩评论