HTML characters from a downloaded page dont appears correctly
Some pages have HTML special characters in their content, but they are appearing as a square (an unknown character).
What can I do?
Can I convert the String containg the carachters to another format(UTF-8)? It's in the conversion from InputStream to String that happens this. I really don't know what causes it.
public HttpURLConnection openConnection(String url) {
try {
URL urlDownload = new URL(url);
HttpURLConnecti开发者_StackOverflow社区on con = (HttpURLConnection) urlDownload.openConnection();
con.setInstanceFollowRedirects(true);
con.connect();
return con;
} catch (Exception e) {
return null;
}
}
private String getContent(HttpURLConnection con) {
try {
return IOUtils.toString(con.getInputStream());
} catch (Exception e) {
System.out.println("Erro baixando página: " + e);
return null;
}
}
page.setContent(getContent(openConnection(con)));
You need to read the InputStream
using InputStreamReader
with the charset as specified in the Content-Type
header of the downloaded HTML page. Otherwise the platform default charset will be used, which is apparently not the same as the HTML's one in your case.
Reader reader = new InputStreamReader(input, "UTF-8");
// ...
You can of course also use a HTML reader/parser like Jsoup which takes this automatically into account.
String html = Jsoup.connect("http://stackoverflow.com").get().html();
Update: as per your updated question, you seem to be using URLConnection
to request the HTML page and IOUtils
to convert InputStream
to String
. You need to use it as follows:
String contentType = connection.getHeaderField("Content-Type");
String charset = "UTF-8"; // Default to UTF-8
for (String param : contentType.replace(" ", "").split(";")) {
if (param.startsWith("charset=")) {
charset = param.split("=", 2)[1];
break;
}
}
String html = IOUtils.toString(input, charset);
If you're still having problems with getting the characters right, then it can only mean that the console/viewer where you're printing those characters to doesn't support the charset. E.g., when you run the following in Eclipse
System.out.println(html);
Then you need to ensure that the Eclipse console uses UTF-8. You can set it by Window > Preferences > General > Workspace > Text File Encoding.
Or if you're writing it to some file by FileWriter
, then you should rather be using InputStream
/OutputStream
from the beginning on without converting it to String
first. If converting to String
is really an important step, then you need to write it to new OutputStreamWriter(output, "UTF-8")
.
精彩评论