Stop Jsoup from encoding
I'm trying to parese an URL with JSoup which contains the following Text: Ætterni
.
After parsing the document the same string looks like that: Ætterni
.
How do I prevent this form 开发者_高级运维happening? I want the document 1:1 exactly like it was.
Code:
doc = Jsoup.connect(url).get();
String docEncoding=doc.outputSettings().charset().name();
OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(localLink),docEncoding);
writer.write(doc.html());
writer.close();
Use
doc.outputSettings().escapeMode(EscapeMode.xhtml);
for avoiding entities conversion.
You seem to be not utilizing the Jsoup's powers in any way. I'd just stream the HTML plain using java.net.URL
. This way you have a 1:1 copy of the response.
InputStream input = new URL(url).openStream();
OutputStream output = new FileOutputStream(localLink);
// Now copy input to output the usual Java IO way.
You should not use Reader
/Writer
for this as this may malform the characters of sources in unknown encoding, because the platform default encoding would be used instead.
精彩评论