开发者

Stop Jsoup from encoding

I'm trying to parese an URL with JSoup which contains the following Text: Ætterni. After parsing the document the same string looks like that: Ætterni.

How do I prevent this form 开发者_高级运维happening? I want the document 1:1 exactly like it was.

Code:

doc = Jsoup.connect(url).get();
String docEncoding=doc.outputSettings().charset().name();
OutputStreamWriter writer = new OutputStreamWriter(new FileOutputStream(localLink),docEncoding);
writer.write(doc.html());
writer.close();


Use doc.outputSettings().escapeMode(EscapeMode.xhtml); for avoiding entities conversion.


You seem to be not utilizing the Jsoup's powers in any way. I'd just stream the HTML plain using java.net.URL. This way you have a 1:1 copy of the response.

InputStream input = new URL(url).openStream();
OutputStream output = new FileOutputStream(localLink);
// Now copy input to output the usual Java IO way.

You should not use Reader/Writer for this as this may malform the characters of sources in unknown encoding, because the platform default encoding would be used instead.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜