Download a web page without character replacement

2023-01-16 03:22 问答作者：

I'm tryng to download a web page in java with the following:

URL url = new URL("www.jksfljasdlfas.com");
FIle to = new File("/home/test/test.html");

Reader in = new InputStreamReader(url.openStream(), "UTF-8");
Writer out = new OutputStreamWriter(new FileOutputStrea开发者_运维技巧m(to), "UTF-8");

int c;
while((c = in.read()) != -1){
    out.write(c);
}
in.close();
out.close();

I download the page and some character are replaced by entities:

this:

<a href="http://www.generation276.org/film/?m=200812&paged=2" >Pagina successiva »</a>

become this:

<a href="http://www.generation276.org/film/?m=200812&paged=2" >Pagina successiva »</a>

Downloading the same page with Chrome, the & remains &.

I'm new in Charset/encoding; can anybody understand the probem?

The Java part is working perfectly fine.

Chrome is tricking you there. In FireFox, when I select View -> Page Source, I see this:

<a href="http://www.generation276.org/film/?m=200812&#038;paged=3" >
Pagina successiva &raquo;</a>

while with FireBug / Inspect Element I see this:

<a href="http://www.generation276.org/film/?m=200812&paged=3" style="">
Pagina successiva »</a>

and it copies to the clipboard as this:

<a href="http://www.generation276.org/film/?m=200812&amp;paged=3" style="">
Pagina successiva »</a>

Browsers don't always show you what's really there.

The second part of your question is identical to this previous Question:

Java: How to decode HTML character entities in Java like HttpUtility.HtmlDecode?

And hence the answer is also the same:

Use StringEscapeUtils.unescapeHTML(String) from the Apache Commons / Lang project.

The actual source of that page does say:

<a href="http://www.generation276.org/film/?m=200812&#038;paged=2" >Pagina successiva &raquo;</a>

and this is perfectly fine. & is a valid character reference for a literal ampersand character in HTML, although the entity reference & is generally more common.

<a href="http://www.generation276.org/film/?m=200812&paged=2" >Pagina successiva &raquo;</a>

This is invalid HTML.

When you save ‘HTML only’, Chrome saves the original HTML source without change. When you save ‘Complete’, it has to re-write the page to change references to other resources.

Unfortunately the serialisation process involved in this appears to have a bug in failing to &-escape the ampersands in the URL. Whilst browsers typically let you get away with this, it will break (mangling your URL) if the word to the right of the ampersand happens to make a valid HTML entity name or character reference.

Other places where Chrome serialises attribute values, such as innerHTML, do not suffer from this rather poor bug.

ETA:

I have to "unescape" the &... how can I do?

If you try to scrape information out of the source using regex you'd have to decode manually using HTML decoder. There isn't one built-in to Java so you would need a third-party tool such as that from Apache Commons as linked by seanizer.

However, scraping with regex is crude and unreliable. I would strongly suggest using an HTML parser to load the file and pick out the data you want. It will deal with decoding attribute values and text content.

继续阅读：encoding entities http

Download a web page without character replacement

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？