HTML Mixed Encodings?

2023-04-08 20:32 问答作者：

First I would like to say thank you for the help in advance.

I am currently w开发者_如何学Pythonriting a web crawler that parses HTML content, strips HTML tags, and then spell checks the text which is retrieved from the parsing.

Stripping HTML tags and spell checking has not caused any problems, using JSoup and Google Spell Check API.

I am able to pull down content from a URL and passing this information into a byte[] and then ultimately a String so that it can be stripped and spell checked. I am running into a problem with character encoding.

For example when parsing http://www.testwareinc.com/...

Original Text: We’ve expanded our Mobile Web and Mobile App testing services.

... the page is using ISO-8859-1 according to meta tag...

ISO-8859-1 Parse: Weve expanded our Mobile Web and Mobile App testing services.

... then trying using UTF-8...

UTF-8 Parse: We�ve expanded our Mobile Web and Mobile App testing services.

Question Is it possible that HTML of a webpage can include a mix of encodings? And how can that be detected?

It looks like the apostrophe is coded as a 0x92 byte, which according to Wikipedia is an unassigned/private code point.

From there on, it looks like the browser falls back by assuming it's a non-encoded 1-byte Unicode code point : +0092 (Private Use Two) which appears to be represented as an apostrophe. No wait, if it's one byte, it's more probably cp1252: Browsers must have a fallback strategy according to the advertised CP, such as ISO-8859-1 -> CP1252.

So no mix of encoding here but as others said a broken document. But with a fallback heuristic that will sometimes help, sometimes not.

If you're curious enough, you may want to dive into FF or Chrome's source code to see exactly what they do in such a case.

Having more than 1 encoding in a document isn't a mixed document, it is a broken document.

Unfortunately there are a lot of web pages that use an encoding that doesn't match the document definition, or contains some data that is valid in the given encoding and some content that is invalid.

There is no good way to handle this. It is possible to try and guess the encoding of a document, but it is difficult and not 100% reliable. In cases like yours, the simplest solution is just to ignore parts of the document that can't be decoded.

Apache Tika has an encoding detector. There are also commercial alternatives if you need, say, something in C++ and are in a position to spend money.

I can pretty much guarantee that each web page is in one encoding, but it's easy to be mistaken about which one.

seems like issue with special characters. Check this StringEscapeUtils.escapeHtml if it helps. or any method there

edited: added this logic as he was not able to get code working

public static void main(String[] args) throws FileNotFoundException {
        String asd = "’";
        System.out.println(StringEscapeUtils.escapeXml(asd)); //output - &#8217;
    System.out.println(StringEscapeUtils.escapeHtml(asd)); //output - &rsquo;
}

继续阅读：encoding parsing web-crawler

HTML Mixed Encodings?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？