What is the best way to determine/convert the encoding of an external HTML file?

2023-02-05 08:01 问答作者：

I am parsing HTML from ~100 different domains. I could check what encoding each domain uses & do things that way, but that seems dumb.

Usually the encoding is in the header tags yeah? but not always I gather. so I may need to run some regex? or use some mb_ functions. Or perhaps use cURL? All the examples I've found so far are for XML & now I've got a headache.

Yes also I am using the DOMDocument 开发者_Python百科class to find what I want. And that is all working great. Except for the encoding.

According to the W3C internationalization standards, you should follow these priorities in order to get the encoding of an HTML/XML document:

Content-Type header (from the HTTP response)
XML or XHTML declaration, e.g.: <?xml version="1.0" encoding="utf-8" ?>
meta tag with http-equiv="Content-Type" (from the HTML header)

In my experience, when all that fails, you can assume encoding is most probably ISO-8859-1 or CP1252. You can decode content using the iconv library, e.g.: iconv("UTF-8", "ISO-8859-1", $content).

If you use the cURL library to fetch the URLs, you can get the Content Type header with: curl_getinfo($ch, CURLINFO_CONTENT_TYPE). The other tags can be extracted with an XML/HTML parser.

You can parse a meta tag which any responsible programmer should have included in the <head> element.

<meta http-equiv="content-type" 
        content="text/html;charset=utf-8" />

You can also choose to reject any html which does not have the charset in the header or in a meta tag.

继续阅读：curl domdocument encoding html-parsing php

What is the best way to determine/convert the encoding of an external HTML file?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？