guess charset encoding in PHP
I'm trying to write my own web crawler with cURL in PHP.
[...]
mb_internal_encoding('UTF-8');
mb_language('uni');
$this->_curl = curl_init();
curl_setopt($this->_curl, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($this->_curl, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($this->_curl, CURLOPT_FOLLOWLOCATION, false);
curl_setopt($this->_curl, CURLOPT_MAXREDIRS, 0);
curl_setopt($this->_curl, CURLOPT_TIMEOUT, 10);
curl_setopt($this->_curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (开发者_StackOverflow中文版Windows; U; Windows NT 6.1; de; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.10');
curl_setopt($this->_curl, CURLOPT_HEADER, true);
curl_setopt($this->_curl, CURLOPT_RETURNTRANSFER, true);
$header = array(
"Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3",
"Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7",
"Keep-Alive: 115",
"Connection: keep-alive",
);
curl_setopt($this->_curl, CURLOPT_HTTPHEADER, $header);
curl_setopt($this->_curl, CURLOPT_URL, $url);
curl_setopt($this->_curl, CURLOPT_POST, false);
curl_setopt($this->_curl, CURLOPT_POSTFIELDS, array());
curl_setopt($this->_curl, CURLOPT_HTTPGET, true);
$page = curl_exec($this->_curl);
[...]
The problem is the charset of the website. As you can see on
http://blog.163.com/drewes_4711/blog/static/179317021201151624826557/
there is a header "Content-Type: ...;charset=GBK"
so I can do mb_convert_encoding($content, "UTF-8", "GBK");
but what should I do with
http://tech.hexun.com/2011-06-21/130756909.html
It seems to be the same charset, but it's not given in HTTP header. So I have massive problems with german umlauts, chinese and asian languages... Is there any module or snippet that I can use to determine the charset of ANY downloaded HTML site with cURL?
That second link contains:
<meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
All data before that looks like plain ASCII. So you could try, if the HTTP headers don't give a clue, just parsing (assuming plain ASCII, not UTF-8 - that's likely to break) until you find that header.
This is obviously not guaranteed to work. If the server doesn't send the encoding, and the page doesn't have that header either, you're out of luck. There are no universal means to detect the encoding of a given piece of data.
精彩评论