开发者

Detect remote charset in php

I would like to determine a remote page's encoding through detection of the Content-Type header tag

<meta http-equiv="Content-Type" content="text/html; charset=XXXXX" />

if present.

I retrieve the remote page and try to do a regex to find the required setting if present. I am still learning hence the problem below... Here is what I have:开发者_如何学JAVA

    $EncStart = 'charset=';
    $EncEnd = '" \/\>';
    preg_match( "/$EncStart(.*)$EncEnd/s", $RemoteContent, $RemoteEncoding );
    echo = $RemoteEncoding[ 1 ];

The above does indeed echo the name of the encoding but it does not know where to stop so it prints out the rest of the line then most of the rest of the remote page in my test. Example: When testing a remote russian page it printed:

windows-1251" />

rest of page ....

Which means that $EncStart was okay, but the $EncEnd part of the regex failed to stop the matching. This meta header usually ends in 3 different possibility after the name of the encoding.

"> | "/> | " />

I do not know weather this is usable to satisfy the end of the maching and if yes how to escape it. I played with different ways of doing it but none worked.

Thank you in advance for lending a hand.


add a question mark to your pattern to make it non-greedy (and there's also no need of 's')

preg_match( "/charset=\"(.+?)\"/", $RemoteContent, $RemoteEncoding );
echo $RemoteEncoding[ 1 ];

note that this won't handle charset = "..." or charset='...' and many other combinations.


Take a look at Simple HTML Dom Parser. With it, you can easily find the charset from the head without resorting to cumbersome regexes. But as David already commented, you should also examine the headers for the same information and prioritize it if found.

Tested example:

require_once 'simple_html_dom.php';

$source = file_get_contents('http://www.google.com');
$dom = str_get_html($source);
$meta = $dom->find('meta[http-equiv=content-type]', 0);
$src_charset = substr($meta ->content, stripos($meta ->content, 'charset=') + 8);

foreach ($http_response_header as $header) {
    @list($name, $value) = explode(':', $header, 2);
    if (strtolower($name) == 'content-type') {
        $hdr_charset = substr($value, stripos($value, 'charset=') + 8);
        break;
    }
}

var_dump(
    $hdr_charset,
    $src_charset
);
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜