开发者

php dom document remove special characters

im using dom document getElementsByTagName to retrieve a website title.

here is my code:

$doc = new DOMDocument();
@$doc->loadHTML($strData);
$doc->encoding = 'utf-8';
$doc->saveHTML(开发者_开发百科);
$titleNode = $doc->getElementsByTagName("title");

it works fine but when there is special character in the title, the retrieve data is not accurate. im getting "Some More Google Plus Invite Workarounds #wrapper { background:url(/) no-repeat 50% 0; } body { background:#CFD8E2; }" instead.

i did the following to replace the special chars but it didnt work:

// Replace all special characters into space
    $specialChars = array('~','`','!','@','#','$','%','^','&','*','(',')','-','_','=','+','|','\\',']','[','}','{','"','\'',':',';','/','?','.',',','>','<');
        foreach ($specialChars as $a) {
         $titleNode = str_replace($a, ' ', $titleNode);

    }

im getting empty title instead. The <title> value is somthing like this:

<title>Some More Google Plus Invite Workarounds  < Communication, Social Networking < PC World India News < PC World.in</title>

so what should i be doing


It looks like your HTML is not well formed. If you have a stray < in the title, I'm surprised that you're not getting Warning: DOMDocument::loadHTML(): error parsing attribute name in Entity, line: 1 in <path> on line <line>.

As to replacing: if you replace all of the < and > in an html document, you'll not be able to retrieve elements from it: there will not be any elements left:

<head><title>Foo</title></head>

Becomes

headtitleFoo/title/head

Unfortunately, not much can be done to fix this -- bad HTML is bad HTML. If you know that you can expect that type of problem ahead of time, then you might be able to do something with preg_replace (maybe preg_replace("#\s<\s#g",'&lt;',$input);? preg_match('#title[^>]*>(.*)</title#', $input, $matches)?) or substr, but you might just be up a creek.


i had a look the site; and it's a problem because they don't use the proper html-entities in the title:

<title>Some More Google Plus Invite Workarounds  < Communication, Social Networking < PC World India News < PC World.in</title>

i assume that DOMDocument has an issue with that and thinks this is where the tag ends. As a workaround you could add '< ' to $specialChars to dodge this problem.


$fp = fsockopen("www.domain.com", 80, $errno, $errstr, 30);
if (!$fp) {
    echo "$errstr ($errno)<br />\n";
} else {
    $out = "GET / HTTP/1.1\r\n";    
    $out .= "Host: www.domain.com\r\n";
    $out .= "Connection: Close\r\n\r\n";
    fwrite($fp, $out);
    $buffer = '';
    while (!feof($fp)) {
        $buffer .= fgets($fp, 128);
    }
    fclose($fp);
            preg_match('#<.*?title.*?>(.*?)<.*?title.*?>#', $buffer, $matches); 
            var_dump($matches);
}
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜