开发者

Problem with use php DOM to parse html

I use DOMDocument of PHP to parse a HTML source (got via cURL). cURL work nice but When I use DOM to parse, a problem occur. See the code.

    <?php
    $url = "http://www.google.com.vn/advanced_search?hl=en";
    $ch = curl_init($url);
    $header = array();
    $header[0]  = "Accept: text/xml,application/xml,application/xhtml+xml,";
    $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    $header[]   = "Cache-Control: max-age=0";
    $header[]   = "Connection: keep-alive";
    $header[]   = "Keep-Alive: 300";
    $hea开发者_运维问答der[]   = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $header[]   = "Accept-Language: en-us,en;q=0.5";
    $header[]   = "Pragma: "; // browsers keep this blank.

    curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; vi; rv:1.9.2.3)  Gecko/20100401 Firefox/3.6.3 FirePHP/0.5');
    curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
    curl_setopt($ch, CURLOPT_AUTOREFERER, true);
    //curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    //curl_setopt($ch, CURLOPT_HEADER, 1);
    curl_setopt($ch, CURLINFO_HEADER_OUT, 1);
    $html = curl_exec($ch); 
    /*
     * if I do:
     * echo $html;
     * exit;    // <-- it work finally
     *  numbers of <td> tag equal to </td>
     */
     $dom = new DOMDocument();
     @$dom->loadHTML($html);
     $html = $dom->saveHTML();
     echo $html; // <-- output html not right syntax . number of <td> tag greater than </td> tag.

    ?>

Is here a programming error or DOMDocument bug?


When you remove the error suppression you will see that DOMDocument will give a couple of these:

Warning: DOMDocument::loadHTML(): Opening and ending tag mismatch: form and tr
Warning: DOMDocument::loadHTML(): Opening and ending tag mismatch: div and tr
Warning: DOMDocument::loadHTML(): Opening and ending tag mismatch: td and tr

In order to parse the markup into a DOM tree, loadHTML will try to fix as much as it can, so that's likely why you think it's a bug. It really isn't. The Google markup is just invalid.

On a sidenote: why do you need to scrape that page anyways? Google has an API for searching. Use that instead.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜