Problem with use php DOM to parse html
I use DOMDocument of PHP to parse a HTML source (got via cURL). cURL work nice but When I use DOM to parse, a problem occur. See the code.
<?php
$url = "http://www.google.com.vn/advanced_search?hl=en";
$ch = curl_init($url);
$header = array();
$header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
$header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
$header[] = "Cache-Control: max-age=0";
$header[] = "Connection: keep-alive";
$header[] = "Keep-Alive: 300";
$hea开发者_运维问答der[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header[] = "Accept-Language: en-us,en;q=0.5";
$header[] = "Pragma: "; // browsers keep this blank.
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.1; vi; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 FirePHP/0.5');
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
//curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
//curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLINFO_HEADER_OUT, 1);
$html = curl_exec($ch);
/*
* if I do:
* echo $html;
* exit; // <-- it work finally
* numbers of <td> tag equal to </td>
*/
$dom = new DOMDocument();
@$dom->loadHTML($html);
$html = $dom->saveHTML();
echo $html; // <-- output html not right syntax . number of <td> tag greater than </td> tag.
?>
Is here a programming error or DOMDocument bug?
When you remove the error suppression you will see that DOMDocument
will give a couple of these:
Warning: DOMDocument::loadHTML(): Opening and ending tag mismatch: form and tr
Warning: DOMDocument::loadHTML(): Opening and ending tag mismatch: div and tr
Warning: DOMDocument::loadHTML(): Opening and ending tag mismatch: td and tr
In order to parse the markup into a DOM tree, loadHTML
will try to fix as much as it can, so that's likely why you think it's a bug. It really isn't. The Google markup is just invalid.
On a sidenote: why do you need to scrape that page anyways? Google has an API for searching. Use that instead.
精彩评论