Need a good HTML parser on php
Found this one http://simplehtmldom.sourceforge.net/ but it has failed to work
extracting this page http://php.net/manual/en/function.curl-setopt.php
and parse it to plain html, it failed and returned a partial html page
This is what I want to do, Go to a html page and get the components individual( the contents of all开发者_运维问答 div and p in a hierarchy ) I like the features of simplehtmldom any such parser is required which is good at all code(best and worst).
I often use DOMDocument::loadHTML
, which works not too bad, in the general cases -- and I like querying the documents, once they are loaded as DOM, with Xpath
.
Unfortunatly, I suppose that, in some cases, if the HTML page is really to badly-formed, some parsing problems can occur... That's when you start understanding that respecting web-standards is a great idea...
Building on Pascal MARTIN's response...
I use a combination of CURL and XPATH. Below is a function I use in one of my classes.
protected function _get_xpath($url) {
$refferer='http://www.whatever.com/';
$useragent='Googlebot/2.1 (http://www.googlebot.com/bot.html)';
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt ($ch, CURLOPT_REFERER, $refferer);
curl_setopt($ch, CURLOPT_URL, $url);
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
// $output contains the output string
$output = curl_exec($ch);
//echo htmlentities($output);
if(curl_errno($ch)) {
echo 'Curl error: ' . curl_error($ch);
}
else {
$dom = new DOMDocument();
@$dom->loadHTML($output);
$this->xpath = new DOMXPath($dom);
$this->html = $output;
}
// close curl resource to free up system resources
curl_close($ch);
}
You can then parse the document structure using evaluate and extract the information you want
$resultDom = $this->xpath->evaluate("//span[@id='headerResults']/strong");
$this->results = $resultDom->item(0)->nodeValue;
I found the best one for my use here it is - http://querypath.org/
精彩评论