Need a good HTML parser on php

2022-12-13 11:57 问答作者：

Found this one http://simplehtmldom.sourceforge.net/ but it has failed to work

extracting this page http://php.net/manual/en/function.curl-setopt.php
and parse it to plain html, it failed and returned a partial html page

This is what I want to do, Go to a html page and get the components individual( the contents of all开发者_运维问答 div and p in a hierarchy ) I like the features of simplehtmldom any such parser is required which is good at all code(best and worst).

I often use DOMDocument::loadHTML, which works not too bad, in the general cases -- and I like querying the documents, once they are loaded as DOM, with Xpath.

Unfortunatly, I suppose that, in some cases, if the HTML page is really to badly-formed, some parsing problems can occur... That's when you start understanding that respecting web-standards is a great idea...

Building on Pascal MARTIN's response...

I use a combination of CURL and XPATH. Below is a function I use in one of my classes.

protected function _get_xpath($url) {
    $refferer='http://www.whatever.com/';
    $useragent='Googlebot/2.1 (http://www.googlebot.com/bot.html)';
    // create curl resource
    $ch = curl_init();

    // set url
    curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
    curl_setopt ($ch, CURLOPT_REFERER, $refferer);
    curl_setopt($ch, CURLOPT_URL, $url);

    //return the transfer as a string
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);

    // $output contains the output string
    $output = curl_exec($ch);
    //echo htmlentities($output);

    if(curl_errno($ch)) {
        echo 'Curl error: ' . curl_error($ch);
    }
    else {
        $dom = new DOMDocument();
        @$dom->loadHTML($output);
        $this->xpath = new DOMXPath($dom);
        $this->html = $output;
    }

    // close curl resource to free up system resources
    curl_close($ch);
}

You can then parse the document structure using evaluate and extract the information you want

$resultDom = $this->xpath->evaluate("//span[@id='headerResults']/strong");
$this->results = $resultDom->item(0)->nodeValue;

I found the best one for my use here it is - http://querypath.org/

继续阅读：parsing

Need a good HTML parser on php

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？