开发者

Strip html that is not in tags

I'm currently scraping a website and have all the useful data I need, although it comes with a bit of data that I don't want.

Example:

<h2>He开发者_如何转开发ading</h2>
<p>Useful <a href="/foo">data</a></p>
Rubbish <a href="/bar">data</a>
<h2>heading</h2>

So essentially I want to remove all text that is not enclosed by either h2 or p tags.

Is there an easy function/preg?


The laziest solution would be using phpQuery or QueryPath with just:

foreach (qp($html)->find("body *") as $node) {
    echo $node->html(), "\n";
}

It iterates over all tags below body, and skips text nodes implicitely. So you just have to collect the resulting ->html() snippets.


The nicest way to do it is with PHP's DOMDocument class. This is very similar to mario's answer, except that it doesn't require a whole new library.

$doc = new DOMDocument;
$doc->loadXML('<root>' . $yourContent . '</root>');

$nodes = $doc->firstChild->childNodes;

$output = '';
for ($i = 0; $i < $nodes->length; $i++) {
    $node = $nodes->item($i);
    if ($node->nodeType !== XML_TEXT_NODE) {
        $output .= $doc->saveXML($node);
    }
}

echo $output;


Results are alittle better:

preg_match_all('~<h2>.*?<\/h2>|<p>.*?<\/p>~i', $str, $new);
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜