Strip html that is not in tags
I'm currently scraping a website and have all the useful data I need, although it comes with a bit of data that I don't want.
Example:
<h2>He开发者_如何转开发ading</h2>
<p>Useful <a href="/foo">data</a></p>
Rubbish <a href="/bar">data</a>
<h2>heading</h2>
So essentially I want to remove all text that is not enclosed by either h2
or p
tags.
Is there an easy function/preg?
The laziest solution would be using phpQuery or QueryPath with just:
foreach (qp($html)->find("body *") as $node) {
echo $node->html(), "\n";
}
It iterates over all tags below body, and skips text nodes implicitely. So you just have to collect the resulting ->html() snippets.
The nicest way to do it is with PHP's DOMDocument class. This is very similar to mario's answer, except that it doesn't require a whole new library.
$doc = new DOMDocument;
$doc->loadXML('<root>' . $yourContent . '</root>');
$nodes = $doc->firstChild->childNodes;
$output = '';
for ($i = 0; $i < $nodes->length; $i++) {
$node = $nodes->item($i);
if ($node->nodeType !== XML_TEXT_NODE) {
$output .= $doc->saveXML($node);
}
}
echo $output;
Results are alittle better:
preg_match_all('~<h2>.*?<\/h2>|<p>.*?<\/p>~i', $str, $new);
精彩评论