How to get string from HTML with regex?
I'm trying to parse block from html page so i try to preg_match
this block with php
if( preg_match('<\/div>(.*?)<div class="adsdiv">', $data, $t))
but doesn't work
</div>
blablabla
blablabla
blablabla
<div class="adsdiv">
i want grep only blab开发者_C百科labla blablabla
words
any help
Regex aint the right tool for this. Here is how to do it with DOM
$html = <<< HTML
<div class="parent">
<div>
<p>previous div<p>
</div>
blablabla
blablabla
blablabla
<div class="adsdiv">
<p>other content</p>
</div>
</div>
HTML;
Content in an HTML Document is TextNodes. Tags are ElementNodes. Your TextNode with the content of blablabla has to have a parent node. For fetching the TextNode value, we will assume you want all the TextNode of the ParentNode of the div
with class
attribute of adsdiv
$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$nodes = $xPath->query('//div[@class="adsdiv"]');
foreach($nodes as $node) {
foreach($node->parentNode->childNodes as $child) {
if($child instanceof DOMText) {
echo $child->nodeValue;
}
};
}
Yes, it's not a funky one liner, but it's also much less of a headache and gives you solid control over the HTML document. Harnessing the Query Power of XPath, we could have shortened the above to
$nodes = $xPath->query('//div[@class="adsdiv"]/../text()');
foreach($nodes as $node) {
echo $node->nodeValue;
}
I kept it deliberatly verbose to illustrate how to use DOM though.
Apart from what has been said above, also add the /s
modifier so .
will match newlines. (edit: as Alan kindly pointed out, [^<]+
will match newlines anyway)
I always use /U
as well since in these cases you normally want minimal matching by default. (will be faster as well). And /i
since people say <div>
, <DIV>
, or even <Div>
...
if (preg_match('/<\/div>([^<]+)<div class="adsdiv">/Usi', $data, $match))
{
echo "Found: ".$match[1]."<br>";
} else {
echo "Not found<br>";
}
edit made it a little more explicit!
From the PHP Manual:
s (PCRE_DOTALL) - If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl's /s modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier.
So, the following should work:
if (preg_match('~<\/div>(.*?)<div class="adsdiv">~s', $data, $t))
The ~
are there to delimit the regular expression.
You need to delimit your regex; use /<\/div>(.*?)<div class="adsdiv">/
instead.
精彩评论