How to get string from HTML with regex?

2023-01-08 18:19 问答作者：

I'm trying to parse block from html page so i try to preg_match this block with php

if( preg_match('<\/div>(.*?)<div class="adsdiv">', $data, $t))

but doesn't work

</div>

blablabla

blablabla

blablabla

<div class="adsdiv">

i want grep only blab开发者_C百科labla blablabla words any help

Regex aint the right tool for this. Here is how to do it with DOM

$html = <<< HTML
<div class="parent">
    <div>
        <p>previous div<p>
    </div>
    blablabla
    blablabla
    blablabla
    <div class="adsdiv">
        <p>other content</p>
    </div>
</div>
HTML;

Content in an HTML Document is TextNodes. Tags are ElementNodes. Your TextNode with the content of blablabla has to have a parent node. For fetching the TextNode value, we will assume you want all the TextNode of the ParentNode of the div with class attribute of adsdiv

$dom = new DOMDocument;
$dom->loadHTML($html);
$xPath = new DOMXPath($dom);
$nodes = $xPath->query('//div[@class="adsdiv"]');
foreach($nodes as $node) {
    foreach($node->parentNode->childNodes as $child) {
        if($child instanceof DOMText) {
            echo $child->nodeValue;
        }
    };
}

Yes, it's not a funky one liner, but it's also much less of a headache and gives you solid control over the HTML document. Harnessing the Query Power of XPath, we could have shortened the above to

$nodes = $xPath->query('//div[@class="adsdiv"]/../text()');
foreach($nodes as $node) {
    echo $node->nodeValue;
}

I kept it deliberatly verbose to illustrate how to use DOM though.

Apart from what has been said above, also add the /s modifier so . will match newlines. (edit: as Alan kindly pointed out, [^<]+ will match newlines anyway)

I always use /U as well since in these cases you normally want minimal matching by default. (will be faster as well). And /i since people say <div>, <DIV>, or even <Div>...

if (preg_match('/<\/div>([^<]+)<div class="adsdiv">/Usi', $data, $match))
{
    echo "Found: ".$match[1]."<br>";
} else {
    echo "Not found<br>";
}

edit made it a little more explicit!

From the PHP Manual:

s (PCRE_DOTALL) - If this modifier is set, a dot metacharacter in the pattern matches all characters, including newlines. Without it, newlines are excluded. This modifier is equivalent to Perl's /s modifier. A negative class such as [^a] always matches a newline character, independent of the setting of this modifier.

So, the following should work:

if (preg_match('~<\/div>(.*?)<div class="adsdiv">~s', $data, $t))

The ~ are there to delimit the regular expression.

You need to delimit your regex; use /<\/div>(.*?)<div class="adsdiv">/ instead.

继续阅读：html-parsing php regex

How to get string from HTML with regex?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？