Capture content inside html tags with regex

2022-12-10 02:20 问答作者：

First off, I'm aware this is a bad practice and I have answered many questions even saying so, but to clarify I am forced to use regex because this application stores regexes in a database and only functions this way. I absolutely cannot change the functionality

Now that we got that out of the way.. because I always use DOM methods I'm not used to doing this with regular expressions.

I want to capture everything inside of the intro content division, up to the first end div tag. I don't care if the regex will fail on nested divs. I need to capture space ( newline ) characters too.

<div class="intro-content">
<p>blah</p>
<br/>
<strong>test</strong>
</div>

Regex so far:

<div\s*class="intro-content">(.*)</div>

This obviously doesn't work because the . character will not match space characters.

I do realize there have been hundreds of question asked, but the questions I visited only had relatively simple answers ( excluding the DOM suggestion answers ) where a (.*) would not suffice because it开发者_Go百科 doesn't account for newlines, and some regexes were too greedy.

I'm not looking for a perfect, clean solution that will account for every possibility ( like that's even possible ) - I just want a quick solution that will work for this solution so I can move on and work on more modern applications that aren't so horribly coded.

It sounds like you need to enable the "dot all" (s) flag. This will make . match all characters including line breaks. For example:

preg_match('/<div\s*class="intro-content">(.*)<\/div>/s', $html);

You should not use regexp's to parse html like this. div tags can be nested, and since regexp don't have any context, there is no way to parse that. Use a HTML parser instead. For example:

$doc = new DomDocument();
$doc->loadHtml($html);
foreach ($doc->getElementsByClassName("div") as $div) {
  var_dump($div);
}

See: DomDocument

Edit:

And then I saw your note:

I am forced to use regex because this application stores regexes in a database and only functions this way. I absolutely cannot change the functionality

Well. At least make sure that you match non-greedy. That way it'll match correct as long as there are no nested tags:

preg_match('/<div\s*class="intro-content">(.*?)<\/div>/s', $html);

This obviously doesn't work because the . character will not match space characters.

Should do, but if it doesn't, we can just add them in:

<div\s*class="intro-content">([ \t\r\n.]*)</div>

You then need to make it lazy, so it captures everything up to the first </div> and not the last. We do this by adding a question mark:

<div\s*class="intro-content">([ \t\r\n.]*?)</div>

There. Give that a shot. You might be able to replace the space characters (\t\r\n) between [ and ] with a single \s too.

继续阅读：php regex

Capture content inside html tags with regex

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？