Capture content inside html tags with regex
First off, I'm aware this is a bad practice and I have answered many questions even saying so, but to clarify I am forced to use regex because this application stores regexes in a database and only functions this way. I absolutely cannot change the functionality
Now that we got that out of the way.. because I always use DOM methods I'm not used to doing this with regular expressions.
I want to capture everything inside of the intro content division, up to the first end div tag. I don't care if the regex will fail on nested divs. I need to capture space ( newline ) characters too.
<div class="intro-content">
<p>blah</p>
<br/>
<strong>test</strong>
</div>
Regex so far:
<div\s*class="intro-content">(.*)</div>
This obviously doesn't work because the .
character will not match space characters.
I do realize there have been hundreds of question asked, but the questions I visited only had relatively simple answers ( excluding the DOM suggestion answers ) where a (.*)
would not suffice because it开发者_Go百科 doesn't account for newlines, and some regexes were too greedy.
I'm not looking for a perfect, clean solution that will account for every possibility ( like that's even possible ) - I just want a quick solution that will work for this solution so I can move on and work on more modern applications that aren't so horribly coded.
It sounds like you need to enable the "dot all" (s) flag. This will make . match all characters including line breaks. For example:
preg_match('/<div\s*class="intro-content">(.*)<\/div>/s', $html);
You should not use regexp's to parse html like this. div
tags can be nested, and since regexp don't have any context, there is no way to parse that. Use a HTML parser instead. For example:
$doc = new DomDocument();
$doc->loadHtml($html);
foreach ($doc->getElementsByClassName("div") as $div) {
var_dump($div);
}
See: DomDocument
Edit:
And then I saw your note:
I am forced to use regex because this application stores regexes in a database and only functions this way. I absolutely cannot change the functionality
Well. At least make sure that you match non-greedy. That way it'll match correct as long as there are no nested tags:
preg_match('/<div\s*class="intro-content">(.*?)<\/div>/s', $html);
This obviously doesn't work because the
.
character will not match space characters.
Should do, but if it doesn't, we can just add them in:
<div\s*class="intro-content">([ \t\r\n.]*)</div>
You then need to make it lazy, so it captures everything up to the first </div>
and not the last. We do this by adding a question mark:
<div\s*class="intro-content">([ \t\r\n.]*?)</div>
There. Give that a shot. You might be able to replace the space characters (\t\r\n
) between [
and ]
with a single \s
too.
精彩评论