regex: put text outside <p> inside <p>
I have some broken html-开发者_JS百科code that i would like to fix with regex.
The html might be something like this:
<p>text1</p>
<p>text2</p>
text3
<p>text4</p>
<p>text5</p>
But there can be much more paragraphs and other html-elements too.
I want to turn in into:
<p>text1</p>
<p>text2</p>
<p>text3</p>
<p>text4</p>
<p>text5</p>
Is this possible with a regex? I'm using php if that matters.
No, this is generally a bad idea with regexes. Regexes don't do stateful parsing. HTML has implicit tags and requires state to be kept to parse.
HTML generally has lots of quirks. It is hard to write an HTML parser as not only you have to keep track of how things should be, but also account for broken behaviour seen in the wild.
Regexes are the wrong tool for this job.
Could http://htmlpurifier.org/ help you?
While regexes are not the best solution for this kind of job, this code works for the example you gave (it might not be optimal!)
<php>
$text = '<p>text1</p>
<p>text2</p>
text3
<p>text4</p>
<p>text5</p>';
$regex = '|(([\r\n ]*<p>[a-zA-Z0-9 \r\n]+</p>[\r\n ]*)+)([\r\n ]*[a-zA-Z0-9 ]+)(([\r\n ]*<p>[a-zA-Z0-9 \r\n]+</p>[\r\n ]*)+)|i';
$replacement = '${1}<p>${3}</p>${4}';
$replacedText = preg_replace($regex, $replacement, $text);
echo $replacedText;
</php>
in the replacement string, see that you use match 1, 3 and 4 to get the correct sub-matches! If you want to be able to capture other HTML tags then
, you can use this regex:
$regex = '|(([\r\n ]*<[a-z0-6]+>[a-zA-Z0-9 \r\n]+</[a-z0-6]+>[\r\n ]*)+)([\r\n ]*[a-zA-Z0-9 ]+)(([\r\n ]*<[a-z0-6]+>[a-zA-Z0-9 \r\n]+</[a-z0-6]+>[\r\n ]*)+)|i';
but be aware that it can mess stuff up, because the closing tag can match to something different.
精彩评论