开发者

How to remove any html tags with nothing but optional whitespace between them

I need to clean up some VERY ugly html (think < span>< /span> < em>< /em> < em> < /em> < strong>< /s开发者_如何学运维trong> ) over and over again...

I'm looking for a nice and easy preg_replace to eliminate any html tags that contain optional whitespace between them. Your assistance is greatly appreciated!

Oh, and just found this beauty:

< p>< strong>< strong>< /strong>< /strong>< /p>

looks like this will need to live in a while loop as well.


It's funny how this topic keeps coming up.

Don't go with regex. Try HTML Tidy instead.


If you are looking to really clean up some code, I'd suggest the Tidy class in PHP. There are some examples that might help get you started. (Note this is a front-end to HTML Tidy)


If you really want a regex, here's one:

s:<(\w+)>\s*<\/\1>::g

Run it multiple times to eliminate nested cases.


Well, it looks like tidy WAS the answer:

function cleanupcrap($html){
$tidy_config = array( 
     'clean' => true, 
     'output-xhtml' => true, 
     'show-body-only' => true, 
     'wrap' => 0,
     ); 

    $tidy = tidy_parse_string($html, $tidy_config, 'UTF8'); 
    $tidy->cleanRepair(); 
    return $tidy->value;

}

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜