开发者

Regex for selective stripping of HTML

I'm trying to parse some HTML with PHP as an exercise, outputting it as just text, and I've hit a snag. I'd like to remove any tags that are hidden with style="display: none;" - bearing in mind that the tag may contain other attributes and style properties.

The code I have so far is this:

$page = preg_replace("#<([a-z]+).*?style=\".*?开发者_如何学Pythondisplay:\s*none[^>]*>.*?</\1>#s","",$page);`

The code it returning NULL with a PREG_BACKTRACK_LIMIT_ERROR.

I tried this instead:

$page = preg_replace("#<([a-z]+)[^>]*?style=\"[^\"]*?display:\s*none[^>]*>.*?</\1>#s","",$page);

But now it's just not replacing any tags.

Any help would be much appreciated. Thanks!


Using DOMDocument, you can try something like this:

$doc = new DOMDocument;
$doc->loadHTMLFile("foo.html");
$nodeList = $doc->getElementsByTagName('*');
foreach($nodeList as $node) {
    if(strpos(strtolower($node->getAttribute('style')), 'display: none') !== false) {
        $doc->removeChild($node);
    }
}
$doc->saveHTMLFile("foo.html");


You should never parse HTML with Regex. That makes your eyes bleed. HTML is not regular in any form. It should be parsed by using a DOM-parser.

Parse HTML to DOM with PHP

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜