开发者

Remove all html tags and content except for a div class

I want to remove all html content from a string except for a div class : <div class="toto">blablabla</div>

Shou开发者_如何学Pythonld I use a Regex or DOM Parser?

To answer drachenstern :

It's a comment content with bbcode. And the html in this div is generated with Geshi (code highlighter) so i don't want to delete this.

For example a visitor can enter <script></script> in a [code][/code] bbcode tag. All HTML outside the [code][/code] bbcode tag must be delete no?


It's almost impossible to use a Regex to successfully extract data from a webpage so I would suggest instead using a dedicated HTML parser. Some problems are just to big for you to try to tackle them everytime, so it's considered fine and acceptable to use an external library.

If you only need the one div, I would like to know what you're trying to do with the particular div. Can it work clientside or must it happen serverside? Are you trying to write a scraper?


I think you already know the answer :-)

Seriously though... in this case it depends on the complexity of the surrounding string... If it has a large variance or the snippet you want to keep is nested deeply then a dom parser makes sonse.. if the surrounding contents are pretty predictable then a regex may do. I would probably use the DOM either way because it going to be easier to work with and be the most failsafe.


Unless you can guarantee that this section of the string is EXACTLY of the form <div class="toto".....</div>, regex can't do it. By exactly I mean mainly no other attributes before class and no other div element inside this one. Upper/lowercase, white space, and single/double quotes should be able to be handled by regex.

Since you probably can't make those guarantees (it would be an exceptional case if you could), you need a DOM parser

Even if you do make regex work for this very specific case, if you are going to have possible changes or perform similar tasks elsewhere then the DOM parser will quickly become worth it.

EDIT - had mistakenly categorized whitespace, fixed.


(I'm assuming from the PHP tag that the following would be useful...)

strip_tags() does exactly this. The sample code

<?php
$text = '<p>Test paragraph.</p> <div class="blah">Other text</div>';

echo strip_tags($text, '<div>');
echo "\n";
?>

produces the output

Test paragraph. <div class="blah">Other text</div>


wordpress and others still use kses and it seems to be a quite flexible filter - have a look at it here: http://sourceforge.net/projects/kses/

Example:

$string = kses($string, array('div' => array('valueless' => 'n', value => 'toto')));
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜