Remove all html tags and content except for a div class

2023-02-03 11:54 问答作者：

I want to remove all html content from a string except for a div class : <div class="toto">blablabla</div>

Shou开发者_如何学Pythonld I use a Regex or DOM Parser?

To answer drachenstern :

It's a comment content with bbcode. And the html in this div is generated with Geshi (code highlighter) so i don't want to delete this.

For example a visitor can enter <script></script> in a [code][/code] bbcode tag. All HTML outside the [code][/code] bbcode tag must be delete no?

It's almost impossible to use a Regex to successfully extract data from a webpage so I would suggest instead using a dedicated HTML parser. Some problems are just to big for you to try to tackle them everytime, so it's considered fine and acceptable to use an external library.

If you only need the one div, I would like to know what you're trying to do with the particular div. Can it work clientside or must it happen serverside? Are you trying to write a scraper?

I think you already know the answer :-)

Seriously though... in this case it depends on the complexity of the surrounding string... If it has a large variance or the snippet you want to keep is nested deeply then a dom parser makes sonse.. if the surrounding contents are pretty predictable then a regex may do. I would probably use the DOM either way because it going to be easier to work with and be the most failsafe.

Unless you can guarantee that this section of the string is EXACTLY of the form <div class="toto".....</div>, regex can't do it. By exactly I mean mainly no other attributes before class and no other div element inside this one. Upper/lowercase, white space, and single/double quotes should be able to be handled by regex.

Since you probably can't make those guarantees (it would be an exceptional case if you could), you need a DOM parser

Even if you do make regex work for this very specific case, if you are going to have possible changes or perform similar tasks elsewhere then the DOM parser will quickly become worth it.

EDIT - had mistakenly categorized whitespace, fixed.

(I'm assuming from the PHP tag that the following would be useful...)

strip_tags() does exactly this. The sample code

<?php
$text = '<p>Test paragraph.</p> <div class="blah">Other text</div>';

echo strip_tags($text, '<div>');
echo "\n";
?>

produces the output

Test paragraph. <div class="blah">Other text</div>

wordpress and others still use kses and it seems to be a quite flexible filter - have a look at it here: http://sourceforge.net/projects/kses/

Example:

$string = kses($string, array('div' => array('valueless' => 'n', value => 'toto')));

继续阅读：dom php regex

Remove all html tags and content except for a div class

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？