Perl HTML::Strip whitelist

2023-04-03 11:50 问答作者：

Is there a way to give a whitelist to the module that it would preserve certain tags?

<div><b>test</b></div>

Stripped with this code

my $hs = HTML::Strip->new();

open FILE, "<test.markup";
$raw_html=<FILE>;
my $clean_text = $hs->parse( $raw_html );
$hs->eof;

Produces output below

test

However I would like to get with <b> tag whitelisted output below.

<b>test</b>

EDIT, ONE SOLUTION

Using HTML::StripScripts::Parser

my $hss = HTML::StripScripts::Parser->new(
     {
         Context => 'Inline',
         EscapeFiltered  => 0,
         BanAllBut       => [qw(i b u)],
     },
     strict_comment => 0,
     strict_names   => 0,
);

$hss->filter_html("<div><b>test</b></div>");
$cooked = $hss->filtered_document;
$cooked =~ s/<!--filtered-->//g;
print $cooked; // <b>test</b>

Reading both the Perl wrapper and the underlying XS code, there's no whitelist capability.

It is possible to add, though not 100% trivial - the code already checks tag names for "strip" tags like <script> and is only 200LOC.

As another approach, RegEx book from O'Reilly has a regular expression recipe that can strip HTML tags (including whitelist capability).

If you'd rather not mess with RegEx, try HTML::StripScripts::Parser - it seems it uses whitelists

继续阅读：html-parsing perl perl-module

Perl HTML::Strip whitelist

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？