开发者

Perl HTML::Strip whitelist

Is there a way to give a whitelist to the module that it would preserve certain tags?

N开发者_运维技巧ow markup as below

<div><b>test</b></div>

Stripped with this code

my $hs = HTML::Strip->new();

open FILE, "<test.markup";
$raw_html=<FILE>;
my $clean_text = $hs->parse( $raw_html );
$hs->eof;

Produces output below

test

However I would like to get with <b> tag whitelisted output below.

<b>test</b>

EDIT, ONE SOLUTION

Using HTML::StripScripts::Parser

my $hss = HTML::StripScripts::Parser->new(
     {
         Context => 'Inline',
         EscapeFiltered  => 0,
         BanAllBut       => [qw(i b u)],
     },
     strict_comment => 0,
     strict_names   => 0,
);

$hss->filter_html("<div><b>test</b></div>");
$cooked = $hss->filtered_document;
$cooked =~ s/<!--filtered-->//g;
print $cooked; // <b>test</b>


Reading both the Perl wrapper and the underlying XS code, there's no whitelist capability.

It is possible to add, though not 100% trivial - the code already checks tag names for "strip" tags like <script> and is only 200LOC.


As another approach, RegEx book from O'Reilly has a regular expression recipe that can strip HTML tags (including whitelist capability).


If you'd rather not mess with RegEx, try HTML::StripScripts::Parser - it seems it uses whitelists

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜