How can I rewrite the text parts of HTML using HTML::Parser without altering the <script> and <head> sections?

2023-01-08 06:14 问答作者：

The following code is a shortened version of an example from HTML::Parser

#!/usr/bin/perl -w
use strict;
my $code = shift || usage();
sub edit_print { local $_ = shift; tr/a-z/n-za-m/; print } 
use HTML::开发者_如何学运维Parser 3.05;
my $p = HTML::Parser->new(unbroken_text => 1,
     default_h => [ sub { print @_; }, "text" ],
     text_h    => [ \&edit_print,      "text" ],
);
my $file = shift;
$p->parse_file($file)

This code works pretty well, but it has a disadvantage that it also rewrites the text inside <script> and also <head> sections. I've adapted the example above to do what I want, but unfortunately there is a remaining bug where it rewrites things like the text inside the <title> tag which I don't want to rewrite.

Does anyone know how to write something like the above, but without mangling the JavaScript, <title>, or other sections? I'm happy to use another module apart from HTML::Parser if necessary.

Add start and end handlers to your parser, and have them record the ancestry of the current element. When the ancestry contains <head> or <script>, disable rewriting.

Keep your front matter

#! /usr/bin/perl

use warnings;
use strict;

use HTML::Parser 3.05;

sub edit_print { local $_ = shift; tr/a-z/n-za-m/; print }

and use the following sub to create a new parser:

sub create_parser {
  my @tags;
  my $start = sub {
    my($text,$tagname) = @_;
    push @tags => $tagname;
    print $text;
  };
  my $end = sub {
    my($text,$tagname) = @_;
    die "$0: expected </$tags[-1]>, got </$tagname>"
      unless $tagname eq $tags[-1];
    pop @tags;
    print $text;
  };
  my $edit_print = sub {
    if (grep /^(head|script)$/, @tags) { print @_ }
    else                               { edit_print @_ }
  };

  HTML::Parser->new(
    unbroken_text => 1,
    default_h     => [ sub { print @_ }, "text" ],
    text_h        => [ $edit_print,      "text" ],
    start_h       => [ $start,           "text,tagname" ],
    end_h         => [ $end,             "text,tagname" ],
  );
}

The reason for creating it inside a sub is the handler callbacks are closures that share private state in @tags. This implementation allows you to instantiate multiple parsers without worrying about them stomping on each other's data.

my $p = create_parser;
$p->parse_file(\*DATA);

__DATA__
foo
<html>
<head>
<title>My Title</title>
<style type="text/css">
  /* don't change me */
</style>
</head>
<body>
<script type="text/javascript">
  // or me
</script>
<h1>My Document</h1>
<p>Yo.</p>
</body>
</html>

Output:

sbb
<html>
<head>
<title>My Title</title>
<style type="text/css">
  /* don't change me */
</style>
</head>
<body>
<script type="text/javascript">
  // or me
</script>
<h1>Ml Dbphzrag</h1>
<p>Yb.</p>
</body>
</html>

Looking at your existing code, I'm not sure where you're stuck:

add a stack of Booleans
```
my @do_edit = (0)
```
in edit_print, don't edit if $do_edit[0] is 0
add start_h and end_h handlers to shift/unshift values for certain element names

继续阅读：perl

How can I rewrite the text parts of HTML using HTML::Parser without altering the <script> and <head> sections?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？