开发者

How can I render HTML as text using Perl as Lynx does? [duplicate]

This question already has answers here: Closed 12 years ago.

Possible Duplicate:

Which CPAN module would you recommend for turning HTML into plain text? 开发者_运维百科

Question:

  • Is there a module to render HTML, specifically to gather the text, while adhering to font-style tags, such as <tt>, <b>, <i>, etc and break-line <br>, similar to Lynx.

For example:

# cat test.html

<body>  
<div id="foo" class="blah">  
<tt>test<br>
<b>test</b><br>
whatever<br>
test</tt>
</div>
</body>

# lynx.exe --dump test.html

test
test
whatever
test

Note: the second line should be bold.


Lynx is a big program and its html rendering will be non trivial.

How about this:

my $lynx = '/path/to/lynx';
my $html = [ html here ];
my $txt = `$lynx --dump --width 9999 -stdin <<EOF\n$html\nEOF\n`;


Go to search.cpan.org and search for HTML text which will give you lots of options to suit your particular needs. HTML::FormatText is a good baseline, and then branch out into specific variations of it, for example HTML::FormatText::WithLinks if you want to preserve links as footnotes.


I am on Windows so I cannot fully test this but you can adapt htext that comes with HTML::Parser:

#!/usr/bin/perl

use strict; use warnings;

use HTML::Parser;
use Term::ANSIColor;

use HTML::Parser 3.00 ();

my %inside;

sub tag {
   my($tag, $num) = @_;
   $inside{$tag} += $num;
   print " ";  # not for all tags
}

sub text {
    return if $inside{script} || $inside{style};
    my $esc = 1;
    if ( $inside{b} or $inside{strong} ) {
        print color 'blue';
    }
    elsif ( $inside{i} or $inside{em} ) {
        print color 'yellow';
    }
    else {
        $esc = 0;
    }
    print $_[0];
    print color 'reset' if $esc;
}

HTML::Parser->new(api_version => 3,
    handlers => [
        start => [\&tag, "tagname, '+1'"],
        end   => [\&tag, "tagname, '-1'"],
        text  => [\&text, "dtext"],
    ],
    marked_sections => 1,
)->parse_file(shift) || die "Can't open file: $!\n";;
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜