perl HTML::TableExtract get stripped text

2023-03-21 03:46 问答作者：

My tables' rows in HTML are as follows,

<TR bgcolor="#FFFFFF" onmouseover="this.bgColor='#DBE9FF';" onmouseout="this.bgColor='#FFFFFF';">
   <TD  class="dlfont">07/01/2011 10:33 AM EDT</B>&nbsp;</TD>
   <TD  class="dlfont">DRB</B>&nbsp;</TD><TD  class="dlfont">Blah</B>&nbsp;</TD>
   <TD  class="dlfont">PPD</B>&nbsp;</TD><TD  class="dlfont"> </B>&nbsp;</TD>
   <TD  class="dlfont">07/01/2011</B>&nbsp;</TD>
   <TD width=50 align=center><A HREF="javascript:parent.nav.details('0701201110:33AMEDTDRBPPD')"><IMG border='0' src='/images/view.gif' height=10 width=19></A></TD>
</TR>


<TR bgcolor="#EEEEEE" onmouseover="this.bgColor='#DBE9FF';" onmouseout="this.bgColor='#EEEEEE';">
    <TD  class="dlfont">07/01/2011 10:33 AM EDT</B>&nbsp;</TD>
    <TD  class="dlfont">WHPSF</B>&nbsp;</TD>
    <TD  class="dlfont">Blah</B>&nbsp;</TD>
    <TD  class="dlfont"> </B>&nbsp;</TD>
    <TD  class="dlfont"> </B>&nbsp;</TD>
    <TD  class="dlfont">07/01/2011</B>&nbsp;</TD>  
    <TD width=50 align=center><A HREF="javascript:parent.nav.details('0701201110:33AMEDTWHPSF')"><IMG border='0' src='/images/view.gif' height=10 width=19></A></TD>
</TR>

When I extract the rows using HTML::TableExtract, the extra chara开发者_StackOverflowcters </B>  also appear at the end and form some kind of special character. How can I get rid of this?

I would keep in mind two things when using HTML::TableExtract with the badly formatted HTML in your question

use keep_html=>1 in the HTML::TableExtract constructor
use a regex to remove the </B> , carefully

Here's some Perl code I wrote to prune the </B>  out of the table cells, but note, this could change validly formatted HTML to badly formatted HTML if you blindly apply it in all cases.

#!/usr/bin/perl

use strict;
use warnings;
use HTML::TableExtract;

my($f) = @ARGV;
open F,$f;
my $html = join '',<F>;
close F;

### your html didn't include headers, so I added a first table row with td text, time a b c d e f, to help HTML::TableExtract find the table in file, $f 
my $te = HTML::TableExtract->new(
    keep_html=>1,
    headers=>[qw/ time a b c d e f/]);

$te->parse($html);

for my $ts($te->tables)
{
    print "Table(",join(',',$ts->coords),":\n";
    for my $row ($ts->rows)
    {
        for my $cell (@$row)
        {
            next unless $cell;
                    ## maybe add $ at end of regex or other test here to make sure valid cases of <B>...</B>&nbsp; are not affected
            $cell =~ s/<\/B>&nbsp;//i;
            print $cell."\n";
        }
    }
}

继续阅读：html-parsing perl

perl HTML::TableExtract get stripped text

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？