Single line regular expression needed for a pattern in Perl

2023-02-19 13:58 问答作者：

I need to read many HTML files containing similar structure using perl.

The structure consists of STRRRR...E

S=html header just before table begins
T=unique table start structure in the html file(I can identify it)
R=Group of html elements(those are tr's, I can identify it too)
E=All remaining - singnifies end R's

I want to extract all R's in array using single line "m" perlop.

I'm looking for something li开发者_运维技巧ke this:

@all_Rs = $htmlfile=~m{ST(R)*E}gs;

But it has never worked out.

Until now I've been doing round about way to do it like using deleting unwanted text, for loop etc. I want to extract all rows from this page: http://www.trainenquiry.com/StaticContent/Railway_Amnities/Enquiry%20-%20North/STATIONS.aspx and there are many such pages.

Regex is the wrong tool. Use an HTML parser.

use HTML::TreeBuilder::XPath;
my $tree= HTML::TreeBuilder::XPath->new_from_content(<<'END_OF_HTML');
<html>
    <table>
        <tr>1
        <tr>2
        <tr>3
        <tr>4
        <tr>5
    </table>
</html>
END_OF_HTML

print $_->as_text for $tree->findnodes('//tr');

HTML::TreeBuilder::XPath inherits from HTML::TreeBuilder.

daxim is right about using a real parser. My personal choice is XML::LibXML.

use XML::LibXML
my $parser = XML::LibXML->new();
$parser->recover(1);                 # don't fail on parsing errors
my $doc = do { 
    local $SIG{__WARN__} = sub {};   # silence warning about parsing errors
    $parser->parse_html_file('http://www.trainenquiry.com/StaticContent/Railway_Amnities/Enquiry%20-%20North/STATIONS.aspx');
};

print $_->toString() for $doc->findnodes('//tr[td[1][@class="td_background"]]');

This gets me each station row from that page.

For a bit more work we can have a nice data structure to hold the text in each cell.

use Data::Dumper;
my @data = map {
    my $row = $_;
    [ map {
        $_->findvalue('normalize-space(text())');
    } $row->findnodes('td') ]
} $doc->findnodes('//tr[td[1][@class="td_background"]]');
print Dumper \@data;

If you want to process an HTML table, consider using a module that knows how to process HTML tables!

#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;


my $html = get 'http://www.trainenquiry.com/StaticContent/Railway_Amnities/Enquiry%20-%20North/STATIONS.aspx';
$html =~ s/&nbsp;/ /g;

my $te = new HTML::TableExtract( depth => 1, count => 2 );
$te->parse($html);
foreach my $ts ($te->table_states) {
   foreach my $row ($ts->rows) {
      next if $row->[0] =~ /^\s*(Next|Station)/;
      next if $row->[4] =~ /^\s*(ARR\/DEP|RESERVATION)/;
      foreach my $cell (@$row) {
          $cell =~ s/^\s+//;
          $cell =~ s/\s+$//;
          print "$cell\n";
      }
      print "\n";
   }
}

继续阅读：perl regex

Single line regular expression needed for a pattern in Perl

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？