How can I extract the contents of a specific table from HTML source using Perl?

2023-01-20 10:09 问答作者：

I have to parse 5000 files - which look pretty identical.

I like using HTML::TokeParser::Simple and DBI in order to do the parsing job and store the results.

I have little experience with HTML::TokeParser::Simple but this task goes over my head. Note: i also have had a look at the ideas - that seems to be also an appropiate way. But at the moment i have issues t开发者_C百科o get the correspodending xpath-expressions: I tried to determine the corresponding xpath-expressions that needs to be filled in the Perl-programme.

This is what I have right now:

use strict;

use HTML::TreeBuilder::XPath;

my $tree = HTML::TreeBuilder::XPath->new;

#use real file name here
open(my $fh, "<", "file.html") or die $!;

$tree->parse_file($fh);

my ($name)       = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($type)       = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($adress)     = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($adress_two) = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($telephone)  = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($fax)    = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($internet)   = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($officer)    = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($employees)  = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($offices)    = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($worker)     = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($country)    = $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});
my ($the_council)= $tree->findnodes(qq{/html/body/table/tr[1]/td[2]});


print $name->as_text;
print $type->as_text;
print $adress->as_text;
print $adress_two->as_text;
print $telephone->as_text;
print $fax->as_text;
print $internet->as_text;
print $officer->as_text;
print $employees->as_text;
print $offices->as_text;
print $worker->as_text;
print $country->as_text;
print $the_council->as_text;

is this all right ? Note - i w ant to store this in a database.

BTW: See one of the example sites:

http://www.kultusportal-bw.de/servlet/PB/menu/1188427/index.html?COMPLETEHREF=http://www.kultus-bw.de/did_abfrage/detail.php?id=04313488

in the grey shadowed block you see the wanted information: 17 lines that are wanted. Note - i have 5000 different HTML-files - that all are structured in the very same way!

That means i would be happy to have a template that can be runned with HTML::TokeParser::Simple and DBI.

Can i make use of the above mentioned code... or do i have to change it.

Love to hear from you! That would be great!!

Use some HTML::TableExtract magic:

#!/usr/bin/perl

use strict; use warnings;
use HTML::TableExtract;
use YAML;

my $te = HTML::TableExtract->new( attribs => {
    border => 0,
    bgcolor => '#EFEFEF',
    leftmargin => 15,
    topmargin => 5,
});

$te->parse_file('kultus-bw.html');
my ($table) = $te->tables;

for my $row ( $table->rows ) {
    cleanup(@$row);
    print "@$row\n";
}

sub cleanup {
    for ( @_ ) {
        s/\s+//;
        s/[\xa0 ]+\z//;
        s/\s+/ /g;
    }
}

Output:

Schul-/Behördenname: Abendgymnasium Ostwürttemberg
Schulart: Privatschule (04313488)
Hausadressse: Friedrichstr.70, 73430 Aalen
Postfachadresse: Keine Angabe
Telefon: 07361/680040
Fax: 07361/680040
E-Mail: Keine Angabe
Internet: www.abendgymnasium-ostwuerttemberg.de 
ÜbergeordneteDienststelle: Regierungspräsidium Stuttgart Abteilung 7 Schule und Bildung 
Schulleitung: Keine Angabe
Stellv.Schulleitung: Keine Angabe
AnzahlSchüler: 259
AnzahlKlassen: 8
AnzahlLehrer: Keine Angabe
Kreis: Ostalbkreis
Schulträger: <Verband/Verein> (Verband/Verein)

Of course, I saved a local copy of the page before running the script.

继续阅读：html-parsing html-table perl

How can I extract the contents of a specific table from HTML source using Perl?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？