LWP::Simple - how to implement a loop into it [with live demo]
good evening dear community!
i want to process multiple webpages, kind of like a web spider/crawler might. I have some bits - but now i need to have some improved spider-logic. See the target-url http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50
This page has got more than 6000 results! Well how do i get all the results? I use the module LWP::simple and i need to have some improved arguments that i can use in order to get all the 6150 records
Attempt: Here are the first 5 page URLs:
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=0
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=50
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=100
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=150
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=200
We can see that the "s" attribute in the URL starts at 0 for page 1, then increases by 50 for each page there after. We can use this information to create a loop:
my $i_first = "0";
my $i_last = "6100";
my $i_interval = "50";
for (my $i = $i_first; $i <= $i_last; $i += $i_interval) {
my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i";
#process pageurl
}
tadmc (a very very supportive user) has created a great script that puts out a cvs-formated results. i have build in this loop in the code: (Note - i guess that there has gone wrong something! See the musings below... with the code-snippets and the error-messages:
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;
my $i_first = "0";
my $i_last = "6100";
my $开发者_如何学Goi_interval = "50";
for (my $i = $i_first; $i <= $i_last; $i += $i_interval) {
my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i";
#process pageurl
}
my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d; # strip the carriage returns
$html =~ s/ / /g; # expand the spaces
my $te = new HTML::TableExtract();
$te->parse($html);
my @cols = qw(
rownum
number
name
phone
type
website
);
my @fields = qw(
rownum
number
name
street
postal
town
phone
fax
type
website
);
my $csv = Text::CSV->new({ binary => 1 });
foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {
trim leading/trailing whitespace from base fields
s/^s+//, s/\s+$// for @$row;
load the fields into the hash using a "hash slice"
my %h;
@h{@cols} = @$row;
derive some fields from base fields, again using a hash slice
@h{qw/name street postal town/} = split /n+/, $h{name};
@h{qw/phone fax/} = split /n+/, $h{phone};
trim leading/trailing whitespace from derived fields
s/^s+//, s/\s+$// for @h{qw/name street postal town/};
$csv->combine(@h{@fields});
print $csv->string, "\n";
}
}
There have been some issues - i have made a mistake i guess that the error is here:
for (my $i = $i_first; $i <= $i_last; $i += $i_interval) {
my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i";
#process pageurl
}
my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d; # strip the carriage returns
$html =~ s/ / /g; # expand the spaces
i have written down some kind of double - code. I need to leave out one part ... this one here
my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d; # strip the carriage returns
$html =~ s/ / /g; # expand the spaces
see the results in the command line:
martin@suse-linux:~> cd perl
martin@suse-linux:~/perl> perl bavaria_all_.pl
Possible unintended interpolation of %h in string at bavaria_all_.pl line 52.
Possible unintended interpolation of %h in string at bavaria_all_.pl line 52.
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 52.
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 52.
syntax error at bavaria_all_.pl line 59, near "/,"
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 59.
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 60.
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 60.
Substitution replacement not terminated at bavaria_all_.pl line 63.
martin@suse-linux:~/perl>
what do you think!? look forward to hear from you
btw - see the code, created by tadmc, without any improved spider-logic....This runs very very nciely - without any issue: it spits out a nice formatted cvs-output!!
#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;
my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d; # strip the carriage returns
$html =~ s/ / /g; # expand the spaces
my $te = new HTML::TableExtract();
$te->parse($html);
my @cols = qw(
rownum
number
name
phone
type
website
);
my @fields = qw(
rownum
number
name
street
postal
town
phone
fax
type
website
);
my $csv = Text::CSV->new({ binary => 1 });
foreach my $ts ($te->table_states) {
foreach my $row ($ts->rows) {
trim leading/trailing whitespace from base fields
s/^s+//, s/\s+$// for @$row;
load the fields into the hash using a "hash slice"
my %h;
@h{@cols} = @$row;
derive some fields from base fields, again using a hash slice
@h{qw/name street postal town/} = split /n+/, $h{name};
@h{qw/phone fax/} = split /n+/, $h{phone};
trim leading/trailing whitespace from derived fields
s/^s+//, s/\s+$// for @h{qw/name street postal town/};
$csv->combine(@h{@fields});
print $csv->string, "\n";
}
}
Note: this above mentioned code runs nicely - it spits out csv-formated output.
A different approach to achieve paging is to extract all URLs from the page and detect the pager URLs.
...
for (@urls) {
if (is_pager_url($_) and not exists $seen{$_}) {
push @pager_url, $_;
$seen{$_}++;
}
}
...
sub is_pager_url {
my ($url) = @_;
return 1 if $url =~ m{schulsuche.asp\?q=e\&a=\d+\&s=\d+};
}
This way you don't have to deal with incrementing counters or establishing the total number of pages. It will also work for different values of a and s. By keeping a %seen hash, you can cheaply avoid differentiating between prev and next pages.
Excellent! I was waiting for you to figure out how to get the multiple pages on your own!
1) put my code inside of the page-getting loop (move the "}" way down to the end).
2) $html = get $pageurl; # change this to use your new URL
3) put my backslash back where I had it: tr/\r//d;
精彩评论