improving LWP::Simple perl performance
Alas, I have yet another question:
I have been tasked with reading a webpage and extracting links from that page (easy stuff with HTML::TokeParser). He (my boss) then insists that I read from these links and grab some details from each of those pages, and parse ALL of that information into an xml file, which can later be read.
So, I can set this up fairly simply like so:
#!/usr/bin/perl -w
use strict;
use LWP::Simple;
require HTML::TokeParser;
$|=1; # un buffer
my $base = 'http://www.something_interesting/';
my $path = 'http://www.something_interesting/Default.aspx开发者_StackOverflow中文版';
my $rawHTML = get($path); # attempt to d/l the page to mem
my $p = HTML::TokeParser->new(\$rawHTML) || die "Can't open: $!";
open (my $out, "> output.xml") or die;
while (my $token = $p->get_tag("a")) {
my $url = $token->[1]{href} || "-";
if ($url =~ /event\.aspx\?eventid=(\d+)/) {
( my $event_id = $url ) =~ s/event\.aspx\?eventid=(\d+)/$1/;
my $text = $p->get_trimmed_text("/a");
print $out $event_id,"\n";
print $out $text,"\n";
my $details = $base.$url;
my $contents = get($details);
# now set up another HTML::TokeParser, and parse each of those files.
}
}
This would probably be OK if there were maybe 5 links on this page. However, I'm trying to read from ~600 links, and grab info from each of these pages. So, needless to say, my method is taking a LONG time... i honestly don't know how long, since I've never let it finish.
It was my idea to simply write something that only gets the information as needed (eg, a java app that looks up the information from the link that you want)... however, this doesn't seem to be acceptable, so I'm turning to you guys :)
Is there any way to improve on this process?
You will probably see a speed boost -- at the expense of less simple code -- if you execute your get()
s in parallel instead of sequentially.
Parallel::ForkManager is where I would start (and even includes an LWP::Simple get()
example in its documentation), but there are plenty of other alternatives to be found on CPAN, including the fairly dated LWP::Parallel::UserAgent.
If you want to fetch more than one item from a server and do so speedily, use TCP Keep-Alive. Drop the simplistic LWP::Simple
and use the regular LWP::UserAgent
with the keep_alive
option. That will set up a connection cache, so you will not incur the TCP connection build-up overhead when fetching more pages from the same host.
use strict;
use warnings;
use LWP::UserAgent;
use HTTP::Request::Common;
my @urls = @ARGV or die 'URLs!';
my %opts = ( keep_alive => 10 ); # cache 10 connections
my $ua = LWP::UserAgent->new( %opts );
for ( @urls ) {
my $req = HEAD $_;
print $req->as_string;
my $rsp = $ua->request( $req );
print $rsp->as_string;
}
my $cache = $ua->conn_cache;
my @conns = $cache->get_connections;
# has methods of Net::HTTP, IO::Socket::INET, IO::Socket
WWW::Mechanize is a great piece of work to start with and if you are looking at modules, I'd also suggest Web::Scraper
Both have docs at the links I provided and should help you get going quickly.
There's a good chance it's blocking on the http get request while it waits for the response from the network. Use an asynchronous http library and see if it helps.
use strict;
use warnings;
use threads; # or: use forks;
use Thread::Queue qw( );
use constant MAX_WORKERS => 10;
my $request_q = Thread::Queue->new();
my $response_q = Thread::Queue->new();
# Create the workers.
my @workers;
for (1..MAX_WORKERS) {
push @workers, async {
while (my $url = $request_q->dequeue()) {
$response_q->enqueue(process_request($url));
}
};
}
# Submit work to workers.
$request_q->enqueue(@urls);
# Signal the workers they are done.
for (1..@workers) {
$request_q->enqueue(undef);
}
# Wait for the workers to finish.
$_->join() for @workers;
# Collect the results.
while (my $item = $response_q->dequeue()) {
process_response($item);
}
Your issue is scrapping being more CPU-intensive than I/O-intensive. While most people here would suggest you to use more CPU, I'll try to show a great advantage of Perl being used as a "glue" language. Everyone agrees that Libxml2 is an excellent XML/HTML parser. Also, libcurl is an awesome download agent. However, in the Perl universe, many scrapers are based on LWP::UserAgent and HTML::TreeBuilder::XPath (which is similar to HTML::TokeParser, while being XPath-compliant). In that cases, you can use a drop-in replacement modules to handle downloads and HTML parsing via libcurl/libxml2:
use LWP::Protocol::Net::Curl;
use HTML::TreeBuilder::LibXML;
HTML::TreeBuilder::LibXML->replace_original();
I saw an average 5x speed increase just by prepending these 3 lines in several scrapers I used to maintain. But, as you're using HTML::TokeParser, I'd recommend you to try Web::Scraper::LibXML instead (plus LWP::Protocol::Net::Curl, which affects both LWP::Simple and Web::Scraper).
精彩评论