开发者

improving LWP::Simple perl performance

Alas, I have yet another question:

I have been tasked with reading a webpage and extracting links from that page (easy stuff with HTML::TokeParser). He (my boss) then insists that I read from these links and grab some details from each of those pages, and parse ALL of that information into an xml file, which can later be read.

So, I can set this up fairly simply like so:

#!/usr/bin/perl -w

use     strict;
use     LWP::Simple; 
require HTML::TokeParser; 

$|=1;                        # un buffer

my $base = 'http://www.something_interesting/';
my $path = 'http://www.something_interesting/Default.aspx开发者_StackOverflow中文版';
my $rawHTML = get($path); # attempt to d/l the page to mem

my $p = HTML::TokeParser->new(\$rawHTML) || die "Can't open: $!";

open (my $out, "> output.xml") or die;

while (my $token = $p->get_tag("a")) {

    my $url = $token->[1]{href} || "-";

    if ($url =~ /event\.aspx\?eventid=(\d+)/) {
        ( my $event_id = $url ) =~ s/event\.aspx\?eventid=(\d+)/$1/;
        my $text = $p->get_trimmed_text("/a");
        print $out $event_id,"\n";
        print $out $text,"\n";

        my $details = $base.$url;
        my $contents = get($details);

        # now set up another HTML::TokeParser, and parse each of those files.

    }
}

This would probably be OK if there were maybe 5 links on this page. However, I'm trying to read from ~600 links, and grab info from each of these pages. So, needless to say, my method is taking a LONG time... i honestly don't know how long, since I've never let it finish.

It was my idea to simply write something that only gets the information as needed (eg, a java app that looks up the information from the link that you want)... however, this doesn't seem to be acceptable, so I'm turning to you guys :)

Is there any way to improve on this process?


You will probably see a speed boost -- at the expense of less simple code -- if you execute your get()s in parallel instead of sequentially.

Parallel::ForkManager is where I would start (and even includes an LWP::Simple get() example in its documentation), but there are plenty of other alternatives to be found on CPAN, including the fairly dated LWP::Parallel::UserAgent.


If you want to fetch more than one item from a server and do so speedily, use TCP Keep-Alive. Drop the simplistic LWP::Simple and use the regular LWP::UserAgent with the keep_alive option. That will set up a connection cache, so you will not incur the TCP connection build-up overhead when fetching more pages from the same host.

use strict;
use warnings;
use LWP::UserAgent;
use HTTP::Request::Common;

my @urls = @ARGV or die 'URLs!';
my %opts = ( keep_alive => 10 ); # cache 10 connections
my $ua = LWP::UserAgent->new( %opts );
for ( @urls ) {
        my $req = HEAD $_;
        print $req->as_string;
        my $rsp = $ua->request( $req );
        print $rsp->as_string;
}

my $cache = $ua->conn_cache;
my @conns = $cache->get_connections;
# has methods of Net::HTTP, IO::Socket::INET, IO::Socket


WWW::Mechanize is a great piece of work to start with and if you are looking at modules, I'd also suggest Web::Scraper

Both have docs at the links I provided and should help you get going quickly.


There's a good chance it's blocking on the http get request while it waits for the response from the network. Use an asynchronous http library and see if it helps.


use strict;
use warnings;

use threads;  # or: use forks;

use Thread::Queue qw( );

use constant MAX_WORKERS => 10;

my $request_q  = Thread::Queue->new();
my $response_q = Thread::Queue->new();

# Create the workers.
my @workers;
for (1..MAX_WORKERS) {
   push @workers, async {
      while (my $url = $request_q->dequeue()) {
         $response_q->enqueue(process_request($url));
      }
   };
}

# Submit work to workers.
$request_q->enqueue(@urls);

# Signal the workers they are done.    
for (1..@workers) {
   $request_q->enqueue(undef);
}

# Wait for the workers to finish.
$_->join() for @workers;

# Collect the results.
while (my $item = $response_q->dequeue()) {
   process_response($item);
}


Your issue is scrapping being more CPU-intensive than I/O-intensive. While most people here would suggest you to use more CPU, I'll try to show a great advantage of Perl being used as a "glue" language. Everyone agrees that Libxml2 is an excellent XML/HTML parser. Also, libcurl is an awesome download agent. However, in the Perl universe, many scrapers are based on LWP::UserAgent and HTML::TreeBuilder::XPath (which is similar to HTML::TokeParser, while being XPath-compliant). In that cases, you can use a drop-in replacement modules to handle downloads and HTML parsing via libcurl/libxml2:

use LWP::Protocol::Net::Curl;
use HTML::TreeBuilder::LibXML;
HTML::TreeBuilder::LibXML->replace_original();

I saw an average 5x speed increase just by prepending these 3 lines in several scrapers I used to maintain. But, as you're using HTML::TokeParser, I'd recommend you to try Web::Scraper::LibXML instead (plus LWP::Protocol::Net::Curl, which affects both LWP::Simple and Web::Scraper).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜