improving LWP::Simple perl performance

2023-03-15 02:38 问答作者：

Alas, I have yet another question:

I have been tasked with reading a webpage and extracting links from that page (easy stuff with HTML::TokeParser). He (my boss) then insists that I read from these links and grab some details from each of those pages, and parse ALL of that information into an xml file, which can later be read.

So, I can set this up fairly simply like so:

#!/usr/bin/perl -w

use     strict;
use     LWP::Simple; 
require HTML::TokeParser; 

$|=1;                        # un buffer

my $base = 'http://www.something_interesting/';
my $path = 'http://www.something_interesting/Default.aspx开发者_StackOverflow中文版';
my $rawHTML = get($path); # attempt to d/l the page to mem

my $p = HTML::TokeParser->new(\$rawHTML) || die "Can't open: $!";

open (my $out, "> output.xml") or die;

while (my $token = $p->get_tag("a")) {

    my $url = $token->[1]{href} || "-";

    if ($url =~ /event\.aspx\?eventid=(\d+)/) {
        ( my $event_id = $url ) =~ s/event\.aspx\?eventid=(\d+)/$1/;
        my $text = $p->get_trimmed_text("/a");
        print $out $event_id,"\n";
        print $out $text,"\n";

        my $details = $base.$url;
        my $contents = get($details);

        # now set up another HTML::TokeParser, and parse each of those files.

    }
}

This would probably be OK if there were maybe 5 links on this page. However, I'm trying to read from ~600 links, and grab info from each of these pages. So, needless to say, my method is taking a LONG time... i honestly don't know how long, since I've never let it finish.

It was my idea to simply write something that only gets the information as needed (eg, a java app that looks up the information from the link that you want)... however, this doesn't seem to be acceptable, so I'm turning to you guys :)

Is there any way to improve on this process?

You will probably see a speed boost -- at the expense of less simple code -- if you execute your get()s in parallel instead of sequentially.

Parallel::ForkManager is where I would start (and even includes an LWP::Simple get() example in its documentation), but there are plenty of other alternatives to be found on CPAN, including the fairly dated LWP::Parallel::UserAgent.

If you want to fetch more than one item from a server and do so speedily, use TCP Keep-Alive. Drop the simplistic LWP::Simple and use the regular LWP::UserAgent with the keep_alive option. That will set up a connection cache, so you will not incur the TCP connection build-up overhead when fetching more pages from the same host.

use strict;
use warnings;
use LWP::UserAgent;
use HTTP::Request::Common;

my @urls = @ARGV or die 'URLs!';
my %opts = ( keep_alive => 10 ); # cache 10 connections
my $ua = LWP::UserAgent->new( %opts );
for ( @urls ) {
        my $req = HEAD $_;
        print $req->as_string;
        my $rsp = $ua->request( $req );
        print $rsp->as_string;
}

my $cache = $ua->conn_cache;
my @conns = $cache->get_connections;
# has methods of Net::HTTP, IO::Socket::INET, IO::Socket

WWW::Mechanize is a great piece of work to start with and if you are looking at modules, I'd also suggest Web::Scraper

Both have docs at the links I provided and should help you get going quickly.

There's a good chance it's blocking on the http get request while it waits for the response from the network. Use an asynchronous http library and see if it helps.

use strict;
use warnings;

use threads;  # or: use forks;

use Thread::Queue qw( );

use constant MAX_WORKERS => 10;

my $request_q  = Thread::Queue->new();
my $response_q = Thread::Queue->new();

# Create the workers.
my @workers;
for (1..MAX_WORKERS) {
   push @workers, async {
      while (my $url = $request_q->dequeue()) {
         $response_q->enqueue(process_request($url));
      }
   };
}

# Submit work to workers.
$request_q->enqueue(@urls);

# Signal the workers they are done.    
for (1..@workers) {
   $request_q->enqueue(undef);
}

# Wait for the workers to finish.
$_->join() for @workers;

# Collect the results.
while (my $item = $response_q->dequeue()) {
   process_response($item);
}

Your issue is scrapping being more CPU-intensive than I/O-intensive. While most people here would suggest you to use more CPU, I'll try to show a great advantage of Perl being used as a "glue" language. Everyone agrees that Libxml2 is an excellent XML/HTML parser. Also, libcurl is an awesome download agent. However, in the Perl universe, many scrapers are based on LWP::UserAgent and HTML::TreeBuilder::XPath (which is similar to HTML::TokeParser, while being XPath-compliant). In that cases, you can use a drop-in replacement modules to handle downloads and HTML parsing via libcurl/libxml2:

use LWP::Protocol::Net::Curl;
use HTML::TreeBuilder::LibXML;
HTML::TreeBuilder::LibXML->replace_original();

I saw an average 5x speed increase just by prepending these 3 lines in several scrapers I used to maintain. But, as you're using HTML::TokeParser, I'd recommend you to try Web::Scraper::LibXML instead (plus LWP::Protocol::Net::Curl, which affects both LWP::Simple and Web::Scraper).

继续阅读：html-parsing perl

improving LWP::Simple perl performance

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？