Threads application terminates unexpectedly
I have little scraping application and trying to add multithreading to it. Here is code (MyMech is WWW::Mechanize subclass used to process HTTP errors):
#!/usr/bin/perl
use strict;
use MyMech;
use File::Basename;
use File::Path;
use HTML::Entities;
use threads;
use threads::shared;
use Thread::Queue;
use List::Util qw( max sum );
my $page = 1;
my %CONFIG = read_config();
my $mech = MyMech->new( autocheck => 1 );
$mech->quiet(0);
$mech->get( $CONFIG{BASE_URL} . "/site-map.php" );
my @championship_links =
$mech->find_all_links( url_regex => qr/\d{4}-\d{4}\/$/ );
foreach my $championship_link (@championship_links) {
my @threads;
my $queue = Thread::Queue->new;
my $queue_processed = Thread::Queue->new;
my $url = sprintf $championship_link->url_abs();
print $url, "\n";
next unless $url =~ m{soccer}i;
$mech->get($url);
my ( $last_round_loaded, $current_round ) =
find_current_round( $mech->content() );
unless ($last_round_loaded) {
print "\tLoading rounds data...\n";
$mech->submit_form(
form_id => "leagueForm",
fields => {
round => $current_round,
},
);
}
my @match_links =
$mech->find_all_links( url_regex => qr/matchdetails\.php\?matchid=\d+$/ );
foreach my $link (@match_links) {
$queue->enqueue($link);
}
print "Starting printing thread...\n";
my $printing_thread = threads->create(
sub { printing_thread( scalar(@match_links), $queue_proces开发者_Python百科sed ) } )
->detach;
push @threads, $printing_thread;
print "Starting threads...\n";
foreach my $thread_id ( 1 .. $CONFIG{NUMBER_OF_THREADS} ) {
my $thread = threads->create(
sub { scrape_match( $thread_id, $queue, $queue_processed ) } )
->join;
push @threads, $thread;
}
undef $queue;
undef $queue_processed;
foreach my $thread ( threads->list() ) {
if ( $thread->is_running() ) {
print $thread->tid(), "\n";
}
}
#sleep 5;
}
print "Finished!\n";
sub printing_thread {
my ( $number_of_matches, $queue_processed ) = @_;
my @fields =
qw (
championship
year
receiving_team
visiting_team
score
average_home
average_draw
average_away
max_home
max_draw
max_away
date
url
);
while ($number_of_matches) {
if ( my $match = $queue_processed->dequeue_nb ) {
open my $fh, ">>:encoding(UTF-8)", $CONFIG{RESULT_FILE} or die $!;
print $fh join( "\t", @{$match}{@fields} ), "\n";
close $fh;
$number_of_matches--;
}
}
threads->exit();
}
sub scrape_match {
my ( $thread_id, $queue, $queue_processed ) = @_;
while ( my $match_link = $queue->dequeue_nb ) {
my $url = sprintf $match_link->url_abs();
print "\t$url", "\n";
my $mech = MyMech->new( autocheck => 1 );
$mech->quiet(0);
$mech->get($url);
my $match = parse_match( $mech->content() );
$match->{url} = $url;
$queue_processed->enqueue($match);
}
return 1;
}
And i have some strange things with this code. Sometimes it run but sometimes it exit with no errors (at the ->detach
point). I know that @match_links contain data but threads are not created and it just close. Usually it terminates after processing second $championship_link
entry.
May be i'm doing something wrong?
Update
Here is code for find_current_round
subroutine (but i'm sure it's not related to the question):
sub find_current_round {
my ($html) = @_;
my ($select_html) = $html =~ m{
<select\s+name="round"[^>]+>\s*
(.+?)
</select>
}isx;
my ( $option_html, $current_round ) = $select_html =~ m{
(<option\s+value="\d+"(?:\s+ selected="selected")?>(\d+)</option>)\Z
}isx;
my ($last_round_loaded) = $option_html =~ m{selected};
return ( $last_round_loaded, $current_round );
}
First off - don't use dequeue_nb(). This is a bad idea, because if a queue is temporarily empty, it'll return undef and your thread will exit.
Use instead dequeue
and and end
. dequeue
will block, but once you end
your queue, the while will exit.
You're also doing some decidedly odd things with your threads - I would suggest that you rarely want to detach
a thread. You're just assuming your thread is going to complete before your program, which isn't a good plan.
Likewise this;
my $thread = threads->create(
sub { scrape_match( $thread_id, $queue, $queue_processed ) } )
->join;
You're spawning a thread, and then instantly joining it. And so that join
call will... block waiting for your thread to exit. You don't need threads at all to do that...
You also scope your queues within your foreach loop. I don't think that's a good plan. I would suggest instead - scope them externally, and spawn a defined number of 'worker' threads (and one 'printing' thread).
And then just feed them through the queue mechanism. Otherwise you'll end up creating multiple queue instances, because they're lexically scoped.
And once you've finished queuing stuff, issue a $queue -> end
which'll terminate the while loop.
You also don't need to give a thread a $thread_id
because ... they already have one. Try: threads -> self -> tid();
instead.
精彩评论