开发者

Threads application terminates unexpectedly

I have little scraping application and trying to add multithreading to it. Here is code (MyMech is WWW::Mechanize subclass used to process HTTP errors):

#!/usr/bin/perl

use strict;
use MyMech;
use File::Basename;
use File::Path;
use HTML::Entities;
use threads;
use threads::shared;
use Thread::Queue;
use List::Util qw( max sum );

my $page   = 1;
my %CONFIG = read_config();

my $mech = MyMech->new( autocheck => 1 );
$mech->quiet(0);

$mech->get( $CONFIG{BASE_URL} . "/site-map.php" );

my @championship_links =
  $mech->find_all_links( url_regex => qr/\d{4}-\d{4}\/$/ );

foreach my $championship_link (@championship_links) {

    my @threads;

    my $queue           = Thread::Queue->new;
    my $queue_processed = Thread::Queue->new;

    my $url = sprintf $championship_link->url_abs();

    print $url, "\n";

    next unless $url =~ m{soccer}i;

    $mech->get($url);

    my ( $last_round_loaded, $current_round ) =
      find_current_round( $mech->content() );

    unless ($last_round_loaded) {

        print "\tLoading rounds data...\n";

        $mech->submit_form(

            form_id => "leagueForm",
            fields  => {

                round => $current_round,
            },
        );
    }

    my @match_links =
      $mech->find_all_links( url_regex => qr/matchdetails\.php\?matchid=\d+$/ );

    foreach my $link (@match_links) {

        $queue->enqueue($link);
    }

    print "Starting printing thread...\n";

    my $printing_thread = threads->create(
        sub { printing_thread( scalar(@match_links), $queue_proces开发者_Python百科sed ) } )
      ->detach;

    push @threads, $printing_thread;

    print "Starting threads...\n";

    foreach my $thread_id ( 1 .. $CONFIG{NUMBER_OF_THREADS} ) {

        my $thread = threads->create(
            sub { scrape_match( $thread_id, $queue, $queue_processed ) } )
          ->join;
        push @threads, $thread;
    }

    undef $queue;
    undef $queue_processed;

    foreach my $thread ( threads->list() ) {

        if ( $thread->is_running() ) {

            print $thread->tid(), "\n";
        }
    }

    #sleep 5;
}

print "Finished!\n";

sub printing_thread {

    my ( $number_of_matches, $queue_processed ) = @_;

    my @fields =
      qw (
          championship
          year
          receiving_team
          visiting_team
          score
          average_home
          average_draw
          average_away
          max_home
          max_draw
          max_away
          date
          url
         );

    while ($number_of_matches) {

        if ( my $match = $queue_processed->dequeue_nb ) {

            open my $fh, ">>:encoding(UTF-8)", $CONFIG{RESULT_FILE} or die $!;

            print $fh join( "\t", @{$match}{@fields} ), "\n";
            close $fh;

            $number_of_matches--;
        }
    }

    threads->exit();
}

sub scrape_match {

    my ( $thread_id, $queue, $queue_processed ) = @_;

    while ( my $match_link = $queue->dequeue_nb ) {

        my $url = sprintf $match_link->url_abs();

        print "\t$url", "\n";

        my $mech = MyMech->new( autocheck => 1 );
        $mech->quiet(0);

        $mech->get($url);

        my $match = parse_match( $mech->content() );
        $match->{url} = $url;

        $queue_processed->enqueue($match);
    }

    return 1;
}

And i have some strange things with this code. Sometimes it run but sometimes it exit with no errors (at the ->detach point). I know that @match_links contain data but threads are not created and it just close. Usually it terminates after processing second $championship_link entry.

May be i'm doing something wrong?

Update Here is code for find_current_round subroutine (but i'm sure it's not related to the question):

sub find_current_round {

    my ($html) = @_;

    my ($select_html) = $html =~ m{

    <select\s+name="round"[^>]+>\s*
    (.+?)
    </select>
    }isx;

    my ( $option_html, $current_round ) = $select_html =~ m{

    (<option\s+value="\d+"(?:\s+ selected="selected")?>(\d+)</option>)\Z
    }isx;

    my ($last_round_loaded) = $option_html =~ m{selected};

    return ( $last_round_loaded, $current_round );
}


First off - don't use dequeue_nb(). This is a bad idea, because if a queue is temporarily empty, it'll return undef and your thread will exit.

Use instead dequeue and and end. dequeue will block, but once you end your queue, the while will exit.

You're also doing some decidedly odd things with your threads - I would suggest that you rarely want to detach a thread. You're just assuming your thread is going to complete before your program, which isn't a good plan.

Likewise this;

    my $thread = threads->create(
        sub { scrape_match( $thread_id, $queue, $queue_processed ) } )
      ->join;

You're spawning a thread, and then instantly joining it. And so that join call will... block waiting for your thread to exit. You don't need threads at all to do that...

You also scope your queues within your foreach loop. I don't think that's a good plan. I would suggest instead - scope them externally, and spawn a defined number of 'worker' threads (and one 'printing' thread).

And then just feed them through the queue mechanism. Otherwise you'll end up creating multiple queue instances, because they're lexically scoped.

And once you've finished queuing stuff, issue a $queue -> end which'll terminate the while loop.

You also don't need to give a thread a $thread_id because ... they already have one. Try: threads -> self -> tid(); instead.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜