开发者

my site crawler died while it's running

I wrote a site crawler to get links and images to create site map but it killed while running! so it's not my whole class

class pageCrawler {

    .......

    private $links = array();

    public function 开发者_开发问答__construct ( $url ) {

    ignore_user_abort ( true );
    set_time_limit ( 0 );
    register_shutdown_function ( array ( $this, 'callRegisteredShutdown' ) );
    $this->host = $urlParts [ 'host' ];
    $this->crawlingUrl ( $url );
    $this->doCrawlLinks ();

}

$this->crawlingUrl ( $url ): at beginning main address set to this method (e.g http://www.mysite.com)

getUrl(): connect to url by fsockopen then get url contents

findLinks(): return a href and img src and then store returns links in $this->links[] then i echo something to flush output and insert following code after that :

echo str_pad ( " ", 5000 );
flush ();

$this->doCrawlLinks(): it's check $this->links and do same process that i describe in top for first element of $this->links

then shift first element again doCrawlLinks() run and get url content of first element then shift first element of $this->links till $this->links get empty


it's general trend of my class it's work but suddenly it's crashed suddenly. i set set_time_limit(0) to do forever but my process dosent't finish because my shoutdoown function dosent execute ! i confused where is my problem


Wild guess - do you have a recursion in doCrawlLinks() ? Deep recursion can simply crash process. Or it can crash by memory limit per process.

From my experience, it is very helpfull to keep the list of links in database with pending/processed flag on them, so you can shutdown and resume your crawler any time you want (or in your case - resume it after crash).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜