开发者

multi curl unable to handle more than 200 request at a time

Could you please tell me , is there any limitation to send a request using multi_curl. When I tried to send a request more than 200 , it was getting timeout.

see the below code .............. .........................................

foreach($newUrlArry as $url){   
            $gatherUrl[] = $url['url'];
        }
        /*...................Array slice----------------------*/

        $totalUrlRequest = count($gatherUrl);
        if($totalUrlRequest > 10){
            $offset = 10;
            $index = 0;
            $matchedAnchors = array();
            $dom = new DOMDocument;
            $NoOfTilesRequest = ceil($totalUrlRequest/$offset);
            for($sl = 0; $sl<$NoOfTilesRequest;$sl++){
                $output = array_slice($gatherUrl, $index, $offset);
                $index = $offset+$index;
                $responseAction = $this->multiRequestAction($output);
                $k=0;
                foreach($responseAction as $responseHtml){
                @$dom->loadHTML($responseHtml);
                $documentLinks = $dom->getElementsByTagName("a");
                $chieldFlag = false;
                for($i=0;$i<$documentLinks->length;$i++) {
                $documentLink = $documentLinks->item($i);
                   if ($documentLink->hasAttribute('href') AND substr($documentLink->getAttribute('href'), 0, strlen($match)) == $match) {
                            $description = $documentLink->childNodes;
                            foreach($description as $words) {
                                $name =  trim($words->nodeName);
                                if($name == 'em' ||  $name == 'b' || $name=="span" || $name=="p") {
                                    if(!empty($words->nodeValue)) {
                                        $matchedAnchors[$sl][$k]['anchor']  = trim($words->nodeValue);
                                        $matchedAnchors[$sl][$k]['img']         = 0;
                                        if($documentLink->hasAttribute('rel'))
                                            $matchedAnchors[$sl][$k]['rel']    = 'Y';
                                        else
                                            $matchedAnchors[$sl][$k]['rel']    = 'N';   
                                        $chieldFlag = true;
                                        break;
                                    }
                                }
                                elseif($name == 'img' ) { 
                                    $alt= $words->getAttribute('alt');
                                    if(!empty($alt)) {
                             开发者_如何学C           $matchedAnchors[$sl][$k]['anchor']  =  trim($words->getAttribute('alt'));
                                        $matchedAnchors[$sl][$k]['img']         = 1; 
                                        if($documentLink->hasAttribute('rel'))
                                            $matchedAnchors[$sl][$k]['rel']    = 'Y';
                                        else
                                            $matchedAnchors[$sl][$k]['rel']    = 'N';   
                                        $chieldFlag = true;
                                        break;
                                    }
                                }

                            }
                            if(!$chieldFlag){
                                $matchedAnchors[$sl][$k]['anchor']  = $documentLink->nodeValue;
                                $matchedAnchors[$sl][$k]['img']         = 0; 
                                if($documentLink->hasAttribute('rel'))
                                    $matchedAnchors[$sl][$k]['rel']    = 'Y';
                                else
                                    $matchedAnchors[$sl][$k]['rel']    = 'N';   
                            }

                        }

                    }$k++;
                }       
            }
        }


Both @Phliplip & @lunixbochs have mentioned common cURL pitfalls (max execution time & denied by the target server.)

When sending that many cURL requests to the same server I try to "be nice" and place voluntarily sleep periods so I don't bombard the host. For a low-end site, 1000+ requests could be like a mini DDOS!

Here's code that's worked for me. I it used to scrape a client's product data from their old site, since the data was locked in a proprietary database system with NO export function.

<?php
header('Content-type: text/html; charset=utf-8', true);
set_time_limit(0);
$urls = array(
    'http://www.example.com/cgi-bin/product?id=500', 
    'http://www.example.com/cgi-bin/product?id=501',  
    'http://www.example.com/cgi-bin/product?id=502',  
    'http://www.example.com/cgi-bin/product?id=503',  
    'http://www.example.com/cgi-bin/product?id=504', 
);
$i = 0;
foreach($urls as $url){
    echo $url."\n";
    $curl = curl_init($url);
    $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
    curl_setopt($curl, CURLOPT_USERAGENT, $userAgent);
    curl_setopt($curl, CURLOPT_AUTOREFERER, true);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1 );
    curl_setopt($curl, CURLOPT_TIMEOUT, 25 );
    $html = curl_exec($curl);
    $html = @mb_convert_encoding($html, 'HTML-ENTITIES', 'utf-8');  
    curl_close($curl);
    // now do something with info returned by curl 
    $i++;
    if($i%10==0){
        sleep(20);
    } else {
        sleep(2);
    }
}
?>

The main features are:

  • no max execution time
  • voluntary sleep-ing
  • new curl init & exec for each request.

In my experience, going to sleep() will stop servers from denying you. However if by "different different server" you mean that you are sending a small number of requests a large number of servers, for example:

$urls = array(
    'http://www.example-one.com/', 
    'http://www.example-two.com/', 
    'http://www.example-three.com/', 
    'http://www.example-four.com/', 
    'http://www.example-five.com/', 
    'http://www.example-six.com/'
);

And you are using set_time_limit(0); then something then an error may be causing your code to die; try

ini_set('display_errors',1); 
error_reporting(E_ALL);

And tell us the error message you are getting.


PHP doesn't place a restriction on the number of connections using curl_multi_init, but memory usage and time limits will be an issue.

Check your memory_limit setting in your php.ini and try to increase it to see if that helps you.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜