开发者

Is this site blocking/ignoring my HTTP requests?

I'm only able to fetch from a site when I use cURL with a proxy. cURL without a proxy and file_get_contents() return nothing开发者_如何转开发 (cURL HTTP code "0" and curl_error() Empty reply from server). I'm able to fetch other sites just fine without a proxy.

Aside from being blocked, is there any other possible explanation of why I can only access this site via proxy?


Did you set a USER AGENT in cURL? Sometimes websites will block you if your USER AGENT isn't set or if your HTTP request looks suspicious.

To set your USER AGENT in PHP:

curl_setopt($curl, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");


Is this from your workplace or something? Many companies disable file_get_contents() on shared PHP installs, as it's quite risky.

The site probably has user agent detection. You can fake that in your curl call but I don't believe that's possible with file_get_contents(). Another method sites use is to only display content once a cookie has been set so site scrapers will never see the data.

Try this:

function curl_scrape($url,$data,$proxy,$proxystatus)
{
    $fp = fopen("cookie.txt", "w");
    fclose($fp);
    $ch = curl_init();

    curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt");
    curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
    curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
    curl_setopt($ch, CURLOPT_TIMEOUT, 40);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);

    if ($proxystatus == 'on')
    {
        curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, FALSE);
        curl_setopt($ch, CURLOPT_HTTPPROXYTUNNEL, TRUE);
        curl_setopt($ch, CURLOPT_PROXY, $proxy);
    }

    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_HEADER, TRUE);
    curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
    curl_setopt($ch, CURLOPT_POST, TRUE);
    curl_setopt($ch, CURLOPT_POSTFIELDS, $data);

    ob_start(); // prevent any output
    return curl_exec ($ch); // execute the curl command
    ob_end_clean(); // stop preventing output
    curl_close ($ch);
    unset($ch);
}


I'm guessing I was truly blocked. Using proxy now and it works fine.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜