开发者

tracking page headers and redirects with php-libcurl

I was writing a script to track headers especially redirects and cookies for a url. Many times when i open a url it redirects to another url or sometimes more than one url and also stores some cookies. But when i ran the scri开发者_如何学运维pt with url

http://en.wikipedia.org/

my script didnt save cookies and it only showed one redirect and didnt store any cookies. but when i browsed the url in firefox it saved cookies and when i inspected it with Live HTTP Headers it showed multiple get requests. Live HTTP Headers also shows that there are Set-Cookie headers.

<?php

$url="http://en.wikipedia.org/";
$userAgent="Mozilla/5.0 (Windows NT 5.1; rv:2.0)Gecko/20100101 Firefox/4.0";
$accept="text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
$encoding="gzip, deflate";
$header['lang']="en-us,en;q=0.5";
$header['charset']="ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header['conn']="keep-alive";
$header['keep-alive']=115;
$i=1;
$flag=1;        //0 if there is no redirect i.e. no location header to follow. used here to to control the while loop below

while($flag!=0) {
    $ch=curl_init();
    curl_setopt($ch,CURLOPT_URL,$url);
    curl_setopt($ch,CURLOPT_USERAGENT,$userAgent);
    curl_setopt($ch,CURLOPT_ENCODING,$encoding);
    curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
    curl_setopt($ch,CURLOPT_FOLLOWLOCATION,0);
    curl_setopt($ch,CURLOPT_HEADER,1);
    curl_setopt($ch,CURLOPT_NOBODY,1);
    curl_setopt($ch,CURLOPT_AUTOREFERER,true);
    curl_setopt($ch, CURLOPT_COOKIEJAR, dirname(__FILE__) . "/cookie.txt");
    curl_setopt($ch, CURLOPT_COOKIEFILE, dirname(__FILE__) . "/cookie.txt");
    curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
    $pageHeader[$i]=curl_exec($ch);
    curl_close($ch);
    $flag=preg_match('/Location: (.*)\s/',$pageHeader[$i],$location[$i]);
    if($flag==1) {      //if there is a location header    
        if(preg_match('@^(http://|www.)@',$location[$i][1],$tempurl)==1) {      //if it is an absolute url
            $url=$location[$i][1];
        } else {
            if(preg_match('@^/(.*)@',$location[$i][1],$tempurl)==1) {   //if the url corresponds to url relative to server's root
                preg_match('@^((http://)|(www.))[^/]+@',$url,$domain);
                $url=$domain.$tempurl[0];
            } else {        //if the url is relative to current directory
                $url=preg_replace('@(/[^/]+)$@',"/".$location[$i][1],$url);
            }
        }
        $location[$i]=$url;
        preg_match('/Set-Cookie: (.*)\s/',$pageHeader[$i],$cookie[$i]);
        $i++;
    }

    foreach($location as $l)
        $loc=$loc.$l."\n";

    $header=implode("\n\n\n",$pageHeader);
    file_put_contents(dirname(__FILE__) . "/location.txt",$loc);
    file_put_contents(dirname(__FILE__) . "/header.txt",$header);
?>

here the file location.txt and header.txt are created but cookie.txt are not created. if i change the url to google.com then it shows the redirect to google.co.in in the location.txt file and it saves a cookie in the cookie.txt file. But when i open google.com in Firefox it saves three cookies. What can be wrong? I think there is some javascript on the page that is setting the cookies so curl is not able to get that. also any suggestions for the improvement of above code are welcome


Your Location: following code is completely broken, as you should've seen most HTTP redirects relative and thus you can't just use that string as a URL in the subsequent request.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜