tracking page headers and redirects with php-libcurl
I was writing a script to track headers especially redirects and cookies for a url. Many times when i open a url it redirects to another url or sometimes more than one url and also stores some cookies. But when i ran the scri开发者_如何学运维pt with url
http://en.wikipedia.org/
my script didnt save cookies and it only showed one redirect and didnt store any cookies. but when i browsed the url in firefox it saved cookies and when i inspected it with Live HTTP Headers
it showed multiple get requests. Live HTTP Headers also shows that there are Set-Cookie headers.
<?php
$url="http://en.wikipedia.org/";
$userAgent="Mozilla/5.0 (Windows NT 5.1; rv:2.0)Gecko/20100101 Firefox/4.0";
$accept="text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8";
$encoding="gzip, deflate";
$header['lang']="en-us,en;q=0.5";
$header['charset']="ISO-8859-1,utf-8;q=0.7,*;q=0.7";
$header['conn']="keep-alive";
$header['keep-alive']=115;
$i=1;
$flag=1; //0 if there is no redirect i.e. no location header to follow. used here to to control the while loop below
while($flag!=0) {
$ch=curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_USERAGENT,$userAgent);
curl_setopt($ch,CURLOPT_ENCODING,$encoding);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,0);
curl_setopt($ch,CURLOPT_HEADER,1);
curl_setopt($ch,CURLOPT_NOBODY,1);
curl_setopt($ch,CURLOPT_AUTOREFERER,true);
curl_setopt($ch, CURLOPT_COOKIEJAR, dirname(__FILE__) . "/cookie.txt");
curl_setopt($ch, CURLOPT_COOKIEFILE, dirname(__FILE__) . "/cookie.txt");
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
$pageHeader[$i]=curl_exec($ch);
curl_close($ch);
$flag=preg_match('/Location: (.*)\s/',$pageHeader[$i],$location[$i]);
if($flag==1) { //if there is a location header
if(preg_match('@^(http://|www.)@',$location[$i][1],$tempurl)==1) { //if it is an absolute url
$url=$location[$i][1];
} else {
if(preg_match('@^/(.*)@',$location[$i][1],$tempurl)==1) { //if the url corresponds to url relative to server's root
preg_match('@^((http://)|(www.))[^/]+@',$url,$domain);
$url=$domain.$tempurl[0];
} else { //if the url is relative to current directory
$url=preg_replace('@(/[^/]+)$@',"/".$location[$i][1],$url);
}
}
$location[$i]=$url;
preg_match('/Set-Cookie: (.*)\s/',$pageHeader[$i],$cookie[$i]);
$i++;
}
foreach($location as $l)
$loc=$loc.$l."\n";
$header=implode("\n\n\n",$pageHeader);
file_put_contents(dirname(__FILE__) . "/location.txt",$loc);
file_put_contents(dirname(__FILE__) . "/header.txt",$header);
?>
here the file location.txt
and header.txt
are created but cookie.txt
are not created.
if i change the url to google.com then it shows the redirect to google.co.in
in the location.txt
file and it saves a cookie in the cookie.txt
file. But when i open google.com
in Firefox
it saves three cookies. What can be wrong?
I think there is some javascript on the page that is setting the cookies so curl is not able to get that.
also any suggestions for the improvement of above code are welcome
Your Location: following code is completely broken, as you should've seen most HTTP redirects relative and thus you can't just use that string as a URL in the subsequent request.
精彩评论