Post data to other site using PHP and save output
I'm trying to save info from the http://www.woorank.com search results. The site caches data for popular sites, but for most you need to do a search before it returns a report. So I tried this:
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.woorank.com/en/report/generate');
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, array('url'=>'hellothere.com'));
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_exec($ch);
curl_close($ch);
It seems (based on curl output) to redirect to http://www.woorank.com/en/www/hellothere.com, as it should after you search, but it doesn't generate a report and simply states there is no report yet (just as it would when you visit the url directly).
Am I doing something wrong? Or is it not possible to retrieve this info?
Update
Request headers: http://pastebin.com/3ijZfMmF
(Request-Line) POST /en/report/generate HTTP/1.1 Host www.woorank.com User-Agent Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.3) Gecko/20100401 Firefox/3.6.3 Accept text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 Accept-Language en-us,en;q=0.5 Accept-Encoding gzip,deflate Accept-Charset ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive 115 Connection keep-alive Referer http://www.woorank.com/ Cookie __utma=201458455.1161920622.1291713267.1291747441.1291773488.4; __utmc=201458455; __utmz=201458455.1291713267.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __utmb=201458455.1.10.1291773488 Content-Type application/x-www-form-urlencoded Content-Length 16
I'm not sure how to get the request headers from the test script, but using this:
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLINFO_HEADER_OUT, true);
$headers = curl_getinfo($ch);
The $headers
var contains:
Array ( [url] => http://www.woorank.com/en/www/someothersite.com [content_type] =>开发者_如何学编程 text/html; charset=UTF-8 [http_code] => 200 [header_size] => 841 [request_size] => 280 [filetime] => -1 [ssl_verify_result] => 0 [redirect_count] => 1 [total_time] => 0.904581 [namelookup_time] => 3.2E-5 [connect_time] => 3.3E-5 [pretransfer_time] => 3.7E-5 [size_upload] => 155 [size_download] => 5297 [speed_download] => 5855 [speed_upload] => 171 [download_content_length] => 5297 [upload_content_length] => 0 [starttransfer_time] => 0.242975 [redirect_time] => 0.577306 [request_header] => GET /en/www/someothersite.com HTTP/1.1 Host: www.woorank.com Accept: */* )
It seems to me that this is the redirect that happens after the search form is submitted. But I'm not sure whether there's no POST at all, or that it isn't visible in these headers. But since it doesn't work, I'm guessing it's the former.
The output from curl_exec
is simply the HTML from http://www.woorank.com/en/www/someothersite.com.
Update 2
I tried adding some of the headers to the curl request using:
curl_setopt($ch, CURLOPT_HTTPHEADER, $headers);
and e.g.
$headers = array(
"Host: www.woorank.com",
"Referer: http://www.woorank.com/"
);
Doesn't make it POST the form, but now the curl_exec
shows the response headers. Here's the difference:
Firefox, response headers from site:
HTTP/1.1 302 Found Date Wed, 08 Dec 2010 02:19:18 GMT Server Apache/2.2.9 (Fedora) X-Powered-By PHP/5.2.6 Set-Cookie language=en; expires=Wed, 08-Dec-2010 03:19:18 GMT; path=/ Set-Cookie generate=somesite.com; expires=Wed, 08-Dec-2010 03:19:19 GMT; path=/ Location /en/www/somesite.com Cache-Control max-age=1 Expires Wed, 08 Dec 2010 02:19:19 GMT Vary Accept-Encoding,User-Agent Content-Encoding gzip Content-Length 20 Keep-Alive timeout=1, max=100 Connection Keep-Alive Content-Type text/html; charset=UTF-8
and from test.php:
HTTP/1.1 302 Found Date: Wed, 08 Dec 2010 02:27:21 GMT Server: Apache/2.2.9 (Fedora) X-Powered-By: PHP/5.2.6 Set-Cookie: language=en; expires=Wed, 08-Dec-2010 03:27:21 GMT; path=/ Set-Cookie: generate=someothersite.com; expires=Wed, 08-Dec-2010 03:27:22 GMT; path=/ Location: /en/www/someothersite.com Cache-Control: max-age=1 Expires: Wed, 08 Dec 2010 02:27:22 GMT Vary: Accept-Encoding,User-Agent Content-Length: 0 Keep-Alive: timeout=1, max=100 Connection: Keep-Alive Content-Type: text/html; charset=UTF-8
I only notice Content-Encoding gzip
and Content-Length 20
missing in the test. Don't know what that means but when adding "Content-Length: 20" to the headers it says "HTTP/1.1 413 Request Entity Too Large" and doesn't do anything; adding "Content-Encoding: gzip" makes it return the HTML gzipped (I assume, since it looks like this: "‹ÍXésÚ8ÿœüZíì&]ìºG “æè1 MmÚ...").
Hope this info helps.
You want to make sure you're matching the necessary headers. Make the request that you want to emulate with cURL and post the headers here. Use a plugin like HTTPFox on firefox, or similar tools. Then we can see if your query matches the header
ANSWER : I looked at the site myself and found that it uses cookies to make sure you're not a simple robot before generating reports. This can be evaded by updating your cURL script to generate the right cookies.
There may also be other simple checks that you'd have to bypass (e.g. Referer, User-Agent, etc.), you can do it all with cURL though.
However, they probably use this kind of cookie protection because they don't want people scraping their data. If you're going to hack past that restriction you should go through the courtesy of asking the admin permission to download his site. While you're not at legal risk (they have no ToS), it'd be a nice thing to do.
Maybe something like this? especially wondering what you get as output(print_r)?
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.woorank.com/en/report/generate');
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, array('url'=>'hellothere.com'));
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
$result = curl_exec ($ch);
print_r($result); // output?
curl_close($ch);
精彩评论