Scraping from a website that requires a login?
Can this be done if so, how? I want to scrape data from xbox.com开发者_开发百科 but the pages I need to scrape only appear after a successful login.
Most login forms will set a cookie. So you should use a HTTP class like Zend_Http that can store them for further requests. It's presumably as simple as:
$client = new Zend_Http_Client();
$client->setCookieJar(); // this is the crucial part for "logging in"
// make login request
$client->setUri("http://xbox.com/login");
$client->setParameterPost("login", "hackz0r");
$result = $client->request('POST');
// go scraping
...
You will have to go through the required login transaction by sending POST data with your CURL requests. That said, it is a bad idea to scrape data from behind a login - the site didn't put that information in the public for a reason, and for you to do so might constitute copyright infringement,
It can be done in theory, provided you have a web fetching class that supports cookies. It looks like PHP HTTP_Request2
from PEAR can send cookies if you provide the cookie information as part of the request. All you should need to do would be:
- Send a login request
- Extract the cookie data from the HTTP headers of the response to the above request
- Set this cookie data on subsequent requests
Note that many sites will have anti-scraping techniques of varying degrees of sophistication, and may make this more difficult. It may also be illegal, immoral or contrary to the site user agreement.
There are several ways to login automatically, some more complicated than others. xbox.com probably uses the Windows Live API, so you'll have to look into the documentation for that.
The PHP library PGBrowser can get this done pretty easily. Below is a demo code snippet taken from the companion blog. I believe this won't work with the XBox website because Microsoft now uses SSO, but is still applicable to other websites with content behind login forms.
require 'pgbrowser.php';
$b = new PGBrowser();
$b->useCache = true;
$page = $b->get('http://yoursite.com/login'); // Retrieve login web page
$form = $page->forms(1); // Retrieve form
// Note the form field names have to be specified
$form->set('username', "your_username_or_email");
$form->set('password', "your_password");
$page = $form->submit(); // Submit form
echo $page->html; // This shows the web page normally displayed after successful login, e.g. dashboard
//initial request with login data
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.example.com/login.php');
curl_setopt($ch, CURLOPT_USERAGENT,'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/32.0.1700.107 Chrome/32.0.1700.107 Safari/537.36');
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, "username=XXXXX&password=XXXXX");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_COOKIESESSION, true);
curl_setopt($ch, CURLOPT_COOKIEJAR, 'cookie-name'); //could be empty, but cause problems on some hosts
curl_setopt($ch, CURLOPT_COOKIEFILE, '/var/www/ip4.x/file/tmp'); //could be empty, but cause problems on some hosts
$answer = curl_exec($ch);
if (curl_error($ch)) {
echo curl_error($ch);
}
//another request preserving the session
curl_setopt($ch, CURLOPT_URL, 'http://www.example.com/list');
curl_setopt($ch, CURLOPT_POST, false);
curl_setopt($ch, CURLOPT_POSTFIELDS, "");
$answer = curl_exec($ch);
if (curl_error($ch)) {
echo curl_error($ch);
}
curl_close ($ch);
精彩评论