开发者

Scraping ASP.Net website with POST variables in PHP

For the past few days I have been trying to scrape a website but so far with no luck.

The situation is as following: The website I am trying to scrape requires data from a form submitted previously. I have recognized the variables that are required by the web app and have investigated what HTTP headers are sent by the original web app.

Since I have pretty much zero knowledge in ASP.net, thought I'd just ask whether I am missing something here.

I have tried different methods (CURL, get contents and the Snoopy class), here's my code of the curl method:

<?php
$url = 'http://www.urltowebsite.com/Default.aspx';
$fields = array('__VIEWSTATE' => 'averylongvar',
                '__EVENTVALIDATION' => 'anotherverylongvar',
                'A few' => 'other variables');

$fields_string = http_build_query($fields);

$curl = curl_init($url);

curl_setopt_array
(
    $curl,
    array
    (
        CURLOPT_RETURNTRANSFER  =>    true,
        CURLOPT_SSL_VERIFYPEER  =>    0,  //    Not supported in PHP
        CURLOPT_SSL_VERIFYHOST  =>    0,  //        at this time.
        CURLOPT_HTTPHEADER      =>
            array
            (
                'Content-type: application/x-www-form-urlencoded; charset=utf-8',
                'Set-Cookie: ASP.NET_SessionId='.uniqid().'; path: /; HttpOnly'
            ),
        CURLOPT_POST            =>    true,
        CURLOPT_POSTFIELDS      =>    $fields_string,
        CURLOPT_FOLLOWLOCATION => 1
    )
);

$response = curl_exec($curl);
curl_close($curl);

echo $response;
?>

The following headers were requested:

  • Request URL: 开发者_开发技巧http://www.urltowebsite.com/default.aspx
  • Request Method:POST
  • Status Code: 200 OK

Request Headers

  • Accept:application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,/;q=0.5
  • Content-Type:application/x-www-form-urlencoded
  • User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_4; en-us) AppleWebKit/533.18.1 (KHTML, like Gecko) Version/5.0.2 Safari/533.18.5

Form Data

  • A lot of form fields

Response Headers

  • Cache-Control:private
  • Content-Length:30168
  • Content-Type:text/html; charset=utf-8
  • Date:Thu, 09 Sep 2010 17:22:29 GMT
  • Server:Microsoft-IIS/6.0
  • X-Aspnet-Version:2.0.50727
  • X-Powered-By:ASP.NET

When I investigate the headers of the CURL script that I wrote, somehow does not generate the Form data request. Neither is the request method set to POST. This is where it seems to me where things go wrong, but dunno.

Any help is appreciated!!!

EDIT: I forgot to mention that the result of the scraping is a custom session expired page of the remote website.


Since __VIEWSTATE and __EVENTVALIDATION are base 64 char arrays, I've used urlencode() for those fields:

$fields = array('__VIEWSTATE' => urlencode( $averylongvar ),
                '__EVENTVALIDATION' => urlencode( $anotherverylongvar),
                'A few' => 'other variables');

And worked fine for me.


Since VIEWSTATE contains the state of the page in a particular situation (and all this state is encoded into a big, apparently messy, string), you cannot be sure that the param you are scraping can be the same for your "mock" request (I'm quite sure that it cannot be the same ;) ).

If you really have to deal with VIEWSTATE and EVENTVALIDATION params my advice is to follow another approach, that is to scrape content via Selenium or with an HtmlUnit like library (but unfortunately I don't know if there's something similar in PHP).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜