开发者

Screen scrape an ASP.NET Page not working

I am trying to bring back the calendar events on the page at the following site: http://www.wphospital.org/News-Events/Calendar-of-Events.aspx

Notice that this site has a link called "Month" - I need to be able POST data requesting calendar events for a particular month. I cannot get this to work. Here is the code:

private static void GetData(ref string buf)
{
    try
    {
        //First, request the search form to get the viewstate value 
        HttpWebRequest webRequest = default(HttpWebRequest);
        webRequest = (HttpWebRequest)System.Net.WebRequest.Create("http://www.wphospital.org/News-Events/Calendar-of-Events.aspx");
        StreamReader responseReader = new StreamReader(webRequest.GetResponse().GetResponseStream());
        string responseData = responseReader.ReadToEnd();
        responseReader.Close();

        //Extract the viewstate value and build out POST data 
        string viewState = ExtractViewState(responseData);
        string eventValidation = ExtractEventValidation(responseData);
        string postData = null;

        postData = String.Format("ctl00$manScript={0}&__EVENTTARGET=&__EVENTARGUMENT&__LASTFOCUS=&__VIEWSTATE={1}&lng={2}&__EVENTVALIDATION={3}&ctl00$searchbox1$txtWord={4}&textfield2={5}&ctl00$plcMain$lstbxCategory={6}&ctl00$plcMain$lstbxSubCategory={7}", "ctl00$plcMain$updMonthNav|ctl00$plcMain$btnNextMonth", viewState, "en-US", eventValidation, "Search", "your search here", 0, 0);

        var encoding = new ASCIIEncoding();
        byte[] data = encoding.GetBytes(postData);

        //Now post to the search form 
        webRequest = (HttpWebRequest)System.Net.WebRequest.Create("http://www.wphospital.org/News-Events/Calendar-of-Events.aspx");
        webRequest.UserAgent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)";
        webRequest.Method = "POST";
        webRequest.ContentType = "application/x-www-form-urlencoded";
        webRequest.ContentLength = data.Length;

        var newStream = webRequest.GetRequestStream();
        newStream.Write(data, 0, data.Length);
        newStream.Close();

        responseReader = new StreamReader(webRequest.GetResponse().GetResponseStream());

        //And read the response 
        responseData = responseReader.ReadToEnd();
        responseReader.Close();
        buf = responseData;
    }
    catch (WebException ex)
    {
        if (ex.Status == WebExceptionStatus.ProtocolError)
        {
            Console.Write("The server 开发者_运维问答returned protocol error ");
            // Get HttpWebResponse so that you can check the HTTP status code.
            HttpWebResponse httpResponse = (HttpWebResponse)ex.Response;
            int sc = (int)httpResponse.StatusCode;
            string strsc = httpResponse.StatusCode.ToString();
        }
    }
}

private static string ExtractViewState(string s)
{
    string viewStateNameDelimiter = "__VIEWSTATE";
    string valueDelimiter = "value=\"";

    int viewStateNamePosition = s.IndexOf(viewStateNameDelimiter);
    int viewStateValuePosition = s.IndexOf(valueDelimiter, viewStateNamePosition);

    int viewStateStartPosition = viewStateValuePosition + valueDelimiter.Length;
    int viewStateEndPosition = s.IndexOf("\"", viewStateStartPosition);

    return HttpUtility.UrlEncodeUnicode(s.Substring(viewStateStartPosition, viewStateEndPosition - viewStateStartPosition));
}

Can anyone point me in the right direction?


This may or may not solve your problem because I don't know exactly what the problem is when you say it's not working. But as "Al W" noted - the response from an async postback is not going to look like a straight HTML stream. So if your problem is parsing it afterwards then this might help.

I had the "opportunity" to discover this recently because I needed to rewrite that output. I'm working on a C# jQuery port and found that I was breaking WebForms pages when I tried to re-render the output stream during an async postback. I went through the client script that parses the response and figured out the format of the response.

Each panel that is updated will return a block of data that is formatted like:

"Length|Type|ID|Content"

There could be any number of these strung together. Type is "updatePanel" for UpdatePanels. ID is the UniqueID of the control, and Content is the actual HTML data. Length is equal to the number of bytes in Content, and you need to use that to parse each block, because the separator character may appear inside Content itself. So if you decided you wanted to rewrite this data before sending it back to an ASP.NET page (like I did) you need to update Length to reflect the final length of your content.

The code I used to parse and rewrite it is in Server/CsQueryHttpContext.


For POST operations, you want it to be UTF-8 encoded, so just re-do the one line

        //var encoding = new ASCIIEncoding();
        //byte[] data = encoding.GetBytes(postData);
        //do this instead.....
        byte[] data = Encoding.UTF8.GetBytes(postData);

and see if this helps you out


Below is the network trace I get in chrome when clicking the monthly button. Notice the __EVENTTARGET:ctl00$plcMain$monthBtn asp.net has a javascript framework in there that is calling a javascript:postback() method when that link is clicked, which sets the event target. That's basically how ASP.NET webforms knows which event to fire on a postback. One tricky thing here is that the web page is using an update panel so you might not get a true html response. If you can get your request to look something like this, then you should get a successful response. Hope this helps.

Request URL:http://www.wphospital.org/News-Events/Calendar-of-Events.aspx
Request Method:POST
Status Code:200 OK
Request Headers
Accept-Charset:ISO-8859-1,utf-8;q=0.7,*;q=0.3
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8
Cache-Control:no-cache
Content-Length:9718
Content-Type:application/x-www-form-urlencoded
Cookie:CMSPreferredCulture=en-US; ASP.NET_SessionId=h2nval45vq0q5yb0cp233huc; __utma=101137351.234148951.1312486481.1312486481.1312486481.1; __utmb=101137351.1.10.1312486481; __utmc=101137351; __utmz=101137351.1312486481.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __unam=ef169fe-131964a5f2a-24ec879b-1
Host:www.wphospital.org
Origin:http://www.wphospital.org
Proxy-Connection:keep-alive
Referer:http://www.wphospital.org/News-Events/Calendar-of-Events.aspx
User-Agent:Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.124 Safari/534.30
X-MicrosoftAjax:Delta=true
Form Dataview URL encoded
ctl00$manScript:ctl00$plcMain$updTab|ctl00$plcMain$monthBtn
__EVENTTARGET:ctl00$plcMain$monthBtn
__EVENTARGUMENT:
__LASTFOCUS:
__VIEWSTATE:<removed for brevity>
lng:en-US
__EVENTVALIDATION:/wEWEgLbj/nSDgKt983zDgKWlOLbAQKr3LqFAwKL3uqpBwK9kfRnArDHltMCAuTk0eAHAsfniK0DAteIosMPAsiIosMPAsmIosMPAsuIosMPAoD0ookDApCbiOcPAo biOcPAombiOcPAoubiOcPyfqRx8FdqYzlnnkXcJEJZzzopJY=
ctl00$searchbox1$txtWord:Search
textfield2:Enter your search here
ctl00$plcMain$lstbxCategory:0
ctl00$plcMain$lstbxSubCategory:0
ctl00$plcMain$hdnEventCount:2
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜