Screen-scraping of a secure page of any site on https:// with asp.net in C#
I've done site scraping of secure page of any site on http by below code:
string cookiedata = "fsfsfsdfsfsfsfsfsdf";
NetworkCredential credential = new NetworkCredential("xxx", "xxx");
HttpWebRequest request = HttpWebRequest.Create("https://ysats.com") as HttpWebRequest;
//set the user agent so it looks like IE to not raise suspicion
request.UserAgent = "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)";
request.Method = "POST";
//set the cookie in the request header
request.Headers.Add("Cookie", cookiedata);
request.Credentials = credential;
//get the response from the server
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
using (Stream stream = response.GetResponseStream())
{
using (StreamReader reader = new StreamReader(stream))
{
string pagedata = reader.ReadTo开发者_运维百科End();
//now we can scrape the contents of the secure page as needed
//since the page contents is now stored in our pagedata string
Response.Write(pagedata);
}
}
response.Close();
but when I am trying to scrap any site on https:// by this code then i always scrape the login page not secure page not required page.
Please advice what should i do for scraping a secure page of any site on https.
You need to send a POST request with login details for the website, then scrape the page following the login. You'd also have to make sure your WebClient
keeps cookies around.
This will inevitably vary from site to site (what the fields are called, what information is required etc.) so you won't be able to develop a blanket solution, and you'd have to check if the login failed or you'd end up scraping the login page again.
See also this duplicate question.
精彩评论