Retrieving HTML pages from a 3rd party log in website with ASP.NET
Our Situation: Our team needs to retrieve log information from a 3rd party website (Specifically, this log information is call logs -- our client rents an 866 number. When calls come in, they assist people and need to make notes accordingly in our application that will correspond with the current call). Our client has a web account with the 3rd party that allows them to view开发者_如何学Go the current call logs (date/time, phone number, amount of time on each call, etc).
I contacted the developer of their website and inquired about API or any other means of syncing our database with their constantly updating database. They currently DO NOT support API. I informed them of my situation and they are perfectly fine with any way we can retrieve the information (bot/crawler). *The 3rd party said that they are working on API but could not give us a general timeline as to when it will be up... and as with every client, they need to start production ASAP.
I completely understand that if the 3rd party were to change their HTML layout, it may cause a slight headache for us (sorting the data from the webpage). That being said, this is a temporary solution to a long term issue. Once they implement their API, we will switch them over to it.
So my question is this: What is the best way to log into the 3rd party website (see image: http://i903.photobucket.com/albums/ac239/jreedinc/customtf.jpg) and retrieve certain HTML pages? We have reviewed source codes of webcrawlers, but none of them have the capability of storing cookies and posting information back to the website (with log in information). We would prefer to do this in ASP.NET.
Is there another way to accomplish logging on to the website, then retrieving said information?
The classes you'll need to use are in the System.Net namespace. Below is some quick and dirty proof of concept code. To login in to a site that uses form login + cookies for security and then scrape the HTML output of a page.
In order to parse the HTML results you'll need to use an additional tool.
Possible HTML parsing tools.
SgmlReader, can convert HTML to XML. You then use .NET's XML features to extract data from the XML.
http://code.msdn.microsoft.com/SgmlReader
HTML Agility Pack, allows XPath queries against HTML documents.
http://htmlagilitypack.codeplex.com/
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
class WebWorker {
/// <summary>
/// Cookies for use by web worker
/// </summary>
private System.Collections.Generic.List `<System.Net.Cookie` > cookies = new List < System.Net.Cookie > ();
public string GetWebPageContent(string url) {
System.Net.HttpWebRequest request = (System.Net.HttpWebRequest) System.Net.WebRequest.Create(url);
System.Net.CookieContainer cookieContainer = new System.Net.CookieContainer();
request.CookieContainer = cookieContainer;
request.Method = "GET";
//add cookies to maintain session state
foreach(System.Net.Cookie c in this.cookies) {
cookieContainer.Add(c);
}
System.Net.HttpWebResponse response = request.GetResponse() as System.Net.HttpWebResponse;
System.IO.Stream responseStream = response.GetResponseStream();
System.IO.StreamReader sReader = new System.IO.StreamReader(responseStream);
System.Diagnostics.Debug.WriteLine("Content:\n" + sReader.ReadToEnd());
return sReader.ReadToEnd();
}
public string Login(string url, string userIdFormFieldName, string userIdValue, string passwordFormFieldName, string passwordValue) {
System.Net.HttpWebRequest request = (System.Net.HttpWebRequest) System.Net.WebRequest.Create(url);
System.Net.CookieContainer cookieContainer = new System.Net.CookieContainer();
request.CookieContainer = cookieContainer;
request.Method = "POST";
request.ContentType = "application/x-www-form-urlencoded";
string postData = System.Web.HttpUtility.UrlEncode(userIdFormFieldName) + "=" + System.Web.HttpUtility.UrlEncode(userIdValue) +
"&" + System.Web.HttpUtility.UrlEncode(passwordFormFieldName) + "=" + System.Web.HttpUtility.UrlEncode(passwordValue);
request.ContentLength = postData.Length;
request.AllowAutoRedirect = false; //allowing redirect seems to loose cookies
byte[] postDataBytes = System.Text.Encoding.UTF8.GetBytes(postData);
System.IO.Stream requestStream = request.GetRequestStream();
requestStream.Write(postDataBytes, 0, postDataBytes.Length);
System.Net.HttpWebResponse response = request.GetResponse() as System.Net.HttpWebResponse;
// System.Diagnostics.Debug.Write(WriteLine(new StreamReader(response.GetResponseStream()).ReadToEnd());
System.IO.Stream responseStream = response.GetResponseStream();
System.IO.StreamReader sReader = new System.IO.StreamReader(responseStream);
System.Diagnostics.Debug.WriteLine("Content:\n" + sReader.ReadToEnd());
this.cookies.Clear();
if (response.Cookies.Count > 0) {
for (int i = 0; i < response.Cookies.Count; i++) {
this.cookies.Add(response.Cookies[i]);
}
}
return "OK";
}
} //end class
//sample to use class
WebWorker worker = new WebWorker();
worker.Login("http://localhost/test/default.aspx", "uid", "bob", "pwd", "secret");
worker.GetWebPageContent("http://localhost/test/default.aspx");
I used a tool recently called WebQL (its a web scraper tool that lets the developer use SQL like syntax to scrape information from web pages.
WebQL on Wikipedia
This is actually a relatively simple operation. What you need to do is get the page that the screenshot posts back to (something like login.php, etc) and then construct a webrequest to that page with the login data you have. You will most likely get back a cookiecontainer that will have your login cookie to use on all subsequent requests.
You can look at this MSDN article for the basics of how to do it, but their write-up is kind of confusing. Look at the community comments at the end for an example of how to post back page variables (like the username and password). You will need to make sure you pass the cookiecontainer around on subsequent requests.
Unfortunately .NET does not natively have something like WWW::Mechanize, but the Webclient does have an "upload value" which might make it easier. You will still have to manually parse the page to figure out what fields you need to pass.
精彩评论