Retrieving HTML pages from a 3rd party log in website with ASP.NET

2022-12-15 05:11 问答作者：

Our Situation: Our team needs to retrieve log information from a 3rd party website (Specifically, this log information is call logs -- our client rents an 866 number. When calls come in, they assist people and need to make notes accordingly in our application that will correspond with the current call). Our client has a web account with the 3rd party that allows them to view开发者_如何学Go the current call logs (date/time, phone number, amount of time on each call, etc).

I contacted the developer of their website and inquired about API or any other means of syncing our database with their constantly updating database. They currently DO NOT support API. I informed them of my situation and they are perfectly fine with any way we can retrieve the information (bot/crawler). *The 3rd party said that they are working on API but could not give us a general timeline as to when it will be up... and as with every client, they need to start production ASAP.

I completely understand that if the 3rd party were to change their HTML layout, it may cause a slight headache for us (sorting the data from the webpage). That being said, this is a temporary solution to a long term issue. Once they implement their API, we will switch them over to it.

So my question is this: What is the best way to log into the 3rd party website (see image: http://i903.photobucket.com/albums/ac239/jreedinc/customtf.jpg) and retrieve certain HTML pages? We have reviewed source codes of webcrawlers, but none of them have the capability of storing cookies and posting information back to the website (with log in information). We would prefer to do this in ASP.NET.

Is there another way to accomplish logging on to the website, then retrieving said information?

The classes you'll need to use are in the System.Net namespace. Below is some quick and dirty proof of concept code. To login in to a site that uses form login + cookies for security and then scrape the HTML output of a page.

In order to parse the HTML results you'll need to use an additional tool.

Possible HTML parsing tools.

SgmlReader, can convert HTML to XML. You then use .NET's XML features to extract data from the XML.
http://code.msdn.microsoft.com/SgmlReader

HTML Agility Pack, allows XPath queries against HTML documents.
http://htmlagilitypack.codeplex.com/

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;


class WebWorker {

 /// <summary>  
 /// Cookies for use by web worker  
 /// </summary>  
 private System.Collections.Generic.List `<System.Net.Cookie` > cookies = new List < System.Net.Cookie > ();


 public string GetWebPageContent(string url) {
  System.Net.HttpWebRequest request = (System.Net.HttpWebRequest) System.Net.WebRequest.Create(url);
  System.Net.CookieContainer cookieContainer = new System.Net.CookieContainer();
  request.CookieContainer = cookieContainer;
  request.Method = "GET";

  //add cookies to maintain session state  
  foreach(System.Net.Cookie c in this.cookies) {
   cookieContainer.Add(c);
  }



  System.Net.HttpWebResponse response = request.GetResponse() as System.Net.HttpWebResponse;


  System.IO.Stream responseStream = response.GetResponseStream();

  System.IO.StreamReader sReader = new System.IO.StreamReader(responseStream);

  System.Diagnostics.Debug.WriteLine("Content:\n" + sReader.ReadToEnd());


  return sReader.ReadToEnd();

 }

 public string Login(string url, string userIdFormFieldName, string userIdValue, string passwordFormFieldName, string passwordValue) {

  System.Net.HttpWebRequest request = (System.Net.HttpWebRequest) System.Net.WebRequest.Create(url);
  System.Net.CookieContainer cookieContainer = new System.Net.CookieContainer();
  request.CookieContainer = cookieContainer;
  request.Method = "POST";
  request.ContentType = "application/x-www-form-urlencoded";
  string postData = System.Web.HttpUtility.UrlEncode(userIdFormFieldName) + "=" + System.Web.HttpUtility.UrlEncode(userIdValue) +
   "&" + System.Web.HttpUtility.UrlEncode(passwordFormFieldName) + "=" + System.Web.HttpUtility.UrlEncode(passwordValue);

  request.ContentLength = postData.Length;

  request.AllowAutoRedirect = false; //allowing redirect seems to loose cookies  
  byte[] postDataBytes = System.Text.Encoding.UTF8.GetBytes(postData);
  System.IO.Stream requestStream = request.GetRequestStream();
  requestStream.Write(postDataBytes, 0, postDataBytes.Length);
  System.Net.HttpWebResponse response = request.GetResponse() as System.Net.HttpWebResponse;

  // System.Diagnostics.Debug.Write(WriteLine(new StreamReader(response.GetResponseStream()).ReadToEnd());  

  System.IO.Stream responseStream = response.GetResponseStream();

  System.IO.StreamReader sReader = new System.IO.StreamReader(responseStream);

  System.Diagnostics.Debug.WriteLine("Content:\n" + sReader.ReadToEnd());
  this.cookies.Clear();

  if (response.Cookies.Count > 0) {
   for (int i = 0; i < response.Cookies.Count; i++) {
    this.cookies.Add(response.Cookies[i]);
   }
  }

  return "OK";
 }


} //end class

//sample to use class

WebWorker worker = new WebWorker();  
worker.Login("http://localhost/test/default.aspx", "uid", "bob", "pwd", "secret");  
worker.GetWebPageContent("http://localhost/test/default.aspx");

I used a tool recently called WebQL (its a web scraper tool that lets the developer use SQL like syntax to scrape information from web pages.

WebQL on Wikipedia

This is actually a relatively simple operation. What you need to do is get the page that the screenshot posts back to (something like login.php, etc) and then construct a webrequest to that page with the login data you have. You will most likely get back a cookiecontainer that will have your login cookie to use on all subsequent requests.

You can look at this MSDN article for the basics of how to do it, but their write-up is kind of confusing. Look at the community comments at the end for an example of how to post back page variables (like the username and password). You will need to make sure you pass the cookiecontainer around on subsequent requests.

Unfortunately .NET does not natively have something like WWW::Mechanize, but the Webclient does have an "upload value" which might make it easier. You will still have to manually parse the page to figure out what fields you need to pass.

继续阅读：asp.net bots web-crawler

Retrieving HTML pages from a 3rd party log in website with ASP.NET

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？