Scraping data from a secure website or automating mundane task
I have a website where I need to login with username and password and captcha.
Once in I have a control panel开发者_Go百科 that has bookings. For each booking there is a link for a details page that has the email address of the person making the booking.
Each day I need a list of all these email addresses to send an email to them.
I know how to scrape sites in .NET to get these types of details but not for websites where I need to be logged in.
I seen an article where I can pass the cookie as a header and that should do the trick but that would require me to view the cookie in firebug and copy and paste it over.
This would be sued by a non technical person so that's not really the best option.
The other thing I was thinking is a script they can run that automates this in the browser? Any tips on how to do this?
There's something you should know, no matter if you're querying the web through HtmlAgilityPack
or using HttpWebRequest
class directly (HtmlAgilityPack
uses it): How to handle Cookies.
Here's basically the steps you should follow:
- Load the page you want to be logged in
- Submit the required info to log in using POST method (username, password, or whatever the page requests)
- Save the Cookies in the response, and use those Cookies from now on.
- Request the page with those Cookies and parse it with
HtmlAgilityPack
.
Here's something I always do when using HtmlAgilityPack
: Send request to the website using HttpWebRequest
instead of doing this using Load(..)
method of HtmlWeb
class.
Take in count, that one of the parameters of Load
method in HtmlDocument
class receives a Stream
. All you have to do is pass the response
stream (obtained by request.GetResponseStream()
) and you will have the HtmlDocument
object you need.
I suggest you installing Fiddler. It is a really great tool to inspect HTTP requests/responses, either from your browser or from your application.
Run Fiddler
, and try to log on the site through the browser, and see what the browser sends to the page and what the page returns, and that's exactly what you need to emulate using HttpWebRequest
class.
Edit:
The idea isn't just to pass a static Cookie in the header. It must be the Cookie returned by the page after logged in.
To handle Cookies, take a look at HttpWebRequest.CookieContainer property. It's easier than you think. All you need to do is declare a CookieContainer
variable (empty), and assign it to that property before sending any request to the website. When the website gives a response, the Cookies should be added to that container automatically, so you will be able to use them the next time you request the website.
Edit 2:
If all you need is a script to automate it through your browser, take a look at WatiN library. I'm sure you will be able to run it by yourself after you see one or two examples of how to use it ;-)
To scrap a web site in .NET, there is the Html Agility Pack.
And here is the link that explains how to do login with it: Using HtmlAgilityPack to GET and POST web forms
For automating screen scraping, Selenium is a good tool. There are 2 things- 1) install Selenium IDE (works only in Firefox). 2) Install Selenium RC Server
After starting Selenium IDE, go to the site that you are trying to automate and start recording events that you do on the site. Think it as recording a macro in the browser. Afterwards, you get the code output for the language you want.
Just so you know Browsermob uses Selenium for load testing and for automating tasks on browser.
I've uploaded a ppt that I made a while back. This should save you a good amount of time- http://www.4shared.com/get/tlwT3qb_/SeleniumInstructions.html
In the above link select the option of regular download.
I spent good amount of time in figuring it out, so thought it may save somebody's time.
精彩评论