Scraping data from a secure website or automating mundane task

2023-02-14 11:44 问答作者：

I have a website where I need to login with username and password and captcha.

Once in I have a control panel开发者_Go百科 that has bookings. For each booking there is a link for a details page that has the email address of the person making the booking.

Each day I need a list of all these email addresses to send an email to them.

I know how to scrape sites in .NET to get these types of details but not for websites where I need to be logged in.

I seen an article where I can pass the cookie as a header and that should do the trick but that would require me to view the cookie in firebug and copy and paste it over.

This would be sued by a non technical person so that's not really the best option.

The other thing I was thinking is a script they can run that automates this in the browser? Any tips on how to do this?

There's something you should know, no matter if you're querying the web through HtmlAgilityPack or using HttpWebRequest class directly (HtmlAgilityPack uses it): How to handle Cookies.

Here's basically the steps you should follow:

Load the page you want to be logged in
Submit the required info to log in using POST method (username, password, or whatever the page requests)
Save the Cookies in the response, and use those Cookies from now on.
Request the page with those Cookies and parse it with HtmlAgilityPack.

Here's something I always do when using HtmlAgilityPack: Send request to the website using HttpWebRequest instead of doing this using Load(..) method of HtmlWeb class.

Take in count, that one of the parameters of Load method in HtmlDocument class receives a Stream. All you have to do is pass the response stream (obtained by request.GetResponseStream()) and you will have the HtmlDocument object you need.

I suggest you installing Fiddler. It is a really great tool to inspect HTTP requests/responses, either from your browser or from your application.

Run Fiddler, and try to log on the site through the browser, and see what the browser sends to the page and what the page returns, and that's exactly what you need to emulate using HttpWebRequest class.

Edit:

The idea isn't just to pass a static Cookie in the header. It must be the Cookie returned by the page after logged in.

To handle Cookies, take a look at HttpWebRequest.CookieContainer property. It's easier than you think. All you need to do is declare a CookieContainer variable (empty), and assign it to that property before sending any request to the website. When the website gives a response, the Cookies should be added to that container automatically, so you will be able to use them the next time you request the website.

Edit 2:

If all you need is a script to automate it through your browser, take a look at WatiN library. I'm sure you will be able to run it by yourself after you see one or two examples of how to use it ;-)

To scrap a web site in .NET, there is the Html Agility Pack.

And here is the link that explains how to do login with it: Using HtmlAgilityPack to GET and POST web forms

For automating screen scraping, Selenium is a good tool. There are 2 things- 1) install Selenium IDE (works only in Firefox). 2) Install Selenium RC Server

After starting Selenium IDE, go to the site that you are trying to automate and start recording events that you do on the site. Think it as recording a macro in the browser. Afterwards, you get the code output for the language you want.

Just so you know Browsermob uses Selenium for load testing and for automating tasks on browser.

I've uploaded a ppt that I made a while back. This should save you a good amount of time- http://www.4shared.com/get/tlwT3qb_/SeleniumInstructions.html

In the above link select the option of regular download.

I spent good amount of time in figuring it out, so thought it may save somebody's time.

继续阅读：.net screen-scraping ssl

Scraping data from a secure website or automating mundane task

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？