开发者

Screen Scraping [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

Questions asking us to recommend or find a book, tool, software library, tutorial or other off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.

Closed 8 years ago.

Improve this question

Just curious: What do you find to be your best tools for creating automated screen scrapes th开发者_如何转开发ese days? is the .Net Agility pack a good option? What do you do about scraping sites that use a lot of AJAX?


I find that if the page has a pretty static layout, then the HTML Agility Pack is perfect for getting all the data I need. I've not run into a single page that it hasn't been able to handle and not get me the results I wanted.

If you find that the page is rendered with a great deal of dynamic code, you're going to have to do more than just download the page, you'll have to actually execute it.

To do that, you'll need something like the WebKit .NET library (a .NET wrapper around the WebKit rendering engine) which will allow you to download the page and actually execute Javascript as well. Then, once you are sure the document has been rendered completely, you can get the page details.


For the very basics I use:

  • Asynchronous HTTP Client - notably faster than the standard HttpWeb* (preliminary tests showed that it was about 25% faster).
  • Majestic 12 HTML Parser - about 50-100% faster than HTML Agility Pack.

I don't have JavaScript enabled yet, but I'm planning on using Google's V8 JavaScript Engine. This requires that you make calls to unmanaged code, but the performance of V8 justifies it.


For automating screen scraping, Selenium is a good tool. There are 2 things- 1) install Selenium IDE (works only in Firefox). 2) Install Selenium RC Server

After starting Selenium IDE, go to the site that you are trying to automate and start recording events that you do on the site. Think it as recording a macro in the browser. Afterwards, you get the code output for the language you want.

Just so you know Browsermob uses Selenium for load testing and for automating tasks on browser.

I've uploaded a ppt that I made a while back. This should save you a good amount of time- http://www.4shared.com/get/tlwT3qb_/SeleniumInstructions.html

In the above link select the option of regular download.

I spent good amount of time in figuring it out, so thought it may save somebody's time.


The best tool "these days" is one that not only gives you the desired features (Javascript, automation), but also the one that you don't have to run yourself... I am, of course, alluding to using a cloud service. This approach will save you network bandwidth, will deliver results faster (because it can scale better than a custom solution you'll likely end up developing) and, most importantly, save you the IT and maintenance headache.

On that note, check out a scraping solution called Bobik (http://usebobik.com). I've written an article about it at http://zscraper.wordpress.com/2012/07/03/a-comparison-shopping-android-app-without-backend/.

Hope this helps.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜