开发者

Looking for a free alternative to Webzinc .NET, screen scraping, web automation libraries for .NET [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.

Closed 7 years ago.

开发者_StackOverflow Improve this question

I came across this .NET library:

http://www.webzinc.com/online/faq.aspx

However, I was wondering if there was a free alternative out there?


Building robots isn't that hard, and there are a number of books that describe the general algorithm for doing so (a simple Google search will turn up a number of algorithms).

The jist of it from a .NET perspecitve is to recursively:

  • Download pages - This is done through the HttpWebRequest/HttpWebResponse, or the WebClient classes. Also, you can use the new WCF Web API from CodePlex, which is a vast improvement over the above, meant specifically for producing/consuming REST content, it works wonderfully for spidering purposes (mainly because of it's extensibility)

  • Parse the downloaded content - I highly recommend the Html Agility Pack as well as the fizzler extension for the Html Agility Pack. The Html Agility Pack will handle malformed HTML and allow you to query HTML elements using XPath (or a subset of). Additionally, fizzler will allow you to use CSS selectors if you are familiar with using them in jQuery.

  • Once you have the HTML in a structured format, scan the structure for the content that is relevant to you and process it.

    • Scan the structured format for external links and place in the queue to be processed (against whatever constraints you want for your app, you aren't indexing the entire web, are you?).

    • Get the next item in the queue, and repeat the process again.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜