Any Good Open Source Web Crawling Framework in C#
Iam building a shopping comparison engine and I need to build a crawling engine to perform the daily data collection process.
I have decided to build the crawler in C#. I have a lot of bad experience with HttpWebRequest/HttpWebResponse Classes and they are known to be highly buggy and unstable for large crawls. So I have decided NOT to build on them. Even in framework 4.0 they are buggy.
I speak by my own personal experience.
I would like opinions from experts here who have been coding crawlers, if they know about any good open source crawling frameworks, like java has nutch and apache commons which are very stable and highly robust libraries.
If there are some already existing crawling frameworks in C#, I shall go ahead and build my application on top of them.
If not am planning to extend this solution from code project and extend it.
http://www.codeproject.com/KB/IP/Crawler.asp开发者_JAVA百科x
If any one can suggest me a better path, I shall be really thankful.
EDIT : Some sites which I have to crawl render the page using very complex Java Scripts, now this added more complexity to my web crawlers since I need to be able to crawl pages rendered by JavaScript. If someone has used any library in C# which can crawl javascript rendered, please do share. I have used watin which I dont prefer and I also know about selenium. If you know about anything other than these please do share with me and the community.
PhantomJS + HtmlAgilityPack
I know this topic is a bit old, but I've had the best results by far with PhantomJS. There is a NuGet package for it, and combining it with HtmlAgilityPack makes for a pretty decent fetching & scraping toolkit.
This example just uses PhantomJS's built in parsing capabilities. This worked with a very old version of the library; since it seems to be under active development still, it'd be safe to assume that even more capabilities have been added.
void Test()
{
var linkText = @"Help Spread DuckDuckGo!";
Console.WriteLine(GetHyperlinkUrl("duckduckgo.com", linkText));
// as of right now, this would print ‘https://duckduckgo.com/spread’
}
/// <summary>
/// Loads pageUrl, finds a hyperlink containing searchLinkText, returns
/// its URL if found, otherwise an empty string.
/// </summary>
public string GetHyperlinkUrl(string pageUrl, string searchLinkText)
{
using (IWebDriver phantom = new PhantomJSDriver())
{
phantom.Navigate.GoToUrl(pageUrl);
var link = phantom.FindElement(By.PartialLinkText(searchLinkText));
if(link != null)
return link.GetAttribute("href");
}
return string.Empty;
}
Abot C# Web Crawler
Description from http://code.google.com/p/abot/ says : Abot is an open source C# web crawler built for speed and flexibility. It takes care of the low level plumbing (multithreading, http requests, scheduling, link parsing, etc..). You just hook into key events to process data or plugin your own implementations of core interfaces to take complete control over the crawl process.
haven't used it though.
I know of something called NCrawler, available on codeplex. Not used it personally, but a colleague says it works OK.
arachnode.net can process JavaScript.
Ncrawler does not support Javascript.But it looks very good , and easy to use solution if you don't need javascript execution
I understand this topic is very old, but I made a solution for fast crawlers writing and may be useful for someone else. The package name is
Laraue.Crawling.Dynamic.PuppeterSharp
The main idea that first you describe a model that you want to receive
public class User
{
string Name { get; set; }
int Age { get; set; }
string[] ImageLinks { get; set; }
}
And then write how to fill it values
var schema = new PuppeterSharpSchemaBuilder<User>()
.HasProperty(x => x.Name, ".name")
.HasProperty(x => x.Age, ".age")
.HasArrayProperty(
x => x.ImageLinks,
".links a",
async handle => await handle.GetAttributeValueAsync("href"))
.Build();
Then this schema can be parsed. The library use PuppeterSharp package inside
// Download browser and open the page
await new BrowserFetcher().DownloadAsync();
await using var browser = await Puppeteer.LaunchAsync(new LaunchOptions());
var page = await browser.NewPageAsync();
var response = await page.GoToAsync(link);
// Parse the page using described schema
var parser = new PuppeterSharpParser(new LoggerFactory());
var model = await parser.RunAsync(schema, await page.QuerySelectorAsync("body"));
The library supports also static crawling via AngleSharp library when JS rendering is not required through the package
Laraue.Crawling.Static.AngleSharp
The schema describes the same way.
精彩评论