Any Good Open Source Web Crawling Framework in C#

2023-01-28 06:11 问答作者：

Iam building a shopping comparison engine and I need to build a crawling engine to perform the daily data collection process.

I have decided to build the crawler in C#. I have a lot of bad experience with HttpWebRequest/HttpWebResponse Classes and they are known to be highly buggy and unstable for large crawls. So I have decided NOT to build on them. Even in framework 4.0 they are buggy.

I speak by my own personal experience.

I would like opinions from experts here who have been coding crawlers, if they know about any good open source crawling frameworks, like java has nutch and apache commons which are very stable and highly robust libraries.

If there are some already existing crawling frameworks in C#, I shall go ahead and build my application on top of them.

If not am planning to extend this solution from code project and extend it.

http://www.codeproject.com/KB/IP/Crawler.asp开发者_JAVA百科x

If any one can suggest me a better path, I shall be really thankful.

EDIT : Some sites which I have to crawl render the page using very complex Java Scripts, now this added more complexity to my web crawlers since I need to be able to crawl pages rendered by JavaScript. If someone has used any library in C# which can crawl javascript rendered, please do share. I have used watin which I dont prefer and I also know about selenium. If you know about anything other than these please do share with me and the community.

PhantomJS + HtmlAgilityPack

I know this topic is a bit old, but I've had the best results by far with PhantomJS. There is a NuGet package for it, and combining it with HtmlAgilityPack makes for a pretty decent fetching & scraping toolkit.

This example just uses PhantomJS's built in parsing capabilities. This worked with a very old version of the library; since it seems to be under active development still, it'd be safe to assume that even more capabilities have been added.

void Test()
{
    var linkText = @"Help Spread DuckDuckGo!";
    Console.WriteLine(GetHyperlinkUrl("duckduckgo.com", linkText));
    // as of right now, this would print ‘https://duckduckgo.com/spread’
}

/// <summary>
/// Loads pageUrl, finds a hyperlink containing searchLinkText, returns
/// its URL if found, otherwise an empty string.
/// </summary>
public string GetHyperlinkUrl(string pageUrl, string searchLinkText)
{
    using (IWebDriver phantom = new PhantomJSDriver())
    {
        phantom.Navigate.GoToUrl(pageUrl);
        var link = phantom.FindElement(By.PartialLinkText(searchLinkText));
        if(link != null)
            return link.GetAttribute("href");
    }
    return string.Empty;
}

Abot C# Web Crawler

Description from http://code.google.com/p/abot/ says : Abot is an open source C# web crawler built for speed and flexibility. It takes care of the low level plumbing (multithreading, http requests, scheduling, link parsing, etc..). You just hook into key events to process data or plugin your own implementations of core interfaces to take complete control over the crawl process.

haven't used it though.

I know of something called NCrawler, available on codeplex. Not used it personally, but a colleague says it works OK.

arachnode.net can process JavaScript.

Ncrawler does not support Javascript.But it looks very good , and easy to use solution if you don't need javascript execution

I understand this topic is very old, but I made a solution for fast crawlers writing and may be useful for someone else. The package name is

Laraue.Crawling.Dynamic.PuppeterSharp

The main idea that first you describe a model that you want to receive

public class User
{
    string Name { get; set; }
    int Age { get; set; }
    string[] ImageLinks { get; set; }
}

And then write how to fill it values

var schema = new PuppeterSharpSchemaBuilder<User>()
    .HasProperty(x => x.Name, ".name")
    .HasProperty(x => x.Age, ".age")
    .HasArrayProperty(
        x => x.ImageLinks,
        ".links a",
        async handle => await handle.GetAttributeValueAsync("href"))
    .Build();

Then this schema can be parsed. The library use PuppeterSharp package inside

// Download browser and open the page
await new BrowserFetcher().DownloadAsync();
await using var browser = await Puppeteer.LaunchAsync(new LaunchOptions());
var page = await browser.NewPageAsync();
var response = await page.GoToAsync(link);

// Parse the page using described schema
var parser = new PuppeterSharpParser(new LoggerFactory());
var model = await parser.RunAsync(schema, await page.QuerySelectorAsync("body"));

The library supports also static crawling via AngleSharp library when JS rendering is not required through the package

Laraue.Crawling.Static.AngleSharp

The schema describes the same way.

继续阅读：screen-scraping web-crawler web-scraping

Any Good Open Source Web Crawling Framework in C#

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？