Asynch Design C#

2022-12-15 17:55 问答作者：

I am writing a crawler in C# that starts with a set of known url's in a file. I want pull the pages down asynch. My question is what is the best pattern for this, i.e. Read file into List/Array of urls, Create an array to store completed urls? Should I create a 2 dimensional arr开发者_如何学JAVAay to track status of threads and completion? Also some other considerations are retries (if the first request is slow or dead) or auto restarts (app/system crash).

foreach (var url in File.ReadAllLines("urls.txt"))
{
    var client = new WebClient();
    client.DownloadStringCompleted += (sender, e) => 
    {
        if (e.Error == null)
        {
            // e.Result will contain the downloaded HTML
        }
        else
        {
            // some error occurred: analyze e.Error property
        }
    };
    client.DownloadStringAsync(new Uri(url));
}

I recommend that you pull from a Queue and fetch each URL in a separate thread, peeling off from the Queue until you max out of the number of simultaenous threads that you want to allow. Each thread invokes a callback method that reports whether it finished successfully or encountered a problem.

As you start each thread, put its ManagedThreadId into a Dictionary, with the key being the id and the value being the thread status. The callback method should return its id and completion status. Delete each thread from the Dictionary as it completes and launch the next waiting thread. If it didn't finish successfully, then add it back to the queue.

The Dictionary's Count property tells you how many threads are in flight and the callback can also be used to update your UI or check for a pause or halt signal. If you need to persist your results in case of a system crash, then you should consider using database tables in lieu of memory-resident collections, such a manitra describes.

This approach has worked very well for me for lots of simultaneous threads.

Here is my opinion about storing the data

I would suggest you to use a relationnal database for storing the page list because it will make your task easier for :

retrieving the page to crawl (basically the N pages with the oldest SuccessFullCrawlDate)
adding the newly discovered pages
marking pages as being crawled (set the SuccessFullCrawlDate flags)
in case of a program crash, your data would already be safe
you could add columns to store the number of retries to automatically discard those that failed more than N times ...

An example of relational model would be :

//this would contain all the crawled pages
table Pages {
    Id bigint,
    Url nvarchar(2000)
    Created DateTime,
    LastSuccessfullCrawlDate DateTime,
    NumberOfRetry  int //increment this when a failure occures, if it reach 10 => set Ignored to True
    Title nvarchar(200)   //this is is where you would put the html
    Content nvarchar(max) //this is is where you would put the html
    Ignored Bool,         //set it to True to ignore this page
}

You could also handle Referer with a table wih this structure :

//this would contain all the crawled pages
table Referer {
    ParentId bigint,
    ChildId bigint
}

It could allow you to implement your very own Page Rank :p

继续阅读：multithreading

Asynch Design C#

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？