Most efficient way to download thousands of webpages
I have few thousand of items. For each item I need to download a webpage and process this webpage. Processing itself is not processor-intensive.
Right now, I'm doing it synchr开发者_运维百科onously using webclient class, but it takes too long. I'm sure it can be easily paralelized/asynchronized. But Iam looking for most resource-efficient way to do it. There are possibly some limits for amount of active webrequests, so I dont like idea of creating thousands webclients and starting asynchronous operation on each one. Unless it is not an actual problem.
Is it possible to use Parallel Extensions and Task class in C# 4?
Edit: Thanks for the answers. I was hoping for something using asynchronous operations, because running synchronous operation in paralel will only block those thread.
You want to use a structure called a producer/consumer queue. You queue up all your urls for processing, and assign consumer threads to dequeue each url (with appropriate locking) and then download and process it.
This allows you to control and tune the number of consumers for what works best in your situation. In most cases, you'll find the optimum throughput for network operations is achieved with between 5 and 20 active connections. More and you start worrying about congestion issues on the wire or context switching issues among your threads. Of course, it varies depending on your circumstances: a server with a lot of cores and fat pipe might be able to push this number much higher, but an old P4 on dialup might find it does best with just a couple going at a time. That's why the tuning ability is so important.
Try using Parallel.ForEach([list of items], x => YourDownloadFunction(x))
It will handle concurrency automatically and efficiently, using thread pools and the whole lot.
Use Thread. Parallel.ForEach has limited threads, based on amount of cores/cpus you have. Fetching websites doesn't make a thread completely active throughout its operation. There will be delays between requests (images, static content, etc). So, use threads to maximize the speed. Start with 50 threads then go up from there to see how much your computer can handle.
精彩评论