开发者

Fast internet crawler

I'd like to do perform data mining on a large scale. For this, I need a fast crawler. All I need is something to download a web page, extract links and follow them recursively, but without visiting the same url twice. Basically, I want to avoid looping.

I already wrote a crawler in python, but it's too slow. I'm not able to saturate a 1开发者_C百科00Mbit line with it. Top speed is ~40 urls/sec. and for some reason it's hard to get better results. It seems like a problem with python's multithreading/sockets. I also ran into problems with python's gargabe collector, but that was solvable. CPU isn't the bottleneck btw.

So, what should I use to write a crawler that is as fast as possible, and what's the best solution to avoid looping while crawling?

EDIT: The solution was to combine multiprocessing and threading modules. Spawn multiple processes with multiple threads per process for best effect. Spawning multiple threads in a single process is not effective and multiple processes with just one thread consume too much memory.


Why not use something already tested for crawling, like Scrapy? I managed to reach almost 100 pages per second on a low-end VPS that has limited RAM memory (about 400Mb), while network speed was around 6-7 Mb/s (i.e. below 100Mbps).

Another improvement you can do is use urllib3 (especially when crawling many pages from a single domain). Here's a brief comparison I did some time ago:

Fast internet crawler

UPDATE:

Scrapy now uses the Requests library, which in turn uses urllib3. That makes Scrapy the absolute go-to tool when it comes to scraping. Recent versions also support deploying projects, so scraping from a VPS is easier than ever.


Around 2 years ago i have developed a crawler. And it can download almost 250urls per second. You could flow my steps.

  1. Optimize your file pointer use. Try to use minimal file pointer.
  2. Don't write your data every time. Try to dump your data after storing around 5000 url or 10000 url.
  3. For your robustness you don't need to use different configuration. Try to Use a log file and when you want to resume then just try to read the log file and resume your crawler.
  4. Distributed all your webcrawler task. And process it in a interval wise.

    a. downloader

    b. link extractor

    c. URLSeen

    d. ContentSeen


I have written a simple multithreading crawler. It is available on GitHub as Discovering Web Resources and I've written a related article: Automated Discovery of Blog Feeds and Twitter, Facebook, LinkedIn Accounts Connected to Business Website. You can change the number of threads being used in the NWORKERS class variable. Don't hesitate to ask any further question if you need extra help.


It sounds like you have a design problem more than a language problem. Try looking into the multiprocessing module for accessing more sites at the same time rather than threads. Also, consider getting some table to store your previously visited sites (a database maybe?).


Impossible to tell what your limitations are. Your problem is similiar to the C10K problem -- read first, don't optimize straight away. Go for the low-hanging fruit: Most probably you get significant performance improvements by analyzing your application design. Don't start out massively-mulithreaded or massively-multiprocessed.

I'd use Twisted to write the the networking part, this can be very fast. In general, I/O on the machine has to be better than average. Either you have to write your data to disk or to another machine, not every notebook supports 10MByte/s sustained database writes. Lastly, if you have an asynchronous internet connection, It might simply be that your upstream is saturated. ACK priorization helps here (OpenBSD example).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜