Web crawler update strategy
I want to crawl useful resource (like background picture .. ) from certain websites. It is not a hard job, especially with the help of some wonderful projects like scrapy.
The problem here is I not only just want crawl this site ONE TIME. I also want to keep my crawl long running and crawl the updated resource. So I want to know is there any good strategy for a web crawler to get updated pages?
Here's a coarse algorithm I've thought of. I divided the crawl process into rounds. Each round URL repository will give crawler a certain number (like , 10000) of URLs to crawl. And then next round. The detailed steps are:
- crawler add start URLs to URL repository
- crawler ask URL repository for at most N URL to crawl
- crawler fetch the URLs, and update certain information in URL repository, like the page content, the fetch time and whethe开发者_StackOverflowr the content has been changed.
- just go back to step 2
To further specify that, I still need to solve following question: How to decide the "refresh-ness" of a web page, which indicates the probability that this web page has been updated ?
Since that is an open question, hopefully it will brought some fruitful discussion here.
The "batch" algorithm you describe is a common way to implement this, I have worked on a few such implementations with scrapy.
The approach I took is to initialize your spider start URLs to get the next batch to crawl and output the data (resources + links) as normal. Then process these as you choose to generate the next batch. It is possible to parallelize all of this, so you have many spiders crawling different batches at once, if you put URLs belonging to the same site in the same batch, then scrapy will take care of politeness (with some configuration for your preferences).
An interesting tweak is to break the scheduling into short term (within a single batch, inside scrapy) and long term (between crawl batches), giving some of the advantages of a more incremental approach, while keeping things a little simpler.
There are many approaches to the crawl ordering problem (how to decide the "refresh-ness") you mention, and the best approach depends on what your priorities are (freshness vs. comprehensiveness, are come resources more important than others, etc.).
I would like to recommend this Web Crawling article by Christopher Olston and Marc Najork. It's a great survey and covers the topics you are interested in (the batch crawling model and crawl ordering).
精彩评论