Web crawler update strategy

2022-12-26 04:16 问答作者：

I want to crawl useful resource (like background picture .. ) from certain websites. It is not a hard job, especially with the help of some wonderful projects like scrapy.

The problem here is I not only just want crawl this site ONE TIME. I also want to keep my crawl long running and crawl the updated resource. So I want to know is there any good strategy for a web crawler to get updated pages?

Here's a coarse algorithm I've thought of. I divided the crawl process into rounds. Each round URL repository will give crawler a certain number (like , 10000) of URLs to crawl. And then next round. The detailed steps are:

crawler add start URLs to URL repository
crawler ask URL repository for at most N URL to crawl
crawler fetch the URLs, and update certain information in URL repository, like the page content, the fetch time and whethe开发者_StackOverflowr the content has been changed.
just go back to step 2

To further specify that, I still need to solve following question: How to decide the "refresh-ness" of a web page, which indicates the probability that this web page has been updated ?

Since that is an open question, hopefully it will brought some fruitful discussion here.

The "batch" algorithm you describe is a common way to implement this, I have worked on a few such implementations with scrapy.

The approach I took is to initialize your spider start URLs to get the next batch to crawl and output the data (resources + links) as normal. Then process these as you choose to generate the next batch. It is possible to parallelize all of this, so you have many spiders crawling different batches at once, if you put URLs belonging to the same site in the same batch, then scrapy will take care of politeness (with some configuration for your preferences).

An interesting tweak is to break the scheduling into short term (within a single batch, inside scrapy) and long term (between crawl batches), giving some of the advantages of a more incremental approach, while keeping things a little simpler.

There are many approaches to the crawl ordering problem (how to decide the "refresh-ness") you mention, and the best approach depends on what your priorities are (freshness vs. comprehensiveness, are come resources more important than others, etc.).

I would like to recommend this Web Crawling article by Christopher Olston and Marc Najork. It's a great survey and covers the topics you are interested in (the batch crawling model and crawl ordering).

继续阅读：scrapy web-crawler

Web crawler update strategy

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？