What is a decent update interval for a web crawler?
I am currently working on my own little web crawler thingy and was wondering...
What is a decent interval for a web crawler to visit the same sites again?
Should you revisit them once a day? Once per hour? I really do not know...has any开发者_如何学JAVAbody some experience in this matter? Perhaps someone can point me into the right direction?
I think your crawlers visits need to be organic.
I'd start by crawling the list once a week,
and when a sites content changes, set that one to crawl twice a week,
[and then] when you see more frequent changes, you crawl more frequently.
The algorithm would need to be smart enough to know the difference between one off edits and frequent site changes.
Also, never forget to pay attention to the Robots.txt... that's the first page you should hit in a crawl, and you need to respect it's contents above all else.
It's going to depend on the sites you are crawling and what you are doing with the results.
Some will not object to a fairly frequent visitation rate, but others might restrict you to one visit every day, for example.
A lot of sites are keen to protect their content (witness Murdoch and News International railing against Google and putting the Times (UK) behind a paywall), so they view crawlers with distrust.
If you are only going to crawl a few sites then it would be worth contacting the site owners and explain what you want to do and see what they reply. If they do reply respect their wishes and always obey the robots.txt
file.
Even an hour can be impolite depending on what sites you are spidering and how intensely. I assume you are doing this as an exercise, so help save the world and limit yourself to sites that are built to handle huge loads and then only get HTTP headers first to see if you need to even get the page.
Even more polite would be to spider a limited set first with wget
, store it locally and crawl against your cache.
If you aren't doing this as an exercise, there is no reason to do it as it has done to death and the interwebz does't need another one.
精彩评论