Generating db_gone urls for fetch
In my crawler system, I have set the fetch interval as 30 days. I initially set my user agent as say "...." then many urls are getting rejected. But after changing my user agent to appropriate name, I want to fetch those urls which are rejected initially. But the thing is those urls with the db_gone status will have retry interval as 45 days. So generator wont pick that.Hence in this case how would I fetch those urls with db_gone status?
Is nutch by default has any options to crawl those db_gone urls alon开发者_StackOverflow中文版e?
Or do I need to write a seperate map-reduce program to collect those urls and use freegen to generate segments for them?
You just need to configure nutch-site.xml with a different refetch interval.
ADDITION
<property>
<name>db.fetch.interval.max</name>
<value>7776000</value>
<description>The maximum number of seconds between re-fetches of a page
(90 days). After this period every page in the db will be re-tried, no
matter what is its status.
</description>
</property>
精彩评论