开发者

Generating db_gone urls for fetch

In my crawler system, I have set the fetch interval as 30 days. I initially set my user agent as say "...." then many urls are getting rejected. But after changing my user agent to appropriate name, I want to fetch those urls which are rejected initially. But the thing is those urls with the db_gone status will have retry interval as 45 days. So generator wont pick that.Hence in this case how would I fetch those urls with db_gone status?

Is nutch by default has any options to crawl those db_gone urls alon开发者_StackOverflow中文版e?

Or do I need to write a seperate map-reduce program to collect those urls and use freegen to generate segments for them?


You just need to configure nutch-site.xml with a different refetch interval.

ADDITION

<property> <name>db.fetch.interval.max</name>
<value>7776000</value>
<description>The maximum number of seconds between re-fetches of a page (90 days). After this period every page in the db will be re-tried, no matter what is its status. </description>
</property>

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜