开发者

Nutch : get current crawl depth in the plugin

I want to write my own HTML parser plugin for nutch. I am doing focused crawling by generating outlinks falling only in specific xpath. In my use case, I want to fetch different data from the html pages depending on the current depth of the crawl. So I need to know the current depth in HtmlParser plugin for each content that I am parsin开发者_开发百科g.

Is it possible with Nutch? I see CrawlDatum does not have crawl_depth information. I was thinking of having map of information in another data structure. Does anybody have better idea?

Thanks


Crawl.java has NutchConfiguration object. This object is passed while initializing all the components. I set the property for crawl-depth before creating new Fetcher.

conf.setInt("crawl.depth", i+1);
new Fetcher(conf).fetch(segs[0], threads,
          org.apache.nutch.fetcher.Fetcher.isParsing(conf));  // fetch it

The HtmlParser plugin can access it as below:

LOG.info("Current depth: " + getConf().getInt("crawl.depth", -1));

This doesn't force me to break map-reduce. Thanks Nayn


With Nutch, "depth" represents the number of generate/fetch/update cycles run successively. Per example, if you are at depth 4, it means your are in the fourth cycle. When you say that you want to go no further than depth 10, it means that you want to stop after 10 cycles.

Within each cycle, the number or previous cycles run before it (the "depth") is unknown. That information is not kept.

If you have your own version of Crawl.java, you could keep track of the current "depth" and pass that information to your HTML parser plugin.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜