Run Nutch on existing Hadoop cluster

2023-02-16 14:48 问答作者：

We have a Hadoop cluster (Hadoop 0.20) and I want to use Nutch 1.2 to import some files over HTTP into HDFS, but I couldn't get Nutch running on the cluster.

I've updated the $HADOOP_HOME/bin/hadoop script to add the Nutch jars to the classpath (actually I've copied the classpath setup from $NUTCH_HOME/bin/nutch script without the part that adds the $NUTCH_HOME/lib/* to the classpath) and then I tried running the following command to inject URLS:

hadoop jar nutch*.jar org.apache.nutch.crawl.Injector -conf conf/nutch-site.xml crawl_path urls_path

but I got java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found.

The $NUTCH_HOME/conf/nutch-site.xml configuration file sets the property

<property>
  开发者_JS百科  <name>mapreduce.job.jar.unpack.pattern</name>
    <value>(?:classes/|lib/|plugins/).*</value>
</property>

as workaround to force unpacking of the /plugin directory as suggested at: When nutch is run on hadoop > 0.20.2 (or cdh) it will not find plugins because MapReduce will not unpack plugin/ directory from the job's pack (due to MAPREDUCE-967) but it seems that for me it didn't work.

Has anybody encountered this problem? Do you have a step by step tutorial on how to run Nutch on existing Hadoop?

Thanks in advance,

mihaela

Finally I ran Nutch MapReduce jobs (Injector, Generator and Fetcher) using the bin/hadoop script with no modification with respect of Nutch.

The problem is with org.apache.hadoop.util.RunJar class (the class which runs a hadoop job jar when calling hadoop jar <jobfile> jobClass) that adds to the classpath from the job jar file only the classes/ and lib/ subdirectories and Nutch jobs have a plugins subfolder also which containes the plugins used at runtime. I tried overriding the property mapreduce.job.jar.unpack.pattern to value (?:classes/|lib/|plugins/).* so that the RunJar class add also the plugins to the classpath but it didn't work.

After looking in Nutch code I saw that it uses a property plugin.folders which controls where can be found the plugins. So what I have done and it worked was to copy the plugins subfolder from the job jar to a shared drive and set the property plugin.folders to that path each time I run a Nutch job. For example:

 hadoop jar <path to nutch job file> org.apache.nutch.fetcher.Fetcher -conf ../conf/nutch-default.xml -Dplugin.folders=<path to plugins folder> <segment path>

In the conf/nutch-default.xml file I have set some properties like the agent name, proxy host and port, timeout, content limit, etc.

I have also tried creating the Nutch job jar with the plugins subfolder in the lib subfolder and then setting the plugin.folders property to value lib/plugins but it didn't work....

I ran Nutch on an existing hadoop cluster modifying the bin/nutch script and then copying the nutch config files on the hadoop folders, modifying the TS and NS parameters. Did you try it that way?

继续阅读：cluster-computing nutch

Run Nutch on existing Hadoop cluster

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？