开发者

Run Nutch on existing Hadoop cluster

We have a Hadoop cluster (Hadoop 0.20) and I want to use Nutch 1.2 to import some files over HTTP into HDFS, but I couldn't get Nutch running on the cluster.

I've updated the $HADOOP_HOME/bin/hadoop script to add the Nutch jars to the classpath (actually I've copied the classpath setup from $NUTCH_HOME/bin/nutch script without the part that adds the $NUTCH_HOME/lib/* to the classpath) and then I tried running the following command to inject URLS:

hadoop jar nutch*.jar org.apache.nutch.crawl.Injector -conf conf/nutch-site.xml crawl_path urls_path

but I got java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found.

The $NUTCH_HOME/conf/nutch-site.xml configuration file sets the property

<property>
  开发者_JS百科  <name>mapreduce.job.jar.unpack.pattern</name>
    <value>(?:classes/|lib/|plugins/).*</value>
</property>

as workaround to force unpacking of the /plugin directory as suggested at: When nutch is run on hadoop > 0.20.2 (or cdh) it will not find plugins because MapReduce will not unpack plugin/ directory from the job's pack (due to MAPREDUCE-967) but it seems that for me it didn't work.

Has anybody encountered this problem? Do you have a step by step tutorial on how to run Nutch on existing Hadoop?

Thanks in advance,

mihaela


Finally I ran Nutch MapReduce jobs (Injector, Generator and Fetcher) using the bin/hadoop script with no modification with respect of Nutch.

The problem is with org.apache.hadoop.util.RunJar class (the class which runs a hadoop job jar when calling hadoop jar <jobfile> jobClass) that adds to the classpath from the job jar file only the classes/ and lib/ subdirectories and Nutch jobs have a plugins subfolder also which containes the plugins used at runtime. I tried overriding the property mapreduce.job.jar.unpack.pattern to value (?:classes/|lib/|plugins/).* so that the RunJar class add also the plugins to the classpath but it didn't work.

After looking in Nutch code I saw that it uses a property plugin.folders which controls where can be found the plugins. So what I have done and it worked was to copy the plugins subfolder from the job jar to a shared drive and set the property plugin.folders to that path each time I run a Nutch job. For example:

 hadoop jar <path to nutch job file> org.apache.nutch.fetcher.Fetcher -conf ../conf/nutch-default.xml -Dplugin.folders=<path to plugins folder> <segment path>

In the conf/nutch-default.xml file I have set some properties like the agent name, proxy host and port, timeout, content limit, etc.

I have also tried creating the Nutch job jar with the plugins subfolder in the lib subfolder and then setting the plugin.folders property to value lib/plugins but it didn't work....


I ran Nutch on an existing hadoop cluster modifying the bin/nutch script and then copying the nutch config files on the hadoop folders, modifying the TS and NS parameters. Did you try it that way?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜