Run Nutch on existing Hadoop cluster
We have a Hadoop cluster (Hadoop 0.20) and I want to use Nutch 1.2 to import some files over HTTP into HDFS, but I couldn't get Nutch running on the cluster.
I've updated the $HADOOP_HOME/bin/hadoop script to add the Nutch jars to the classpath (actually I've copied the classpath setup from $NUTCH_HOME/bin/nutch script without the part that adds the $NUTCH_HOME/lib/* to the classpath) and then I tried running the following command to inject URLS:
hadoop jar nutch*.jar org.apache.nutch.crawl.Injector -conf conf/nutch-site.xml crawl_path urls_path
but I got java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found.
The $NUTCH_HOME/conf/nutch-site.xml configuration file sets the property
<property>
开发者_JS百科 <name>mapreduce.job.jar.unpack.pattern</name>
<value>(?:classes/|lib/|plugins/).*</value>
</property>
as workaround to force unpacking of the /plugin directory as suggested at: When nutch is run on hadoop > 0.20.2 (or cdh) it will not find plugins because MapReduce will not unpack plugin/ directory from the job's pack (due to MAPREDUCE-967) but it seems that for me it didn't work.
Has anybody encountered this problem? Do you have a step by step tutorial on how to run Nutch on existing Hadoop?
Thanks in advance,
mihaelaFinally I ran Nutch MapReduce jobs (Injector, Generator and Fetcher) using the bin/hadoop script with no modification with respect of Nutch.
The problem is with org.apache.hadoop.util.RunJar
class (the class which runs a hadoop job jar when calling hadoop jar <jobfile> jobClass
) that adds to the classpath from the job jar file only the classes/
and lib/
subdirectories and Nutch jobs have a plugins
subfolder also which containes the plugins used at runtime. I tried overriding the property mapreduce.job.jar.unpack.pattern
to value (?:classes/|lib/|plugins/).*
so that the RunJar class add also the plugins to the classpath but it didn't work.
After looking in Nutch code I saw that it uses a property plugin.folders
which controls where can be found the plugins. So what I have done and it worked was to copy the plugins subfolder from the job jar to a shared drive and set the property plugin.folders
to that path each time I run a Nutch job. For example:
hadoop jar <path to nutch job file> org.apache.nutch.fetcher.Fetcher -conf ../conf/nutch-default.xml -Dplugin.folders=<path to plugins folder> <segment path>
In the conf/nutch-default.xml
file I have set some properties like the agent name, proxy host and port, timeout, content limit, etc.
I have also tried creating the Nutch job jar with the plugins subfolder in the lib subfolder and then setting the plugin.folders
property to value lib/plugins
but it didn't work....
I ran Nutch on an existing hadoop cluster modifying the bin/nutch script and then copying the nutch config files on the hadoop folders, modifying the TS and NS parameters. Did you try it that way?
精彩评论