Hadoop Streaming Problems
I ran into these issues while using Hadoop Streaming. I'm writing code in python
1) Aggregate library package
According to the hadoop streaming docs ( http://hadoop.apache.org/common/docs/r0.20.0/streaming.html#Working+with+the+Hadoop+Aggregate+Package+%28the+-reduce+aggregate+option%29 ), there is an inbuilt Aggregate class which can work both as a mapper and a reducer.
Here is the command:
shell> hadoop jar hadoop-streaming.jar -file mapper.py -mapper mapper.py -combiner aggregate -reducer NONE -input input_files -output output_path
Executing this command fails the mapper with this error:
java.io.IOException: Cannot run program "aggregate": java.io.IOException: error=2, No such file or directory
However, if you run this command using aggregate as the reducer and not the combiner, the job works fine.
shell> hadoop jar hadoop-streaming.jar -file mapper.py -mapper mapper.py -reduce aggregate -reducer NONE -input input_files -output output_path
Does this mean I cannot use the aggregate class as the combiner?
2) Cannot use | as a seperator for the generic options
This is an example command from the above link
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-D map.output.key.field.separator=. \
-D mapred.text.key.partitioner.options=-k1,2\
-D mapred.reduce.tasks=12
You cannot use | as an argument for map.output.key.field.separator. The error is
-D: command not found
11/08/03 10:48:02 ERROR streaming.StreamJob: Missing required options: input, output
(Update)You have to escape the | with a \ like this
-D stream.map.output.field.separator=\|
3) Cannot specify the -D options at the end of the command just like in the example. The Error is
-D: command开发者_运维百科 not found
11/08/03 10:50:23 ERROR streaming.StreamJob: Unrecognized option: -D
Is the documentation flawed or I'm doing something wrong?
Any insight on what I'm doing wrong is appreciated. Thnx
This question was asked 3 years ago, but today I still got the problem with -D option so I will add a little information for other people if they have the same problem.
According to the manual of hadoop streaming:
bin/hadoop command [genericOptions] [commandOptions]
-D is a genereic option so you have to put it before any other options. So in this case, the command should look like:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-D map.output.key.field.separator=. \
-D mapred.text.key.partitioner.options=-k1,2\
-D mapred.reduce.tasks=12
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
精彩评论