开发者

Hadoop Streaming Problems

I ran into these issues while using Hadoop Streaming. I'm writing code in python

1) Aggregate library package

According to the hadoop streaming docs ( http://hadoop.apache.org/common/docs/r0.20.0/streaming.html#Working+with+the+Hadoop+Aggregate+Package+%28the+-reduce+aggregate+option%29 ), there is an inbuilt Aggregate class which can work both as a mapper and a reducer.

Here is the command:

shell> hadoop jar hadoop-streaming.jar -file mapper.py -mapper mapper.py -combiner aggregate -reducer NONE -input input_files -output output_path

Executing this command fails the mapper with this error:

java.io.IOException: Cannot run program "aggregate": java.io.IOException: error=2, No such file or directory

However, if you run this command using aggregate as the reducer and not the combiner, the job works fine.

shell> hadoop jar hadoop-streaming.jar -file mapper.py -mapper mapper.py -reduce aggregate -reducer NONE -input input_files -output output_path

Does this mean I cannot use the aggregate class as the combiner?

2) Cannot use | as a seperator for the generic options

This is an example command from the above link

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-D map.output.key.field.separator=. \
-D mapred.text.key.partitioner.options=-k1,2\
-D mapred.reduce.tasks=12

You cannot use | as an argument for map.output.key.field.separator. The error is

-D: command not found
11/08/03 10:48:02 ERROR streaming.StreamJob: Missing required options: input, output

(Update)You have to escape the | with a \ like this

-D stream.map.output.field.separator=\|

3) Cannot specify the -D options at the end of the command just like in the example. The Error is

-D: command开发者_运维百科 not found
11/08/03 10:50:23 ERROR streaming.StreamJob: Unrecognized option: -D

Is the documentation flawed or I'm doing something wrong?

Any insight on what I'm doing wrong is appreciated. Thnx


This question was asked 3 years ago, but today I still got the problem with -D option so I will add a little information for other people if they have the same problem.

According to the manual of hadoop streaming:

bin/hadoop command [genericOptions] [commandOptions]

-D is a genereic option so you have to put it before any other options. So in this case, the command should look like:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-D stream.map.output.field.separator=. \
-D stream.num.map.output.key.fields=4 \
-D map.output.key.field.separator=. \
-D mapred.text.key.partitioner.options=-k1,2\
-D mapred.reduce.tasks=12
-input myInputDirs \
-output myOutputDir \
-mapper org.apache.hadoop.mapred.lib.IdentityMapper \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜