Hadoop streaming and AMAZON EMR

2023-01-22 09:28 问答作者：

I have been attempting to use Hadoop streaming in Amazon EMR to do a simple word count for a bunch of text files. In order to get a handle on hadoop streaming and on Amazon's EMR I took a very simplified data set too. Each text file had only one line of text in it (the line could contain arbitrarily large number of words).

The mapper is an R script, that splits the line into words and spits it back to the stream.

cat(wordList[i],"\t1\n")

I decided to use the LongValueSum Aggregate reducer for adding the counts together, so I had to prefix my mapper output by LongValueSum

cat("LongValueSum:",wordList[i],"\t1\n")

and specify the reducer to be "aggregate"

The questions I have now are the following:

The intermediate stage between mapper and reducer, just sorts the stream. It does not really combine by the keys. Am I right? I ask this because If I do not use "LongValueSum" as a prefix to the words output by the mapper, at the reducer I just receive the streams sorted by the keys, but not aggregated. That is I just receive ordered by K, as opposed to (K, list(Values开发者_如何转开发)) at the reducer. Do I need to specify a combiner in my command?
How are other aggregate reducers used. I see, a lot of other reducers/aggregates/combiners available on http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/lib/aggregate/package-summary.html

How are these combiners and reducer specified in an AMAZON EMR set up?

I believe an issue of this kind has been filed and fixed in Hadoop streaming for a combiner, but I am not sure what version AMAZON EMR is hosting, and the version in which this fix is available.

How about custom input formats and record readers and writers. There are bunch of libraries written in Java. Is it sufficient to specify the java class name for each of these options?

The intermediate stage between mapper and reducer, just sorts the stream. It does not really combine by the keys. Am I right?

The aggregate reducer in streaming does implement the relevant combiner interfaces so Hadoop will use it if it sees fit [1]

That is I just receive ordered by K, as opposed to (K, list(Values)) at the reducer.

With the streaming interface you always receive K,V value pairs; you'll never receive (K,list(values))

How are other aggregate reducers used.

Which of them are you unsure about? The link you specified has a quick summary of the behaviour of each

I believe an issue of this kind has been filed and fixed

What issue are you thinking of?

not sure what version AMAZON EMR is hosting

EMR is based on Hadoop 0.20.2

Is it sufficient to specify the java class name for each of these options?

Do you mean in the context of streaming? or the aggregate framework?

[1] http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/lib/aggregate/package-summary.html

继续阅读：amazon-emr

Hadoop streaming and AMAZON EMR

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？