Hadoop streaming and AMAZON EMR
I have been attempting to use Hadoop streaming in Amazon EMR to do a simple word count for a bunch of text files. In order to get a handle on hadoop streaming and on Amazon's EMR I took a very simplified data set too. Each text file had only one line of text in it (the line could contain arbitrarily large number of words).
The mapper is an R script, that splits the line into words and spits it back to the stream.
cat(wordList[i],"\t1\n")
I decided to use the LongValueSum Aggregate reducer for adding the counts together, so I had to prefix my mapper output by LongValueSum
cat("LongValueSum:",wordList[i],"\t1\n")
and specify the reducer to be "aggregate"
The questions I have now are the following:
The intermediate stage between mapper and reducer, just sorts the stream. It does not really combine by the keys. Am I right? I ask this because If I do not use "LongValueSum" as a prefix to the words output by the mapper, at the reducer I just receive the streams sorted by the keys, but not aggregated. That is I just receive ordered by K, as opposed to (K, list(Values开发者_如何转开发)) at the reducer. Do I need to specify a combiner in my command?
How are other aggregate reducers used. I see, a lot of other reducers/aggregates/combiners available on http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/lib/aggregate/package-summary.html
How are these combiners and reducer specified in an AMAZON EMR set up?
I believe an issue of this kind has been filed and fixed in Hadoop streaming for a combiner, but I am not sure what version AMAZON EMR is hosting, and the version in which this fix is available.
- How about custom input formats and record readers and writers. There are bunch of libraries written in Java. Is it sufficient to specify the java class name for each of these options?
The intermediate stage between mapper and reducer, just sorts the stream. It does not really combine by the keys. Am I right?
The aggregate
reducer in streaming does implement the relevant combiner interfaces so Hadoop will use it if it sees fit [1]
That is I just receive ordered by K, as opposed to (K, list(Values)) at the reducer.
With the streaming interface you always receive K,V value pairs; you'll never receive (K,list(values))
How are other aggregate reducers used.
Which of them are you unsure about? The link you specified has a quick summary of the behaviour of each
I believe an issue of this kind has been filed and fixed
What issue are you thinking of?
not sure what version AMAZON EMR is hosting
EMR is based on Hadoop 0.20.2
Is it sufficient to specify the java class name for each of these options?
Do you mean in the context of streaming? or the aggregate framework?
[1] http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred/lib/aggregate/package-summary.html
精彩评论