hadoop streaming ensuring one key per reducer

2023-04-04 22:33 问答作者：

I have a mapper that, while processing data, classifies output into 3 different types (type is the output key). My goal is to create 3 different csv files via the reducers, each with all of the data for one key with a header row.

The key values can change and are text strings.

Now, ideally, i would like to have 3 different reducers and each reducer would get only one key with it's entire list of values.

Except, this doesn't seem to work because the keys don't get mapped to specific reducers.

The answer to this in other places has been to write a custom partitioner class that would map each desired key value to a specific reducer. This would be great except that I need to use streaming with python and i am not able to include a custom streaming jar in my job so that seems not an option.

I see in the hadoop docs that there is an alternate partitioner class available that can enable secondary sorts, but it isn't immediately obvious to me that it is possible, using either the default or key field based partitioner, to ensure that each key ends up on it's own reducer without writing a java class and using a custom streaming jar.

Any suggestions would be much appreciated.

Examples:

mapper output:

csv2\tfieldA,fieldB,fieldC csv1\tfield1,field2,field3,field4 csv3\tfieldRed,fieldGreen ...

the problem is that if i have 3 reducers i end up with key distribution like this:

reducer1        reducer2        recuder3
csv1            csv2
csv3

one reducer gets two dif开发者_如何学JAVAferent key types and one reducer gets no data sent to it at all. this is because the hash(key csv1) mod 3 and hash(key csv2) mod 3 result in the same value.

I'm pretty sure MultipleOutputFormat [1] can be used under streaming. That'll solve most of your problems.

http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/mapred/lib/MultipleOutputFormat.html

If you are stuck with streaming, and can't include any external jars for a custom partitioner, then this is probably not going to work the way you want it to without some hacks.

If these are absolute requirements, you can get around this, but it's messy.

Here's what you can do:

Hadoop, by default, uses a hashing partitioner, like this:

key.hashCode() % numReducers

So you can pick keys such that they hash to 1, 2, and 3 (or three numbers such that x % 3 = 1, 2, 3). This is a nasty hack, and I wouldn't suggest it unless you have no other options.

If you want custom output to different csv files you can direct write (with API) to hdfs. As you know hadoop passes with key and associated value list to single reduce task. In reduce code, check , while key is same write to same file. If another key comes, create new file manually and write into it. It does not matter how many reducers you have

继续阅读：amazon-emr hadoop-streaming

hadoop streaming ensuring one key per reducer

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？