hadoop : supporting multiple outputs for Map Reduce jobs

2023-04-06 21:22 问答作者：

Seems like it is supported in Hado开发者_StackOverflow中文版op(reference), but I dont know how to use this.

I want to :

a.) Map - Read a huge XML file and load the relevant data and pass on to reduce  
b.) Reduce - write two .sql files for different tables

Why I am choosing map/reduce is because I have to do this for over 100k(may be many more) xml files residing ondisk. any better suggestions are welcome

Any resources/tutorials explaining how to use this is appreciated.

I am using Python and would want to learn how to achieve this using streaming

Thank you

Might not be an elegant solution, but you could create two templates to convert the output of the reduce tasks into the required format once the job is complete. Much could be automated by writing a shell script which would look for the reduce outputs and apply the templates on them. With the shell script the transformation happens in sequence and doesn't take care of the n machines in the cluster.

Or else in the reduce tasks you could create the two output formats into a single file with some delimiter and split them later using the delimiter. In this approach since the transformation happens in the reduce, the transformation is spread across all the nodes in the cluster.

继续阅读：mapreduce

hadoop : supporting multiple outputs for Map Reduce jobs

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？