开发者

hadoop : supporting multiple outputs for Map Reduce jobs

Seems like it is supported in Hado开发者_StackOverflow中文版op(reference), but I dont know how to use this.

I want to :

a.) Map - Read a huge XML file and load the relevant data and pass on to reduce  
b.) Reduce - write two .sql files for different tables  

Why I am choosing map/reduce is because I have to do this for over 100k(may be many more) xml files residing ondisk. any better suggestions are welcome

Any resources/tutorials explaining how to use this is appreciated.

I am using Python and would want to learn how to achieve this using streaming

Thank you


Might not be an elegant solution, but you could create two templates to convert the output of the reduce tasks into the required format once the job is complete. Much could be automated by writing a shell script which would look for the reduce outputs and apply the templates on them. With the shell script the transformation happens in sequence and doesn't take care of the n machines in the cluster.

Or else in the reduce tasks you could create the two output formats into a single file with some delimiter and split them later using the delimiter. In this approach since the transformation happens in the reduce, the transformation is spread across all the nodes in the cluster.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜