hadoop : supporting multiple outputs for Map Reduce jobs
Seems like it is supported in Hado开发者_StackOverflow中文版op
(reference), but I dont know how to use this.
I want to :
a.) Map - Read a huge XML file and load the relevant data and pass on to reduce
b.) Reduce - write two .sql files for different tables
Why I am choosing map/reduce is because I have to do this for over 100k(may be many more)
xml files residing ondisk. any better suggestions are welcome
Any resources/tutorials explaining how to use this is appreciated.
I am using Python
and would want to learn how to achieve this using streaming
Thank you
Might not be an elegant solution, but you could create two templates to convert the output of the reduce tasks into the required format once the job is complete. Much could be automated by writing a shell script which would look for the reduce outputs and apply the templates on them. With the shell script the transformation happens in sequence and doesn't take care of the n machines in the cluster.
Or else in the reduce tasks you could create the two output formats into a single file with some delimiter and split them later using the delimiter. In this approach since the transformation happens in the reduce, the transformation is spread across all the nodes in the cluster.
精彩评论