Hadoop: what should be mapped and what should be reduced?
This is my first time using map/reduce. I want to write a program that processes a large log file. For example, if I was processing a log file that had records consisting of {Student, College, and GPA}, and wanted to sort all students by college, what would be the 'map' part and what would be the 'reduce' part? I am having some difficulty with the concept, despite having gone over a number 开发者_Go百科of tutorials and examples.
Thanks!
Technically speaking, Hadoop MapReduce treats everything as key-value pairs; you just need to define what the keys are and what the values are. The signatures of map and reduce are
map: (K1 x V1) -> (K2 x V2) list
reduce: (K2 x V2) list -> (K3 x V3) list
with sorting taking place on K2 values in the intermediate shuffle phase between map and reduce.
If your inputs are of the form
Student x (College x GPA)
Then your mapper should do nothing more than get the College values to the key:
map: (s, c, g) -> [(c, s, g)]
with college as the new key, Hadoop will sort by college for you. Your reducer then, is just a plain old "identity reducer."
If you are carrying out a sorting operation in practice (that is, this isn't a homework problem), then check out Hive, or Pig. These systems drastically simplify these kinds of tasks. Sorting on a particular column becomes quite trivial. However, it is always educational to write, say, a hadoop streaming job for tasks like the one you identified here, to give you a better understanding of mappers and reducers.
Yahoo has sorted Peta and Tera Bytes of data. Others (including Google) do it on a regular basis, you can search for the sort benchmarks on the internet. Yahoo has published a paper on how they have done it.
Ray's approach has to be modified a bit to get the final output sorted. The input data has to be sampled and a custom partition written to send a key range to a particular reducer. Then the output of the N reducers just need to be concatenated. The Yahoo paper explains this is more detail.
The 'org.apache.hadoop.examples.terasort' package has sample code for sorting data.
If you are new to MapReduce I suggest watching the following videos. They are a bit lengthy, but are worth.
http://www.youtube.com/watch?v=yjPBkvYh-ss
http://www.youtube.com/watch?v=-vD6PUdf3Js
http://www.youtube.com/watch?v=5Eib_H_zCEY
http://www.youtube.com/watch?v=1ZDybXl212Q
http://www.youtube.com/watch?v=BT-piFBP4fE
Edit: Found some more information at the Cloudera blog here. There are some built-in classes to make sorting easier.
Total order partitions HADOOP-3019. As a spin-off from the TeraSort record, Hadoop now has library classes for efficiently producing a globally sorted output. InputSampler is used to sample a subset of the input data, and then TotalOrderPartitioner is used to partition the map outputs into approximately equal-sized partitions. Very neat stuff — well worth a look, even if you don’t need to use it.
精彩评论