开发者

Hadoop: what should be mapped and what should be reduced?

This is my first time using map/reduce. I want to write a program that processes a large log file. For example, if I was processing a log file that had records consisting of {Student, College, and GPA}, and wanted to sort all students by college, what would be the 'map' part and what would be the 'reduce' part? I am having some difficulty with the concept, despite having gone over a number 开发者_Go百科of tutorials and examples.

Thanks!


Technically speaking, Hadoop MapReduce treats everything as key-value pairs; you just need to define what the keys are and what the values are. The signatures of map and reduce are

map: (K1 x V1) -> (K2 x V2) list
reduce: (K2 x V2) list -> (K3 x V3) list

with sorting taking place on K2 values in the intermediate shuffle phase between map and reduce.

If your inputs are of the form

Student x (College x GPA)

Then your mapper should do nothing more than get the College values to the key:

map: (s, c, g) -> [(c, s, g)]

with college as the new key, Hadoop will sort by college for you. Your reducer then, is just a plain old "identity reducer."

If you are carrying out a sorting operation in practice (that is, this isn't a homework problem), then check out Hive, or Pig. These systems drastically simplify these kinds of tasks. Sorting on a particular column becomes quite trivial. However, it is always educational to write, say, a hadoop streaming job for tasks like the one you identified here, to give you a better understanding of mappers and reducers.


Yahoo has sorted Peta and Tera Bytes of data. Others (including Google) do it on a regular basis, you can search for the sort benchmarks on the internet. Yahoo has published a paper on how they have done it.

Ray's approach has to be modified a bit to get the final output sorted. The input data has to be sampled and a custom partition written to send a key range to a particular reducer. Then the output of the N reducers just need to be concatenated. The Yahoo paper explains this is more detail.

The 'org.apache.hadoop.examples.terasort' package has sample code for sorting data.

If you are new to MapReduce I suggest watching the following videos. They are a bit lengthy, but are worth.

http://www.youtube.com/watch?v=yjPBkvYh-ss
http://www.youtube.com/watch?v=-vD6PUdf3Js
http://www.youtube.com/watch?v=5Eib_H_zCEY
http://www.youtube.com/watch?v=1ZDybXl212Q
http://www.youtube.com/watch?v=BT-piFBP4fE

Edit: Found some more information at the Cloudera blog here. There are some built-in classes to make sorting easier.

Total order partitions HADOOP-3019. As a spin-off from the TeraSort record, Hadoop now has library classes for efficiently producing a globally sorted output. InputSampler is used to sample a subset of the input data, and then TotalOrderPartitioner is used to partition the map outputs into approximately equal-sized partitions. Very neat stuff — well worth a look, even if you don’t need to use it.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜