Hadoop: what should be mapped and what should be reduced?

2023-04-09 06:58 问答作者：

This is my first time using map/reduce. I want to write a program that processes a large log file. For example, if I was processing a log file that had records consisting of {Student, College, and GPA}, and wanted to sort all students by college, what would be the 'map' part and what would be the 'reduce' part? I am having some difficulty with the concept, despite having gone over a number 开发者_Go百科of tutorials and examples.

Thanks!

Technically speaking, Hadoop MapReduce treats everything as key-value pairs; you just need to define what the keys are and what the values are. The signatures of map and reduce are

map: (K1 x V1) -> (K2 x V2) list
reduce: (K2 x V2) list -> (K3 x V3) list

with sorting taking place on K2 values in the intermediate shuffle phase between map and reduce.

If your inputs are of the form

Student x (College x GPA)

Then your mapper should do nothing more than get the College values to the key:

map: (s, c, g) -> [(c, s, g)]

with college as the new key, Hadoop will sort by college for you. Your reducer then, is just a plain old "identity reducer."

If you are carrying out a sorting operation in practice (that is, this isn't a homework problem), then check out Hive, or Pig. These systems drastically simplify these kinds of tasks. Sorting on a particular column becomes quite trivial. However, it is always educational to write, say, a hadoop streaming job for tasks like the one you identified here, to give you a better understanding of mappers and reducers.

Yahoo has sorted Peta and Tera Bytes of data. Others (including Google) do it on a regular basis, you can search for the sort benchmarks on the internet. Yahoo has published a paper on how they have done it.

Ray's approach has to be modified a bit to get the final output sorted. The input data has to be sampled and a custom partition written to send a key range to a particular reducer. Then the output of the N reducers just need to be concatenated. The Yahoo paper explains this is more detail.

The 'org.apache.hadoop.examples.terasort' package has sample code for sorting data.

If you are new to MapReduce I suggest watching the following videos. They are a bit lengthy, but are worth.

http://www.youtube.com/watch?v=yjPBkvYh-ss
http://www.youtube.com/watch?v=-vD6PUdf3Js
http://www.youtube.com/watch?v=5Eib_H_zCEY
http://www.youtube.com/watch?v=1ZDybXl212Q
http://www.youtube.com/watch?v=BT-piFBP4fE

Edit: Found some more information at the Cloudera blog here. There are some built-in classes to make sorting easier.

Total order partitions HADOOP-3019. As a spin-off from the TeraSort record, Hadoop now has library classes for efficiently producing a globally sorted output. InputSampler is used to sample a subset of the input data, and then TotalOrderPartitioner is used to partition the map outputs into approximately equal-sized partitions. Very neat stuff — well worth a look, even if you don’t need to use it.

继续阅读：mapreduce

Hadoop: what should be mapped and what should be reduced?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

王昌瑞《潜梦追凶》剧组庆生新锐演员未来可期？

Is it allowed to ask users to enter credit card details for own payment method?

Escaping "<" in Perl-generated XML

imessage会显示已读吗？

微信重新建群怎么建？