Hadoop MapReduce with already sorted files
I'm working with Hadoop MapReduce. I've got data in HDFS and data in each file is already sorted. Is it possible to force MapReduce not to resort the data after map phase? I've tried to change the map.sort.class to no-op, but it didn't work (i.e. the data wasn't sorted as I'd expected). Does 开发者_Python百科anyone tried doing something similar and managed to achieve it?
I think it depends on what style result you want, sorted result or unsorted result?
If you need result be sorted, I think hadoop is not suitable to do this work. There are two reasons:
- INPUT DATA will be stored in different chunk(if big enough) and partitioned into multi-splits. Each one split will be mapped to one map task and all output of map tasks will gathered(after processes of partitioned/sorted/combined/copied/merged) as reduce's input. It is hard to keep keys in order among these stages.
- Sort function exists not only after map process in map task. When do merge process during reduce task, there is sort option,too.
If you do not need result be sorted,I think this patch may be what you want:
Support no sort dataflow in map output and reduce merge phrase : https://issues.apache.org/jira/browse/MAPREDUCE-3397
精彩评论