Extreme amount of overhead in simple MapReduce job
I'm experimenting with Hadoop and created a very simply map and reduce job. The input is a 30 line text file, and the output is only 3 lines (it's an excerpt of a log file, where the map extracts a page name and the execution time, while the reduce calculates min, max and avg execution times).
This simple job takes some 36 seconds to execute on Hadoop in Pseudo-Distributed mode (fs.default.name=hdfs://localhost, dfs.replication=1, mapred.job.tracker=localhost:8021). This is on a 2.93Ghz Nehalem, 8GB memory, X25-E SSD running Ubuntu 10.04.
I added debug output in the mapper and reducer for each call to it, which made clear that every call to the mapper resp. reducer happens in the same second. In other words, the overhead is before and after Hadoop actually calls my code.
This is final output:
18:16:59 INFO input.FileInputFormat: Total input paths to process : 1
18:16:59 INFO mapred.JobClient: Running job: job_201108251550_0023
18:17:00 INFO mapred.JobClient: map 0% reduce 0%
18:17:15 INFO mapred.JobClient: map 100% reduce 0%
18:17:30 INFO mapred.JobClient: map 100% reduce 100%
18:17:35 INFO mapred.JobClient: Job complete: job_201108251550_0023
18:17:35 INFO mapred.JobClient: Counters: 25
18:17:35 INFO mapred.JobClient: Job Counters
18:17:35 INFO mapred.JobClient: Launched reduce tasks=1
18:17:35 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=11305
18:17:35 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
18:17:35 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
18:17:35 INFO mapred.JobClient: Launched map tasks=1
18:17:35 INFO mapred.JobClient: Data-local map tasks=1
18:17:35 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=12762
18:17:35 INFO mapred.JobClient: File Output Format Counters
18:17:35 INFO mapred.JobClient: Bytes Written=140
18:17:35 INFO mapred.JobClient: FileSystemCounters
18:17:35 INFO mapred.JobClient: FILE_BYTES_READ=142
18:17:35 INFO mapred.JobClient: HDFS_BYTES_READ=7979
18:17:35 INFO mapred.JobClient: FILE_BYTES_WRITTEN=41899
18:17:35 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=140
18:17:35 INFO mapred.JobClient: File Input Format Counters
18:17:35 INFO mapred.JobClient: Bytes Read=7877
18:17:35 INFO mapred.JobClient: Map-Reduce Framework
18:17:35 INFO mapred.JobClient: Reduce input groups=3
18:17:35 INFO mapred.JobClient: Map output materialized bytes=142
18:17:35 INFO mapred.JobClient: Combine output records=0
18:17:35 INFO mapred.JobClient: Map input records=30
18:17:35 INFO mapred.JobClient: Reduce shuffle bytes=0
18:17:35 INFO mapred.JobClient: Reduce output records=3
18:17:35 INFO mapred.JobClient: Spilled Records=12
18:17:35 INFO mapred.JobClient: Map output bytes=124
18:17:35 INFO mapred.JobClient: Combine input records=0
18:17:35 INFO mapred.JobClient: Map output records=6
18:17:35 INFO mapred.JobClient: SPLIT_RAW_BYTES=102
18:17:35 INFO mapred.JobClient: Reduce input records=6
As can be seen, it takes 16 seconds until the map is done, then 15 seconds until the reduce is done, and then still 6 seconds till the job is considered done.
Here's the syslog of just the reduce task:
18:17:14,894 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library
18:17:15,007 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists!
18:17:15,088 INFO org.apache.hadoop.mapred.ReduceTask: ShuffleRamManager: MemoryLimit=130514944, MaxSingleShuffleLimit=32628736
18:17:15,093 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Thread started: Thread for merging on-disk files
18:17:15,093 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Thread started: Thread for merging in memory files
18:17:15,093 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Thread waiting: Thread for merging on-disk files
18:17:15,094 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Need another 1 map output(s) where 0 is already in progress
18:17:15,094 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Thread started: Thread for polling Map Completion Events
18:17:15,095 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts)
18:17:20,095 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201108251550_0023_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts)
18:17:21,103 INFO org.apache.hadoop.mapred.ReduceTask: GetMapEventsThread exiting
18:17:21,104 INFO org.apache.hadoop.mapred.ReduceTask: getMapsEventsThread joined.
18:17:21,104 INFO org.apache.hadoop.mapred.ReduceTask: Closed ram manager
18:17:21,104 INFO org.apache.hadoop.mapred.ReduceTask: Interleaved on-disk merge complete: 0 files left.
18:17:21,104 INFO org.apache.hadoop.mapred.ReduceTask: In-memory merge complete: 1 files left.
18:17:21,110 INFO org.apache.hadoop.mapred.Merger: Merging 1 sorted segments
18:17:21,110 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 138 bytes
18:17:21,113 INFO org.apache.hadoop.mapred.ReduceTask: Merged 1 segments, 138 bytes to disk to satisfy reduce memor开发者_运维问答y limit
18:17:21,114 INFO org.apache.hadoop.mapred.ReduceTask: Merging 1 files, 142 bytes from disk
18:17:21,114 INFO org.apache.hadoop.mapred.ReduceTask: Merging 0 segments, 0 bytes from memory into reduce
18:17:21,114 INFO org.apache.hadoop.mapred.Merger: Merging 1 sorted segments
18:17:21,117 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 138 bytes
18:17:21,163 INFO org.apache.hadoop.mapred.Task: Task:attempt_201108251550_0023_r_000000_0 is done. And is in the process of commiting
18:17:24,170 INFO org.apache.hadoop.mapred.Task: Task attempt_201108251550_0023_r_000000_0 is allowed to commit now
18:17:24,180 INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: Saved output of task 'attempt_201108251550_0023_r_000000_0' to output.txt
18:17:27,088 INFO org.apache.hadoop.mapred.Task: Task 'attempt_201108251550_0023_r_000000_0' done.
18:17:27,091 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
18:17:27,110 INFO org.apache.hadoop.io.nativeio.NativeIO: Initialized cache for UID to User mapping with a cache timeout of 14400 seconds.
18:17:27,110 INFO org.apache.hadoop.io.nativeio.NativeIO: Got UserName akira for UID 1000 from the native implementation
My actual reducer code is called at 18:17:21.
What gives? Is this normal?
When your file is 300 million lines, then those 30 odd seconds of overhead will be negligible.
This is all normal. But why do you care?
I hope you're prototyping something for a much larger larger dataset, if not Hadoop is definitely not your solution.
Hadoop was made for running on hundreds of nodes, not one, and with the expectation of all kinds of failure modes. Systems of this size need very loose coupling to be resilient; and this includes things such as polling for jobs by tasktrackers etc. None of these things are optimised for fast startup/shutdown of task attempts.
I'd recommend drawing yourself a picture from the time you submit your job to the time the job is done that illustrates what daemons are involved and what machines they might be running on.
Scalability != Performance
The way I look at Hadoop is that the primary design goal was to make it scale horizontally. This design goal means that Hadoop was made to handle huge volumes efficiently. If you have a very small problem (it fits in the ram of one system and doesn't take very long) then a normal application will always outperform Hadoop. In such cases the overhead needed for scaling is overhead.
hadoop scales up from 5 nodes on (namenode + secondary + 3 datanodes) with a RAID5 safety (3 replicas). On a single node you have the availability to test your code, not more. After you test and went into a "real" Cluster you have to know how large your dataset will be (blocksizing), also the RAM of the namenode is a important parameter (each block consumes RAM, more space, more RAM) for a first test.
精彩评论