开发者

Where should Map put temporary files when running under Hadoop

I am running Hadoop 0.20.1 under SLES 10 (SUSE).

My Map task takes a file and generates a few more, I then generate my results from these files. I would like to know where I should place these files, so that performance is good and there are no collisions. If Hadoop can delete the directory automatically - that would be nice.

Right now, I am using the temp folder and task id, to create a unique folder, and then working within subfolders of that folder.

reduceTaskId = job.get("mapred.task.id");
reduceTempDir = job.get("mapred.temp.dir"); 
String myTemporaryFoldername = reduceTempDir+File.separator+reduceTaskId+ File.separator;       
File diseaseParent = new File(myTemporaryFoldername+File.separator +REDUCE_WORK_FOLDER);  

The problem with this approach is that I am not sure it is optimal, also I have to delete each new folder or I start to run out of space. Thanks akintayo

(edit) I found that the best place to keep files 开发者_StackOverflow中文版that you don't want beyond the life of map would be job.get("job.local.dir") which provides a path that will be deleted when the map tasks finishes. I am not sure if the delete is done on a per key basis or for each tasktracker.


The problem with that approach is that the sort and shuffle is going to move your data away from where that data was localized.

I do not know much about your data but the distributed cache might work well for you

${mapred.local.dir}/taskTracker/archive/ : The distributed cache. This directory holds the localized distributed cache. Thus localized distributed cache is shared among all the tasks and jobs

http://www.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/

"It is common for a MapReduce program to require one or more files to be read by each map or reduce task before execution. For example, you may have a lookup table that needs to be parsed before processing a set of records. To address this scenario, Hadoop’s MapReduce implementation includes a distributed file cache that will manage copying your file(s) out to the task execution nodes.

The DistributedCache was introduced in Hadoop 0.7.0; see HADOOP-288 for more detail on its origins. There is a great deal of existing documentation for the DistributedCache: see the Hadoop FAQ, the MapReduce Tutorial, the Hadoop Javadoc, and the Hadoop Streaming Tutorial. Once you’ve read the existing documentation and understand how to use the DistributedCache, come on back."

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜