Where should Map put temporary files when running under Hadoop

2023-01-10 03:50 问答作者：

I am running Hadoop 0.20.1 under SLES 10 (SUSE).

My Map task takes a file and generates a few more, I then generate my results from these files. I would like to know where I should place these files, so that performance is good and there are no collisions. If Hadoop can delete the directory automatically - that would be nice.

Right now, I am using the temp folder and task id, to create a unique folder, and then working within subfolders of that folder.

reduceTaskId = job.get("mapred.task.id");
reduceTempDir = job.get("mapred.temp.dir"); 
String myTemporaryFoldername = reduceTempDir+File.separator+reduceTaskId+ File.separator;       
File diseaseParent = new File(myTemporaryFoldername+File.separator +REDUCE_WORK_FOLDER);

The problem with this approach is that I am not sure it is optimal, also I have to delete each new folder or I start to run out of space. Thanks akintayo

(edit) I found that the best place to keep files 开发者_StackOverflow中文版that you don't want beyond the life of map would be job.get("job.local.dir") which provides a path that will be deleted when the map tasks finishes. I am not sure if the delete is done on a per key basis or for each tasktracker.

The problem with that approach is that the sort and shuffle is going to move your data away from where that data was localized.

I do not know much about your data but the distributed cache might work well for you

${mapred.local.dir}/taskTracker/archive/ : The distributed cache. This directory holds the localized distributed cache. Thus localized distributed cache is shared among all the tasks and jobs

http://www.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/

"It is common for a MapReduce program to require one or more files to be read by each map or reduce task before execution. For example, you may have a lookup table that needs to be parsed before processing a set of records. To address this scenario, Hadoop’s MapReduce implementation includes a distributed file cache that will manage copying your file(s) out to the task execution nodes.

The DistributedCache was introduced in Hadoop 0.7.0; see HADOOP-288 for more detail on its origins. There is a great deal of existing documentation for the DistributedCache: see the Hadoop FAQ, the MapReduce Tutorial, the Hadoop Javadoc, and the Hadoop Streaming Tutorial. Once you’ve read the existing documentation and understand how to use the DistributedCache, come on back."

继续阅读：file local map temporary

Where should Map put temporary files when running under Hadoop

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？