开发者

Hadoop Streaming Multiple Files per Map Job

I have a Hadoop streaming setup that works, however t开发者_如何学运维here is a bit of overhead when initializing the mappers which is done once per file, and since I am processing many files I notice I'm spending a lot of time in initialization.

Is there a way, without writing any Java, to specify that I want to reuse the same mapper instance for multiple files to amortize the initialization cost?


In $HADOOP_HOME/conf/mapred-site.xml add/edit the follow property

<property>
    <name>mapred.job.reuse.jvm.num.tasks</name>
    <value>#</value>
</property>

The # can be set to a number to specify how many times the JVM is to be reused (default is 1), or set to -1 for no limit on the reuse amount.

It's also to specify it per job by setting the job configuration mapred.job.reuse.jvm.num.tasks to the desired value.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜