Hadoop Streaming Multiple Files per Map Job
I have a Hadoop streaming setup that works, however t开发者_如何学运维here is a bit of overhead when initializing the mappers which is done once per file, and since I am processing many files I notice I'm spending a lot of time in initialization.
Is there a way, without writing any Java, to specify that I want to reuse the same mapper instance for multiple files to amortize the initialization cost?
In $HADOOP_HOME/conf/mapred-site.xml
add/edit the follow property
<property>
<name>mapred.job.reuse.jvm.num.tasks</name>
<value>#</value>
</property>
The #
can be set to a number to specify how many times the JVM is to be reused (default is 1), or set to -1 for no limit on the reuse amount.
It's also to specify it per job by setting the job configuration mapred.job.reuse.jvm.num.tasks
to the desired value.
精彩评论