How to tell MapReduce how many mappers to use at the same time?
I am w开发者_C百科riting an indexing app for MapReduce. I was able to split inputs with NLineInputFormat, and now I've got few hundred mappers in my app. However, only 2/mashine of those are active at the same time, the rest are "PENDING". I believe that such a behavior slows the app significantly.
How do I make hadoop run at least 100 of those at the same time per machine?
I am using the old hadoop api syntax. Here's what I've tried so far:
conf.setNumMapTasks(1000);
conf.setNumTasksToExecutePerJvm(500);
none of those seem to have any effect.
Any ideas how I can make the mappers actually RUN in parallel?
The JobConf.setNumMapTasks() is just a hint to the MR framework and I am not sure the effect of calling it. In your case the total number of map tasks across the whole job should be equal to the total number of lines in the input divided by the number of lines configured in the NLineInputFormat. You can find more details on the total number of map/reduce tasks across the whole job here.
The description for mapred.tasktracker.map.tasks.maximum says
The maximum number of map tasks that will be run simultaneously by a task tracker.
You need to configure the mapred.tasktracker.map.tasks.maximum (which is defaulted to 2) to change the number of map tasks run parallely on a particular node by the task tracker. I could not get the documentation for 0.20.2, so I am not sure if the parameter exists or if the same parameter name is used in 0.20.2 release.
精彩评论