Running Hadoop MapReduce, is it possible to call external executables outside of HDFS
Within my mapper I'd like to call external software installed on the worker node outside of the HDFS. Is this possible? What is the best way to do this?
I understand that this may take some of the advantages/scalability of MapReduce away, but i'd like to interact both within the HDFS and call compiled/installed external software c开发者_JS百科odes within my mapper to process some data.
Mappers (and reducers) are like any other process on the box- as long as the TaskTracker user has permission to run the executable, there is no problem doing so. There are a few ways to call external processes, but since we are already in Java, ProcessBuilder seems a logical place to start.
EDIT: Just found that Hadoop has a class explicitly for this purpose: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/Shell.html
This is certainly doable. You may find it best to work with Hadoop Streaming. As it says on that website:
Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run map/reduce jobs with any executable or script as the mapper and/or the reducer.
I tend to start with external code inside of Hadoop Streaming. Depending on your language, there are likely many good examples of how to use it in Streaming; once you get inside your language of choice, you can usually pipe data out to another program, if desired. I have had several layers of programs in different languages playing nicely with no additional effort than if I had run it on a normal Linux box, beyond just getting the outer layer working with Hadoop Streaming.
精彩评论