Managing dependencies with Hadoop Streaming?
I have a quick Hadoop Streaming question. If I'm using Python streaming and I have Python packages that my mappers/reducers require but aren't installed by default do I need to install those on all the Hadoop machines as wel开发者_如何学Pythonl or is there some sort of serialization that sends them to the remote machines?
If they're not installed on your task boxes, you can send them with -file. If you need a package or other directory structure, you can send a zipfile, which will be unpacked for you. Here's a Haddop 0.17 invocation:
$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.17.0-streaming.jar -mapper mapper.py -reducer reducer.py -input input/foo -output output -file /tmp/foo.py -file /tmp/lib.zip
However, see this issue for a caveat:
https://issues.apache.org/jira/browse/MAPREDUCE-596
If you use Dumbo you can use -libegg to distribute egg files and auto-configure the Python runtime:
https://github.com/klbostee/dumbo/wiki/Short-tutorial#wiki-eggs_and_jars https://github.com/klbostee/dumbo/wiki/Configuration-files
精彩评论