copying to and from hdfs within Hadoop Streaming
I asked a similar question to this earlier, but after doing some exploring, I have a better understanding of what's going on, but i'd like to see if other people have alternative solutions to my approach.
Problem
Suppose you're trying to write a Hadoop streaming job that gzips a bunch of really large files on hdfs. The Hadoop Streaming guide suggests that you write a mapper to copy the file from hdfs onto the local node, do your work, then copy the file back to hdfs. Here's a small script, with some extra code tthat's explained inline, to doa slightly more basic task: simply rename some file
The Script
#!/bin/bash
# Remove "s from the environment variable to work around a stupid bug in hadoop.
export HADOOP_CLIENT_OPTS=`echo $HADOOP_CLIENT_OPTS | tr -d '"'`
# Get just the size of the file on the local disk.
function localSize() {
ls -l $1 | awk '{ print $5 }'
}
# Get just the size of the file on HDFS. Oddly, the first command includes a
# new line at the start of the size, so we remove it by using a substring.
function hdfsSize() {
s=`hadoop dfs -ls /some/other/path/$1 | awk '{ print $5 }'`
echo ${s:1}
}
while read line
do
ds=ourFile.dat
# Copy the file from HDFS to local disk.
hadoop dfs -copyToLocal /path/to/some/large/file/$ds $ds
# Spin until the file is fully copied.
while [ ! -f $ds ]
do
echo "spin"
sleep 1
done
# Delete the renamed version of the file and copy it.
hadoop dfs -rm /some/other/path/blah
hadoop dfs -copyFromLocal $ds /some/other/path/blah
# Print out the sizes of the file on local disk and hdfs, they *should* be equal
localSize $ds
hdfsSize blah
# If they aren't equal, spin until they are.
while [ "`localSize $ds`" != "`hdfsSize blah`" ]
do
echo "copy spin"
sleep 1
done
# Print out the file size at the end, just for fun.
hadoop dfs -ls /some/other/path/blah
done
Output
After running the script, we get this output
spin
spin
开发者_开发知识库spin
Deleted hdfs://kracken:54310/some/other/path/blah
200890778
67108864
copy spin
Found 1 items
-rw-r--r-- 3 hadoop supergroup 200890778 2011-10-06 16:00 /home/stevens35/blah
The Issue
It seems clear that hadoop dfs -copyToLocal
and hadoop dfs -copyFromLocal
seem to be returning before the relevant files have finished transfering, as seen by the spin
and copy spin
outputs. My guess is that the Hadoop streaming jvm is adopting the thread created by the hadoop dfs
command and so the file transfer threads keep running even though hadoop dfs
exits, but this is just a guess. This becomes particularly annoying when the file is large, and Hadoop streaming exits before the last file is finished copying; it seems as if the file transfer dies midway through and you're left with a paritial file on HDFS. This hack of mine seems to at least ensure that the files finish copying.
I should note that I'm using Cloudera's hadoop version 0.20.2+737.
Has anyone encountered this problem? What alternative work arounds have you found? And has the issue been fixed in any newer releaeses of Hadoop?
精彩评论