开发者

copying to and from hdfs within Hadoop Streaming

I asked a similar question to this earlier, but after doing some exploring, I have a better understanding of what's going on, but i'd like to see if other people have alternative solutions to my approach.

Problem

Suppose you're trying to write a Hadoop streaming job that gzips a bunch of really large files on hdfs. The Hadoop Streaming guide suggests that you write a mapper to copy the file from hdfs onto the local node, do your work, then copy the file back to hdfs. Here's a small script, with some extra code tthat's explained inline, to doa slightly more basic task: simply rename some file

The Script

#!/bin/bash

# Remove "s from the environment variable to work around a stupid bug in hadoop.
export HADOOP_CLIENT_OPTS=`echo $HADOOP_CLIENT_OPTS | tr -d '"'`

# Get just the size of the file on the local disk.
function localSize() {
 ls -l $1 | awk '{ print $5 }'
}

# Get just the size of the file on HDFS.  Oddly, the first command includes a 
# new line at the start of the size, so we remove it by using a substring.
function hdfsSize() {
 s=`hadoop dfs -ls /some/other/path/$1 | awk '{ print $5 }'`
 echo ${s:1}
}

while read line
do
 ds=ourFile.dat
 # Copy the file from HDFS to local disk.
 hadoop dfs -copyToLocal /path/to/some/large/file/$ds $ds
 # Spin until the file is fully copied.
 while [ ! -f $ds ]
 do 
  echo "spin"
  sleep 1 
 done

 # Delete the renamed version of the file and copy it.
 hadoop dfs -rm /some/other/path/blah
 hadoop dfs -copyFromLocal $ds /some/other/path/blah
 # Print out the sizes of the file on local disk and hdfs, they *should* be equal
 localSize $ds
 hdfsSize blah
 # If they aren't equal, spin until they are.
 while [ "`localSize $ds`" != "`hdfsSize blah`" ]
 do
  echo "copy spin"
  sleep 1
 done
 # Print out the file size at the end, just for fun.
 hadoop dfs -ls /some/other/path/blah
done

Output

After running the script, we get this output

spin
spin
开发者_开发知识库spin
Deleted hdfs://kracken:54310/some/other/path/blah
200890778
67108864
copy spin
Found 1 items   
-rw-r--r--   3 hadoop supergroup  200890778 2011-10-06 16:00 /home/stevens35/blah

The Issue

It seems clear that hadoop dfs -copyToLocal and hadoop dfs -copyFromLocal seem to be returning before the relevant files have finished transfering, as seen by the spin and copy spin outputs. My guess is that the Hadoop streaming jvm is adopting the thread created by the hadoop dfs command and so the file transfer threads keep running even though hadoop dfs exits, but this is just a guess. This becomes particularly annoying when the file is large, and Hadoop streaming exits before the last file is finished copying; it seems as if the file transfer dies midway through and you're left with a paritial file on HDFS. This hack of mine seems to at least ensure that the files finish copying.

I should note that I'm using Cloudera's hadoop version 0.20.2+737.

Has anyone encountered this problem? What alternative work arounds have you found? And has the issue been fixed in any newer releaeses of Hadoop?

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜