copying to and from hdfs within Hadoop Streaming

2023-04-11 13:22 问答作者：

I asked a similar question to this earlier, but after doing some exploring, I have a better understanding of what's going on, but i'd like to see if other people have alternative solutions to my approach.

Problem

Suppose you're trying to write a Hadoop streaming job that gzips a bunch of really large files on hdfs. The Hadoop Streaming guide suggests that you write a mapper to copy the file from hdfs onto the local node, do your work, then copy the file back to hdfs. Here's a small script, with some extra code tthat's explained inline, to doa slightly more basic task: simply rename some file

The Script

#!/bin/bash

# Remove "s from the environment variable to work around a stupid bug in hadoop.
export HADOOP_CLIENT_OPTS=`echo $HADOOP_CLIENT_OPTS | tr -d '"'`

# Get just the size of the file on the local disk.
function localSize() {
 ls -l $1 | awk '{ print $5 }'
}

# Get just the size of the file on HDFS.  Oddly, the first command includes a 
# new line at the start of the size, so we remove it by using a substring.
function hdfsSize() {
 s=`hadoop dfs -ls /some/other/path/$1 | awk '{ print $5 }'`
 echo ${s:1}
}

while read line
do
 ds=ourFile.dat
 # Copy the file from HDFS to local disk.
 hadoop dfs -copyToLocal /path/to/some/large/file/$ds $ds
 # Spin until the file is fully copied.
 while [ ! -f $ds ]
 do 
  echo "spin"
  sleep 1 
 done

 # Delete the renamed version of the file and copy it.
 hadoop dfs -rm /some/other/path/blah
 hadoop dfs -copyFromLocal $ds /some/other/path/blah
 # Print out the sizes of the file on local disk and hdfs, they *should* be equal
 localSize $ds
 hdfsSize blah
 # If they aren't equal, spin until they are.
 while [ "`localSize $ds`" != "`hdfsSize blah`" ]
 do
  echo "copy spin"
  sleep 1
 done
 # Print out the file size at the end, just for fun.
 hadoop dfs -ls /some/other/path/blah
done

Output

After running the script, we get this output

spin
spin
开发者_开发知识库spin
Deleted hdfs://kracken:54310/some/other/path/blah
200890778
67108864
copy spin
Found 1 items   
-rw-r--r--   3 hadoop supergroup  200890778 2011-10-06 16:00 /home/stevens35/blah

The Issue

It seems clear that hadoop dfs -copyToLocal and hadoop dfs -copyFromLocal seem to be returning before the relevant files have finished transfering, as seen by the spin and copy spin outputs. My guess is that the Hadoop streaming jvm is adopting the thread created by the hadoop dfs command and so the file transfer threads keep running even though hadoop dfs exits, but this is just a guess. This becomes particularly annoying when the file is large, and Hadoop streaming exits before the last file is finished copying; it seems as if the file transfer dies midway through and you're left with a paritial file on HDFS. This hack of mine seems to at least ensure that the files finish copying.

I should note that I'm using Cloudera's hadoop version 0.20.2+737.

Has anyone encountered this problem? What alternative work arounds have you found? And has the issue been fixed in any newer releaeses of Hadoop?

继续阅读：hadoop-streaming

copying to and from hdfs within Hadoop Streaming

Problem

The Script

Output

The Issue

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

Problem

The Script

Output

The Issue

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集 河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？