How do I prevent `hadoop fs rmr <uri>` from creating $folder$ files?
We're using Amazon's Elastic Map Reduce to perform some large file processing jobs. As a part of our workflow, we occasionally need to remove files from S3 that may already exist. We do so using the hadoop fs interface, like this:
hadoop fs -rmr s3://mybucket/a/b/myfile.log
This removes the file from S3 appropriately, but in it's place leaves an empty file named "s3://mybucket/a/b_$folder$". As described in this question, Hadoop's Pig is unable to handle these files, so later steps in th开发者_JS百科e workflow can choke on this file.
(Note, it doesn't seem to matter whether we use -rmr
or -rm
or whether we use s3://
or s3n://
as the scheme: all of these exhibit the described behavior.)
How do I use the hadoop fs
interface to remove files from S3 and be sure not to leave these troublesome files behind?
I wasn't able to figure out if it's possible to use the hadoop fs interface in this way. However, the s3cmd interface does the right thing (but only for one key at a time):
s3cmd del s3://mybucket/a/b/myfile.log
This requires configuring a ~/.s3cfg file with your AWS credentials first. s3cmd --configure
will interactively help you create this file.
It is how the S3 suppot is implemented in Hadoop, see this: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/s3native/NativeS3FileSystem.html.
So use s3cmd.
精彩评论