Multiple files as input on Amazon Elastic MapReduce
I'm trying to run a job on Elastic MapReduce (EMR) with a custom jar. I'm trying to process about a 1000 files in a single directory. When I submit my job with the parameter s3n://bucketname/compressed/*.xml.gz
, I get a "matched 0 files" error. If I pass just the absolute path to a file (e.g. s3n://bucketname/compressed/00001.xml.gz
), it runs fine, but only one file gets processed. I tried using the name of the directory (s3n://bucketname/compressed/
), hopi开发者_StackOverflowng that the files within will be processed, but that just passes the directory to the job.
At the same time, I have a smaller local hadoop installation. In that, when I pass my job with wildcards (/path/to/dir/on/hdfs/*.xml.gz
), it works fine and all 1000 files are listed correctly.
How do I get EMR to list all my files?
I don't know how EMR lists all the files, but here's a piece of code which works for me:
FileSystem fs = FileSystem.get(URI.create(args[0]), job.getConfiguration());
FileStatus[] files = fs.listStatus(new Path(args[0]));
for(FileStatus sfs:files){
FileInputFormat.addInputPath(job, sfs.getPath());
}
It will list all the files which are in the input directory, and you can do to those anything that you will
精彩评论