multiple outputs hadoop [duplicate]
Possible Duplicate:
MultipleOutputFormat in hadoop
How can I change the code in the WordCount.java program in the examples such that the output of the WordCounts for each file is put on separate files. That is, instead of having a single wordcount all files in that default part-00000 file. Also the output file always has the name part-00000 or some other name along those lines, can I chose the output filename I want for this file, and if so how?
I imagine I have to configure this in the main somehow, but I have searched on this and I can't find how to do this?
  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: wordcount <in> <out>");
      System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombine开发者_StackOverflow社区rClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
Any help appreciated, Ted
Look at MultipleOutputFormat or MultipleOutputs which are part of the Hadoop api. Either of those should solve your problem. MultipleOutputFormat allows you to specify a filepath based on any part (or combination of) the key/value. Here's an example from Hadoop, the Definitive Guide:
protected String generateFileNameForKeyValue(NullWritable key, Text value,
String name) {
  parser.parse(value);
  return parser.getStationId() + "/" + parser.getYear();
}
For MultipleOutputFormat, all files will have the same format but your results will be written to different files depending on the generated filename.
For MultipleOutputs, your results can be saved using multiple different output formats as well. For instance, if you're processing a server log that had warnings, info messages and errors, you could save each type of message to a different file (via MultipleOutputs) and format each type of output differently.
There will be a part-XXXXX file output for every reducer you have.  Adjust mapred.reduce.tasks to an appropriate number (usually a multiple of the number of machines you have) and that's how many output files you'll have.
As far as choosing names for your files, the easiest way to go is to rename them after the job is done.
 
         加载中,请稍侯......
 加载中,请稍侯......
      
精彩评论