开发者

Make use of the relation name/table name/file name in Hadoop's MapReduce

Is there a way to use the relation name in MapReduce's Map and Reduce? I am trying to do Set difference using Hadoop's MapReduce.

Input: 2 files R and S containing list of terms. (Am going to use t to denote a term)

Objective: To find R - S, i.e. terms in R and not in S

Approach:

Mapper: Spits out t -> R or t -> S, depending on whether t comes from R or S. So, the map output has the t as the key and the file name as the value.

Reducer: If the value list for a t contains only R, then output t -> t.

Do I need to some how tag the terms with the filename? Or is there any other way?

Source code for something I did for Set Union (doesn't need file name anywhere in this case). Just wanted to use this as an example to illustrate the unavailability of filename in Mapper.

public class Union {
        public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> {

                public void map(LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException {
                        output.collect(value, value);
                }
        }

        public static class Reduce extends MapReduceBase implements Reducer<Text, Text, Text, Text> {

                public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException{
                        while (values.hasNext())
                        {
                                output.collect(key, values.next());
                                break;
                        }
                }
        }

        public static void main(String[] args) throws Exception {
                JobConf conf = new JobConf(Union.class);
                conf.setJobName("Union");
                conf.setOutputKeyClass(Text.class);
                conf.setOutputValueClass(Text.class);

                conf.setMapperClass(Map.class);
                conf.setCombinerClass(Reduce.class);
                conf.setReducerClass(Reduce.class);
                conf.set("mapred.job.queue.name", "myQueue");
                conf.setNumReduceTasks(5);

                conf.setInputFormat(TextInputFormat.class);
                conf.setOutputFormat(TextOutputFormat.class);

                FileInputFormat.setInputPaths(conf, new Path(args[0]));
                FileOutputF开发者_开发百科ormat.setOutputPath(conf, new Path(args[1]));

                JobClient.runJob(conf);
        }
}

As you can see I can't identify which key -> value pair (input to the Mapper) came from which file. Am I overlooking something simple here?

Thanks much.


I would implement your question just like you answered. That is just the way MapReduce was meant to be.
I guess your problem was actually writing n-times the same value into the HDFS?

EDIT: Pasted from my Comment down there

Ah I got it ;) I'm not really familiar with the "old" API, but you can "query" your Reporter with:

reporter.getInputSplit();

This returns you an interface called InputSplit. This is easily castable to "FileSplit". And within FileSplit object you could obtain the Path with: "split.getPath()". And from the Path object you just need to call the getName() method.

So this snippet should work for you:

FileSplit fsplit = reporter.getInputSplit(); // maybe cast it down to FileSplit if needed..
String yourFileName = fsplit.getPath().getName();
0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜