Any suggestions for reading two different dataset into Hadoop at the same time?

2023-02-23 05:26 问答作者：

Dear hadooper: I'm new for hadoop, and recently try to implement an algorithm.

This algorithm needs to calculate a matrix, which represent the different rating of every two pa开发者_Go百科ir of songs. I already did this, and the output is a 600000*600000 sparse matrix which I stored in my HDFS. Let's call this dataset A (size=160G)

Now, I need to read the users' profiles to predict their rating for a specific song. So I need to read the users' profile first(which is 5G size), let call this dataset B, and then calculate use the dataset A.

But now I don't know how to read the two dataset from a single hadoop program. Or can I read the dataset B into RAM then do the calculation?( I guess I can't, because the HDFS is a distribute system, and I can't read the dataset B into a single machine's memory).

Any suggestions?

You can use two Map function, Each Map Function Can process one data set if you want to implement different processing. You need to register your map with your job conf. For eg:

           public static class FullOuterJoinStdDetMapper extends MapReduceBase implements Mapper <LongWritable ,Text ,Text, Text>
    {
            private String person_name, book_title,file_tag="person_book#";
            private String emit_value = new String();
            //emit_value = "";
            public void map(LongWritable key, Text values, OutputCollector<Text,Text>output, Reporter reporter)
                     throws IOException
            {
                    String line = values.toString();
                    try
                    {
                            String[] person_detail = line.split(",");
                            person_name = person_detail[0].trim();
                            book_title = person_detail[1].trim();
                    }
                    catch (ArrayIndexOutOfBoundsException e)
                    {
                            person_name = "student name missing";
                     }
                    emit_value = file_tag + person_name;
                    output.collect(new Text(book_title), new Text(emit_value));
            }

    }


       public static class FullOuterJoinResultDetMapper extends MapReduceBase implements  Mapper <LongWritable ,Text ,Text, Text>
     {
            private String author_name, book_title,file_tag="auth_book#";
            private String emit_value = new String();

// emit_value = ""; public void map(LongWritable key, Text values, OutputCollectoroutput, Reporter reporter) throws IOException { String line = values.toString(); try { String[] author_detail = line.split(","); author_name = author_detail[1].trim(); book_title = author_detail[0].trim(); } catch (ArrayIndexOutOfBoundsException e) { author_name = "Not Appeared in Exam"; }

                          emit_value = file_tag + author_name;                                     
                         output.collect(new Text(book_title), new Text(emit_value));
                 }

             }


       public static void main(String args[])
                    throws Exception
    {

           if(args.length !=3)
                    {
                            System.out.println("Input outpur file missing");
                            System.exit(-1);
                    }


            Configuration conf = new Configuration();
            String [] argum = new GenericOptionsParser(conf,args).getRemainingArgs();
            conf.set("mapred.textoutputformat.separator", ",");
            JobConf mrjob = new JobConf();
            mrjob.setJobName("Inner_Join");
            mrjob.setJarByClass(FullOuterJoin.class);

            MultipleInputs.addInputPath(mrjob,new Path(argum[0]),TextInputFormat.class,FullOuterJoinStdDetMapper.class);
            MultipleInputs.addInputPath(mrjob,new Path(argum[1]),TextInputFormat.class,FullOuterJoinResultDetMapper.class);

            FileOutputFormat.setOutputPath(mrjob,new Path(args[2]));
            mrjob.setReducerClass(FullOuterJoinReducer.class);

            mrjob.setOutputKeyClass(Text.class);
            mrjob.setOutputValueClass(Text.class);

            JobClient.runJob(mrjob);
    }

Hadoop allows you to use different map input formats for different folders. So you can read from several datasources and then cast to specific type in Map function i.e. in one case you got (String,User) in other (String,SongSongRating) and you Map signature is (String,Object). The second step is selection recommendation algorithm, join those data in some way so aggregator will have least information enough to calculate recommendation.

Any suggestions for reading two different dataset into Hadoop at the same time?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？