开发者

hadoop, map/reduce output file(part-00000) and distributed cache

the value ouput from my map/reduce is a bytewritable array, which is written in the output file part-00000 (hadoop do so by default). i need this array for my next map function so i wanted to keep this array in distributed cache. can sombody tell how can i read from outputfile (part-00000) wh开发者_StackOverflowich may not be a text file and store in distributed cache.


My suggestion:

Create a new Hadoop job with the following properties:

  • Input the directory with all the part-... files.
  • Create a custom OutputFormat class that writes to your distributed cache.
  • Now make your job to look essentially to have something like this:

    conf.setInputFormat(SequenceFileInputFormat.class);
    conf.setMapperClass(IdentityMapper.class);
    conf.setReducerClass(IdentityReducer.class);
    conf.setOutputFormat(DistributedCacheOutputFormat.class);
    

Have a look at the Yahoo Hadoop tutorial because it has some examples on this point: http://developer.yahoo.com/hadoop/tutorial/module5.html#outputformat

HTH

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜