开发者

How do I concatenate a lot of files into one inside Hadoop, with no mapping or reduction

I'm trying to combine multiple files in multiple input directories into a single file, for various odd reasons I won't go into. My initial try was to write a 'nul' mapper and reducer that just copied input to output, but that failed. My latest try is:

vcm_hadoop lester jar /vcm/home/apps/hadoop/contrib/streaming/hadoop-*-streaming.jar -input /cruncher/201004/08/17/00 -output /lcuffcat9 -mapper /bin/cat -reducer NONE

but I end up wi开发者_Python百科th multiple output files anyway. Anybody know how I can coax everything into a single output file?


Keep the cat mappers and use a single cat reducer. Make sure you're setting the number of reducers to one. The output will also have gone through the sorter.

You need to use a reducer because you can only suggest the number of mappers.

If you don't want the output sorted, you could have your mappers take filenames as input, read from that file, and output the filename and line number as the key and a line from the file as the value, and have the reducer throw away the key and output the value.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜