开发者

How to merge 2 bzip2'ed files?

I want to merge 2 bzip2'ed files. I tried appending one to another: cat file1.bzip2 file2.bzip2 > out.bzip2 which seems to work (this file decompressed correctly), but I want to use this file as a Hadoop input file, and I get errors about c开发者_运维技巧orrupted blocks.

What's the best way to merge 2 bzip2'ed files without decompressing them?


Handling concatenated bzip is fixed on trunk, or should be: https://issues.apache.org/jira/browse/HADOOP-4012. There are examples of it working: https://issues.apache.org/jira/browse/MAPREDUCE-477?focusedCommentId=12871993&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12871993 Make sure you're running a recent version of Hadoop and you should be fine.


You could compress (well, store) them both into a new bz2? It'd mean you'd have to do 3 decompressions to get the contents of the 2 archives, but might work with your scenario.


This question is quite old, but I came upon it right now, so, if anyone else searches for this, this is what I found to join multiple bz2 files in HDFS into one whithout using the local filesystem. This can be used for any text file also.

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input foo \
-output foo_merged \
-mapper /bin/cat \
-reducer /bin/cat 

This joins all the files in folder foo and writes a single file (part-00000) to folder foo_merged.

You can use wildcards for the input folder or use as many -input as you need to include all the files that are going to be joined.

The output file will be uncompressed. If you want the output also compressed in bz2, you should specify these two options:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-D mapred.output.compress=true \
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
-input foo \
-output foo_merged \
-mapper /bin/cat \
-reducer /bin/cat 

Replacing the BZip2Codec for whichever you want to use.

More info here.


You wouldn't necessary have to merge files to use them as Hadoop input:

  • consider file_name* - a pattern;
  • file_name_1,file_name_2 - list of inputs.

And Hadoop will handle it.

Otherwise you can use streaming of the Hadoop to merge them (with decompression).

You could produce list of files by pattern like:

FILES_LIST="'ls -m template*.bz2'"

INPUT_FILE="'echo $FILES_LIST | tr -d ' ' '"

inner ' quotes should be different. You can pass $INPUT_FILE as a variable to your script via CLI.

Also consider the CombineFileInputFormat class as InputFormat.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜