How to merge 2 bzip2'ed files?

2023-01-06 01:52 问答作者：

I want to merge 2 bzip2'ed files. I tried appending one to another: cat file1.bzip2 file2.bzip2 > out.bzip2 which seems to work (this file decompressed correctly), but I want to use this file as a Hadoop input file, and I get errors about c开发者_运维技巧orrupted blocks.

What's the best way to merge 2 bzip2'ed files without decompressing them?

Handling concatenated bzip is fixed on trunk, or should be: https://issues.apache.org/jira/browse/HADOOP-4012. There are examples of it working: https://issues.apache.org/jira/browse/MAPREDUCE-477?focusedCommentId=12871993&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12871993 Make sure you're running a recent version of Hadoop and you should be fine.

You could compress (well, store) them both into a new bz2? It'd mean you'd have to do 3 decompressions to get the contents of the 2 archives, but might work with your scenario.

This question is quite old, but I came upon it right now, so, if anyone else searches for this, this is what I found to join multiple bz2 files in HDFS into one whithout using the local filesystem. This can be used for any text file also.

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-input foo \
-output foo_merged \
-mapper /bin/cat \
-reducer /bin/cat

This joins all the files in folder foo and writes a single file (part-00000) to folder foo_merged.

You can use wildcards for the input folder or use as many -input as you need to include all the files that are going to be joined.

The output file will be uncompressed. If you want the output also compressed in bz2, you should specify these two options:

$HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
-D mapred.output.compress=true \
-D mapred.output.compression.codec=org.apache.hadoop.io.compress.BZip2Codec \
-input foo \
-output foo_merged \
-mapper /bin/cat \
-reducer /bin/cat

Replacing the BZip2Codec for whichever you want to use.

More info here.

You wouldn't necessary have to merge files to use them as Hadoop input:

consider file_name* - a pattern;
file_name_1,file_name_2 - list of inputs.

And Hadoop will handle it.

Otherwise you can use streaming of the Hadoop to merge them (with decompression).

You could produce list of files by pattern like:

FILES_LIST="'ls -m template*.bz2'"

INPUT_FILE="'echo $FILES_LIST | tr -d ' ' '"

inner ' quotes should be different. You can pass $INPUT_FILE as a variable to your script via CLI.

Also consider the CombineFileInputFormat class as InputFormat.

继续阅读：bzip2 compression

How to merge 2 bzip2'ed files?

更多精彩内容

精彩评论

最新问答

央视是哪个频道？

请问买过的朋友，舒提啦旅行箱实际使用体验如何？？

检查不孕不育需要的费用？

海信ULED电视画质有什么不同的地方?？

钉子可以挂的住画框幕布吗？

问答排行榜

河神2九牛入海钓河妖是第几集河妖什么来历可活吞牛？

性激素六项检查的最佳时间是多久？多少钱？？

Easiest way to get words of one line from istream into a vector?

《梦在燃烧 (《三国演义》动画片主题曲)》MP3歌词-汤子星？

抽烟只抽炫赫门？