开发者

Which is the easiest way to combine small HDFS blocks?

I'm collecting logs with Flume开发者_JAVA技巧 to the HDFS. For the test case I have small files (~300kB) because the log collecting process was scaled for the real usage.

Is there any easy way to combine these small files into larger ones which are closer to the HDFS block size (64MB)?


The GNU coreutils split could do the work.

If the source data are lines - in my case they are - and one line is around 84 bytes, then an HDFS block 64MB could contain around 800000 lines:

hadoop dfs -cat /sourcedir/* | split --lines=800000 - joined_
hadoop dfs -copyFromLocal ./joined_* /destdir/

or with --line-bytes option:

hadoop dfs -cat /sourcedir/* | split --line-bytes=67108864 - joined_
hadoop dfs -copyFromLocal ./joined_* /destdir/


My current solution is to write a MapReduce job that effectively does nothing, while having a limited number of reducers. Each reducer outputs a file, so this cats them together. You can add the name of the original file in each line to help show where it came from.

I'm still interested in hearing if there is a standard or proven best way of doing this that I am not aware of.


You should take a look at File Crusher open sourced by media6degrees. It might be a little outdated but you can download the source and make your changes and/or contribute. The JAR and Source are in: http://www.jointhegrid.com/hadoop_filecrush/index.jsp

This is essentially a map-reduce technique for merging small files.

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜