开发者

Amazon Elastic Map Reduce: Does input fragments size matter

Given I need to process input of 20 Gb with the use of 10 instances. Is it different to have 10 input files of 2Gb compare to 4 input files of 5Gb? In latter case, can Amazon Elastic MapReduce automatically distribute load of 4 input files across 10 instances? (I'm using Streaming method as my mappe开发者_开发百科r is written using ruby)


The only thing that matters is whether the files are splittable.

If the files are uncompressed plain text or compressed with lzo then Hadoop will sort out the splitting.

x5 2gb files will result in ~100 splits and hence ~100 map tasks (10gb / 128mb (EMR blocksize) ~= 100)

x10 1gb files will result in again ~100 splits and hence, again, 100 map tasks.

If the files are gzip or bzip2 compressed then Hadoop (at least, the version running on EMR) will not split the files.

x5 2gb files will result in only 5 splits (and again hence only 5 map tasks)

x10 1gb files will result in only 10 splits (and again hence only 10 map tasks)

Mat

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜