开发者

Parallel Copy to HDFS

What is the best and fast way to achieve parallel copy to h开发者_C百科adoop from an NFS mount? We have a mount with huge number of files and we need to copy it into hdfs.

Some options:

  1. Run copyFromLocal in a multithreaded way
  2. Use distcp in an isolated way.
  3. Can i write a map only job to do copy?

Regards, JD


I think the key question is what is on the source side of the NFS link? If it is a NAS you are likely to be better off with a situation where you have several client machines running copyFromLocal at the same time (one each). Even high performance NASs are going to be displeased when you have more than 5-10 simultaneous disk reads from the same client. I would model the following (all with copyFromLocal):

  • NAS -> 1 Client -> 5, 10, 50, 100 parallel processes
  • NAS -> 5 Clients -> 5, 10, 50, 100 parallel processes each

I would definitely avoid M/R as the process startup cost is too high and even distcp won't do as well because you won't be able to control how heavily the source NAS is hit (this will be your bottleneck).

0

上一篇:

下一篇:

精彩评论

暂无评论...
验证码 换一张
取 消

最新问答

问答排行榜