Map and Reduce with large datasets = how does it work in practice?
i would be thankfull for advice:
http://en.wikipedia.org/wiki/MapReduce states: "...a large server farm can use MapReduce to sort a petabyte of data in only a few hours..." and "...The master node takes the input, partitions it up into smaller sub-problems, and distributes those to worker nodes..."
I completely do NOT understand how this will work in Practice. Given I have a SAN(storage) with 1 Petabyte of Data. How can I distrubute that amout of data efficiently through the "Master" to the slaves? Thats something I can not understand. Given I have a 10开发者_如何学编程Gibt connection from SAN to the Master, and from the Masters to the slave 1 Gbit, I can at maximum "spread" 10Gbit at a time. How can I process Petabytes withing several hours,as I first have to transfer the data to the "reducer/worker nodes"?
Thanks very much! Jens
Actually, on a full-blown Map/Reduce framework, such as Hadoop, the data storage itself is distributed. Hadoop, for example, has the HDFS distributed file storage system that allows for both redudancy and high performance. The filesystem nodes can be used as computing nodes, or they can be dedicated storage nodes, depending on how to framework has been deployed.
Usually, when mentioning computing times in this case, it is assumed that the input data already exists in the distributed storage of the cluster. The master node merely feeds the computing nodes with data ranges to process - not with the data itself.
I believe it's because the master node does the management, not the data transfer.
The data is stored on a distributed file system and brought in from several nodes simultaneously. (There's no reason for the data to go through the master node.)
精彩评论