Hadoop Data Replication Strategy
Posted by sranka on October 17, 2013
With replication and fault tolerance, an inbuilt feature of Hadoop. I was always curious to know how blocks are replicated. Got this information while reading “Hadoop The Definitive Guide Edition – 3 ” in chapter 3 “The Hadoop Distributed Filesystem”. Thought would be interesting to share.
- How does the namenode choose which datanodes to store replicas on?
Hadoop’s default strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy). The second replica is placed on a different rack from the first (off-rack), chosen at random. The third replica is placed on the same rack as the second, but on a different node chosen at random. Further replicas are placed on random nodes on the cluster, although the system tries to avoid placing too many replicas on the same rack.
The above entire text has been taken from Chapter 3 of “Hadoop The Definitive Guide Edition – 3 “
Hope This helps
Sunil S Ranka
“Superior BI is the antidote to Business Failure”