Behind The Scene Of MapReduce Job
Posted by sranka on October 28, 2013
Recently I have spending most of my time on Big Data projects,using CDH 4.X. Understanding key component of hadoop infrastruture is very necessary, But the MapReduce (MR) is the most important for processing and aggregrating data. For getting the best of the performance, one needs to know the details of MapReduce job. After reading several white papers and few books, in my opinion below paragraph summarizes the MapReduce THE BEST !!!!!
All About Map Reduce
The execution of a MapReduce job is broken down into map tasks and reduce tasks. Subsequently, map task execution is divided into the phases: Read (reading map inputs), Map (map function pro-cessing), Collect (serializing to buffer and partitioning), Spill (sorting, combining, compressing, and writing map outputs to local disk), and Merge (merging sorted spill files). Reduce task execution is divided into the phases: Shuffle (transferring map outputs to reduce tasks, with decompression if needed), Merge (merging sorted map outputs), Reduce (reduce function processing), and Write (writing reduce outputs to the distributed file-system). Each phase represents an important part of
the job’s overall execution in Hadoop.
In the MapReduce model, computation is divided into a map function and a reduce function. The map function takes a key/value pair and produces one or more intermediate key/value pairs. The reduce function then takes these intermediate key/value pairs and merges all values corresponding to a single key. The map function can run independently on each key/value pair, exposing enormous amounts of parallelism. Similarly, the reduce function can run independently on each intermediate key, also exposing significant parallelism. In Hadoop, a centralized JobTracker service is responsible for splitting the input data into pieces for processing by independent map and reduce tasks, scheduling each task on a cluster node for execution, and recovering from failures by re-running tasks. On each node, a TaskTracker service runs MapReduce tasks and periodically contacts the JobTracker to report task completions and request new tasks. By default, when a new task is received, a new JVM instance will be spawned to execute it.
The about text is taken from
The Hadoop Distributed Filesystem: Balancing Portability and Performance by Jeffrey Shafer, Scott Rixner, and Alan L. Cox :Rice University
Technical Report : Hadoop Performance Models By Herodotos Herodotou
Hope This helps
Sunil S Ranka
“Superior BI is the antidote to Business Failure”