MapReduce: Simplified Data Processing on Large Clusters

3 min readMar 11, 2023

Jeffrey Dean and Sanjay Ghemawat

According to the paper in earlier times, there was a problem processing large sum of data because of allocation of numerous machines to process the data parallelly and finishing the job in a certain period of time but the amount of data that is to be allocated to the machine and the processes that they have to perform was very difficult. So, to solve this problem technique of mapping and reducing was introduced, where each record is linked with the key by the mapper and then that key is used to combine all the keys by the reducer. This technique was simple and it enabled automatic parallelization and distribution of large-scale computations.

MapReduce is a library, where the map functions is performed by the system and reduce function is defined by the user. The map function takes an input pair and produces a set of intermediate key pairs, groups them together and passes them to reduce function where it accepts an intermediate key and merges them together to form a possible output. Example: Count of URL access frequency. The map function processes the logs of web page request and outputs <URL,1>. The reduce function adds all the values together with same URL and gives output as <URL,Total Count>. Many more areas where the MapReduce works effectively are distributed grep, reverse web link graph, distributed shot etc.

At Google, Large cluster of commodity PCs are connected together with switched Ethernet. The machines are typically dual processor running on Linux with up-to 4GB of memory for machine. The library splits the file into M pieces and starts making copies of the program on cluster of machines. One of the copies of the program is special called — the master, which assigns work to the rest of the copies, Then the intermediate key pairs produced by map function are buffered in the memory and partitioned into R regions by the partitioning function. When reduce worker is notified by the master about these locations it uses remote produce calls to read the data and it then sorts the data by the key pairs. The output of the reduce function is appended to final output file for this reduce partitioning. When all the task are completed the MapReduce call returns back to the user code.

Since cluster of computers is made up of thousands of machines, machine failure is a common thing. To ensure that the Systems are working fine, the master pings every worker periodically and if they do not respond that means the machine has failed but to save our files, the completed map task are re- executed because their output is in local disk and completed reduce task do not need to be executed since the output is stored in a global file system and if the master task fails new copy can be started from the last checkpointed state. Also, MapReduce works on distributed system so debugging can be difficult, to solve this problem we have to implement sequential execution of the map tasks.

The MapReduce library has been successful at Google due to its ease of use, ability to express a variety of problems, and scalability of machines. It hides details of parallelization, fault-tolerance, locality optimization, and load balancing. Google has used MapReduce for web search, sorting, data mining, machine learning, and other systems. The implementation of MapReduce makes efficient use of machine resources and is suitable for large computational problems. The work has taught the importance of restricting the programming model to make it easy to parallelize and distribute computations, optimizing to reduce network bandwidth, and using redundant execution to handle machine failures and data loss.

MapReduce: Simplified Data Processing on Large Clusters

Written by Ananya

No responses yet