CS 294 Blog: "MapReduce: Simplified Data Processing on Large Clusters"

MapReduce provides a programming model that simplifies the distributed processing and generation of large data sets. The main idea is to split up the programs into map phases (a function that processes key/value data set to generate another key/value data set) and reduce phases (function that merges values associated with each given key). This abstraction lets programmers focus on the tasks they are trying to solve without having to deal too much with the complexities of distributed computations, launching programs on many nodes, keeping track of nodes, failures, etc: all of this is handled by the framework itself. The paper shows that many common tasks can be parallelized in this fashion, including counts, grep, sorts, indexing, and so on. While I/O is a limiting factor for many MapReduce tasks (loading huge data takes a long time!), master nodes are smart about assigning workers to data that is local (or nearby) to reduce transfer time. Redundancy and fault-tolerance is achieved by re-assigning jobs to other workers if certain workers seem to be taking too long to execute mapper functions (holding up the reduce phase for everyone).

While MapReduce is not appropriate for all algorithms, this solution grew out of the observation that many parallel tasks (at least at Google) could be easily converted into map/reduce-style programs. Of course, there are tradeoffs in this approach, since some algorithms can not be expressed well in this functional style. Furthermore, the ease of programming in MapReduce might make programmers prefer this model when it's not necessarily suitable (tradeoff between ease of programming and optimal solution for given problem).

The experience with MapReduce at Google and the popularity of MapReduce-inspired implementations like Hadoop testify to the impact that this paper (and model) had on computing (especially cloud/distributed computing). I think that this model will certainly continue to influence research over the next decade, given the increasing amounts of data that need to be processed and the ease with which MapReduce accomplishes this task.

CS 294 Blog

Wednesday, September 28, 2011

"MapReduce: Simplified Data Processing on Large Clusters"

1 comment: