CS 294 Blog: FlumeJava: Easy Efficient Data-Parallel Pipelines

FlumeJava is a Java library that simplifies the creation of MapReduce pipelines using a parallel collections abstraction with operations that are executed in parallel via MR jobs. This is very similar to DryadLINQ (see blog post below) in that it provides a layer between the programmer and the actual execution data flow graph (in this case, a pipeline of MR jobs). The main idea here is that developers can process data and develop algorithms over large data sets without actually worrying about how the data is partitioned and processed in parallel; instead a number of operations (e.g. map or "DoParallel") are defined and developers can add their own functions to modify/transform the data -- and this is all automatically parallelized via FlumeJava and MR.

FlumeJava takes ParallelDo, GroupByKey, CombineValues, and Flatten operations and transforms them into a series of MapShuffleCombineReduce (MSCR), which is a generalization of the MapReduce algorithm. The optimizer performs several passes over the execution plan in order to reduce the number of MSCRs required.

The trade offs inherent in FlumeJava include the lack of UDF analysis (execution plans are determined independently of user defined functions parallelDo) and inability to optimize user code. FlumeJava makes it much more simple to write large programs that would previously have required a huge amount of developer work (to tune the MR pipelines). In the hands of inexperienced users, this can lead to over-use of certain operations (where unnecessary) because the cost in terms of MR jobs is not explicitly known to the programmer (inherent trade-off of the abstraction).

As I mentioned in the DryadLINQ write-up, it looks like interfaces between MR/execution engines and higher-level programs are becoming increasingly important, since they allow programmers to trivially parallelize their applications across huge data sets. I'm not certain that FlumeJava itself will be particularly influential, since it's proprietary to Google, but I could certainly see a FlumeJava clone on Hadoop becoming popular over the next decade. That said, Spark seems to do everything that FlumeJava + MapReduce does... and it's in Scala, which is more pleasant to write than Java.

CS 294 Blog

Friday, October 14, 2011

FlumeJava: Easy Efficient Data-Parallel Pipelines

No comments:

Post a Comment