Friday, October 14, 2011

DyadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language

This paper presents DryadLINQ, a system that facilitates distributed computing by translating a high-level program (in C#, VB, F# using .NET and LINQ constructs) into a distributed execution plan that is passed to Dryad (see earlier blog post on Dryad). This is similar to Pig and Hive, which translate procedural and declarative programs respectively into MapReduce jobs, though Dryad (and DryadLINQ) can take advantage of non-MR-style execution flows. This detail seems particularly relevant and useful, since MR is not idea for all types of programs. As we saw in the Dryad paper, creating and optimizing execution DAGs in Dryad is complex, so DryadLINQ (and other systems like SCOPE) that does this for the programmer seems promising.

DryadLINQ converts LINQ expressions into execution plan graphs (EPGs), which are DAGs, similar to database query plans. Static optimizations (greedy heuristics) are performed, as are dynamic optimizations. For example, DryadLINQ takes advantage of hooks in Dryad to change the execution graph as information becomes available, allowing dynamic optimizations to reduce I/O overhead, network transfer, etc. For Microsoft and companies that use .NET, DryadLINQ makes a lot of sense, since it integrates well with .NET constructs (e.g. users can define functions that make full use of .NET and have them applied to the data during the distributed execution). Dryad was engineered for batch applications on large datasets, so as in MR, it's not appropriate for low-latency lookup. Likewise, it's better for streaming computations than algorithms that require a lot of random accesses. Despite these trade-offs, Dryad support for more generalized  / arbitrary execution flows seems to make DryadLINQ more flexible than Pig or Hive, which are limited to MR. 

I think the idea of flexible data flow graphs for distributed computations is going to be very influential over the next decade, and this necessitates a certain level of complexity that will open the way for a middle layer like DryadLINQ (in between the application and the distributed execution engine). However, I'm not sure how much influence DryadLINQ itself will have, since it seems that SCOPE is being used more widely at Microsoft now -- and the fact that Dryad is closed-source will likely limit its popularity.

No comments:

Post a Comment