CS 294 Blog: Spark

Resilient Distributed Datasets, implemented in Spark, are a fault-tolerant distributed memory abstraction that allows in-memory computation and efficient data re-use on large clusters. An RDD is a read-only partitioned collection of records defined either as operations on data in storage or on a previous RDD’s data. RDDs provide coarse-grained transformations, where an operation (e.g. map, filter, join) is applied to many data items, creating a new RDD. RDDs contain lineage information containing the history of the operations on previous RDDs (and their partitions) used to create the current RDD. Thus, they don’t need to be materialized at all times, since their contents can be computed using this lineage information whenever any of the data needs to be retrieved (or used in a computation). Fine-grained reads are available.

Is the problem real?

The problem here—efficient data re-use and in-memory processing—is definitely a real problem, and we have see many other attempts at solving it (Piccolo, Pregel, Nectar & DryadLINQ, etc.).

What is the solution's main idea (nugget)?

The main nugget of this paper is that tracking lineage for in-memory immutable collections provides a powerful programming paradigm for many problems that require distributed memory and efficient re-use of data, while preserving fault tolerance reasonably trivially.

Why is solution different from previous work?

RDDs differ from previous work in that they are general-purpose (not specific to, say, graph algorithms like Pregel), immutable (unlike Piccolo, which provides a shared-memory abstraction with mutable state) and allow re-use of cached data between jobs (unlike, say, Dryad and MapReduce programs).

Does the paper (or do you) identify any fundamental/hard trade-offs?

Fundamentally, the immutability of RDDs is a trade-off, since modifications to data must be made on a collection-wide scope by transforming an RDD. This is not really as big of a problem as it seems, since lazy evaluation means that these transformations aren’t computed until the data is actually needed. Another trade-off of Spark is its Scala interface, which, while convenient for its interoperability with Java, doesn’t enjoy the same level of familiarity among programmers as do other languages.

Do you think the work will be influential in 10 years?

I think this work has the potential to be influential over the next decade: it provides a relatively simple, elegant, and versatile solution to working with large datasets effectively. Furthermore, with the growing trend of keeping datasets in memory (RAM is getting cheaper), I think Spark is on the right track. We’ll see if Spark itself catches on, but even if it doesn’t, it seems to be indicative of the direction that we’re heading (in a post-MapReduce world).

CS 294 Blog

Wednesday, November 2, 2011

Spark

No comments:

Post a Comment