Resilient Distributed
Datasets, implemented in Spark, are a fault-tolerant distributed memory
abstraction that allows in-memory computation and efficient data re-use on
large clusters. An RDD is a read-only partitioned collection of records defined
either as operations on data in storage or on a previous RDD’s data. RDDs
provide coarse-grained transformations, where an operation (e.g. map, filter,
join) is applied to many data items, creating a new RDD. RDDs contain lineage
information containing the history of the operations on previous RDDs (and
their partitions) used to create the current RDD. Thus, they don’t need to be
materialized at all times, since their contents can be computed using this
lineage information whenever any of the data needs to be retrieved (or used in
a computation). Fine-grained reads are available.
Is the problem real?
The problem here—efficient
data re-use and in-memory processing—is definitely a real problem, and we have
see many other attempts at solving it (Piccolo, Pregel, Nectar & DryadLINQ,
etc.).
What is the solution's main idea (nugget)?
The main nugget of this paper
is that tracking lineage for in-memory immutable collections provides a
powerful programming paradigm for many problems that require distributed memory
and efficient re-use of data, while preserving fault tolerance reasonably
trivially.
Why is solution different from previous work?
RDDs differ from previous
work in that they are general-purpose (not specific to, say, graph algorithms
like Pregel), immutable (unlike Piccolo, which provides a shared-memory
abstraction with mutable state) and allow re-use of cached data between jobs
(unlike, say, Dryad and MapReduce programs).
Does the paper (or do you) identify any
fundamental/hard trade-offs?
Fundamentally, the
immutability of RDDs is a trade-off, since modifications to data must be made
on a collection-wide scope by transforming an RDD. This is not really as big of
a problem as it seems, since lazy evaluation means that these transformations
aren’t computed until the data is actually needed. Another trade-off of Spark
is its Scala interface, which, while convenient for its interoperability with
Java, doesn’t enjoy the same level of familiarity among programmers as do other
languages.
Do you think the work will be influential in 10 years?
I think this work has the
potential to be influential over the next decade: it provides a relatively
simple, elegant, and versatile solution to working with large datasets
effectively. Furthermore, with the growing trend of keeping datasets in memory
(RAM is getting cheaper), I think Spark is on the right track. We’ll see if
Spark itself catches on, but even if it doesn’t, it seems to be indicative of
the direction that we’re heading (in a post-MapReduce world).
No comments:
Post a Comment