Wednesday, October 19, 2011

Erlang: A survey of the language and its industrial applications

Erlang is a parallel functional programming language designed for real-time control systems, so it aims to provide millisecond-order response times, portability, concurrency, IPC, distribution and to accommodate very large programs and non-stop systems. Erlang uses garbage collection, users can control how code is loaded, it uses error detection for robustness, and has a port mechanism for communication with outside processes. Concurrency is supported by the language environment itself and doesn't require programmers to interact with the underlying operating system.

Error handling in Erlang seems straightforward and particularly useful for distributed/parallel systems. Run-time errors can be used to monitor the evaluation of an expression, monitor the behavior of a process, and raise an exception when an undefined function is called. When a process dies, it notifies other processes about why it died, and the tracking process can re-spawn another process to finish the job (since it's all functional, no state information needs to be reconstructed).

Performance-wise, it looks like Erlang is good for soft real-time systems where performance guarantees are not required, but highly desired (and it seems to have a good track record at Ericsson). While I wasn't surprised that C outperforms Erlang on small benchmarks (to be expected with the gc overhead, etc.), I found it interesting that Erlang does better than C on large systems.

One of the major trade-offs might be Erlang's unfamiliar syntax, which coupled with a functional programming model might impose a steep learning curve for many programmers.

As far as influence goes, it is clear that Erlang has remained influential over the past 25 years -- and continues to be an appropriate choice for applications that require high degrees of concurrency (users of Erlang include CouchDB, Riak, AWS SimpleDB, Membase, and others).


Monday, October 17, 2011

Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

COPS (Clusters of Order-Preserving Servers) addresses the same problem as PNUTS, namely, that online applications need to be scalable and provide high availability with low latency and partition tolerance, sacrificing some level of consistency. The authors present a wide-area distributed key-value store that provides "causal+" consistency, i.e. causal consistency with convergent conflict handling. Causal consistency ensures that causal dependencies between operations are preserved in the data store. Convergent conflict handling guarantees that replicas don't permanently diverge and that conflicting updates are handled identically at all sites (i.e. if there are two put() operations on the same key that conflict, then they are resolved in the same way at all replicas). Two versions of COPS are provided: the first provides causal+ consistency between individual keys, while the second provides consistent views of multiple keys (more expensive).

The main tradeoffs here are complexity versus performance, and the COPS approach opts for more complexity general (given the stronger consistency guarantees). I don't think there is very much new here -- this seems like a rehash of ideas we saw in PNUTS and Dynamo, though I agree that the multiple key causal consistency seems valuable (though expensive performance-wise). That said, it seems like Yahoo has been able to get along pretty well without this, so it might not be worth the extra overhead. I doubt this paper specifically will have significant influence over the next decade, simply because there isn't too much novelty here, but the general idea of highly available, scalable, low latency geographically distributed data stores is certainly relevant, especially for web applications.

PNUTS: Yahoo's Hosted Data Serving Platform

PNUTS is a distributed database designed by Yahoo especially for web applications that require scalability, high availability/fault tolerance, and low response time, while tolerating relaxed consistency guarantees. The main idea here is that all high-latency operations are performed asynchronously, while supporting record-level replication/mastering. A pub/sub message system is used instead of a traditional database log to coordinate replication. Data is automatically horizontally partitioned into groups of records called tablets.  These tablets are small (several can be a stored on each node) and can automatically be rearranged among nodes for load balancing.The system is implemented as a single centrally-managed hosted service, which is queried by applications.

Different parts of applications often require varying consistency guarantees -- slightly stale data might often be acceptable, while the latest version is needed in other locations (e.g. to provide show a user that an update has been made). PNUTS uses a per-record timeline consistency model, which guarantees that replicas of a record all write updates in the same order. Applications can then specify what kind of consistency they need on a per-record basis, using less expensive consistency requirements where possible (where stale data is acceptable).

It looks like PNUTS does a good job of balancing trade-offs between differing consistency requirements and latency in a geographically distributed database system by letting applications explicitly set their per-record requirements. The per-record replication and timeline consistency could be problematic when a query depends on the mutual consistency of several records (though this is something that the COPS paper addresses). That said, I think this approach of letting applications decide what kinds of consistency guarantees to require--and when--makes sense and should be very influential over the next decade as web services continue to geographically distribute their data across the world, while trying to maintain stringent latency policies.

Friday, October 14, 2011

FlumeJava: Easy Efficient Data-Parallel Pipelines

FlumeJava is a Java library that simplifies the creation of MapReduce pipelines using a parallel collections abstraction with operations that are executed in parallel via MR jobs. This is very similar to DryadLINQ (see blog post below) in that it provides a layer between the programmer and the actual execution data flow graph (in this case, a pipeline of MR jobs). The main idea here is that developers can process data and develop algorithms over large data sets without actually worrying about how the data is partitioned and processed in parallel; instead a number of operations (e.g. map or "DoParallel") are defined and developers can add their own functions to modify/transform the data -- and this is all automatically parallelized via FlumeJava and MR.

FlumeJava takes ParallelDo, GroupByKey, CombineValues, and Flatten operations and transforms them into a series of MapShuffleCombineReduce (MSCR), which is a generalization of the MapReduce algorithm. The optimizer performs several passes over the execution plan in order to reduce the number of MSCRs required.

The trade offs inherent in FlumeJava include the lack of UDF analysis (execution plans are determined independently of user defined functions parallelDo) and inability to optimize user code. FlumeJava makes it much more simple to write large programs that would previously have required a huge amount of developer work (to tune the MR pipelines). In the hands of inexperienced users, this can lead to over-use of certain operations (where unnecessary) because the cost in terms of MR jobs is not explicitly known to the programmer (inherent trade-off of the abstraction).

As I mentioned in the DryadLINQ write-up, it looks like interfaces between MR/execution engines and higher-level programs are becoming increasingly important, since they allow programmers to trivially parallelize their applications across huge data sets. I'm not certain that FlumeJava itself will be particularly influential, since it's proprietary to Google, but I could certainly see a FlumeJava clone on Hadoop becoming popular over the next decade. That said, Spark seems to do everything that FlumeJava + MapReduce does... and it's in Scala, which is more pleasant to write than Java.

DyadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language

This paper presents DryadLINQ, a system that facilitates distributed computing by translating a high-level program (in C#, VB, F# using .NET and LINQ constructs) into a distributed execution plan that is passed to Dryad (see earlier blog post on Dryad). This is similar to Pig and Hive, which translate procedural and declarative programs respectively into MapReduce jobs, though Dryad (and DryadLINQ) can take advantage of non-MR-style execution flows. This detail seems particularly relevant and useful, since MR is not idea for all types of programs. As we saw in the Dryad paper, creating and optimizing execution DAGs in Dryad is complex, so DryadLINQ (and other systems like SCOPE) that does this for the programmer seems promising.

DryadLINQ converts LINQ expressions into execution plan graphs (EPGs), which are DAGs, similar to database query plans. Static optimizations (greedy heuristics) are performed, as are dynamic optimizations. For example, DryadLINQ takes advantage of hooks in Dryad to change the execution graph as information becomes available, allowing dynamic optimizations to reduce I/O overhead, network transfer, etc. For Microsoft and companies that use .NET, DryadLINQ makes a lot of sense, since it integrates well with .NET constructs (e.g. users can define functions that make full use of .NET and have them applied to the data during the distributed execution). Dryad was engineered for batch applications on large datasets, so as in MR, it's not appropriate for low-latency lookup. Likewise, it's better for streaming computations than algorithms that require a lot of random accesses. Despite these trade-offs, Dryad support for more generalized  / arbitrary execution flows seems to make DryadLINQ more flexible than Pig or Hive, which are limited to MR. 

I think the idea of flexible data flow graphs for distributed computations is going to be very influential over the next decade, and this necessitates a certain level of complexity that will open the way for a middle layer like DryadLINQ (in between the application and the distributed execution engine). However, I'm not sure how much influence DryadLINQ itself will have, since it seems that SCOPE is being used more widely at Microsoft now -- and the fact that Dryad is closed-source will likely limit its popularity.

Dynamo: Amazon's Highly Available Key-Value Store

Dynamo is a high-availability distributed key-value store developed by Amazon. The main problem that they were trying to solve was having a key-value store that is always writeable, at the expense of some consistency under certain failure scenarios. Making the system always writeable was a priority for Amazon, since many of their user-facing applications (e.g. shopping cart) need to be able to receive input from users, while read operations take lower priority (i.e. doesn't affect their business as much).

Amazon services have stringent latency requirements (in the 99.9th percentile) since, interestingly, often the users with the longest histories (and hence the most loyal customers) take the most resources to process.  Dynamo scales and maintains high availability by using consistent hashing for partitioning and replication of data, with consistency facilitated by object versioning. Consistency among replicas during updates is maintained by quorums and decentralized replica synchronization. Also important is that Dynamo can scale incrementally (one host at a time with minimal impact), has symmetry (each node has the same responsibilities), decentralization and heterogeneity (servers with different specifications can take on corresponding workloads).

The interface to Dynamo is simple -- just get() and put() operations. Incremental scalability relies on consistent hashing via a "ring" so that addition/removal of nodes only affects immediate neighbors. Nodes are mapped to several spots on the ring to ensure consistent load distribution.

So far, it looks like Amazon has had excellent results with Dynamo, with 99.9995% of requests successful and no data loss as of the paper's publication. This, coupled with the several Dynamo-inspired open source implementations (Voldemort, Cassandra, Riak, etc.), suggests that this model will be influential over the next decade at least. There is certainly room for key-value stores that have high write availability, especially in business applications and social networking. The main tradeoffs with Dynamo are consistency (it guarantees the A & P of the CAP theorem) and, possibly, the fact that it is *too* configurable (can be hard to tune).

Thursday, October 13, 2011

Dremel: Interactive Analysis of Web-Scale Datasets

Dremel presents a scalable and interactive ad-hoc query system for analysis of read-only nested data. The system is built on top of GFS and can function side-by-side with MapReduce. The main new contribution here is the combination of the following:
- column-based storage
- efficient storage and analysis of nested data structures 
- in-situ data access (data is analyzed in-place)
- SQL-like query language

Column-based storage allows for optimization on several fronts. First, queries that don't require all (or even a majority of) fields can avoid the overhead of loading unnecessary data (i.e. remaining columns) from disk during. Furthermore, column data can be compressed for quicker loading -- especially since column values tend to be similar or there tends to be considerable repetition. Nested data is particularly relevant to Google, since they have to deal with web pages, link schemas, etc. Representing nested data in a column-oriented data store seems tricky, but as the paper shows, not too hard. 


Dremel is designed to handle aggregate queries (e.g. counts) and does so via a tree architecture. Queries are re-written at every node of the tree (i.e. intermediate servers) to aggregate the results of other intermediate nodes or leaves (i.e. horizontal partitions of the data). This lets data be analyzed in place (at the servers where the data is actually located), which, coupled with the parallelization of queries, allows for quick, interactive execution. If absolute accuracy is not required, leaves that are taking long time can be skipped, yielding a result with a known margin of error (and even faster response time). 

The various features of Dremel don't seem particularly innovative -- and the authors admit that column-based storage, nested data structures, etc. have been done before. However, it seems that the novelty lies in the combination of the various features, which makes Dremel particularly adept at handling large data sets interactively. That said, a lot of what Dremel does seems similar to Pig and Hive, though those applications actually generate MR jobs, which might not be the best solution. Tradeoffs are inherent in the accuracy vs. speed decisions (though it's very nice to be able to choose one or the other). More fundamentally, while Dremel is great at aggregate functions over huge sets of data, it seems that performance for more complex queries (e.g. joins, etc.) would suffer. Also, as the article points out, when many columns are read, the column-based storage model suffers (due to data retrieval and record assembly time). 


As far as influence over the next decade, it looks like Dremel has already proven popular at Google (and they have been using it since 2006). I think the column-based model makes a lot of sense for many datasets and query types, though as always one has to weigh the trade-offs ahead of time.