Wednesday, October 19, 2011

Erlang: A survey of the language and its industrial applications

Erlang is a parallel functional programming language designed for real-time control systems, so it aims to provide millisecond-order response times, portability, concurrency, IPC, distribution and to accommodate very large programs and non-stop systems. Erlang uses garbage collection, users can control how code is loaded, it uses error detection for robustness, and has a port mechanism for communication with outside processes. Concurrency is supported by the language environment itself and doesn't require programmers to interact with the underlying operating system.

Error handling in Erlang seems straightforward and particularly useful for distributed/parallel systems. Run-time errors can be used to monitor the evaluation of an expression, monitor the behavior of a process, and raise an exception when an undefined function is called. When a process dies, it notifies other processes about why it died, and the tracking process can re-spawn another process to finish the job (since it's all functional, no state information needs to be reconstructed).

Performance-wise, it looks like Erlang is good for soft real-time systems where performance guarantees are not required, but highly desired (and it seems to have a good track record at Ericsson). While I wasn't surprised that C outperforms Erlang on small benchmarks (to be expected with the gc overhead, etc.), I found it interesting that Erlang does better than C on large systems.

One of the major trade-offs might be Erlang's unfamiliar syntax, which coupled with a functional programming model might impose a steep learning curve for many programmers.

As far as influence goes, it is clear that Erlang has remained influential over the past 25 years -- and continues to be an appropriate choice for applications that require high degrees of concurrency (users of Erlang include CouchDB, Riak, AWS SimpleDB, Membase, and others).


Monday, October 17, 2011

Don't Settle for Eventual: Scalable Causal Consistency for Wide-Area Storage with COPS

COPS (Clusters of Order-Preserving Servers) addresses the same problem as PNUTS, namely, that online applications need to be scalable and provide high availability with low latency and partition tolerance, sacrificing some level of consistency. The authors present a wide-area distributed key-value store that provides "causal+" consistency, i.e. causal consistency with convergent conflict handling. Causal consistency ensures that causal dependencies between operations are preserved in the data store. Convergent conflict handling guarantees that replicas don't permanently diverge and that conflicting updates are handled identically at all sites (i.e. if there are two put() operations on the same key that conflict, then they are resolved in the same way at all replicas). Two versions of COPS are provided: the first provides causal+ consistency between individual keys, while the second provides consistent views of multiple keys (more expensive).

The main tradeoffs here are complexity versus performance, and the COPS approach opts for more complexity general (given the stronger consistency guarantees). I don't think there is very much new here -- this seems like a rehash of ideas we saw in PNUTS and Dynamo, though I agree that the multiple key causal consistency seems valuable (though expensive performance-wise). That said, it seems like Yahoo has been able to get along pretty well without this, so it might not be worth the extra overhead. I doubt this paper specifically will have significant influence over the next decade, simply because there isn't too much novelty here, but the general idea of highly available, scalable, low latency geographically distributed data stores is certainly relevant, especially for web applications.

PNUTS: Yahoo's Hosted Data Serving Platform

PNUTS is a distributed database designed by Yahoo especially for web applications that require scalability, high availability/fault tolerance, and low response time, while tolerating relaxed consistency guarantees. The main idea here is that all high-latency operations are performed asynchronously, while supporting record-level replication/mastering. A pub/sub message system is used instead of a traditional database log to coordinate replication. Data is automatically horizontally partitioned into groups of records called tablets.  These tablets are small (several can be a stored on each node) and can automatically be rearranged among nodes for load balancing.The system is implemented as a single centrally-managed hosted service, which is queried by applications.

Different parts of applications often require varying consistency guarantees -- slightly stale data might often be acceptable, while the latest version is needed in other locations (e.g. to provide show a user that an update has been made). PNUTS uses a per-record timeline consistency model, which guarantees that replicas of a record all write updates in the same order. Applications can then specify what kind of consistency they need on a per-record basis, using less expensive consistency requirements where possible (where stale data is acceptable).

It looks like PNUTS does a good job of balancing trade-offs between differing consistency requirements and latency in a geographically distributed database system by letting applications explicitly set their per-record requirements. The per-record replication and timeline consistency could be problematic when a query depends on the mutual consistency of several records (though this is something that the COPS paper addresses). That said, I think this approach of letting applications decide what kinds of consistency guarantees to require--and when--makes sense and should be very influential over the next decade as web services continue to geographically distribute their data across the world, while trying to maintain stringent latency policies.

Friday, October 14, 2011

FlumeJava: Easy Efficient Data-Parallel Pipelines

FlumeJava is a Java library that simplifies the creation of MapReduce pipelines using a parallel collections abstraction with operations that are executed in parallel via MR jobs. This is very similar to DryadLINQ (see blog post below) in that it provides a layer between the programmer and the actual execution data flow graph (in this case, a pipeline of MR jobs). The main idea here is that developers can process data and develop algorithms over large data sets without actually worrying about how the data is partitioned and processed in parallel; instead a number of operations (e.g. map or "DoParallel") are defined and developers can add their own functions to modify/transform the data -- and this is all automatically parallelized via FlumeJava and MR.

FlumeJava takes ParallelDo, GroupByKey, CombineValues, and Flatten operations and transforms them into a series of MapShuffleCombineReduce (MSCR), which is a generalization of the MapReduce algorithm. The optimizer performs several passes over the execution plan in order to reduce the number of MSCRs required.

The trade offs inherent in FlumeJava include the lack of UDF analysis (execution plans are determined independently of user defined functions parallelDo) and inability to optimize user code. FlumeJava makes it much more simple to write large programs that would previously have required a huge amount of developer work (to tune the MR pipelines). In the hands of inexperienced users, this can lead to over-use of certain operations (where unnecessary) because the cost in terms of MR jobs is not explicitly known to the programmer (inherent trade-off of the abstraction).

As I mentioned in the DryadLINQ write-up, it looks like interfaces between MR/execution engines and higher-level programs are becoming increasingly important, since they allow programmers to trivially parallelize their applications across huge data sets. I'm not certain that FlumeJava itself will be particularly influential, since it's proprietary to Google, but I could certainly see a FlumeJava clone on Hadoop becoming popular over the next decade. That said, Spark seems to do everything that FlumeJava + MapReduce does... and it's in Scala, which is more pleasant to write than Java.

DyadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language

This paper presents DryadLINQ, a system that facilitates distributed computing by translating a high-level program (in C#, VB, F# using .NET and LINQ constructs) into a distributed execution plan that is passed to Dryad (see earlier blog post on Dryad). This is similar to Pig and Hive, which translate procedural and declarative programs respectively into MapReduce jobs, though Dryad (and DryadLINQ) can take advantage of non-MR-style execution flows. This detail seems particularly relevant and useful, since MR is not idea for all types of programs. As we saw in the Dryad paper, creating and optimizing execution DAGs in Dryad is complex, so DryadLINQ (and other systems like SCOPE) that does this for the programmer seems promising.

DryadLINQ converts LINQ expressions into execution plan graphs (EPGs), which are DAGs, similar to database query plans. Static optimizations (greedy heuristics) are performed, as are dynamic optimizations. For example, DryadLINQ takes advantage of hooks in Dryad to change the execution graph as information becomes available, allowing dynamic optimizations to reduce I/O overhead, network transfer, etc. For Microsoft and companies that use .NET, DryadLINQ makes a lot of sense, since it integrates well with .NET constructs (e.g. users can define functions that make full use of .NET and have them applied to the data during the distributed execution). Dryad was engineered for batch applications on large datasets, so as in MR, it's not appropriate for low-latency lookup. Likewise, it's better for streaming computations than algorithms that require a lot of random accesses. Despite these trade-offs, Dryad support for more generalized  / arbitrary execution flows seems to make DryadLINQ more flexible than Pig or Hive, which are limited to MR. 

I think the idea of flexible data flow graphs for distributed computations is going to be very influential over the next decade, and this necessitates a certain level of complexity that will open the way for a middle layer like DryadLINQ (in between the application and the distributed execution engine). However, I'm not sure how much influence DryadLINQ itself will have, since it seems that SCOPE is being used more widely at Microsoft now -- and the fact that Dryad is closed-source will likely limit its popularity.

Dynamo: Amazon's Highly Available Key-Value Store

Dynamo is a high-availability distributed key-value store developed by Amazon. The main problem that they were trying to solve was having a key-value store that is always writeable, at the expense of some consistency under certain failure scenarios. Making the system always writeable was a priority for Amazon, since many of their user-facing applications (e.g. shopping cart) need to be able to receive input from users, while read operations take lower priority (i.e. doesn't affect their business as much).

Amazon services have stringent latency requirements (in the 99.9th percentile) since, interestingly, often the users with the longest histories (and hence the most loyal customers) take the most resources to process.  Dynamo scales and maintains high availability by using consistent hashing for partitioning and replication of data, with consistency facilitated by object versioning. Consistency among replicas during updates is maintained by quorums and decentralized replica synchronization. Also important is that Dynamo can scale incrementally (one host at a time with minimal impact), has symmetry (each node has the same responsibilities), decentralization and heterogeneity (servers with different specifications can take on corresponding workloads).

The interface to Dynamo is simple -- just get() and put() operations. Incremental scalability relies on consistent hashing via a "ring" so that addition/removal of nodes only affects immediate neighbors. Nodes are mapped to several spots on the ring to ensure consistent load distribution.

So far, it looks like Amazon has had excellent results with Dynamo, with 99.9995% of requests successful and no data loss as of the paper's publication. This, coupled with the several Dynamo-inspired open source implementations (Voldemort, Cassandra, Riak, etc.), suggests that this model will be influential over the next decade at least. There is certainly room for key-value stores that have high write availability, especially in business applications and social networking. The main tradeoffs with Dynamo are consistency (it guarantees the A & P of the CAP theorem) and, possibly, the fact that it is *too* configurable (can be hard to tune).

Thursday, October 13, 2011

Dremel: Interactive Analysis of Web-Scale Datasets

Dremel presents a scalable and interactive ad-hoc query system for analysis of read-only nested data. The system is built on top of GFS and can function side-by-side with MapReduce. The main new contribution here is the combination of the following:
- column-based storage
- efficient storage and analysis of nested data structures 
- in-situ data access (data is analyzed in-place)
- SQL-like query language

Column-based storage allows for optimization on several fronts. First, queries that don't require all (or even a majority of) fields can avoid the overhead of loading unnecessary data (i.e. remaining columns) from disk during. Furthermore, column data can be compressed for quicker loading -- especially since column values tend to be similar or there tends to be considerable repetition. Nested data is particularly relevant to Google, since they have to deal with web pages, link schemas, etc. Representing nested data in a column-oriented data store seems tricky, but as the paper shows, not too hard. 


Dremel is designed to handle aggregate queries (e.g. counts) and does so via a tree architecture. Queries are re-written at every node of the tree (i.e. intermediate servers) to aggregate the results of other intermediate nodes or leaves (i.e. horizontal partitions of the data). This lets data be analyzed in place (at the servers where the data is actually located), which, coupled with the parallelization of queries, allows for quick, interactive execution. If absolute accuracy is not required, leaves that are taking long time can be skipped, yielding a result with a known margin of error (and even faster response time). 

The various features of Dremel don't seem particularly innovative -- and the authors admit that column-based storage, nested data structures, etc. have been done before. However, it seems that the novelty lies in the combination of the various features, which makes Dremel particularly adept at handling large data sets interactively. That said, a lot of what Dremel does seems similar to Pig and Hive, though those applications actually generate MR jobs, which might not be the best solution. Tradeoffs are inherent in the accuracy vs. speed decisions (though it's very nice to be able to choose one or the other). More fundamentally, while Dremel is great at aggregate functions over huge sets of data, it seems that performance for more complex queries (e.g. joins, etc.) would suffer. Also, as the article points out, when many columns are read, the column-based storage model suffers (due to data retrieval and record assembly time). 


As far as influence over the next decade, it looks like Dremel has already proven popular at Google (and they have been using it since 2006). I think the column-based model makes a lot of sense for many datasets and query types, though as always one has to weigh the trade-offs ahead of time. 

Wednesday, October 5, 2011

"Bigtable: A Distributed Storage System for Structured Data"

Bigtable was developed at Google as a scalable distributed storage system for structured data that is housed on commodity servers (on top of GFS). They currently use Bigtable for a number of applications including web indexing, Google Earth, Google Finance, and over 60 other applications. Bigtable is similar to databases, but it doesn't support a full relational model, sacrificing this for performance and potential to scale to petabytes of storage easily. Essentially, Bigtable is a sparse, distributed, persistent multidimensional sorted map. It maps a row key, column key, and timestamp to an arbitrary array of bytes. Data is maintained in tables that are partitioned into row ranges called "tablets"  which are used to distribute and lot balance the data. Locations of tablets are stored in multiple METADATA tablets and Bigtable uses Chubby to keep track of tablet servers. Bigtable uses compression to save space, two levels of caching (scan cache for key/value pairs and block cache for block read from GFS), and a single commit log per tablet server instead of a different commit log for each tablet.

The main nugget here is that many applications don't need all the features of a relational database, so a map with row-level atomicity on top of GFS can provide a powerful way to scale storage for many applications. I'm not quite sure what the major tradeoffs are for this specific design, though naturally you can't use it for everything (many applications still benefit from the schemas and ACID-style features of "real" databases). Since so many Google applications are using Bigtable, it seems that it definitely is having influence and meeting a need, at least within the company. Whether or not the Bigtable design becomes popular outside of Google, it seems that high-performance, scalable schema-free storage is becoming a good (i.e. easier than using a RDBMS) solution for huge amounts of data.

Monday, October 3, 2011

"Pig Latin: A Not-So-Foreign Language for Data Processing"

This paper describe Pig Latin, a new programming language, and Pig, a system that turns Pig Latin statements into MapReduce jobs. Pig and Pig Latin take a somewhat different approach to the same fundamental challenge that Hive tries solve, namely, that the MapReduce programming model is very low-level and rigid, making it difficult to use for data analysis. At the same time, the declarative style of SQL can be unnatural for many programmers (especially as queries become more complex), so Pig aims for the middle ground between the rigid procedural style inherent in MapReduce and the declarative style of SQL (and HiveQL, etc.). 

In Pig Latin, a user specifies a series of steps, where each step is a single high-level data transformation (as opposed to SQL where constraints for the desired result are specified declaratively). The Pig system then optimizes the steps (if necessary/allowed) and generates a logical plan, which is then compiled into a sequence of MapReduce tasks. Notably, Pig Latin has extensive support for UDFs, which can be used at any step, including grouping, filtering, joining, and per-tuple processing.

From a programmer's perspective, the Pig (Latin) programming model's more procedural style seems very appealing, as SQL statements can become extremely complex and difficult to create correctly. That said, many types of queries can be easily expressed in SQL-like syntax and many data analysts are already familiar with SQL (and/or already have statements generated for the types of queries that they want to run). Likewise, for interactive and ad-hoc querying, a SQL-based CLI (like the one provided by Hive) seems like a more convenient interface. Due to these tradeoffs, I'm not convinced that Pig will be especially influential over the next decade. While the programming interface is more simple and convenient than MapReduce, Spark seems to do much of this as well or better (though, admittedly, it isn't as popular yet). Hive also seems to be gaining more ground over Pig, and I think the big reason for this is the familiar SQL-style interface. Nevertheless, the idea of building layers on top of MapReduce to make it easier to generate the MapReduce jobs (and to make it more easy to convert tasks into MapReduce terms) seems like a very dominant trend that will continue to influence research into the next decade.

"SCADS: Scale-Independent Storage for Social Computing Applications"

As web applications grow, they encounter significant challenges in storing and querying increasingly large data sets. While parallel databases have been able to help address this issue for data that can be easily partitioned, many web applications have data that is based on social networks with a high degree of interaction between users. This makes the data hard to partition along traditional lines (e.g. users) since joins and other types of queries would often be necessary across these partitions.

At the same time, many web applications are more concerned with high performance / low latency than with complete consistency of data. This tradeoff is something that SCADS aims to exploit, by allowing developers to state their own consistency requirements, use cloud computing for quick up/down scaling, and machine learning to anticipate performance problems. High numbers of queries with low latency are provided by the constraint that any queries must be lookups over bounded contiguous ranges of an index (and indices are automatically maintained according to the developer-specified consistency requirements). Consistency will be specified as a SLA, such that the developer can stipulate that a certain percentage of queries must take less than a certain response time. SCADS will be implemented on top of Cassandra as a column-store, and will implement asynchronous index updates, session guarantees and automatic provisioning.

I like this paper's emphasis on the unique characteristics of web application data that doesn't lend itself to be easily partitioned, as I think that this is obviously a growing trend, especially on the web. The tradeoffs here seem to be in two areas: consistency (some consistency must be sacrificed for speed, but this is usually acceptable for web apps) and query constraints (that must be on a contiguous section of the index). I'm not sure that SCADS itself will be influential over the next decade, but the problem that it addresses (data that is not easily partitioned but doesn't require high consistency) is a real one and one that certainly will drive a lot of database research over the next decade.

"HIVE: Data Warehousing & Analytics on Hadoop"

Hive is an open-source data warehousing solution built on top of Hadoop and developed by a team at Facebook. While Hadoop and the MapReduce model are effective for analyzing enormous sets of data on the scale of that encountered at Facebook, MapReduce is still relatively hard to program. On the other hand, users are much more familiar with SQL, bash, Python, etc., so HIVE was designed as an intermediate layer for querying data through SQL-like syntax that is then compiled into a series of MapReduce jobs that are then run on Hadoop. Among other tasks, Facebook uses Hive for summarizing impressions/clicks and other statistics, ad hoc analysis, data mining, spam detection and ad optimization.

Hive consists of several components on top of Hadoop/HDFS. There is both a CLI  and Thrift API through which HiveQL statements can be submitted. The Metastore is a catalog with table/column/partition information, etc. The Driver manages the compilation, optimization, and execution of HiveQL statements, invoking the Compiler to translate a query into a DAG of MapReduce jobs which are then submitted to the Execution Engine (and sent to Hadoop).

The main 'nugget ' of this project is that MapReduce is hard to program -- SQL is a much more convenient interface, and a certain subset of queries can be converted MapReduce jobs. I think that Hive's interface is fairly intuitive and provides easy access to the power of MapReduce for processing data, so we will probably continue to see its popularity grow (along with similar frameworks like Pig). That said, the Hive/Hadoop stack does have some performance problems, especially with loading data into memory on every job, but we hope to address this issue by using a driver for Spark and caching data in RDDs.