Wednesday, September 28, 2011

"Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks"

Dryad addresses the problem of executing distributed data-parallel applications by using a directed acyclic graph (DAG), modeling computation in vertices and communication/data transfer in the edges. The choice of this model is motivated by the desire to provide a simple(r) programming model for distributed applications, while allowing reliability, efficiency, and scalability from multi-core computers to clusters. The graph model allows developers much more control over data flow, something that is much more limited in other parallel programming environments (e.g. MapReduce or GPU shader languages).

The main tradeoff that I see with Dryad is that it seems to be considerably more complex than MapReduce, but the DAG approach also offers more flexibility. Programmers must first plan out their various subroutines and generate a data flow plan that corresponds to their program. This is considerably more complex than what MapReduce requires programmers to do. At the same time, however, this design makes Dryad more general-purposes than MapReduce, allowing a variety of services to be built on top of the system (like DryadLINQ, which can then provide more "friendly" development interfaces). Still, the main takeaway I had from the paper was that there is a large tradeoff between complexity and flexibility inherent in this and other parallel execution engines.

It remains to be seen whether or not Dryad will remain influential over the next decade. It seems that Microsoft has followed through with a number of services built on top of Dryad (DryadLINQ, Scope, etc.). However, the fact that this is all enterprise-level / closed-source (?) makes it hard to envision them becoming popular with startups and the open source community. Performance-wise, the paper suggests that it scales well, which looks promising. The flexibility vs. complexity tradeoff seems significant, but I think we will see systems like Dryad being used for those applications that cannot be expressed very well in MapReduce terms. In reality, it seems likely that Hadoop might be extended to encompass some of the flexibility that Dryad provides.

"MapReduce: Simplified Data Processing on Large Clusters"

MapReduce provides a programming model that simplifies the distributed processing and generation of large data sets. The main idea is to split up the programs into map phases (a function that processes key/value data set to generate another key/value data set)  and reduce phases (function that merges values associated with each given key). This abstraction lets programmers focus on the tasks they are trying to solve without having to deal too much with the complexities of distributed computations, launching programs on many nodes, keeping track of nodes, failures, etc: all of this is handled by the framework itself. The paper shows that many common tasks can be parallelized in this fashion, including counts, grep, sorts, indexing, and so on. While I/O is a limiting factor for many MapReduce tasks (loading huge data takes a long time!), master nodes are smart about assigning workers to data that is local (or nearby) to reduce transfer time. Redundancy and fault-tolerance is achieved by re-assigning jobs to other workers if certain workers seem to be taking too long to execute mapper functions (holding up the reduce phase for everyone).

While MapReduce is not appropriate for all algorithms, this solution grew out of the observation that many parallel tasks (at least at Google) could be easily converted into map/reduce-style programs. Of course, there are tradeoffs in this approach, since some algorithms can not be expressed well in this functional style. Furthermore, the ease of programming in MapReduce might make programmers prefer this model when it's not necessarily suitable (tradeoff between ease of programming and optimal solution for given problem).

The experience with MapReduce at Google and the popularity of MapReduce-inspired implementations like Hadoop testify to the impact that this paper (and model) had on computing (especially cloud/distributed computing). I think that this model will certainly continue to influence research over the next decade, given the increasing amounts of data that need to be processed and the ease with which MapReduce accomplishes this task. 

Monday, September 26, 2011

“Megastore: Providing Scalable, Highly Available Storage for Interactive Services”

This paper describes a system developed at Google to provide a scalable storage system that achieves high reliability through synchronous replication and provides ACID semantics over partitions of data. The need for this arose because NoSQL datastores, while highly scalable, have limited APIs and loose consistency models that are not appropriate for all types of applications, while consistent replication across datacenters is difficult.

Megastore accomplishes this by partitioning the datastore and replicating each partition separately, providing ACID compliance within partitions, but only limited consistency guarantees across them, on the assumption that most Internet service data can be partitioned by user, etc. Write-ahead logging is used for transactions and concurrency control. Replication uses a low-latency implementation of Paxos that is optimized for reads, such that local reads can be performed on any up-to-date replica. 

Synchronous replication imposes some tradeoffs with the Paxos implementation: low numbers of writes per entity are necessary to avoid high conflict rates. In general this isn’t a problem for Google, since write throughput can be scaled through higher-grain sharding, but this still seems like a significant inconvenience.

I thought the combination of NoSQL and traditional DBMS guarantees was innovative and seems to provide good performance for many Internet serice-type applications, where data can easily be partitioned into entities (which need to be consistent). The emphasis on making data local to speed up reads seems particularly important, since most of these types of applications have high reads. I’m not sure that this paper or Megastore itself will remain influential over the next decade, but I do think we’ll see more of this type of merging NoSQL and traditional DBMS features as Internet app data continues to skyrocket.

"The Google File System"

This paper describes a scalable distributed file system designed for large distributed data-intensive applications. It shares goals of previous DFSs, including performance, scalability, reliability, and availability, but also incorporates new trends: tolerance of component failure (from using commodity hardware), huge files, files are modified primarily by appends, co-designing the FS API and applications simultaneously increases flexibility. GFS provides the usual fs operations, though it is not a standard API. It also has snapshot and record append operations. A GFS cluster has a single master and multiple chunkservers, accessed by multiple clients. Files are divided into 64MB chunks and metadata is stored in the master. Clients contact the master and are directed to the appropriate chunk server and then interact directly with that server. GFS has a relaxed consistency model, so applications rely on appends rather than overwrites, use checkpointing, and write self-validating, self-identifying records. The system focuses on high throughput instead of low latency.



This file system was obviously designed to meet Google’s somewhat specific goals and workload characteristics. Many of the particular constraints (e.g. append-only files) don’t necessarily meet the needs of many applications. However, this solution is still highly relevant, since it is simple to scale and good for MapReduce-type application data (as evidenced by the popularity of the Hadoop/HDFS combo). GFS was novel in its focus on tolerance of commodity hardware (and failures), its focus on large files, and the more relaxed consistency guarantees (trade-offs between simplicity, consistency & performance). As the experience at Google has demonstrated, the design works and works well (for their needs). I think this will certainly be relevant over the next decade, although modifications will no doubt be made to tailor it to other tasks. The single-master approach seems to have limitations (as we heard in the Cloudera lecture), so we might see modifications to this.

Wednesday, September 21, 2011

“Paxos Made Practical”


This paper makes the point that implementing Paxos is difficult to do, even though the concept is not all that complex in theory. Many systems claim to use Paxos, but their precise implementation details are often left undescribed. The Viewstamped Replication paper makes an effort to explain how to use Paxos in the real world, but uses distributed transactions (that add more complexity than all applications necessarily require) and doesn’t really address group membership changes.

The paper then goes on to describe the protocol using a C/++ interface and state-machine replication (a deterministic service that accepts requests & produces replies). The level of detail of the implementation is very thorough and certainly sets the paper apart from the work on Paxos that had been previously published. As far as trade-offs, the implementation can be made more efficient by broadcasting requests to backups. Likewise, there are hardware tradeoffs: the system requires at least three full replicas to survive a failure (so that the two remaining ones constitute a majority). The author proposes having the third machine possibly act as a “witness” that can participate to help replicas recover after a failure or partition.

The potential influence of this paper is hard to gauge: given the specificity of the implementation, the paper should always be helpful to people trying to implement Paxos, but I’m not sure that it presents any real new ideas that will sway the direction of further research on Paxos and consensus protocols.

“Paxos Made Simple”


Paxos addresses the issue of consensus among a group of actors (getting them all to accept the same value). The algorithm is roughly as follows:
- There are three basic roles for agents: proposers, acceptors, and learners

There are two stages: proposal and acceptance stages

Proposal phase:
- First a proposer selects a proposal number n and sends the request to a majority of acceptors
- If an acceptor receives a request with a number n great than any request it has already responded to, it responds with a promise not to accept any more proposals numbered less than n with the highest-numbered proposal it has accepted

Acceptance phase:
- If the proposer receives a response to its request from a majority of acceptors, it sends an accept request to the acceptors with the value v, where v is the value of the highest-number proposal among the responses
- If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it already responded to a request with a number > n
It’s important to have a majority of nodes to accept or else if you encounter partitions, nodes could think that they had a consensus when they actually don’t.

The problem that this paper addresses is certainly real: in a distributed environment it is very problematic for machines to agree on values. The main point of the paper was to explain and prove the algorithm in a straightforward way, since earlier papers made the algorithm seem overly hard (though I think it’s hard to argue that the algorithm/explanation the paper presents is “simple” by any stretch of the imagination!).  I found the organization of the paper a bit unintuitive--the proof leads up to a description of the algorithm, so you don’t necessarily know beforehand where everything is leading (unless you already know the algorithm!). A brief description of the way Paxos works at the beginning would have been helpful. The system doesn’t seem to have any inherent tradeoffs (it solves the problem it sets out to solve), except for complexity. That said, this paper and Paxos itself have been very influential (as we see in the other papers), so there is no question of the paper’s relevance.

Monday, September 19, 2011

"Cluster-Based Scalable Network Services"


This paper addresses the issue of scalability in network services, highlighting three fundamental advantages of using clusters:  scalability, availability, and cost-effectiveness. Scalability means an incremental and linear increase in hardware can maintain constant service performance as load increases. Availability means that the service must be available 24x7, regardless of failures (which are masked by software). Cost-effectiveness means that the service must be economical to scale (emphasis on commodity hardware). However, clusters also present challenges for administration, component vs. system replication (matching hardware to software scaling), partial failures, and lack of shared state among cluster nodes.

The paper proposes a layered architecture based on separation of content from services, using a service-programming model based on workers that perform Transformation, Aggregation, Caching, and Customization (TACC) of content on top of a Scalable Network Service (SNS) layer. It then identifies several techniques: BASE (Basically Available, Soft State, Eventual Consistency), allowing partial failure in clusters, but with less complexity and cost than ACID.  Many web services don't need the strong consistency or durability of data provided by ACID, and instead place more value on high availability.  By not requiring durability and consistency across partial failures, BASE effectively reduces communication and disk activity (or postponement), increasing availability.

The SNS layer provides incremental/absolute scalability of worker nodes (which are simple and stateless), worker load balancing by the manager, front-end availability/fault tolerance, and system monitoring. On top of this, the TACC layer provides an API for transformation of data objects, aggregation (combining data from various objects), customization, and caching.

The paper then looks at the implementation of two network services that use BASE (or BASE-like techniques). TranSend, a scalable transformation and caching proxy for Berkeley dialup users, and HotBot, the Inktomi search engine. In TranSend, web content and other application data exploited BASE semantics: load balancing data was stale (timeouts help recover from extreme cases), transformed data was cached (i.e. soft state), and approximate objects were used by distillers (which transform/aggregate content). However, some data—user profiles—did require the guarantees provided by ACID-compliant databases.

Some of the ideas in the paper are specific to the problems the authors were most concerned with at the time and which have become less relevant (e.g. caching and distillation of web content, etc.). Nevertheless, many of the general ideas presented in this paper have been very influential (or predictive of future trends). The layered architecture is very similar to (if more narrow than) the WSCs described in the Google paper. Likewise, the use of commodity servers is a trend that has only increased in popularity (and servers are becoming more and more lower-end). The BASE techniques and a general shift way from ACID (where immediate consistency, etc. is not absolutely necessary) have also proven influential--and quite logically led to Brewer's CAP theorem--leaving a significant impact on cloud computing.