Wednesday, September 28, 2011

"Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks"

Dryad addresses the problem of executing distributed data-parallel applications by using a directed acyclic graph (DAG), modeling computation in vertices and communication/data transfer in the edges. The choice of this model is motivated by the desire to provide a simple(r) programming model for distributed applications, while allowing reliability, efficiency, and scalability from multi-core computers to clusters. The graph model allows developers much more control over data flow, something that is much more limited in other parallel programming environments (e.g. MapReduce or GPU shader languages).

The main tradeoff that I see with Dryad is that it seems to be considerably more complex than MapReduce, but the DAG approach also offers more flexibility. Programmers must first plan out their various subroutines and generate a data flow plan that corresponds to their program. This is considerably more complex than what MapReduce requires programmers to do. At the same time, however, this design makes Dryad more general-purposes than MapReduce, allowing a variety of services to be built on top of the system (like DryadLINQ, which can then provide more "friendly" development interfaces). Still, the main takeaway I had from the paper was that there is a large tradeoff between complexity and flexibility inherent in this and other parallel execution engines.

It remains to be seen whether or not Dryad will remain influential over the next decade. It seems that Microsoft has followed through with a number of services built on top of Dryad (DryadLINQ, Scope, etc.). However, the fact that this is all enterprise-level / closed-source (?) makes it hard to envision them becoming popular with startups and the open source community. Performance-wise, the paper suggests that it scales well, which looks promising. The flexibility vs. complexity tradeoff seems significant, but I think we will see systems like Dryad being used for those applications that cannot be expressed very well in MapReduce terms. In reality, it seems likely that Hadoop might be extended to encompass some of the flexibility that Dryad provides.

"MapReduce: Simplified Data Processing on Large Clusters"

MapReduce provides a programming model that simplifies the distributed processing and generation of large data sets. The main idea is to split up the programs into map phases (a function that processes key/value data set to generate another key/value data set)  and reduce phases (function that merges values associated with each given key). This abstraction lets programmers focus on the tasks they are trying to solve without having to deal too much with the complexities of distributed computations, launching programs on many nodes, keeping track of nodes, failures, etc: all of this is handled by the framework itself. The paper shows that many common tasks can be parallelized in this fashion, including counts, grep, sorts, indexing, and so on. While I/O is a limiting factor for many MapReduce tasks (loading huge data takes a long time!), master nodes are smart about assigning workers to data that is local (or nearby) to reduce transfer time. Redundancy and fault-tolerance is achieved by re-assigning jobs to other workers if certain workers seem to be taking too long to execute mapper functions (holding up the reduce phase for everyone).

While MapReduce is not appropriate for all algorithms, this solution grew out of the observation that many parallel tasks (at least at Google) could be easily converted into map/reduce-style programs. Of course, there are tradeoffs in this approach, since some algorithms can not be expressed well in this functional style. Furthermore, the ease of programming in MapReduce might make programmers prefer this model when it's not necessarily suitable (tradeoff between ease of programming and optimal solution for given problem).

The experience with MapReduce at Google and the popularity of MapReduce-inspired implementations like Hadoop testify to the impact that this paper (and model) had on computing (especially cloud/distributed computing). I think that this model will certainly continue to influence research over the next decade, given the increasing amounts of data that need to be processed and the ease with which MapReduce accomplishes this task. 

Monday, September 26, 2011

“Megastore: Providing Scalable, Highly Available Storage for Interactive Services”

This paper describes a system developed at Google to provide a scalable storage system that achieves high reliability through synchronous replication and provides ACID semantics over partitions of data. The need for this arose because NoSQL datastores, while highly scalable, have limited APIs and loose consistency models that are not appropriate for all types of applications, while consistent replication across datacenters is difficult.

Megastore accomplishes this by partitioning the datastore and replicating each partition separately, providing ACID compliance within partitions, but only limited consistency guarantees across them, on the assumption that most Internet service data can be partitioned by user, etc. Write-ahead logging is used for transactions and concurrency control. Replication uses a low-latency implementation of Paxos that is optimized for reads, such that local reads can be performed on any up-to-date replica. 

Synchronous replication imposes some tradeoffs with the Paxos implementation: low numbers of writes per entity are necessary to avoid high conflict rates. In general this isn’t a problem for Google, since write throughput can be scaled through higher-grain sharding, but this still seems like a significant inconvenience.

I thought the combination of NoSQL and traditional DBMS guarantees was innovative and seems to provide good performance for many Internet serice-type applications, where data can easily be partitioned into entities (which need to be consistent). The emphasis on making data local to speed up reads seems particularly important, since most of these types of applications have high reads. I’m not sure that this paper or Megastore itself will remain influential over the next decade, but I do think we’ll see more of this type of merging NoSQL and traditional DBMS features as Internet app data continues to skyrocket.

"The Google File System"

This paper describes a scalable distributed file system designed for large distributed data-intensive applications. It shares goals of previous DFSs, including performance, scalability, reliability, and availability, but also incorporates new trends: tolerance of component failure (from using commodity hardware), huge files, files are modified primarily by appends, co-designing the FS API and applications simultaneously increases flexibility. GFS provides the usual fs operations, though it is not a standard API. It also has snapshot and record append operations. A GFS cluster has a single master and multiple chunkservers, accessed by multiple clients. Files are divided into 64MB chunks and metadata is stored in the master. Clients contact the master and are directed to the appropriate chunk server and then interact directly with that server. GFS has a relaxed consistency model, so applications rely on appends rather than overwrites, use checkpointing, and write self-validating, self-identifying records. The system focuses on high throughput instead of low latency.



This file system was obviously designed to meet Google’s somewhat specific goals and workload characteristics. Many of the particular constraints (e.g. append-only files) don’t necessarily meet the needs of many applications. However, this solution is still highly relevant, since it is simple to scale and good for MapReduce-type application data (as evidenced by the popularity of the Hadoop/HDFS combo). GFS was novel in its focus on tolerance of commodity hardware (and failures), its focus on large files, and the more relaxed consistency guarantees (trade-offs between simplicity, consistency & performance). As the experience at Google has demonstrated, the design works and works well (for their needs). I think this will certainly be relevant over the next decade, although modifications will no doubt be made to tailor it to other tasks. The single-master approach seems to have limitations (as we heard in the Cloudera lecture), so we might see modifications to this.

Wednesday, September 21, 2011

“Paxos Made Practical”


This paper makes the point that implementing Paxos is difficult to do, even though the concept is not all that complex in theory. Many systems claim to use Paxos, but their precise implementation details are often left undescribed. The Viewstamped Replication paper makes an effort to explain how to use Paxos in the real world, but uses distributed transactions (that add more complexity than all applications necessarily require) and doesn’t really address group membership changes.

The paper then goes on to describe the protocol using a C/++ interface and state-machine replication (a deterministic service that accepts requests & produces replies). The level of detail of the implementation is very thorough and certainly sets the paper apart from the work on Paxos that had been previously published. As far as trade-offs, the implementation can be made more efficient by broadcasting requests to backups. Likewise, there are hardware tradeoffs: the system requires at least three full replicas to survive a failure (so that the two remaining ones constitute a majority). The author proposes having the third machine possibly act as a “witness” that can participate to help replicas recover after a failure or partition.

The potential influence of this paper is hard to gauge: given the specificity of the implementation, the paper should always be helpful to people trying to implement Paxos, but I’m not sure that it presents any real new ideas that will sway the direction of further research on Paxos and consensus protocols.

“Paxos Made Simple”


Paxos addresses the issue of consensus among a group of actors (getting them all to accept the same value). The algorithm is roughly as follows:
- There are three basic roles for agents: proposers, acceptors, and learners

There are two stages: proposal and acceptance stages

Proposal phase:
- First a proposer selects a proposal number n and sends the request to a majority of acceptors
- If an acceptor receives a request with a number n great than any request it has already responded to, it responds with a promise not to accept any more proposals numbered less than n with the highest-numbered proposal it has accepted

Acceptance phase:
- If the proposer receives a response to its request from a majority of acceptors, it sends an accept request to the acceptors with the value v, where v is the value of the highest-number proposal among the responses
- If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it already responded to a request with a number > n
It’s important to have a majority of nodes to accept or else if you encounter partitions, nodes could think that they had a consensus when they actually don’t.

The problem that this paper addresses is certainly real: in a distributed environment it is very problematic for machines to agree on values. The main point of the paper was to explain and prove the algorithm in a straightforward way, since earlier papers made the algorithm seem overly hard (though I think it’s hard to argue that the algorithm/explanation the paper presents is “simple” by any stretch of the imagination!).  I found the organization of the paper a bit unintuitive--the proof leads up to a description of the algorithm, so you don’t necessarily know beforehand where everything is leading (unless you already know the algorithm!). A brief description of the way Paxos works at the beginning would have been helpful. The system doesn’t seem to have any inherent tradeoffs (it solves the problem it sets out to solve), except for complexity. That said, this paper and Paxos itself have been very influential (as we see in the other papers), so there is no question of the paper’s relevance.

Monday, September 19, 2011

"Cluster-Based Scalable Network Services"


This paper addresses the issue of scalability in network services, highlighting three fundamental advantages of using clusters:  scalability, availability, and cost-effectiveness. Scalability means an incremental and linear increase in hardware can maintain constant service performance as load increases. Availability means that the service must be available 24x7, regardless of failures (which are masked by software). Cost-effectiveness means that the service must be economical to scale (emphasis on commodity hardware). However, clusters also present challenges for administration, component vs. system replication (matching hardware to software scaling), partial failures, and lack of shared state among cluster nodes.

The paper proposes a layered architecture based on separation of content from services, using a service-programming model based on workers that perform Transformation, Aggregation, Caching, and Customization (TACC) of content on top of a Scalable Network Service (SNS) layer. It then identifies several techniques: BASE (Basically Available, Soft State, Eventual Consistency), allowing partial failure in clusters, but with less complexity and cost than ACID.  Many web services don't need the strong consistency or durability of data provided by ACID, and instead place more value on high availability.  By not requiring durability and consistency across partial failures, BASE effectively reduces communication and disk activity (or postponement), increasing availability.

The SNS layer provides incremental/absolute scalability of worker nodes (which are simple and stateless), worker load balancing by the manager, front-end availability/fault tolerance, and system monitoring. On top of this, the TACC layer provides an API for transformation of data objects, aggregation (combining data from various objects), customization, and caching.

The paper then looks at the implementation of two network services that use BASE (or BASE-like techniques). TranSend, a scalable transformation and caching proxy for Berkeley dialup users, and HotBot, the Inktomi search engine. In TranSend, web content and other application data exploited BASE semantics: load balancing data was stale (timeouts help recover from extreme cases), transformed data was cached (i.e. soft state), and approximate objects were used by distillers (which transform/aggregate content). However, some data—user profiles—did require the guarantees provided by ACID-compliant databases.

Some of the ideas in the paper are specific to the problems the authors were most concerned with at the time and which have become less relevant (e.g. caching and distillation of web content, etc.). Nevertheless, many of the general ideas presented in this paper have been very influential (or predictive of future trends). The layered architecture is very similar to (if more narrow than) the WSCs described in the Google paper. Likewise, the use of commodity servers is a trend that has only increased in popularity (and servers are becoming more and more lower-end). The BASE techniques and a general shift way from ACID (where immediate consistency, etc. is not absolutely necessary) have also proven influential--and quite logically led to Brewer's CAP theorem--leaving a significant impact on cloud computing.

"Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services"


The CAP theorem states that it is impossible for a internet service to simultaneously provide consistency, availability, and partition-tolerance. Only any two of the three can be provided. By consistency, Brewer meant atomic data objects, such that it appears that each operation was completed at a single instant (e.g. any read that occurs after a write returns the newest written value). Availability is defined as the guarantee that every request by a non-failing node will (eventually) result in a response. Partition tolerance means that the network can lose arbitrarily many messages sent from one node to another and still return a result.

The proof of this theorem uses an asynchronous network model, where there is no clock and nodes make decisions based on messages received. First, the authors prove that asynchronous networks cannot provide a data object that guarantees availability and atomic consistency when messages are lost, since a read could return data from a stale write (which was not updated due to lost messages). In synchronous (or partially synchronous) network models, a weaker form of consistency can be achieved using timeouts. In general, most web services have to settle for "most of the data, most of the time."

This theorem is highly relevant today and is becoming more and more important. Cloud computing and Internet services are increasingly relying on distributed systems, where networks are not entirely reliable (even more so as applications begin to span several datacenters). The work done in the paper was new (i.e. the theorem had not been proved), but I'm not sure that a formal proof was necessary for people to understand that only two of the three conditions can be possible at once. For me, the most interesting part of the paper was the realization that services need to determine what sort of failure is acceptable (e.g. which of the CAP) -- with web applications often valuing availability over consistency. These tradeoffs are fundamental decisions that application designers need to acknowledge and keep in mind.

Wednesday, September 14, 2011

"The Datacenter Needs an Operating System"

This paper encourages the development of a operating system for data centers. Taking fundamental OS concepts as applied to data centers, this boils down to a system that would manage resource sharing, data sharing, programming abstractions and global debugging/monitoring tools. This is motivated by the observation that applications that run in data centers are now more general-purpose than they were originally, and the current tools available are not necessarily well designed for general use (e.g. MapReduce provides resource sharing, but only among MR jobs -- what happens to other applications that don't translate to MR that easily?).

Resource sharing among diverse applications requires more fine granularity than is currently present (most cluster apps are written as standalone programs that require the entire use of their [sometimes virtual] hosts). Other resource sharing questions include network/bandwidth sharing, service/application dependencies, scheduling, and virtualization.

Data sharing in DCs is usually accomplished via a distributed fs, but this still entails a lot of overhead costs for loading data (e.g. into MapReduce jobs).  Resilient distributed datasets "remember" the transformations used to create them, facilitating re-running tasks -- and they can be kept in memory for extra speed. Other questions for the data sharing component of a DC OS would include a standardized interface for storage, streaming data, and performance isolation (e.g. live vs. non-live data).

New programming abstractions could help with developing applications more quickly and effectively: APIs for launching and monitoring tasks, standardizing communication patterns, and fault-tolerant data structures. Debugging on a DC app is geared both to correctness and performance analysis, since incorrect outputs could result from performance issues (e.g., with the underlying distributed software) as well as coding problems.

I think this paper does a good job of taking several of the core concepts of operating systems and applying them to cloud computing and data center applications. There doesn't seem to be much question that all of these issues are increasing relevant to datacenter computing and I liked seeing the parallels with classic operating systems, suggesting that much of what we know already can be applied to this new DC paradigm. With the rise in cloud computing, these issues will become only more relevant over the next decade. The idea of a unified system that addresses all of these issues seems attractive, but I wonder if this will really happen in practice, as many solutions for each of these issues (albeit not necessarily optimal ones) are already in place and it could be difficult to effect their replacement.

Monday, September 12, 2011

“Data-Level Parallelism in Vector, SIMD, and GPU Architectures”

This chapter from Hennessy & Patterson offers a detailed overview of single instruction multiple data (SIMD) architectures, comparing two main types: graphics processing units (GPUs) and vector processors (e.g. Cray).

GPUs are very widespread, which greatly helps push their adoption. However, CUDA/OpenCL have significant learning curves. Not only is the jargon an issue, as the book points out (with helpful--and somewhat humorous--translation tables), but learning to program them effectively is also non-trivial. They have many different layers of memory with different strengths/weaknesses (latency, scope of shared memory, etc.) and choosing the optimal combination can be difficult for the programmer. Transferring data from main CPU memory to internal GPU memory also is very costly and hinders overall I/O speed. The cores themselves are also not very powerful, although speed can be improved using faster (but less precise) math libraries, etc. At the same time, the vast number of cores available on GPUs provide an excellent opportunity for data parallel applications to run faster.

Vector processors, on the other hand, are somewhat more intuitive to program and effective in data parallelism.  However, this extra programmer-friendliness requires more architectural complexity, making them more expensive. GPUs are already so widespread (due to their graphics applications) that it seemsmost likely that they will be the dominant  trend going forward. As far as their application to cloud computing, it seems likely that some combination of CPUs and GPUs will certainly make their way into the servers in WSCs. The large amounts of cores make them great for MapReduce-style and other data parallel applications. Nevertheless, in addition to complexity of programming, virtualization and scheduling issues remain obstacles to widespread adoption of GPUs in cloud computing.

"Amdahl's Law in the Multicore Era"


This short article analyzes Amdahl’s Law in the context of multicore processors. Amdahl’s Law says that the amount of speedup depends on the fraction of the program that you can parallelize. Originally, the law was used to emphasize the importance of producing more powerful single-core chips, since parallel performance would always be limited by the sequential part of the code. However, massively parallel machines now allow computations on data that would have previously been thought impractical, greatly increasing the need for further development of parallel architectures.

To determine the law’s applicability to multicore processors, the authors examined three different types of multicore processors: symmetric, asymmetric, and dynamic. Symmetric multicore chips have cores that all use equal fractions of chip resources. Asymmetric chips have a single (or several) core that is more powerful than all the others. Dynamic chips could potentially allow cores to join forces for sequential portions or operate independently for parallel portions. The authors show that asymmetric and dynamic models offer the greatest potential for speedup (with dynamic even greater than asymmetric, due to its flexibility).

This article’s approach is certainly significant and relevant, as the number of cores in multicore processors continues to increase. Likewise, asymmetric multicore processors (like Cell -- or the combination of CPUs and GPUs) offer opportunities for improving both sequential and parallelizable portions of programs. It seems plausible that hundreds or thousands of GPU-style cores could begin to complement CPUs in cloud computing, though there are drawbacks including the complexity of writing the programs to exploit both the CPU for sequential and GPU for parallel portions of code (in addition to scheduling and virtualization issues).

“Performance modeling and Analysis of Flash-based Storage Devices”

As SSDs become more popular and less expensive, their widespread adoption necessitates a way to compare their performance both among themselves and to that of traditional hard disk drives. This article uses a black-box model to evaluate SSD performance, with enough granularity to take into account the particularities of SSDs -- especially the slow write speed (due to slowness of erases) and slow updates (which require copying, erase, and write).


The black-box model proposed by the authors seems reasonable, especially given the differing internal structure of SSDs (and the lack of knowledge thereof due to IP). As SSDs become the norm, there will need to be a way to differentiate between them and establish their relative performance. As the article shows, the main shortfalls with SSDs are several: slow erases (and, thus, writes) and the wearing-out of the cells themselves, especially on write-intensive workloads. However, I think that instead of comparing the SSDs in this article to a 5400 rpm HDD, it would have been helpful to see how they compare to a high-end disk (15k rpm) and/or an array of HDDs in RAID. This would help identify which higher-end hardware options provide better performance (i.e., whether  SSDs outperform fast disks, which they probably do, and by how much/on which operations).


That said, SSDs as they are now (or at the time of article writing) seem like an especially good choice for accelerating applications (e.g. as a cheap version of a RAM-based cache) where reliable storage is not really required. Likewise, read-intensive applications would also benefit tremendously from SSD adoption. As SSDs improve, the wear issue might become less relevant. However, the highest barrier to widespread adoption currently seems to be the price. As prices decrease, higher rates of SSD failure might simply not be relevant if they’re used in the distributed and redundant file/storage systems characteristic of cloud computing. That said, eventually phase-changing memory (PCM) will probably overtake SSDs, given their higher reliability.

Wednesday, September 7, 2011

“The Datacenter as a Computer” (Chapters 3, 4, 7)


Chapter 3 focuses on hardware choices for WSCs, primarily the difference of using high-end servers (e.g. HP Integrity Superdome) versus lower-end servers (e.g. HP ProLiant). While high-end servers have excellent internal communication speeds, as soon as the data sets get too large for a single server, network latency begins severely limiting the benefits of a cluster of high-end servers relative to their cost (especially for parallel tasks with high levels of inter-node communication). This motivates the prominence of commodity servers in datacenters, since their lower cost outweighs the small marginal benefit of using large servers. However, one major trade-off is that software may need to be optimized to tolerate the higher latency for requests caused by slower CPU speeds (among other deficiencies in low-end servers, e.g. low memory & disk capacity, etc.).

Chapter 4 details the power supply and cooling systems for datacenters. Datacenters can be classified according to their level of redundancy, with most large commercial datacenters providing multiple power and cooling distribution paths (though not necessarily more than one active one). Power systems usually include an uninterruptible power supply (UPS) system which feeds into power distribution units on the DC floors. For cooling, racks are raised from the floor and either cooled with air,  fluid-based cooling towers, and/or in-rack coolers. Datacenters determine their level of redundancy based on the amount of up-time that they need to be able to provide.

Chapter 7 provides a detailed discussion of managing hardware failures in WSCs. The huge amount of hardware required for a WSC ensures that hardware will fail -- the question is how to tune the balance between the cost of failure (and dealing with it) and the cost of preventing the failure. WSCs should implement a fault-tolerant software infrastructure layer to reduce the cost of making the applications themselves tolerant to hardware failures. This also makes it easier to perform upgrades (since servers don’t always need to be running) and repair failed hardware. Categorizing service faults by severity allows one to determine which faults are tolerable (e.g. those that provide intermittent service degradation) and which ones must be prevented (e.g. unavailability of the entire service).  Interestingly, the experience at Google (as well as in other papers cited) suggests that most service disruption events are caused by human elements: configurations, software, and other human interference, while network and hardware failures are responsible for relatively fewer service interruptions.

Is the problem real?

All three of the issues discussed: trade-offs in the choice between low-cost and high-performance hardware; trade-offs between power/cooling reliability and cost; and trade-offs between failures and the cost of preventing/repairing them are critical topics for large datacenters (and, consequently, WSCs).

What is the solution's main idea (nugget)?

The overarching idea of these three chapters is that for each of these questions (hardware, power/cooling, faults) a balance must be reached between the benefits of an ideal set-up and the cost benefits of allowing for failures and managing them as best as possible (with as little service degradation as possible). This balance depends on the WSC and its applications, but in general it is more cost-efficient to write fault-tolerant software.

Why is solution different from previous work?

The ideas presented here differ primarily from previous solutions by the emphasis on tolerance of failure and latency. With datasets much larger than any one computer can handle, hardware failures are inevitable and it makes sense to spend more time developing software that can tolerate faults (at various layers).

Does the paper (or do you) identify any fundamental/hard trade-offs?

All three chapters are essentially about the fundamental trade-offs presented by each issue (hardware quality, power/cooling redundancy, hardware faults) and striking a balance between the costs and benefits of varying levels of hardware reliability: see discussions above.

Do you think the work will be influential in 10 years?

Many details of the hardware and infrastructure will no doubt change over the next decade. However, the questions of hardware choices (high-end vs. low-end), power/cooling reliability, energy usage, and fault tolerance--as well as finding the right balance between the trade-offs implied by each question--will no doubt remain relevant as datacenters (or even groups of datacenters) expand and require more sophisticated infrastructure to accommodate surging amounts of data.