Wednesday, September 7, 2011

“The Datacenter as a Computer” (Chapters 3, 4, 7)


Chapter 3 focuses on hardware choices for WSCs, primarily the difference of using high-end servers (e.g. HP Integrity Superdome) versus lower-end servers (e.g. HP ProLiant). While high-end servers have excellent internal communication speeds, as soon as the data sets get too large for a single server, network latency begins severely limiting the benefits of a cluster of high-end servers relative to their cost (especially for parallel tasks with high levels of inter-node communication). This motivates the prominence of commodity servers in datacenters, since their lower cost outweighs the small marginal benefit of using large servers. However, one major trade-off is that software may need to be optimized to tolerate the higher latency for requests caused by slower CPU speeds (among other deficiencies in low-end servers, e.g. low memory & disk capacity, etc.).

Chapter 4 details the power supply and cooling systems for datacenters. Datacenters can be classified according to their level of redundancy, with most large commercial datacenters providing multiple power and cooling distribution paths (though not necessarily more than one active one). Power systems usually include an uninterruptible power supply (UPS) system which feeds into power distribution units on the DC floors. For cooling, racks are raised from the floor and either cooled with air,  fluid-based cooling towers, and/or in-rack coolers. Datacenters determine their level of redundancy based on the amount of up-time that they need to be able to provide.

Chapter 7 provides a detailed discussion of managing hardware failures in WSCs. The huge amount of hardware required for a WSC ensures that hardware will fail -- the question is how to tune the balance between the cost of failure (and dealing with it) and the cost of preventing the failure. WSCs should implement a fault-tolerant software infrastructure layer to reduce the cost of making the applications themselves tolerant to hardware failures. This also makes it easier to perform upgrades (since servers don’t always need to be running) and repair failed hardware. Categorizing service faults by severity allows one to determine which faults are tolerable (e.g. those that provide intermittent service degradation) and which ones must be prevented (e.g. unavailability of the entire service).  Interestingly, the experience at Google (as well as in other papers cited) suggests that most service disruption events are caused by human elements: configurations, software, and other human interference, while network and hardware failures are responsible for relatively fewer service interruptions.

Is the problem real?

All three of the issues discussed: trade-offs in the choice between low-cost and high-performance hardware; trade-offs between power/cooling reliability and cost; and trade-offs between failures and the cost of preventing/repairing them are critical topics for large datacenters (and, consequently, WSCs).

What is the solution's main idea (nugget)?

The overarching idea of these three chapters is that for each of these questions (hardware, power/cooling, faults) a balance must be reached between the benefits of an ideal set-up and the cost benefits of allowing for failures and managing them as best as possible (with as little service degradation as possible). This balance depends on the WSC and its applications, but in general it is more cost-efficient to write fault-tolerant software.

Why is solution different from previous work?

The ideas presented here differ primarily from previous solutions by the emphasis on tolerance of failure and latency. With datasets much larger than any one computer can handle, hardware failures are inevitable and it makes sense to spend more time developing software that can tolerate faults (at various layers).

Does the paper (or do you) identify any fundamental/hard trade-offs?

All three chapters are essentially about the fundamental trade-offs presented by each issue (hardware quality, power/cooling redundancy, hardware faults) and striking a balance between the costs and benefits of varying levels of hardware reliability: see discussions above.

Do you think the work will be influential in 10 years?

Many details of the hardware and infrastructure will no doubt change over the next decade. However, the questions of hardware choices (high-end vs. low-end), power/cooling reliability, energy usage, and fault tolerance--as well as finding the right balance between the trade-offs implied by each question--will no doubt remain relevant as datacenters (or even groups of datacenters) expand and require more sophisticated infrastructure to accommodate surging amounts of data.

No comments:

Post a Comment