Wednesday, September 14, 2011

"The Datacenter Needs an Operating System"

This paper encourages the development of a operating system for data centers. Taking fundamental OS concepts as applied to data centers, this boils down to a system that would manage resource sharing, data sharing, programming abstractions and global debugging/monitoring tools. This is motivated by the observation that applications that run in data centers are now more general-purpose than they were originally, and the current tools available are not necessarily well designed for general use (e.g. MapReduce provides resource sharing, but only among MR jobs -- what happens to other applications that don't translate to MR that easily?).

Resource sharing among diverse applications requires more fine granularity than is currently present (most cluster apps are written as standalone programs that require the entire use of their [sometimes virtual] hosts). Other resource sharing questions include network/bandwidth sharing, service/application dependencies, scheduling, and virtualization.

Data sharing in DCs is usually accomplished via a distributed fs, but this still entails a lot of overhead costs for loading data (e.g. into MapReduce jobs).  Resilient distributed datasets "remember" the transformations used to create them, facilitating re-running tasks -- and they can be kept in memory for extra speed. Other questions for the data sharing component of a DC OS would include a standardized interface for storage, streaming data, and performance isolation (e.g. live vs. non-live data).

New programming abstractions could help with developing applications more quickly and effectively: APIs for launching and monitoring tasks, standardizing communication patterns, and fault-tolerant data structures. Debugging on a DC app is geared both to correctness and performance analysis, since incorrect outputs could result from performance issues (e.g., with the underlying distributed software) as well as coding problems.

I think this paper does a good job of taking several of the core concepts of operating systems and applying them to cloud computing and data center applications. There doesn't seem to be much question that all of these issues are increasing relevant to datacenter computing and I liked seeing the parallels with classic operating systems, suggesting that much of what we know already can be applied to this new DC paradigm. With the rise in cloud computing, these issues will become only more relevant over the next decade. The idea of a unified system that addresses all of these issues seems attractive, but I wonder if this will really happen in practice, as many solutions for each of these issues (albeit not necessarily optimal ones) are already in place and it could be difficult to effect their replacement.

No comments:

Post a Comment