Monday, October 3, 2011

"SCADS: Scale-Independent Storage for Social Computing Applications"

As web applications grow, they encounter significant challenges in storing and querying increasingly large data sets. While parallel databases have been able to help address this issue for data that can be easily partitioned, many web applications have data that is based on social networks with a high degree of interaction between users. This makes the data hard to partition along traditional lines (e.g. users) since joins and other types of queries would often be necessary across these partitions.

At the same time, many web applications are more concerned with high performance / low latency than with complete consistency of data. This tradeoff is something that SCADS aims to exploit, by allowing developers to state their own consistency requirements, use cloud computing for quick up/down scaling, and machine learning to anticipate performance problems. High numbers of queries with low latency are provided by the constraint that any queries must be lookups over bounded contiguous ranges of an index (and indices are automatically maintained according to the developer-specified consistency requirements). Consistency will be specified as a SLA, such that the developer can stipulate that a certain percentage of queries must take less than a certain response time. SCADS will be implemented on top of Cassandra as a column-store, and will implement asynchronous index updates, session guarantees and automatic provisioning.

I like this paper's emphasis on the unique characteristics of web application data that doesn't lend itself to be easily partitioned, as I think that this is obviously a growing trend, especially on the web. The tradeoffs here seem to be in two areas: consistency (some consistency must be sacrificed for speed, but this is usually acceptable for web apps) and query constraints (that must be on a contiguous section of the index). I'm not sure that SCADS itself will be influential over the next decade, but the problem that it addresses (data that is not easily partitioned but doesn't require high consistency) is a real one and one that certainly will drive a lot of database research over the next decade.

No comments:

Post a Comment