Wednesday, October 5, 2011

"Bigtable: A Distributed Storage System for Structured Data"

Bigtable was developed at Google as a scalable distributed storage system for structured data that is housed on commodity servers (on top of GFS). They currently use Bigtable for a number of applications including web indexing, Google Earth, Google Finance, and over 60 other applications. Bigtable is similar to databases, but it doesn't support a full relational model, sacrificing this for performance and potential to scale to petabytes of storage easily. Essentially, Bigtable is a sparse, distributed, persistent multidimensional sorted map. It maps a row key, column key, and timestamp to an arbitrary array of bytes. Data is maintained in tables that are partitioned into row ranges called "tablets"  which are used to distribute and lot balance the data. Locations of tablets are stored in multiple METADATA tablets and Bigtable uses Chubby to keep track of tablet servers. Bigtable uses compression to save space, two levels of caching (scan cache for key/value pairs and block cache for block read from GFS), and a single commit log per tablet server instead of a different commit log for each tablet.

The main nugget here is that many applications don't need all the features of a relational database, so a map with row-level atomicity on top of GFS can provide a powerful way to scale storage for many applications. I'm not quite sure what the major tradeoffs are for this specific design, though naturally you can't use it for everything (many applications still benefit from the schemas and ACID-style features of "real" databases). Since so many Google applications are using Bigtable, it seems that it definitely is having influence and meeting a need, at least within the company. Whether or not the Bigtable design becomes popular outside of Google, it seems that high-performance, scalable schema-free storage is becoming a good (i.e. easier than using a RDBMS) solution for huge amounts of data.

No comments:

Post a Comment