Thursday, October 13, 2011

Dremel: Interactive Analysis of Web-Scale Datasets

Dremel presents a scalable and interactive ad-hoc query system for analysis of read-only nested data. The system is built on top of GFS and can function side-by-side with MapReduce. The main new contribution here is the combination of the following:
- column-based storage
- efficient storage and analysis of nested data structures 
- in-situ data access (data is analyzed in-place)
- SQL-like query language

Column-based storage allows for optimization on several fronts. First, queries that don't require all (or even a majority of) fields can avoid the overhead of loading unnecessary data (i.e. remaining columns) from disk during. Furthermore, column data can be compressed for quicker loading -- especially since column values tend to be similar or there tends to be considerable repetition. Nested data is particularly relevant to Google, since they have to deal with web pages, link schemas, etc. Representing nested data in a column-oriented data store seems tricky, but as the paper shows, not too hard. 


Dremel is designed to handle aggregate queries (e.g. counts) and does so via a tree architecture. Queries are re-written at every node of the tree (i.e. intermediate servers) to aggregate the results of other intermediate nodes or leaves (i.e. horizontal partitions of the data). This lets data be analyzed in place (at the servers where the data is actually located), which, coupled with the parallelization of queries, allows for quick, interactive execution. If absolute accuracy is not required, leaves that are taking long time can be skipped, yielding a result with a known margin of error (and even faster response time). 

The various features of Dremel don't seem particularly innovative -- and the authors admit that column-based storage, nested data structures, etc. have been done before. However, it seems that the novelty lies in the combination of the various features, which makes Dremel particularly adept at handling large data sets interactively. That said, a lot of what Dremel does seems similar to Pig and Hive, though those applications actually generate MR jobs, which might not be the best solution. Tradeoffs are inherent in the accuracy vs. speed decisions (though it's very nice to be able to choose one or the other). More fundamentally, while Dremel is great at aggregate functions over huge sets of data, it seems that performance for more complex queries (e.g. joins, etc.) would suffer. Also, as the article points out, when many columns are read, the column-based storage model suffers (due to data retrieval and record assembly time). 


As far as influence over the next decade, it looks like Dremel has already proven popular at Google (and they have been using it since 2006). I think the column-based model makes a lot of sense for many datasets and query types, though as always one has to weigh the trade-offs ahead of time. 

No comments:

Post a Comment