Monday, October 3, 2011

"HIVE: Data Warehousing & Analytics on Hadoop"

Hive is an open-source data warehousing solution built on top of Hadoop and developed by a team at Facebook. While Hadoop and the MapReduce model are effective for analyzing enormous sets of data on the scale of that encountered at Facebook, MapReduce is still relatively hard to program. On the other hand, users are much more familiar with SQL, bash, Python, etc., so HIVE was designed as an intermediate layer for querying data through SQL-like syntax that is then compiled into a series of MapReduce jobs that are then run on Hadoop. Among other tasks, Facebook uses Hive for summarizing impressions/clicks and other statistics, ad hoc analysis, data mining, spam detection and ad optimization.

Hive consists of several components on top of Hadoop/HDFS. There is both a CLI  and Thrift API through which HiveQL statements can be submitted. The Metastore is a catalog with table/column/partition information, etc. The Driver manages the compilation, optimization, and execution of HiveQL statements, invoking the Compiler to translate a query into a DAG of MapReduce jobs which are then submitted to the Execution Engine (and sent to Hadoop).

The main 'nugget ' of this project is that MapReduce is hard to program -- SQL is a much more convenient interface, and a certain subset of queries can be converted MapReduce jobs. I think that Hive's interface is fairly intuitive and provides easy access to the power of MapReduce for processing data, so we will probably continue to see its popularity grow (along with similar frameworks like Pig). That said, the Hive/Hadoop stack does have some performance problems, especially with loading data into memory on every job, but we hope to address this issue by using a driver for Spark and caching data in RDDs. 

No comments:

Post a Comment