Monday, October 3, 2011

"Pig Latin: A Not-So-Foreign Language for Data Processing"

This paper describe Pig Latin, a new programming language, and Pig, a system that turns Pig Latin statements into MapReduce jobs. Pig and Pig Latin take a somewhat different approach to the same fundamental challenge that Hive tries solve, namely, that the MapReduce programming model is very low-level and rigid, making it difficult to use for data analysis. At the same time, the declarative style of SQL can be unnatural for many programmers (especially as queries become more complex), so Pig aims for the middle ground between the rigid procedural style inherent in MapReduce and the declarative style of SQL (and HiveQL, etc.). 

In Pig Latin, a user specifies a series of steps, where each step is a single high-level data transformation (as opposed to SQL where constraints for the desired result are specified declaratively). The Pig system then optimizes the steps (if necessary/allowed) and generates a logical plan, which is then compiled into a sequence of MapReduce tasks. Notably, Pig Latin has extensive support for UDFs, which can be used at any step, including grouping, filtering, joining, and per-tuple processing.

From a programmer's perspective, the Pig (Latin) programming model's more procedural style seems very appealing, as SQL statements can become extremely complex and difficult to create correctly. That said, many types of queries can be easily expressed in SQL-like syntax and many data analysts are already familiar with SQL (and/or already have statements generated for the types of queries that they want to run). Likewise, for interactive and ad-hoc querying, a SQL-based CLI (like the one provided by Hive) seems like a more convenient interface. Due to these tradeoffs, I'm not convinced that Pig will be especially influential over the next decade. While the programming interface is more simple and convenient than MapReduce, Spark seems to do much of this as well or better (though, admittedly, it isn't as popular yet). Hive also seems to be gaining more ground over Pig, and I think the big reason for this is the familiar SQL-style interface. Nevertheless, the idea of building layers on top of MapReduce to make it easier to generate the MapReduce jobs (and to make it more easy to convert tasks into MapReduce terms) seems like a very dominant trend that will continue to influence research into the next decade.

No comments:

Post a Comment