Apache Pig is an abstract language that puts data in the middle. Apache Pig is a “Data-flow” language. In contrast to SQL (and Hive), Pig goes an iterative way and lets data flow from one statement to another. This gives more powerful options when it comes to data. The language used for Apache Pig is called “PigLatin”. A key benefit of Apache Pig is that it abstracts complex tasks in MapReduce such as Joins to very easy functions in Apache Pig. Apache Pig is ways easier for Developers to write complex queries in Hadoop. Pig itself consists of two major components: PigLatin and a runtime environment.
When running Apache Pig, there are two possibilities: the first one is the stand alone mode which is intended to rather small datasets within a virtual machine. On processing Big Data, it is necessary to run Pig in the MapReduce Mode on top of HDFS. Pig applications are usually script files (with the extension .pig) that consist of a series of operations and transformations, that create output data from input data. Pig itself transforms these operations and transformations to MapReduce functions. The set of operations and transformations available by the language can easily be extended via custom code. When compared to the performance of “pure” MapReduce, Pig is a bit slower, but still very close to the native MapReduce performance. Especially for that not experienced in MapReduce, Pig is a great tool (and ways easier to learn than MapReduce)
When writing a Pig application, this application can easily be executed as a script in the Hadoop environment. Especially when using the previously demonstrated Hadoop VM’s, it is easy to get started. Another possibility is to work with Grunt, which allows us to execute Pig commands in the console. The third possibility to run Pig is to embed them in a Java application.
The question is, what differentiates Pig from SQL/Hive. First, Pig is a data-flow language. It is oriented on the data and how it is transformed from one statement to another. It works on a step-by-step iteration and transforms data. Another difference is that SQL needs a schema, but Pig doesn’t. The only dependency is that data needs to be able to work with it in parallel.
The table below will show a sample program. We will look at the possibilities within the next blog posts.