Apache Pig Tutorial

22 February 2024

7 min read

Pig Architecture, Components, Features:

Apache Pig is one of the major components of the Hadoop Ecosystem. Before starting this Apache Pig tutorial let us understand why we need Apache Pig when we already have MapReduce for Big Data Analytics? Writing MapReduce jobs in Java is a complex task and not everyone can do it. To write MapReduce tasks the programmer should possess knowledge of Java and Python programming languages. Moreover, it is a time taking process. All these complexities have helped for the development of Apache Pig. In this Apache Pig Tutorial, we are going to cover the below concepts:

1) What is Apache Pig?

2) Why do we need Apache Pig?

3) Apache Pig Architecture

4) Pig Execution modes

5) Process to download and Install Pig

6) Features of Pig

1) What is Apache Pig?

Apache Pig is a high-level programming language especially designed for analyzing large data sets. In the MapReduce framework, programs are required to be translated into a sequence of Map and Reduce stages. As MapReduce is not a programming model it becomes complex for the data analysts to handle tasks in it. To eliminate the complexities associated with MapReduce an abstraction called Pig was built on top of Hadoop. Pig supports all the data manipulation operations in Hadoop.

Apache Pig allows developers to write data analysis programs using Pig Latin. This is a highly flexible language and supports users in developing custom functions for writing, reading and processing data. It enables the resources to focus more on data analysis by minimizing the time taken for writing Map-Reduce programs. Apache Pig has got this name because it eats any type of data similar to pigs who eat anything.

In order to analyze the large volumes of data programmers write scripts using Pig Latin language. These scripts are then transformed internally into Map and Reduce tasks. Apache Pig comes with a component called Pig engine that takes the scripts written in Pig Latin as an input and converts them into MapReduce jobs.

Apache Pig: Your Big Data Companion

2) Why do we need Apache Pig?

Developers who are not good at Java struggle a lot while working with Hadoop, especially when executing tasks related to the MapReduce framework. Apache Pig is the best solution for all such programmers.

> Pig Latin simplifies the work of programmers by eliminating the need to write complex codes in java for performing MapReduce tasks.

> The multi-query approach of Apache Pig reduces the length of code drastically and minimizes development time.

> Pig Latin is almost similar to SQL and if you are familiar with SQL then it becomes very easy for you to learn

3) Apache Pig Architecture- Apache Pig Architecture comprises of two important components which are:

1. Pig Latin

2. Pig Run-time Environment

A Pig Latin program comprises a series of transformations or operations which uses input data to produce output. These operations describe a data flow that is translated into an executable representation with the help of Pig execution environment. The result of these transformations contains MapReduce jobs which a programmer has no idea about. This is how Pig helps the programmers to focus more on data instead of the nature of execution.

Components of Pig Architecture
As shown in the above diagram, Apache pig consists of various components. Let us discuss the essential components here.

Parser : All the Pig scripts initially go through this parser component. It conducts various checks which include syntax checks of the script, type checking, and other miscellaneous checks. The Parser components produce output as a DAG (directed acyclic graph) which depicts the Pig Latin logical operators and logical statements. In the DAG the data flows are shown as edges and the logical operations represent Pig Latin statements.

Optimizer: The Direct Acyclic Graph is passed to the logical optimizer, which performs logical optimizations such as pushdown and projection.

Compiler : The compiler component transforms the optimized logical plan into a sequence of MapReduce jobs.

Execution Engine: This component submits all the MapReduce jobs in sorted order to the Hadoop. Finally, all the MapReduce jobs are executed on Apache Hadoop to produce desired results.

4) Pig Execution modes
In Hadoop Pig can be executed in two different modes which are:

Local Mode: Here Pig language makes use of a local file system and runs in a single JVM. The local mode is ideal for analyzing small data sets.

Map Reduce Mode: In this mode, all the queries written using Pig Latin are converted into MapReduce jobs and these jobs are run on a Hadoop cluster. MapReduce Mode is highly suitable for running Pig on large datasets.

5) The process to Download and Install Pig
Let us understand the step by step process to download and install Pig. Before beginning the process make sure that you have installed Hadoop in your machine.

Step-1 : The first thing you need to do is click on this link to download the latest Hadoop Pig version http://pig.apache.org/releases.html

Choose the tar.gz file and download it.

Step-2: After finishing the downloading process, go to the directory containing the downloaded tar file and move this file to the location where you want to configure Pig Hadoop. Here we are moving the file to /usr/local.

Go to a directory that contains Pig Hadoop files
cd /usr/local

Extract the contents of tar file as shown below

sudo tar -xvf pig-0.12.1.tar.gz

Step-3: Make appropriate changes to ~/.bashrc to add variables related to Pig environment

Open ~/.bashrc file in a text editor and make following modifications

export PIG_HOME=<Installation directory of Pig> export PATH=$PIG_HOME/bin:$HADOOP_HOME/bin:$PATH

Step-4: Use the below command to source this environment configuration
. ~/.bashrc

Step-5: At this step, we need to recompile Pig in order to support Hadoop 2.2.0
We need to follow the below steps to recompile Pig:

Go to PIG home directory
cd $PIG_HOME

Install Ant

sudo apt-get install ant

Now the downloading process starts and the finishing time is based on your internet speed.

Recompile PIG
sudo ant clean jar-all -Dhadoopversion=23

Step-6: Once the Hadoop Pig installation is over the very next step is to test the Pig installation using below command.
pig -help

6) Features of Pig
Apache Pig has the following features-

An advanced set of operators: It comes with multiple operators that are capable of performing operations such as sort, filter, joins etc.

Ease of programming: This language is similar to SQL and becomes easy for you to write Pig scripts if you are familiar with SQL.

Optimization: Apache Pig comes with the power to optimize tasks automatically, which helps the programmers to focus on the semantics of the language.

Extensibility: This feature allows the developers to write custom applications to read, write and process data.

UDF’s: It also supports the creation of User-defined functions in other programming languages.

Handles diversified data sets: It has the capability to analyze all types of data irrespective of size and structure.

Conclusion: Apache Pig is a core piece of technology in the Hadoop eco-system. It consists of a high-level language (Pig Latin) for expressing data analysis programs, coupled with infrastructure for evaluating these programs.

If you are about to start your career in Big Data Hadoop we suggest you check out our Big Data Hadoop & Spark training by TrainingHub.io. This course has been designed to make the aspirants professional in HDFS, MapReduce, Yarn, HBase, Hive, Pig, Flume, Oozie, Sqoop, etc. Happy learning!