Unit 5 Lecture No-2 (PIG)
Unit 5 Lecture No-2 (PIG)
(KCS061)
Unit-5
3rd YEAR (6th Sem)
(2022-23)
Presented By:
Lalit K Tripathi
Asst. Prof.(CSE)
UCER, Prayagraj
APACHE PIG
What is Pig?
• Apache Pig is an abstraction over MapReduce.
• It is a tool/platform which is used to analyze larger sets of data
representing them as data flows.
• Pig is generally used with Hadoop; we can perform all the data
manipulation operations in Hadoop using Apache Pig.
• To write data analysis programs, Pig provides a high-level language
known as Pig Latin.
• This language provides various operators using which programmers
can develop their own functions for reading, writing, and
processing data.
Pig Architecture & Components
• To analyze data using Apache Pig, programmers need to write
scripts using Pig Latin language.
• All these scripts are internally converted to Map and Reduce
tasks.
• Apache Pig has a component known as Pig Engine that accepts
the Pig Latin scripts as input and converts those scripts into
MapReduce jobs.
Features of Pig
• Rich set of operators: It provides many operators to perform operations like join,
sort, filer, etc.
• Ease of programming: Pig Latin is similar to SQL and it is easy to write a Pig script
if you are good at SQL.
• Handles all kinds of data: Apache Pig analyzes all kinds of data, both structured
as well as unstructured. It stores the results in HDFS.
Apache Pig Vs Hive
• Both Apache Pig and Hive are used to create MapReduce jobs. And in some cases,
Hive operates on HDFS in a similar way Apache Pig does.
Pig Latin – Data Model
Pig Execution Modes
• You can run Apache Pig in two modes.
• Local Mode
– In this mode, all the files are installed and run from your local host and
local file system. There is no need of Hadoop or HDFS. This mode is
generally used for testing purpose.
• MapReduce Mode
– MapReduce mode is where we load or process the data that exists in the
Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we
execute the Pig Latin statements to process the data, a MapReduce job is
invoked in the back-end to perform a particular operation on the data that
exists in the HDFS.
Invoking the Grunt Shell
• Local Mode
• $ pig –x local
• MapReduce mode
• $ pig -x mapreduce (or) pig
Execution Mechanisms
• Interactive Mode (Grunt shell) – You can run Apache Pig in interactive
mode using the Grunt shell. In this shell, you can enter the Pig Latin
statements and get the output (using Dump operator).
• Batch Mode (Script) – You can run Apache Pig in Batch mode by writing the
Pig Latin script in a single file with .pig extension.
• Embedded Mode (UDF) – Apache Pig provides the provision of defining our
own functions (User Defined Functions) in programming languages such as
Java, and using them in our script.
• Interactive Mode:
grunt> customers= LOAD '/home/cloudera/customers.txt' USING
PigStorage(',');
grunt> dump customers;
-----------------------------------------------------------------------------
| grouped_records | group: int | filtered_records: bag({date: chararray,magnitude: float,location: int}) |
--------------------------------------------------------------------------------------------------------------
|
| | 3 | {(2005-02-15, 5.3, 3), (2007-01-04, 5.8, 3)} | |
-------------------------------------------------
| averages | group: int | double
-------------------------------------------------
| |3 | 5.550000190734863 |
22
Filtering
Filter Operator
• The FILTER operator is used to select the required tuples from
a relation based on a condition.
• grunt> student_details = LOAD
'/home/cloudera/student_details.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
• grunt> filter_data = FILTER student_details BY city ==
'Chennai';
• grunt> dump filter_data;
Filter Operator
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai
(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
Distinct Operator
• The DISTINCT operator is used to remove redundant
(duplicate) tuples from a relation.
• grunt> student_details = LOAD
'/home/cloudera/student_details.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
• grunt> distinct_data = DISTINCT student_details;
• grunt> dump distinct_data;
Distinct Operator
student_details.txt
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai
006,Archana,Mishra,9848022335,Chennai
(1,21,Hyderabad)
(2,22,Kolkata)
(3,22,Delhi)
(4,21,Pune)
(5,23,Bhuwaneshwar)
(6,23,Chennai)
(7,24,trivendram)
(8,24,Chennai)
Sorting
Order By
• The ORDER BY operator is used to display the contents of a
relation in a sorted order based on one or more fields.
• grunt> student_details = LOAD
'/home/cloudera/student_details.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
• grunt> order_by_data = ORDER student_details BY age DESC;
Order By
grunt> Dump order_by_data;
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
(7,Komal,Nayak,24,9848022334,trivendram)
(6,Archana,Mishra,23,9848022335,Chennai)
(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)
(3,Rajesh,Khanna,22,9848022339,Delhi)
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(4,Preethi,Agarwal,21,9848022330,Pune)
(1,Rajiv,Reddy,21,9848022337,Hyderabad)
Limit Operator
• grunt> student_details = LOAD
'/home/cloudera/student_details.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
• grunt> limit_data = LIMIT student_details 4;
Grouping & Joining
34
Group Operator
• The GROUP operator is used to group the data in one or more
relations. It collects the data having the same key.
35
• grunt> dump student_groupdata;
(21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hyderabad)}) (22,
{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,9848022338,Kolkata)}) (23,
{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334,trivendram)})
36
Grouping by Multiple Columns
• grunt> student_details = LOAD '/home/cloudera/students.txt'
USING PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, age:int, phone:chararray, city:chararray);
37
• grunt> dump student_multiplegroup;
((21,Pune),{(4,Preethi,Agarwal,21,9848022330,Pune)})
((21,Hyderabad),{(1,Rajiv,Reddy,21,9848022337,Hyderabad)})
((22,Delhi),{(3,Rajesh,Khanna,22,9848022339,Delhi)})
((22,Kolkata),{(2,siddarth,Battacharya,22,9848022338,Kolkata)})
((23,Chennai),{(6,Archana,Mishra,23,9848022335,Chennai)})
((23,Bhuwaneshwar),{(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)})
((24,Chennai),{(8,Bharathi,Nambiayar,24,9848022333,Chennai)})
((24,trivendram),{(7,Komal,Nayak,24,9848022334,trivendram)})
38
Join Operator
• The JOIN operator is used to combine records from two or
more relations.
• Types of Joins:
– Self-join
– Inner-join
– Outer join : left join, right join, full join
39
Self Join
Self-join is used to join a table with itself as if the table were two relations,
temporarily renaming the relation.
Assume that we have two files namely customers.txt and orders.txt
customers.txt
1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
orders.txt
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060 40
Self Join
• customers = LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as (id:int,
name:chararray, age:int, address:chararray, salary:int);
• orders = LOAD '/home/local/orders.txt' USING PigStorage(',') as (oid:int, date:chararray,
customer_id:int, amount:int);
41
Self Join
Output
It will produce the following output, displaying the contents of the relation customers.
(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000)
(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500)
(3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000)
(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500)
(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500)
(6,Komal,22,MP,4500,6,Komal,22,MP,4500)
(7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)
42
Inner Join (equijoin)
• Inner Join is used quite frequently; it is also referred to as equijoin.
An inner join returns rows when there is a match in both tables.
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
44
Left Outer Join
• The left outer Join operation returns all rows from the left table, even if
there are no matches in the right relation.
45
Left Outer Join
Output
It will produce the following output, displaying the contents of the
relation outer_left.
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
46
Right Outer Join
• The right outer join operation returns all rows from the right table, even if
there are no matches in the left table.
• grunt> customers = LOAD '/home/cloudera/customers.txt' USING
PigStorage(',') as (id:int, name:chararray, age:int, address:chararray,
salary:int);
• grunt> orders = LOAD '/home/cloudera/orders.txt' USING PigStorage(',') as
(oid:int, date:chararray, customer_id:int, amount:int);
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
48
Full Outer Join
• The full outer join operation returns rows when there is a match in one of the
relations.
• grunt> customers = LOAD '/home/cloudera/customers.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
• grunt> orders = LOAD '/home/cloudera/orders.txt' USING PigStorage(',') as
(oid:int, date:chararray, customer_id:int, amount:int);
49
Full Outer Join
Output
It will produce the following output, displaying the contents of the
relation outer_full.
(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
50
Cross Operator
• grunt> customers = LOAD '/home/cloudera/customers.txt' USING
PigStorage(',') as (id:int, name:chararray, age:int, address:chararray,
salary:int);
• grunt> orders = LOAD '/home/cloudera/orders.txt' USING
PigStorage(',') as (oid:int, date:chararray, customer_id:int,
amount:int);
51
Cross Operator
Output
It will produce the following output, displaying the contents of the relation cross_data.
52
Combining & Splitting
53
Union Operator
• The UNION operator of Pig Latin is used to merge the content
of two relations. To perform UNION operation on two
relations, their columns and domains must be identical.
54
Union Operator
Assume that we have two files namely student_data1.txt and student_data2.txt in
the /pig_data/ directory of HDFS as shown below.
Student_data1.txt
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.
Student_data2.txt
7,Komal,Nayak,9848022334,trivendram.
8,Bharathi,Nambiayar,9848022333,Chennai.
55
• grunt> student1 = LOAD '/home/cloudera/student_data1.txt'
USING PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, phone:chararray, city:chararray);
• grunt> student2 = LOAD '/home/cloudera/student_data2.txt'
USING PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, phone:chararray, city:chararray);
• grunt> student = UNION student1, student2;
• grunt> dump student;
56
Output
It will display the following output, displaying the contents of the
relation student.
(1,Rajiv,Reddy,9848022337,Hyderabad)
(2,siddarth,Battacharya,9848022338,Kolkata)
(3,Rajesh,Khanna,9848022339,Delhi)
(4,Preethi,Agarwal,9848022330,Pune)
(5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,9848022335,Chennai)
(7,Komal,Nayak,9848022334,trivendram)
(8,Bharathi,Nambiayar,9848022333,Chennai)
57
Pig Latin Built-In Functions
• Eval Functions
• String Functions
• Date-time Functions
• Math Functions
Eval Functions
• Avg()
• CONCAT()
• COUNT()
• COUNT_STAR(
)
• DIFF()
• MAX()
• MIN()
• SIZE()
• SUBTRACT()
• SUM()
AVG(
• Computes the average of the numeric values in)a single-column bag.
• grunt> A = LOAD '/home/cloudera/student.txt' USING PigStorage(',') as (name:chararray, term:chararray,
gpa:float);
• grunt> DUMP A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
• grunt> B = GROUP A BY name;
• grunt> DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,4.0F)})
• grunt> C = FOREACH B GENERATE A.name, AVG(A.gpa);
• grunt> DUMP C;
({(John),(John),(John),(John)},3.850000023841858)
({(Mary),(Mary),(Mary),(Mary)},3.925000011920929)
CONCAT(
) of identical type.
• Concatenates two expressions
• grunt>A = LOAD ‘/home/Cloudera/data.txt' as (f1:chararray,
f2:chararray, f3:chararray);
• grunt>DUMP A;
(apache,open,source)
(hadoop,map,reduce)
(pig,pig,latin)
• grunt>X =
FOREACH A
GENERATE
CONCAT(f2,f3);
COUNT
• Computes the number of elements in a bag.
• Note: You cannot use the tuple designator (*) with COUNT;
that is, COUNT(*) will not work.
• PIG does not provide any intentional way of counting
columns but one of the way to use the count of fields in
pig is by 1st storing the file and then counting the no. of
delimiters in the file.
COUNT
Assume that we have a file named student_details.txt
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad,89
002,siddarth,Battacharya,22,9848022338,Kolkata,78
003,Rajesh,Khanna,22,9848022339,Delhi,90
004,Preethi,Agarwal,21,9848022330,Pune,93
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar,75
006,Archana,Mishra,23,9848022335,Chennai,87
007,Komal,Nayak,24,9848022334,trivendram,83
008,Bharathi,Nambiayar,24,9848022333,Chennai,72
COUNT
student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray, gpa:int);
We can use the built-in function COUNT() (case sensitive) to calculate the number of
tuples in a relation. Let us group the relation student_details using the Group
All operator, and store the result in the relation named student_group_all as shown
below.
Verify the relation student_count using the DUMP operator as shown below.
• Example
• In this example the number of characters in the first field is computed.
(hadoop,map,reduce)
(pig,pig,latin)
• grunt> X =
FOREACH A
GENERATE SIZE(f1);
• grunt> DUMP X;
(6L)
(6L)
(3L)
SUM
• Computes the sum of the numeric values in a single-column
bag. SUM requires a preceding GROUP ALL statement for
global sums and a GROUP BY statement for group sums.
• Example
• In this example the number of pets is computed.
• grunt> A = LOAD ‘/home/Cloudera/data' AS (owner:chararray, pet_type:chararray,
pet_num:int);
• grunt> DUMP A;
(Alice,turtle,1)
(Alice,goldfish,5)
(Alice,cat,2)
(Bob,dog,2)
(Bob,cat,2)
• grunt> B =
GROUP A BY
owner;
• grunt> DUMP B; (Alice,{(Alice,turtle,1),
(Alice,goldfish,5),(Alice,cat,2)}) (Bob,{(Bob,dog,2),
(Bob,cat,2)})
• grunt> X = FOREACH B GENERATE group,
SUM(A.pet_num);
• DUMP X;
(Alice,8L)
Practice Exercise
Consider the student data File (st.txt), Data in the following format Name,
District, age, gender.
a) Write a PIG script to Display Names of all female students
b) Write a PIG script to find the number of Students form XXXX District
c) Write a PIG script to Display District wise count of all male students.
77
String Functions
• ENDSWITH
• STARTSWITH
• SUBSTRING
• EqualsIgnoreCase
• UPPER
• LOWER
• REPLACE
• TRIM, RTRIM, LTRIM
ENDSWITH , STARTSWITH
• ENDSWITH - This function accepts two String parameters, it is
used to verify whether the first string ends with the second.
string.
• STARTSWITH - This function accepts two string parameters. It
verifies whether the first string starts with the second.
• emp.txt
• grunt> emp_data = LOAD ‘/home/cloudera/emp.txt' USING
PigStorage(',') as (id:int, name:chararray, age:int, city:chararray);
• grunt> emp_endswith = FOREACH emp_data GENERATE
(id,name),ENDSWITH ( name, 'n' );
• grunt> Dump emp_endswith;
• grunt> startswith_data = FOREACH emp_data GENERATE (id,name),
STARTSWITH (name,’Ro’);
• grunt> Dump startswith_data;
SUBSTRING()
• This function returns a substring from the given string.
• EMP.TXT
001,Robin,22,newyork
002,Stacy,25,Bhuwaneshwar
003,Kelly,22,Chennai
• grunt> emp_data = LOAD ‘/home/Cloudera/emp.txt' USING
PigStorage(',')as (id:int, name:chararray, age:int, city:chararray);
• date.txt
001,1989/09/26 09:00:00
002,1980/06/20 10:22:00
003,1990/12/19 03:11:44
• grunt> date_data = LOAD ‘/home/cloudera/date.txt' USING
PigStorage(',') as (id:int,date:chararray);
• date.txt
001,1989/09/26 09:00:00
002,1980/06/20 10:22:00
003,1990/12/19 03:11:44
UDF’S
User Defined Functions
• Apache Pig provides extensive support
for User Defined Functions (UDF’s).
• Using these UDF’s, we can define our own functions and use
them.
• The UDF support is provided in six programming languages.
Java, Jython, Python, JavaScript, Ruby and Groovy.
• In Pig,
All UDFs must extend "org.apache.pig.EvalFunc"
All functions must override the "exec" method.
Creating UDF’S
• Open Eclipse and create a new project.
• Create the jar file and export it into the specific directory.
Java code
import java.io.IOException; import
org.apache.pig.EvalFunc; import
org.apache.pig.data.Tuple;
001,Robin,22,newyork
002,BOB,23,Kolkata
003,Maya,23,Tokyo
004,Sara,25,London
005,David,23,Bhuwaneshwar
006,Maggy,22,Chennai
007,Robert,22,newyork
008,Syam,23,Kolkata
009,Mary,25,Tokyo
010,Saran,25,London
011,Stacy,25,Bhuwaneshwar
012,Kelly,22,Chennai
100
Using the UDF
grunt> emp_data = LOAD 'hdfs://localhost:9000/pig_data/emp1.txt' USING PigStorage(',') as (id:int,
name:chararray, age:int, city:chararray);
Let us now convert the names of the employees in to upper case using the UDF sample_eval.