0% found this document useful (0 votes)

27 views101 pages

Unit 5 Lecture No-2 (PIG)

Apache Pig is a platform used for analyzing large data sets by providing a high-level language called Pig Latin, which simplifies data manipulation in Hadoop. It features a rich set of operators for various data operations, supports user-defined functions, and can handle both structured and unstructured data. Pig can be executed in local or MapReduce modes, and it offers different execution mechanisms such as interactive, batch, and embedded modes.

Uploaded by

nickyjaiswal85

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views101 pages

Unit 5 Lecture No-2 (PIG)

Uploaded by

nickyjaiswal85

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Big Data

(KCS061)
Unit-5
3rd YEAR (6th Sem)
(2022-23)
Presented By:
Lalit K Tripathi
Asst. Prof.(CSE)
UCER, Prayagraj
APACHE PIG
What is Pig?
• Apache Pig is an abstraction over MapReduce.
• It is a tool/platform which is used to analyze larger sets of data
representing them as data flows.
• Pig is generally used with Hadoop; we can perform all the data
manipulation operations in Hadoop using Apache Pig.
• To write data analysis programs, Pig provides a high-level language
known as Pig Latin.
• This language provides various operators using which programmers
can develop their own functions for reading, writing, and
processing data.
Pig Architecture & Components
• To analyze data using Apache Pig, programmers need to write
scripts using Pig Latin language.
• All these scripts are internally converted to Map and Reduce
tasks.
• Apache Pig has a component known as Pig Engine that accepts
the Pig Latin scripts as input and converts those scripts into
MapReduce jobs.
Features of Pig

• Rich set of operators: It provides many operators to perform operations like join,
sort, filer, etc.

• Ease of programming: Pig Latin is similar to SQL and it is easy to write a Pig script
if you are good at SQL.

• UDF’s: Pig provides the facility to create User-defined Functions in other

programming languages such as Java and invoke or embed them in Pig Scripts.

• Handles all kinds of data: Apache Pig analyzes all kinds of data, both structured
as well as unstructured. It stores the results in HDFS.
Apache Pig Vs Hive
• Both Apache Pig and Hive are used to create MapReduce jobs. And in some cases,
Hive operates on HDFS in a similar way Apache Pig does.
Pig Latin – Data Model
Pig Execution Modes
• You can run Apache Pig in two modes.
• Local Mode
– In this mode, all the files are installed and run from your local host and
local file system. There is no need of Hadoop or HDFS. This mode is
generally used for testing purpose.
• MapReduce Mode
– MapReduce mode is where we load or process the data that exists in the
Hadoop File System (HDFS) using Apache Pig. In this mode, whenever we
execute the Pig Latin statements to process the data, a MapReduce job is
invoked in the back-end to perform a particular operation on the data that
exists in the HDFS.
Invoking the Grunt Shell
• Local Mode
• $ pig –x local
• MapReduce mode
• $ pig -x mapreduce (or) pig
Execution Mechanisms
• Interactive Mode (Grunt shell) – You can run Apache Pig in interactive
mode using the Grunt shell. In this shell, you can enter the Pig Latin
statements and get the output (using Dump operator).

• Batch Mode (Script) – You can run Apache Pig in Batch mode by writing the
Pig Latin script in a single file with .pig extension.

• Embedded Mode (UDF) – Apache Pig provides the provision of defining our
own functions (User Defined Functions) in programming languages such as
Java, and using them in our script.
• Interactive Mode:
grunt> customers= LOAD '/home/cloudera/customers.txt' USING
PigStorage(',');
grunt> dump customers;

• Batch Mode (Local):

[cloudera@quickstart ~]$ cat pig_samplescript_local.pig

customers= LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
dump customers;

[cloudera@quickstart ~]$ pig -x local pig_samplescript_local.pig

• Batch Mode (HDFS):
[cloudera@quickstart ~]$ cat pig_samplescript_global.pig
customers= LOAD '/training/customers.txt' USING PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
dump customers;

[cloudera@quickstart ~]$ pig -x mapreduce pig_samplescript_global.pig

Pig Latin Basics
Types of PIG Operators
• Diagnostic/Debugging Operators:
DUMP
DESCRIBE
EXPLAIN
ILLUSTRATE

• Data Access Operators:

LOAD
STORE
FILTER
FOREACH
GROUP

• Data Transformation Operators:

UNION
SPLIT
LIMIT
CROSS
DISTINCT
SORTBY 16
Diagnostic Operators
• The load statement will simply load the data into the specified
relation in Apache Pig. To verify the execution of
the Load statement, you have to use the Diagnostic Operators.
• Pig Latin provides four different types of diagnostic operators:
– Dump operator
– Describe operator
– Explanation operator
– Illustration operator
• Dump operator
• The Dump operator is used to run the Pig Latin statements and
display the results on the screen. It is generally used for
debugging Purpose.

grunt> customers= LOAD '/home/cloudera/customers.txt' USING

PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
grunt> dump customers;
• Describe operator
• The describe operator is used to view the schema of a
relation/bag.

grunt> customers= LOAD '/home/cloudera/customers.txt' USING

PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
grunt> describe customers;
customers: {id: int,name: chararray,age:
int,address:
chararray,salary: int}
• Explain operator
• The explain operator is used to print the logical, physical
plan of a relation/bag.

grunt> customers= LOAD '/home/cloudera/customers.txt' USING

PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
grunt> explain customers;
• Illustrate operator
• The illustrate operator is used to review how data is
transformed through a sequence of Pig Latin statements.
ILLUSTRATE command is your best friend when it comes to
debugging a script.

grunt> customers= LOAD '/home/cloudera/customers.txt' USING

PigStorage(',') as
(id:int,name:chararray,age:int,address:chararray,salary:int);
grunt> illustrate customers;
• ILLUSTRATE averages;
------------------------------------------------------------------------------
| records | date: bytearray | magnitude: bytearray | location: bytearray |
------------------------------------------------------------------------------
| | 2005-02-15 | 5.3 |3 |
| | 2004-09-01 | 4.0 |3 |
|------------------------------------------------------------------------------
| 2007-01-04 | 5.8 |3 |
| records | date: chararray | magnitude: float | location: int |
--------------------------------------------------------------------
| | 2005-02-15 | 5.3 |3 |
| | 2004-09-01 | 4.0 |3 |
| | 2007-01-04 | 5.8 |3 |
--------------------------------------------------------------------
| filtered_records | date: chararray | magnitude: float | location: int |
-----------------------------------------------------------------------------
| | 2005-02-15 | 5.3 |3 |
| | 2007-01-04 | 5.8 |3 |

-----------------------------------------------------------------------------
| grouped_records | group: int | filtered_records: bag({date: chararray,magnitude: float,location: int}) |
--------------------------------------------------------------------------------------------------------------
|
| | 3 | {(2005-02-15, 5.3, 3), (2007-01-04, 5.8, 3)} | |
-------------------------------------------------
| averages | group: int | double
-------------------------------------------------
| |3 | 5.550000190734863 |
22
Filtering
Filter Operator
• The FILTER operator is used to select the required tuples from
a relation based on a condition.
• grunt> student_details = LOAD
'/home/cloudera/student_details.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
• grunt> filter_data = FILTER student_details BY city ==
'Chennai';
• grunt> dump filter_data;
Filter Operator
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai

grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',') as

(id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);

filter_data = FILTER student_details BY city == 'Chennai';

grunt> Dump filter_data;

(6,Archana,Mishra,23,9848022335,Chennai)
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
Distinct Operator
• The DISTINCT operator is used to remove redundant
(duplicate) tuples from a relation.
• grunt> student_details = LOAD
'/home/cloudera/student_details.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
• grunt> distinct_data = DISTINCT student_details;
• grunt> dump distinct_data;
Distinct Operator
student_details.txt
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai
006,Archana,Mishra,9848022335,Chennai

grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',') as

(id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray);

grunt> distinct_data = DISTINCT student_details;

grunt> Dump distinct_data;
(1,Rajiv,Reddy,9848022337,Hyderabad)
(2,siddarth,Battacharya,9848022338,Kolkata)
(3,Rajesh,Khanna,9848022339,Delhi)
(4,Preethi,Agarwal,9848022330,Pune)
(5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,9848022335,Chennai)
27
Foreach Operator
• The FOREACH operator is used to generate specified data
transformations based on the column data.
grunt> student_details = LOAD ‘/home/cloudera/student_details.txt'
USING PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray,age:int, phone:chararray, city:chararray);
• get the id, age, and city values of each student from the
relation student_details and store it into another relation
named foreach_data using the foreach operator.
• grunt> foreach_data = FOREACH student_details
GENERATE id,age,city;
• grunt> Dump foreach_data;
student_details.txt
Foreach Operator
001,Rajiv,Reddy,21,9848022337,Hyderabad
002,siddarth,Battacharya,22,9848022338,Kolkata
003,Rajesh,Khanna,22,9848022339,Delhi
004,Preethi,Agarwal,21,9848022330,Pune
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar
006,Archana,Mishra,23,9848022335,Chennai
007,Komal,Nayak,24,9848022334,trivendram
008,Bharathi,Nambiayar,24,9848022333,Chennai

grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',') as

(id:int, firstname:chararray, lastname:chararray,age:int, phone:chararray, city:chararray);
grunt> foreach_data = FOREACH student_details GENERATE id,age,city;
grunt> Dump foreach_data;

(1,21,Hyderabad)
(2,22,Kolkata)
(3,22,Delhi)
(4,21,Pune)
(5,23,Bhuwaneshwar)
(6,23,Chennai)
(7,24,trivendram)
(8,24,Chennai)
Sorting
Order By
• The ORDER BY operator is used to display the contents of a
relation in a sorted order based on one or more fields.
• grunt> student_details = LOAD
'/home/cloudera/student_details.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
• grunt> order_by_data = ORDER student_details BY age DESC;
Order By
grunt> Dump order_by_data;
(8,Bharathi,Nambiayar,24,9848022333,Chennai)
(7,Komal,Nayak,24,9848022334,trivendram)
(6,Archana,Mishra,23,9848022335,Chennai)
(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)
(3,Rajesh,Khanna,22,9848022339,Delhi)
(2,siddarth,Battacharya,22,9848022338,Kolkata)
(4,Preethi,Agarwal,21,9848022330,Pune)
(1,Rajiv,Reddy,21,9848022337,Hyderabad)
Limit Operator
• grunt> student_details = LOAD
'/home/cloudera/student_details.txt' USING PigStorage(',') as
(id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray);
• grunt> limit_data = LIMIT student_details 4;
Grouping & Joining

34
Group Operator
• The GROUP operator is used to group the data in one or more
relations. It collects the data having the same key.

• grunt> student_details = LOAD '/home/cloudera/students.txt' USING

PigStorage(',') as (id:int, firstname:chararray, lastname:chararray,
age:int, phone:chararray, city:chararray);
• grunt> student_groupdata = GROUP student_details by age;

35
• grunt> dump student_groupdata;
(21,{(4,Preethi,Agarwal,21,9848022330,Pune),(1,Rajiv,Reddy,21,9848022337,Hyderabad)}) (22,
{(3,Rajesh,Khanna,22,9848022339,Delhi),(2,siddarth,Battacharya,22,9848022338,Kolkata)}) (23,
{(6,Archana,Mishra,23,9848022335,Chennai),(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)})
(24,{(8,Bharathi,Nambiayar,24,9848022333,Chennai),(7,Komal,Nayak,24,9848022334,trivendram)})

• grunt> describe student_groupdata;

student_groupdata: {group: int,student_details: {(id: int,firstname:
chararray,lastname: chararray,age: int,phone: chararray,city:
chararray)}}

36
Grouping by Multiple Columns
• grunt> student_details = LOAD '/home/cloudera/students.txt'
USING PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, age:int, phone:chararray, city:chararray);

• grunt> student_multiplegroup = GROUP student_details by

(age, city);

37
• grunt> dump student_multiplegroup;
((21,Pune),{(4,Preethi,Agarwal,21,9848022330,Pune)})
((21,Hyderabad),{(1,Rajiv,Reddy,21,9848022337,Hyderabad)})
((22,Delhi),{(3,Rajesh,Khanna,22,9848022339,Delhi)})
((22,Kolkata),{(2,siddarth,Battacharya,22,9848022338,Kolkata)})
((23,Chennai),{(6,Archana,Mishra,23,9848022335,Chennai)})
((23,Bhuwaneshwar),{(5,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar)})
((24,Chennai),{(8,Bharathi,Nambiayar,24,9848022333,Chennai)})
((24,trivendram),{(7,Komal,Nayak,24,9848022334,trivendram)})

38
Join Operator
• The JOIN operator is used to combine records from two or
more relations.
• Types of Joins:
– Self-join
– Inner-join
– Outer join : left join, right join, full join

39
Self Join
Self-join is used to join a table with itself as if the table were two relations,
temporarily renaming the relation.
Assume that we have two files namely customers.txt and orders.txt
customers.txt
1,Ramesh,32,Ahmedabad,2000.00
2,Khilan,25,Delhi,1500.00
3,kaushik,23,Kota,2000.00
4,Chaitali,25,Mumbai,6500.00
5,Hardik,27,Bhopal,8500.00
6,Komal,22,MP,4500.00
7,Muffy,24,Indore,10000.00
orders.txt
102,2009-10-08 00:00:00,3,3000
100,2009-10-08 00:00:00,3,1500
101,2009-11-20 00:00:00,2,1560
103,2008-05-20 00:00:00,4,2060 40
Self Join
• customers = LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as (id:int,
name:chararray, age:int, address:chararray, salary:int);
• orders = LOAD '/home/local/orders.txt' USING PigStorage(',') as (oid:int, date:chararray,
customer_id:int, amount:int);

• grunt> customers1 = LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as

(id:int,
name:chararray, age:int, address:chararray, salary:int);
• grunt> customers2 = LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as (id:int,
name:chararray, age:int, address:chararray, salary:int);

• grunt> customers3 = JOIN customers1 BY id, customers2 BY id;

• grunt> Dump customers3;

41
Self Join
Output
It will produce the following output, displaying the contents of the relation customers.

(1,Ramesh,32,Ahmedabad,2000,1,Ramesh,32,Ahmedabad,2000)
(2,Khilan,25,Delhi,1500,2,Khilan,25,Delhi,1500)
(3,kaushik,23,Kota,2000,3,kaushik,23,Kota,2000)
(4,Chaitali,25,Mumbai,6500,4,Chaitali,25,Mumbai,6500)
(5,Hardik,27,Bhopal,8500,5,Hardik,27,Bhopal,8500)
(6,Komal,22,MP,4500,6,Komal,22,MP,4500)
(7,Muffy,24,Indore,10000,7,Muffy,24,Indore,10000)

42
Inner Join (equijoin)
• Inner Join is used quite frequently; it is also referred to as equijoin.
An inner join returns rows when there is a match in both tables.

• grunt> customers = LOAD '/home/cloudera/customers.txt' USING

PigStorage(',') as (id:int, name:chararray, age:int, address:chararray,
salary:int);
• grunt> orders = LOAD '/home/cloudera/orders.txt' USING PigStorage(',')
as (oid:int, date:chararray, customer_id:int, amount:int);
• grunt> customer_orders = JOIN customers BY id, orders BY customer_id;
• grunt> dump customer_orders;
43
Inner Join (equijoin)
Output
You will get the following output that will the contents of the relation
named customer_orders.

(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)

44
Left Outer Join
• The left outer Join operation returns all rows from the left table, even if
there are no matches in the right relation.

• grunt> customers = LOAD '/home/cloudera/customers.txt' USING

PigStorage(',') as (id:int, name:chararray, age:int, address:chararray,
salary:int);
• grunt> orders = LOAD '/home/cloudera/orders.txt' USING PigStorage(',') as
(oid:int, date:chararray, customer_id:int, amount:int);
• grunt> outer_left = JOIN customers BY id LEFT OUTER, orders BY
customer_id;
• grunt> Dump outer_left;

45
Left Outer Join
Output
It will produce the following output, displaying the contents of the
relation outer_left.

(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
46
Right Outer Join
• The right outer join operation returns all rows from the right table, even if
there are no matches in the left table.
• grunt> customers = LOAD '/home/cloudera/customers.txt' USING
PigStorage(',') as (id:int, name:chararray, age:int, address:chararray,
salary:int);
• grunt> orders = LOAD '/home/cloudera/orders.txt' USING PigStorage(',') as
(oid:int, date:chararray, customer_id:int, amount:int);

• grunt> outer_right = JOIN customers BY id RIGHT OUTER, orders BY

customer_id;

• grunt> Dump outer_right;

47
Right Outer Join
Output
It will produce the following output, displaying the contents of the
relation outer_right.

48
Full Outer Join
• The full outer join operation returns rows when there is a match in one of the
relations.
• grunt> customers = LOAD '/home/cloudera/customers.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, address:chararray, salary:int);
• grunt> orders = LOAD '/home/cloudera/orders.txt' USING PigStorage(',') as
(oid:int, date:chararray, customer_id:int, amount:int);

• grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY

customer_id;

• grunt> Dump outer_full;

49
Full Outer Join
Output
It will produce the following output, displaying the contents of the
relation outer_full.

(1,Ramesh,32,Ahmedabad,2000,,,,)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560)
(3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060)
(5,Hardik,27,Bhopal,8500,,,,)
(6,Komal,22,MP,4500,,,,)
(7,Muffy,24,Indore,10000,,,,)
50
Cross Operator
• grunt> customers = LOAD '/home/cloudera/customers.txt' USING
PigStorage(',') as (id:int, name:chararray, age:int, address:chararray,
salary:int);
• grunt> orders = LOAD '/home/cloudera/orders.txt' USING
PigStorage(',') as (oid:int, date:chararray, customer_id:int,
amount:int);

• grunt> cross_data = CROSS customers, orders;

• grunt> Dump cross_data;

51
Cross Operator
Output
It will produce the following output, displaying the contents of the relation cross_data.

(7,Muffy,24,Indore,10000,103,2008-05-20 00:00:00,4,2060) (7,Muffy,24,Indore,10000,101,2009-11-20 00:00:00,2,1560)

(7,Muffy,24,Indore,10000,100,2009-10-08 00:00:00,3,1500) (7,Muffy,24,Indore,10000,102,2009-10-08 00:00:00,3,3000)
(6,Komal,22,MP,4500,103,2008-05-20 00:00:00,4,2060) (6,Komal,22,MP,4500,101,2009-11-20 00:00:00,2,1560)
(6,Komal,22,MP,4500,100,2009-10-08 00:00:00,3,1500) (6,Komal,22,MP,4500,102,2009-10-08 00:00:00,3,3000)
(5,Hardik,27,Bhopal,8500,103,2008-05-20 00:00:00,4,2060) (5,Hardik,27,Bhopal,8500,101,2009-11-20 00:00:00,2,1560)
(5,Hardik,27,Bhopal,8500,100,2009-10-08 00:00:00,3,1500) (5,Hardik,27,Bhopal,8500,102,2009-10-08 00:00:00,3,3000)
(4,Chaitali,25,Mumbai,6500,103,2008-05-20 00:00:00,4,2060) (4,Chaitali,25,Mumbai,6500,101,2009-20 00:00:00,4,2060)
(4,Chaitali,25,Mumbai,6500,100,2009-10-08 00:00:00,3,1500) (4,Chaitali,25,Mumbai,6500,102,2009-10-08 00:00:00,3,3000)
(3,kaushik,23,Kota,2000,101,2009-11-20 00:00:00,2,1560) (3,kaushik,23,Kota,2000,100,2009-10-08 00:00:00,3,1500)
(3,kaushik,23,Kota,2000,102,2009-10-08 00:00:00,3,3000) (3,kaushik,23,Kota,2000,103,2008-05-20 00:00:00,4,2060)
(2,Khilan,25,Delhi,1500,101,2009-11-20 00:00:00,2,1560) (2,Khilan,25,Delhi,1500,100,2009-10-08 00:00:00,3,1500)
(2,Khilan,25,Delhi,1500,102,2009-10-08 00:00:00,3,3000) (2,Khilan,25,Delhi,1500,103,2008-05-20 00:00:00,4,2060)
(1,Ramesh,32,Ahmedabad,2000,101,2009-11-20 00:00:00,2,1560) (1,Ramesh,32,Ahmedabad,2000,100,2009-10-08 00:00:00,3,1500)
(1,Ramesh,32,Ahmedabad,2000,102,2009-10-08 00:00:00,3,3000) (1,Ramesh,32,Ahmedabad,2000,103,2008-05-20 00:00:00,4,2060)

52
Combining & Splitting

53
Union Operator
• The UNION operator of Pig Latin is used to merge the content
of two relations. To perform UNION operation on two
relations, their columns and domains must be identical.

54
Union Operator
Assume that we have two files namely student_data1.txt and student_data2.txt in
the /pig_data/ directory of HDFS as shown below.

Student_data1.txt
001,Rajiv,Reddy,9848022337,Hyderabad
002,siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai.

Student_data2.txt
7,Komal,Nayak,9848022334,trivendram.
8,Bharathi,Nambiayar,9848022333,Chennai.
55
• grunt> student1 = LOAD '/home/cloudera/student_data1.txt'
USING PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, phone:chararray, city:chararray);
• grunt> student2 = LOAD '/home/cloudera/student_data2.txt'
USING PigStorage(',') as (id:int, firstname:chararray,
lastname:chararray, phone:chararray, city:chararray);
• grunt> student = UNION student1, student2;
• grunt> dump student;

56
Output
It will display the following output, displaying the contents of the
relation student.

(1,Rajiv,Reddy,9848022337,Hyderabad)
(2,siddarth,Battacharya,9848022338,Kolkata)
(3,Rajesh,Khanna,9848022339,Delhi)
(4,Preethi,Agarwal,9848022330,Pune)
(5,Trupthi,Mohanthy,9848022336,Bhuwaneshwar)
(6,Archana,Mishra,9848022335,Chennai)
(7,Komal,Nayak,9848022334,trivendram)
(8,Bharathi,Nambiayar,9848022333,Chennai)
57
Pig Latin Built-In Functions
• Eval Functions
• String Functions
• Date-time Functions
• Math Functions
Eval Functions
• Avg()
• CONCAT()
• COUNT()
• COUNT_STAR(
)
• DIFF()
• MAX()
• MIN()
• SIZE()
• SUBTRACT()
• SUM()
AVG(
• Computes the average of the numeric values in)a single-column bag.
• grunt> A = LOAD '/home/cloudera/student.txt' USING PigStorage(',') as (name:chararray, term:chararray,
gpa:float);
• grunt> DUMP A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
• grunt> B = GROUP A BY name;
• grunt> DUMP B;
(John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),(John,sm,3.8F)})
(Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),(Mary,sm,4.0F)})
• grunt> C = FOREACH B GENERATE A.name, AVG(A.gpa);
• grunt> DUMP C;
({(John),(John),(John),(John)},3.850000023841858)
({(Mary),(Mary),(Mary),(Mary)},3.925000011920929)
CONCAT(
) of identical type.
• Concatenates two expressions
• grunt>A = LOAD ‘/home/Cloudera/data.txt' as (f1:chararray,
f2:chararray, f3:chararray);
• grunt>DUMP A;
(apache,open,source)

(hadoop,map,reduce)
(pig,pig,latin)
• grunt>X =
FOREACH A
GENERATE
CONCAT(f2,f3);
COUNT
• Computes the number of elements in a bag.
• Note: You cannot use the tuple designator (*) with COUNT;
that is, COUNT(*) will not work.
• PIG does not provide any intentional way of counting
columns but one of the way to use the count of fields in
pig is by 1st storing the file and then counting the no. of
delimiters in the file.
COUNT
Assume that we have a file named student_details.txt
student_details.txt
001,Rajiv,Reddy,21,9848022337,Hyderabad,89
002,siddarth,Battacharya,22,9848022338,Kolkata,78
003,Rajesh,Khanna,22,9848022339,Delhi,90
004,Preethi,Agarwal,21,9848022330,Pune,93
005,Trupthi,Mohanthy,23,9848022336,Bhuwaneshwar,75
006,Archana,Mishra,23,9848022335,Chennai,87
007,Komal,Nayak,24,9848022334,trivendram,83
008,Bharathi,Nambiayar,24,9848022333,Chennai,72
COUNT
student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING
PigStorage(',') as (id:int, firstname:chararray, lastname:chararray, age:int,
phone:chararray, city:chararray, gpa:int);

Calculating the Number of Tuples

We can use the built-in function COUNT() (case sensitive) to calculate the number of
tuples in a relation. Let us group the relation student_details using the Group
All operator, and store the result in the relation named student_group_all as shown
below.

grunt> student_group_all = Group student_details All;

COUNT
Dump student_group_all; (all,
{(8,Bharathi,Nambiayar,24,9848022333,Chennai,72), (7,Komal,Nayak,24,9848022
334,trivendram,83), (6,Archana,Mishra,23,9848022335,Chennai,87),
(5,Trupthi,Mohan thy,23,9848022336,Bhuwaneshwar,75),
(4,Preethi,Agarwal,21,9848022330,Pune,93),
(3 ,Rajesh,Khanna,22,9848022339,Delhi,90),
(2,siddarth,Battacharya,22,9848022338,Ko lkata,78),
(1,Rajiv,Reddy,21,9848022337,Hyderabad,89)})
COUNT
Let us now calculate number of tuples/records in the relation.

grunt> student_count = foreach student_group_all Generate

COUNT(student_details.gpa);

Verify the relation student_count using the DUMP operator as shown below.

grunt> Dump student_count;

8
• grunt>A = LOAD '/home/cloudera/c.txt' USING PigStorage(',') as (f1:int, f2:int, f3:int);
• grunt>DUMP A;
1,2,3
4,2,1
8,3,4
4,3,3
7,2,5
8,4,3
• grunt>
B=
GROU
P A BY
f1;
• grunt>
DUMP
B;
(1,
{(1,2,3)
})
(4,
{(4,2,1)
,
COUNT_STAR
• Computes the number of elements in a bag.
• COUNT_STAR includes NULL values in the count computation
(unlike COUNT, which ignores NULL values. A tuple in the bag
will not be counted if the FIRST FIELD in this tuple is NULL.).
• Example
• In this example COUNT_STAR is used the count the tuples in a
bag.
• grunt>X = FOREACH B GENERATE COUNT_STAR(A);
DIFF
• Compares two fields in a tuple.
• The DIFF function takes two bags as arguments and compares them. Any tuples that are in one bag
but not the other are returned in a bag. If the bags match, an empty bag is returned.
• grunt> A = LOAD ‘/home/Cloudera/data.txt' AS
(B1:bag{T1:tuple(t1:int,t2:int)},B2:bag{T2:tuple(f1:int,f2:int)});
• grunt> DUMP A; ({(8,9),(0,1)},{(8,9),(1,1)}) ({(2,3),(4,5)},{(2,3),
(4,5)}) ({(6,7),(3,7)},{(2,2),(3,7)})
• grunt> DESCRIBE A;
• a: {B1: {T1: (t1: int,t2: int)},B2: {T2: (f1: int,f2: int)}}
• grunt> X = FOREACH A DIFF(B1,B2);
• grunt> dump X;
• ({(0,1),(1,1)})
• ({})
• ({(6,7),(2,2)})
MAX
• Computes the maximum of the numeric values or chararrays in
a single-column bag. MAX requires a preceding GROUP ALL
statement for global maximums and a GROUP BY statement for
group maximums.
• Example
– In this example the maximum GPA for all terms is computed for each
student (see the GROUP operator for information about the field
names in relation B).
• grunt> A = LOAD ‘home/Cloudera/student.txt' AS (name:chararray, session:chararray, gpa:float);
• grunt> DUMP A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
• grunt> B =
GROUP A BY
name;
• grunt> DUMP B; (John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),
(John,sm,3.8F)}) (Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),
(Mary,sm,4.0F)})
• grunt> X = FOREACH B GENERATE group, MAX(A.gpa);
• grunt> DUMP X;
MIN
• Computes the minimum of the numeric values or chararrays in
a single-column bag. MIN requires a preceding GROUP… ALL
statement for global minimums and a GROUP … BY statement
for group minimums.
• Example
– In this example the minimum GPA for all terms is computed for each
student (see the GROUP operator for information about the field
names in relation B).
• grunt> A = LOAD ‘/home/Cloudera/student.txt' AS (name:chararray, session:chararray, gpa:float);
• grunt> DUMP A;
(John,fl,3.9F)
(John,wt,3.7F)
(John,sp,4.0F)
(John,sm,3.8F)
(Mary,fl,3.8F)
(Mary,wt,3.9F)
(Mary,sp,4.0F)
(Mary,sm,4.0F)
• grunt> B =
GROUP A BY
name;
• grunt> DUMP B; (John,{(John,fl,3.9F),(John,wt,3.7F),(John,sp,4.0F),
(John,sm,3.8F)}) (Mary,{(Mary,fl,3.8F),(Mary,wt,3.9F),(Mary,sp,4.0F),
(Mary,sm,4.0F)})
• grunt> X = FOREACH B GENERATE group, MIN(A.gpa);
• grunt> DUMP X;
(John,3.7F)
(Mary,3.8F)
SIZE
• Computes the number of elements based on any Pig data type.

• Example
• In this example the number of characters in the first field is computed.

• grunt> A = LOAD 'data' as (f1:chararray, f2:chararray, f3:chararray);

(apache,open,source)

(hadoop,map,reduce)
(pig,pig,latin)
• grunt> X =
FOREACH A
GENERATE SIZE(f1);
• grunt> DUMP X;
(6L)
(6L)
(3L)
SUM
• Computes the sum of the numeric values in a single-column
bag. SUM requires a preceding GROUP ALL statement for
global sums and a GROUP BY statement for group sums.
• Example
• In this example the number of pets is computed.
• grunt> A = LOAD ‘/home/Cloudera/data' AS (owner:chararray, pet_type:chararray,
pet_num:int);
• grunt> DUMP A;
(Alice,turtle,1)
(Alice,goldfish,5)
(Alice,cat,2)
(Bob,dog,2)
(Bob,cat,2)
• grunt> B =
GROUP A BY
owner;
• grunt> DUMP B; (Alice,{(Alice,turtle,1),
(Alice,goldfish,5),(Alice,cat,2)}) (Bob,{(Bob,dog,2),
(Bob,cat,2)})
• grunt> X = FOREACH B GENERATE group,
SUM(A.pet_num);
• DUMP X;
(Alice,8L)
Practice Exercise

Consider the student data File (st.txt), Data in the following format Name,
District, age, gender.
a) Write a PIG script to Display Names of all female students
b) Write a PIG script to find the number of Students form XXXX District
c) Write a PIG script to Display District wise count of all male students.

77
String Functions
• ENDSWITH
• STARTSWITH
• SUBSTRING
• EqualsIgnoreCase
• UPPER
• LOWER
• REPLACE
• TRIM, RTRIM, LTRIM
ENDSWITH , STARTSWITH
• ENDSWITH - This function accepts two String parameters, it is
used to verify whether the first string ends with the second.
string.
• STARTSWITH - This function accepts two string parameters. It
verifies whether the first string starts with the second.
• emp.txt
• grunt> emp_data = LOAD ‘/home/cloudera/emp.txt' USING
PigStorage(',') as (id:int, name:chararray, age:int, city:chararray);
• grunt> emp_endswith = FOREACH emp_data GENERATE
(id,name),ENDSWITH ( name, 'n' );
• grunt> Dump emp_endswith;
• grunt> startswith_data = FOREACH emp_data GENERATE (id,name),
STARTSWITH (name,’Ro’);
• grunt> Dump startswith_data;
SUBSTRING()
• This function returns a substring from the given string.

• EMP.TXT
001,Robin,22,newyork
002,Stacy,25,Bhuwaneshwar
003,Kelly,22,Chennai
• grunt> emp_data = LOAD ‘/home/Cloudera/emp.txt' USING
PigStorage(',')as (id:int, name:chararray, age:int, city:chararray);

• grunt> substring_data = FOREACH emp_data GENERATE (id,name),

SUBSTRING (name, 0, 2);

• grunt> Dump substring_data;

((1,Robin),Rob)
((2,Stacy),Sta)
((3,Kelly),Kel)
EqualsIgnoreCase()
• The EqualsIgnoreCase() function is used to compare two
strings and verify whether they are equal. If both are equal this
function returns the Boolean value true else it returns the
value false.
• grunt> emp_data = LOAD ‘/home/Cloudera/emp.txt' USING PigStorage(',') as (id:int,
name:chararray, age:int, city:chararray);
• grunt> equals_data = FOREACH emp_data GENERATE (id,name), EqualsIgnoreCase(name,
'Robin');
• grunt> Dump equals_data;
•
((1,Robin),true)
((2,BOB),false)
((3,Maya),false)
((4,Sara),false)
((5,David),false)
((6,Maggy),false)
((7,Robert),false)
((8,Syam),false)
((9,Mary),false)
((10,Saran),false)
((11,Stacy),false)
((12,Kelly),false)
UPPER(), LOWER()
• UPPER- This function is used to convert all the characters in a
string to uppercase.
• LOWER- This function is used to convert all the characters in a
string to lowercase.
• grunt> emp_data = LOAD '/home/cloudera/emp.txt' USING
PigStorage(',') as (id:int, name:chararray, age:int, city:chararray);

• grunt> upper_data = FOREACH emp_data GENERATE (id,name),

UPPER(name);
• grunt> Dump upper_data;

• grunt> lower_data = FOREACH emp_data GENERATE (id,name),

LOWER(name);
• grunt> Dump lower_data;
REPLACE()
• This function is used to replace all the characters in a given
string with the new characters.
• grunt> emp_data = LOAD ‘/home/cloudera/emp.txt' USING PigStorage(',') as (id:int,
name:chararray, age:int, city:chararray);

• grunt> replace_data = FOREACH emp_data GENERATE

(id,city),REPLACE(city,'Bhuwaneshwar','Bhuw');

• grunt> Dump replace_data;

((1,newyork),newyork)
((2,Kolkata),Kolkata)
((3,Tokyo),Tokyo)
((4,London),London)
((5,Bhuwaneshwar),Bhuw)
((6,Chennai),Chennai)
((7,newyork),newyork)
((8,Kolkata),Kolkata)
((9,Tokyo),Tokyo)
((10,London),London)
((11,Bhuwaneshwar),Bhuw)
((12,Chennai),Chennai)
TRIM(), RTRIM(), LTRIM()
• The TRIM() function accepts a string and returns its copy after removing
the unwanted spaces before and after it.
• The function LTRIM() is same as the function TRIM(). It removes the
unwanted spaces from the left side of the given string (heading spaces).
• The function RTRIM() is same as the function TRIM(). It removes the
unwanted spaces from the right side of a given string (tailing spaces).
• grunt> emp_data = LOAD ‘/home/cloudera/emp.txt' USING PigStorage(',') as
(id:int, name:chararray, age:int, city:chararray);

• grunt> trim_data = FOREACH emp_data GENERATE (id,name), TRIM(name);

• grunt> ltrim_data = FOREACH emp_data GENERATE (id,name), LTRIM(name);

• grunt> rtrim_data = FOREACH emp_data GENERATE (id,name),

RTRIM(name);

• grunt> Dump trim_data;

• grunt> Dump ltrim_data;

• grunt> Dump rtrim_data;

Date-time Functions
• ToDate()
• GetDay()
• GetMonth()
• GetYear()
ToDate(
)
• This function is used to generate a DateTime object according
to the given parameters.

• date.txt
001,1989/09/26 09:00:00
002,1980/06/20 10:22:00
003,1990/12/19 03:11:44
• grunt> date_data = LOAD ‘/home/cloudera/date.txt' USING
PigStorage(',') as (id:int,date:chararray);

• grunt> todate_data = foreach date_data generate

ToDate(date,'yyyy/MM/dd HH:mm:ss') as (date_time:DateTime);

• grunt> Dump todate_data;

(1989-09-26T09:00:00.000+05:30)
(1980-06-
20T10:22:00.000+05:30) (1990-
12-19T03:11:44.000+05:30)
GetDay(
)
• This function accepts a date-time object as a parameter and
returns the current day of the given date-time object.

• date.txt
001,1989/09/26 09:00:00
002,1980/06/20 10:22:00
003,1990/12/19 03:11:44
UDF’S
User Defined Functions
• Apache Pig provides extensive support
for User Defined Functions (UDF’s).
• Using these UDF’s, we can define our own functions and use
them.
• The UDF support is provided in six programming languages.
Java, Jython, Python, JavaScript, Ruby and Groovy.
• In Pig,
All UDFs must extend "org.apache.pig.EvalFunc"
All functions must override the "exec" method.
Creating UDF’S
• Open Eclipse and create a new project.
• Create the jar file and export it into the specific directory.
Java code
import java.io.IOException; import
org.apache.pig.EvalFunc; import
org.apache.pig.data.Tuple;

public class Sample_Eval extends EvalFunc<String>

{
public String exec(Tuple input) throws IOException {
if (input == null || input.size() == 0)
return null;
String str = (String)input.get(0); return
str.toUpperCase();
}
}
Registering the Jar file
• grunt> REGISTER '/home/cloudera/sample_udf.jar';
• grunt> DEFINE Sample_Eval sample_eval();
• grunt> emp_data = LOAD '/home/cloudera/pigdata.txt' USING
PigStorage(',') as (id:int, name:chararray, age:int,
city:chararray);
• grunt> Upper_case = FOREACH emp_data GENERATE
sample_eval(name);
Using the UDF
After defining the alias you can use the UDF same as the built-in functions. Suppose there is a file named
emp_data in the HDFS /Pig_Data/ directory with the following content.

001,Robin,22,newyork
002,BOB,23,Kolkata
003,Maya,23,Tokyo
004,Sara,25,London
005,David,23,Bhuwaneshwar
006,Maggy,22,Chennai
007,Robert,22,newyork
008,Syam,23,Kolkata
009,Mary,25,Tokyo
010,Saran,25,London
011,Stacy,25,Bhuwaneshwar
012,Kelly,22,Chennai

100
Using the UDF
grunt> emp_data = LOAD 'hdfs://localhost:9000/pig_data/emp1.txt' USING PigStorage(',') as (id:int,
name:chararray, age:int, city:chararray);

Let us now convert the names of the employees in to upper case using the UDF sample_eval.

grunt> Upper_case = FOREACH emp_data GENERATE sample_eval(name);

Verify the contents of the relation Upper_case as shown below.

grunt> Dump Upper_case;

(ROBIN)
(BOB)
(MAYA)
(SARA)
(DAVID)
(MAGGY)
(ROBERT)
(SYAM)
(MARY)
(SARAN)
(STACY) 101
(KELLY)

Unit 5 Lecture No-2 (PIG)
No ratings yet
Unit 5 Lecture No-2 (PIG)
94 pages
Unit IV - Pig PDF
No ratings yet
Unit IV - Pig PDF
79 pages
Apache Pig
100% (2)
Apache Pig
80 pages
Introduction to Apache Pig and Pig Latin
No ratings yet
Introduction to Apache Pig and Pig Latin
41 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
9 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Bda Unit 4 060115 Big Data Analytics Unit 4
No ratings yet
Bda Unit 4 060115 Big Data Analytics Unit 4
19 pages
Unit IV - Big Data Programming
No ratings yet
Unit IV - Big Data Programming
17 pages
Big Data Applications: Pig & Hive
No ratings yet
Big Data Applications: Pig & Hive
29 pages
Unit 5
No ratings yet
Unit 5
24 pages
BDA Unit 5-1
No ratings yet
BDA Unit 5-1
29 pages
Applications of Apache Pig in Big Data
No ratings yet
Applications of Apache Pig in Big Data
10 pages
Apache Pig: Big Data Processing Tool
No ratings yet
Apache Pig: Big Data Processing Tool
49 pages
Hadoop Big Data: Pig, Hive, HBase
No ratings yet
Hadoop Big Data: Pig, Hive, HBase
17 pages
Big Data Applications: Pig, Hive, HBase
No ratings yet
Big Data Applications: Pig, Hive, HBase
21 pages
Introduction to Apache Pig Overview
No ratings yet
Introduction to Apache Pig Overview
98 pages
Notes Unit 5 Bigdata
No ratings yet
Notes Unit 5 Bigdata
19 pages
Notes
No ratings yet
Notes
19 pages
Hadoop Week 5
No ratings yet
Hadoop Week 5
78 pages
Notes UNIT 5 Bigdata
No ratings yet
Notes UNIT 5 Bigdata
18 pages
Unit 5
No ratings yet
Unit 5
19 pages
Big Data Unit 5 Big Data Notes of Unit 5
No ratings yet
Big Data Unit 5 Big Data Notes of Unit 5
16 pages
Big Data Applications: Pig & Hive Overview
No ratings yet
Big Data Applications: Pig & Hive Overview
21 pages
Pig Hive
No ratings yet
Pig Hive
72 pages
Apache Pig: Big Data Analytics Guide
No ratings yet
Apache Pig: Big Data Analytics Guide
65 pages
Apache Pig for Data Analysts
No ratings yet
Apache Pig for Data Analysts
58 pages
Pig 2
No ratings yet
Pig 2
63 pages
Unit5 Part1 Notes
No ratings yet
Unit5 Part1 Notes
21 pages
Pig Hive
No ratings yet
Pig Hive
59 pages
Apache Pig for Data Engineers
No ratings yet
Apache Pig for Data Engineers
50 pages
Apache Pig Guide: Features & Functions
No ratings yet
Apache Pig Guide: Features & Functions
31 pages
BDA-Unit 5-Notes
No ratings yet
BDA-Unit 5-Notes
36 pages
Bdaut 2
No ratings yet
Bdaut 2
66 pages
Apache Pig and Hive Installation Guide
No ratings yet
Apache Pig and Hive Installation Guide
10 pages
BDA Module 4 - Part 1 (Pig) 2023
100% (1)
BDA Module 4 - Part 1 (Pig) 2023
34 pages
Understanding Apache Pig Architecture
No ratings yet
Understanding Apache Pig Architecture
33 pages
Introduction to Apache Pig for Data Processing
No ratings yet
Introduction to Apache Pig for Data Processing
97 pages
BDP U4
No ratings yet
BDP U4
58 pages
BigData Unit 4
No ratings yet
BigData Unit 4
13 pages
Big Data Analytics with Pig and Hive
No ratings yet
Big Data Analytics with Pig and Hive
81 pages
Apache Pig Overview and Features
No ratings yet
Apache Pig Overview and Features
30 pages
Understanding Apache Pig for Big Data
No ratings yet
Understanding Apache Pig for Big Data
47 pages
Introduction to Apache Pig and Its Features
No ratings yet
Introduction to Apache Pig and Its Features
28 pages
Notes of Aktu Btech 3 Yr Big Data
No ratings yet
Notes of Aktu Btech 3 Yr Big Data
15 pages
3 Pig
No ratings yet
3 Pig
77 pages
Big Data Module V Notes
No ratings yet
Big Data Module V Notes
26 pages
Big Data Unit-5
No ratings yet
Big Data Unit-5
81 pages
Unit 5 (Pig, Hive, Hbase)
No ratings yet
Unit 5 (Pig, Hive, Hbase)
18 pages
Big Data Tools: Pig, Hive, HBase Overview
No ratings yet
Big Data Tools: Pig, Hive, HBase Overview
17 pages
6 Part2
No ratings yet
6 Part2
45 pages
Introduction to Apache Pig Basics
No ratings yet
Introduction to Apache Pig Basics
16 pages
Unit-4 PIG
No ratings yet
Unit-4 PIG
9 pages
Introduction to Apache Pig and Pig Latin
No ratings yet
Introduction to Apache Pig and Pig Latin
28 pages
Bda Unit Iv Notes
No ratings yet
Bda Unit Iv Notes
32 pages
Introduction To Pig: SESSION 2016-2017
No ratings yet
Introduction To Pig: SESSION 2016-2017
44 pages
Pig Slides
No ratings yet
Pig Slides
46 pages
Introduction to Apache Pig for Big Data
No ratings yet
Introduction to Apache Pig for Big Data
13 pages
Definitions
No ratings yet
Definitions
9 pages
Problems and Solutions in Electronics Engineering
No ratings yet
Problems and Solutions in Electronics Engineering
4 pages
Data Warehousing With Greenplum 2e
No ratings yet
Data Warehousing With Greenplum 2e
121 pages
Hadoop Basics for Engineering Students
No ratings yet
Hadoop Basics for Engineering Students
18 pages
Ccs334 Unit 1
No ratings yet
Ccs334 Unit 1
44 pages
Installation of Hadoop
No ratings yet
Installation of Hadoop
6 pages
Big Data Computing - Unit 3 - Week-0
No ratings yet
Big Data Computing - Unit 3 - Week-0
2 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
7th Sem Information Science and Engineering 22 Scheme Syllabus
No ratings yet
7th Sem Information Science and Engineering 22 Scheme Syllabus
36 pages
Pbds Unit-5
No ratings yet
Pbds Unit-5
60 pages
Senior DevOps Engineer Resume
No ratings yet
Senior DevOps Engineer Resume
8 pages
Hadoop YARN vs MapReduce Architecture
No ratings yet
Hadoop YARN vs MapReduce Architecture
31 pages
Mir Shezan Data Analyst Resume
No ratings yet
Mir Shezan Data Analyst Resume
3 pages
Bda Q&a
No ratings yet
Bda Q&a
15 pages
CP5092-Cloud Computing Technologies
No ratings yet
CP5092-Cloud Computing Technologies
11 pages
Big Data Project Titles Overview
No ratings yet
Big Data Project Titles Overview
2 pages
Ashok Yadav Solution Architect Resume
No ratings yet
Ashok Yadav Solution Architect Resume
4 pages
MCA 3rd Sem Syllabus
No ratings yet
MCA 3rd Sem Syllabus
26 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
20 pages
Sales Data Processing with Hadoop
No ratings yet
Sales Data Processing with Hadoop
4 pages
Hadoop Development with Oracle BI Tools
No ratings yet
Hadoop Development with Oracle BI Tools
41 pages
DA Lab Manual Final
No ratings yet
DA Lab Manual Final
46 pages
Cloud 4th & 5th Solns Vtu
No ratings yet
Cloud 4th & 5th Solns Vtu
11 pages
Anomaly Detection System in Se
No ratings yet
Anomaly Detection System in Se
13 pages
Business Intelligence: Databases Overview
No ratings yet
Business Intelligence: Databases Overview
48 pages
ReleaseNotes10521 en
No ratings yet
ReleaseNotes10521 en
18 pages
Install VirtualBox & C Compiler on Windows
No ratings yet
Install VirtualBox & C Compiler on Windows
36 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
Big Data Emerging Trends in Class 11 NCERT
No ratings yet
Big Data Emerging Trends in Class 11 NCERT
10 pages

Unit 5 Lecture No-2 (PIG)

Uploaded by

Unit 5 Lecture No-2 (PIG)

Uploaded by

Big Data

• UDF’s: Pig provides the facility to create User-defined Functions in other

• Batch Mode (Local):

[cloudera@quickstart ~]$ cat pig_samplescript_local.pig

[cloudera@quickstart ~]$ pig -x local pig_samplescript_local.pig

[cloudera@quickstart ~]$ pig -x mapreduce pig_samplescript_global.pig

• Data Access Operators:

• Data Transformation Operators:

grunt> customers= LOAD '/home/cloudera/customers.txt' USING

grunt> customers= LOAD '/home/cloudera/customers.txt' USING

grunt> customers= LOAD '/home/cloudera/customers.txt' USING

grunt> customers= LOAD '/home/cloudera/customers.txt' USING

grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',') as

filter_data = FILTER student_details BY city == 'Chennai';

grunt> Dump filter_data;

grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',') as

grunt> distinct_data = DISTINCT student_details;

grunt> student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',') as

• grunt> student_details = LOAD '/home/cloudera/students.txt' USING

• grunt> describe student_groupdata;

• grunt> student_multiplegroup = GROUP student_details by

• grunt> customers1 = LOAD '/home/cloudera/customers.txt' USING PigStorage(',') as

• grunt> customers3 = JOIN customers1 BY id, customers2 BY id;

• grunt> Dump customers3;

• grunt> customers = LOAD '/home/cloudera/customers.txt' USING

• grunt> customers = LOAD '/home/cloudera/customers.txt' USING

• grunt> outer_right = JOIN customers BY id RIGHT OUTER, orders BY

• grunt> Dump outer_right;

• grunt> outer_full = JOIN customers BY id FULL OUTER, orders BY

• grunt> Dump outer_full;

• grunt> cross_data = CROSS customers, orders;

• grunt> Dump cross_data;

(7,Muffy,24,Indore,10000,103,2008-05-20 00:00:00,4,2060) (7,Muffy,24,Indore,10000,101,2009-11-20 00:00:00,2,1560)

Calculating the Number of Tuples

grunt> student_group_all = Group student_details All;

grunt> student_count = foreach student_group_all Generate

grunt> Dump student_count;

• grunt> A = LOAD 'data' as (f1:chararray, f2:chararray, f3:chararray);

• grunt> substring_data = FOREACH emp_data GENERATE (id,name),

• grunt> Dump substring_data;

• grunt> upper_data = FOREACH emp_data GENERATE (id,name),

• grunt> lower_data = FOREACH emp_data GENERATE (id,name),

• grunt> replace_data = FOREACH emp_data GENERATE

• grunt> Dump replace_data;

• grunt> trim_data = FOREACH emp_data GENERATE (id,name), TRIM(name);

• grunt> ltrim_data = FOREACH emp_data GENERATE (id,name), LTRIM(name);

• grunt> rtrim_data = FOREACH emp_data GENERATE (id,name),

• grunt> Dump trim_data;

• grunt> Dump ltrim_data;

• grunt> Dump rtrim_data;

• grunt> todate_data = foreach date_data generate

• grunt> Dump todate_data;

public class Sample_Eval extends EvalFunc<String>

grunt> Upper_case = FOREACH emp_data GENERATE sample_eval(name);

grunt> Dump Upper_case;

You might also like