Pig Introduction to Pig

Whirlwind tour of Pig
Chris Wilkes
cwilkes@seattlehadoop.org

Agenda

1 Why Pig?
2 Data types
3 Operators
4 UDFs
5 Using Pig

Why Pig? Tired of boilerplate

• Started off writing Mappers/Reducers in Java

• Fun at first

• Gets a little tedious

• Need to do more than one MR step

• Write own flow control

• Utility classes to pass parameters / input paths

• Go back and change a Reducer’s input type

• Did you change it in the Job setup?

• Processing two different input types in first job

Why Pig? Java MapReduce boilerplate example

• Typical use case: have two different input types

• log ﬁles (timestamps and userids)

• database table dump (userids and names)

• Want to combine the two together

• Relatively simple, but tedious

Why Pig? Java MapReduce boilerplate example

Need to handle two different output types, need a single
class that can handle both, designate with a “tag”:
Mapper<LongWritable,Text,TaggedKeyWritable,TaggedValueWritable>
Reducer<TaggedKeyWritable,TaggedValueWritable,Text,PurchaseInfoWritable>

Inside of Mapper check in setup() or run() for Path of
input to decide if this is a log ﬁle or database table
if (context.getInputSplit().getPath().contains(“logﬁle”)) {
inputType=”LOGFILE” } else if { ... inputType=”DATABASE”}

Reducer: check tag and then combine
if (key.getTag().equals(“LOGFILE”) { LogEntry logEntry = value.getValue() } else
(key.getTag().equals(“DATABASE”) { UserInfo userInfo = value.getValue() }
context.write(userInfo.getId(), logEntry.getTime() + “ “ + userInfo.getName())

Where’s your shears?

"I was working on my thesis
and realized I needed a
reference. I'd
seen a post on comp.arch
recently that cited a paper,
so I fired up
gnus. While I was searching
the for the post, I came
across another
post whose MIME encoding
screwed up my ancient version
of gnus, so I
stopped and downloaded the
latest version of gnus.

Data Types

• From largest to smallest:

• Bag (relation / group)

• Tuple

• Field

• A bag is a collection of tuples, tuples have ﬁelds

Data Types Bag

$ cat logs $ cat groupbooks.pig
101 1002 10.09 logs = LOAD 'logs' AS
101 8912 5.96 (userid: int, bookid: long, price: double);
102 1002 10.09 bookbuys = GROUP logs BY bookid;
103 8912 5.96 DESCRIBE bookbuys;
103 7122 88.99 DUMP bookbuys;

$ pig -x local groupbooks.pig
bookbuys: {group: long,logs: {userid: int,bookid: long,price: double}}
Tuple
(1002L,{(101,1002L,10.09),(102,1002L,10.09)})
(7122L,{(103,7122L,88.99)})
(8912L,{(101,8912L,5.96),(103,8912L,5.96)})
Inner bag
Field
Field

Data Types Tuple and Fields

$ cat booksexpensive.pig
logs = LOAD 'logs' AS (userid: int, bookid: long, price: double);
bookbuys = GROUP logs BY bookid;
expensive = FOREACH bookbuys {
inside = FILTER logs BY price > 6.0;
GENERATE inside;
}
Refers to the
DESCRIBE expensive;
inner bag
DUMP expensive;
$ pig -x local booksexpensive.pig
expensive: {inside: {userid: int,bookid: long,price: double}}
({(101,1002L,10.09),(102,1002L,10.09)})
({(103,7122L,88.99)})
({}) Inner bag
Note: can always refer to $0, $1, etc

Operator Load

This will load all files under the logs/2010/05 directory
(or the logs/2010/05 file) and put into clicklogs:
clicklogs = LOAD 'logs/2010/05';

Names the files in the tuple “userid” and “url” -- instead
of having to refer as $0 and $1

clicklogs = LOAD 'logs/2010/05' as (userid: int, url: chararray)

Inner bag occurs till a dump/store
Note: no actual loading
command is executed.

Operator Load

By default splits on the tab character (the same as the
key/value separator in MapReduce jobs). Can also specify
your own delimiter:
LOAD ‘logs’ USING PigStorage(‘~’)

PigStorage implements LoadFunc -- implement this
interface to create your own loader, ie “RegExLoader”
from the Piggybank.

Inner bag

Operator Describe, Dump, and Store

“Describe” prints out that variable’s schema:
DUMP combotimes;
combotimes: {group: chararray,
enter: {time: chararray,userid: chararray},
exit: {time: chararray,userid: chararray,cost: double}}
To see output on the screen type “dump varname”:
DUMP namesandaddresses;
To output to a ﬁle / directory use store:
STORE patienttrials INTO ‘trials/2010’;

Inner bag

Operator Group

$ cat starnames $ cat starpositions
1 Mintaka 1 R.A. 05h 32m 0.4s, Dec. -00 17' 57"
2 Alnitak 2 R.A. 05h 40m 45.5s, Dec. -01 56' 34"
3 Epsilon Orionis 3 R.A. 05h 36m 12.8s, Dec. -01 12' 07"
$ cat starsandpositions.pig
names = LOAD 'starnames' as (id: int, name: chararray);
positions = LOAD 'starpositions' as (id: int, position: chararray);
nameandpos = GROUP names BY id, positions BY id;
DESCRIBE nameandpos;
DUMP nameandpos;
nameandpos: {group: int,names: {id: int,name: chararray},
positions: {id: int,position: chararray}}
(1,{(1,Mintaka)},{(1,R.A. bag 32m 0.4s, Dec. -00 17' 57")})
Inner 05h
(2,{(2,Alnitak)},{(2,R.A. 05h 40m 45.5s, Dec. -01 56' 34")})
(3,{(3,Epsilon Orionis)},{(3,R.A. 05h 36m 12.8s, Dec. -01 12' 07")})

Operator Join

Just like GROUP but ﬂatter
$ cat starsandpositions2.pig
names = LOAD 'starnames' as (id: int, name: chararray);
positions = LOAD 'starpositions' as (id: int, position: chararray);
nameandpos = JOIN names BY id, positions BY id;
DESCRIBE nameandpos;
DUMP nameandpos;

nameandpos: {names::id: int,names::name: chararray,
positions::id: int,positions::position: chararray}

(1,Mintaka,1,R.A. 05h 32m 0.4s, Dec. -00 17' 57")
(2,Alnitak,2,R.A.Inner bag
05h 40m 45.5s, Dec. -01 56' 34")
(3,Epsilon Orionis,3,R.A. 05h 36m 12.8s, Dec. -01 12' 07")

Operator Flatten

Ugly looking output from before:
expensive: {inside: {userid: int,bookid: long,price: double}}
({(101,1002L,10.09),(102,1002L,10.09)})
({(103,7122L,88.99)})
Use the FLATTEN operator
GENERATE group, FLATTEN (inside);
}
expensive: {group: long,inside::userid: int,inside::bookid:
long,inside::price: double}
(1002L,101,1002L,10.09)
Inner bag
(1002L,102,1002L,10.09)
(7122L,103,7122L,88.99)

Operator Renaming in Foreach

All columns with cumbersome names:
expensive: {group: long,inside::userid: int,inside::bookid:
long,inside::price: double}
Pick and rename:
GENERATE group AS userid,
FLATTEN (inside.(bookid, price)) AS (bookid, price);
}
Kept the type!
Now easy to use:
expensive: {userid: long,bookid: long,price: double}
(1002L,1002L,10.09)
(1002L,1002L,10.09) bag
Inner
(7122L,7122L,88.99)

Operator Split

When input ﬁle mixes types or needs separation
$ cat enterexittimes
2010-05-10 12:55:12 user123 enter
2010-05-10 13:14:23 user456 enter
2010-05-10 13:16:53 user123 exit 23.79
2010-05-10 13:17:49 user456 exit 0.50

inandout = LOAD 'enterexittimes';
SPLIT inandout INTO enter1 IF $2 == 'enter', exit1 IF $2 == 'exit';

(2010-05-10 12:55:12,user123,enter)
enter1:
(2010-05-10 13:14:23,user456,enter)
(2010-05-10 13:16:53,user123,exit,23.79)
: exit1
(2010-05-10 13:17:49,user456,exit,0.50)

Operator Split

If same schema for each line can specify on load, in this
case need to do a foreach:
enter = FOREACH enter1 GENERATE
(chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray;
exit = FOREACH exit1 GENERATE
(chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray,
(double)$3 AS cost:double;
DESCRIBE enter;
DESCRIBE exit;

enter: {time: chararray,userid: chararray}
exit: {time: chararray,userid: chararray,cost: double}

Operator Sample, Limit

For testing purposes sample both large inputs:
names1 = LOAD 'starnames' as (id: int, name: chararray);
names = SAMPLE names1 0.3;
positions1 = LOAD 'starpositions' as (id: int, position: chararray);
positions = SAMPLE positions1 0.3;
Running returns random rows every time
(1,Mintaka,1,R.A. 05h 32m 0.4s, Dec. -00 17' 57")
Limit only returns the ﬁrst N results. Use with OrderBy
to return the top results:
nameandpos1 = JOIN names BY id, positions BY id;
nameandpos2 = ORDER nameandpos1 BY names::id DESC;
nameandpos Inner bag
= LIMIT nameandpos2 2;
(3,Epsilon Orionis,3,R.A. 05h 36m 12.8s, Dec. -01 12' 07")
(2,Alnitak,2,R.A. 05h 40m 45.5s, Dec. -01 56' 34")

UDF

UDF: User Defined Function
Operates on single values or a group
Simple example: IsEmpty (a FilterFunc)
users = JOIN names BY id, addresses BY id;
D = FOREACH users GENERATE group,
FLATTEN ((IsEmpty(names::firstName) ? “none” : names::firstName)
Working over an aggregate, ie COUNT:
users = JOIN names BY id, books BY buyerId;
D = FOREACH users GENERATE group, COUNT(books)
Working on two values:
distance1= CROSS stars and stars;
distance =

LOAD and GROUP
logﬁles = LOAD ‘logs’ AS (userid: int, bookid: long, price:
double);
userinfo = LOAD ‘users’ AS (userid: int, name: chararray);
userpurchases = GROUP logﬁles BY userid, userinfo BY
userid;
DESCRIBE userpurchases;
DUMP userpurchases;

Inside {} are bags (unordered)
inside () are tuples (ordered list of ﬁelds)

report = FOREACH userpurchases GENERATE
FLATTEN(userinfo.name) AS name, group AS userid,
FLATTEN(SUM(logﬁles.price)) AS cost;
bybigspender = ORDER report BY cost DESC;
DUMP bybigspender;

(Bob,103,94.94999999999999)
(Joe,101,16.05)
(Cindy,102,10.09)

Entering and exiting recorded in same ﬁle:

2010-05-10 12:55:12 user123 enter
2010-05-10 13:14:23 user456 enter
2010-05-10 13:16:53 user123 exit 23.79
2010-05-10 13:17:49 user456 exit 0.50

inandout = LOAD 'enterexittimes';
SPLIT inandout INTO enter
IF $2 == 'enter', exit1 IF $2 == 'exit';

enter = FOREACH enter1 GENERATE
(chararray)$0 AS time:chararray,
(chararray)$1 AS userid:chararray;

exit = FOREACH exit1 GENERATE
(chararray)$0 AS time:chararray,
(chararray)$1 AS userid:chararray,
(double)$3 AS cost:double;

combotimes = GROUP enter BY $1, exit BY $1;

purchases = FOREACH combotimes GENERATE
group AS userid,
FLATTEN(enter.$0) AS entertime,
FLATTEN(exit.$0) AS exittime,
FLATTEN(exit.$2);

DUMP purchases;

Schema for inandout, enter1, exit1 unknown.

enter: {time: chararray,userid: chararray}
exit: {time: chararray,userid: chararray,cost: double}

combotimes: {group: chararray,
enter: {time: chararray,userid: chararray},
exit: {time: chararray,userid: chararray,cost: double}}

purchases: {userid: chararray,entertime: chararray,
exittime: chararray,cost: double}

UDFs
• User Deﬁned Function
• For doing an operation on data
• Already use several builtins:
• COUNT
• SUM
•

Pig Introduction to Pig

More Related Content

What's hot

Similar to Pig Introduction to Pig

Recently uploaded

Pig Introduction to Pig