Whirlwind tour of Pig
           Chris Wilkes
    cwilkes@seattlehadoop.org
Agenda


  1   Why Pig?
  2   Data types
  3   Operators
  4   UDFs
  5   Using Pig
Agenda


  1   Why Pig?
  2   Data types
  3   Operators
  4   UDFs
  5   Using Pig
Why Pig?                                      Tired of boilerplate


•   Started off writing Mappers/Reducers in Java

    •   Fun at first

    •   Gets a little tedious

•   Need to do more than one MR step

    •   Write own flow control

    •   Utility classes to pass parameters / input paths

•   Go back and change a Reducer’s input type

    •   Did you change it in the Job setup?

    •   Processing two different input types in first job
Why Pig?                 Java MapReduce boilerplate example




•   Typical use case: have two different input types

    •   log files (timestamps and userids)

    •   database table dump (userids and names)

•   Want to combine the two together

•   Relatively simple, but tedious
Why Pig?                   Java MapReduce boilerplate example

Need to handle two different output types, need a single
class that can handle both, designate with a “tag”:
 Mapper<LongWritable,Text,TaggedKeyWritable,TaggedValueWritable>
 Reducer<TaggedKeyWritable,TaggedValueWritable,Text,PurchaseInfoWritable>

Inside of Mapper check in setup() or run() for Path of
input to decide if this is a log file or database table
 if (context.getInputSplit().getPath().contains(“logfile”)) {
   inputType=”LOGFILE” } else if { ... inputType=”DATABASE”}

Reducer: check tag and then combine
 if (key.getTag().equals(“LOGFILE”) { LogEntry logEntry = value.getValue() } else
 (key.getTag().equals(“DATABASE”) { UserInfo userInfo = value.getValue() }
 context.write(userInfo.getId(), logEntry.getTime() + “ “ + userInfo.getName())
Where’s your shears?

"I was working on my thesis
and realized I needed a
reference. I'd
seen a post on comp.arch
recently that cited a paper,
so I fired up
gnus. While I was searching
the for the post, I came
across another
post whose MIME encoding
screwed up my ancient version
of gnus, so I
stopped and downloaded the
latest version of gnus.
Agenda


  1   Why Pig?
  2   Data types
  3   Operators
  4   UDFs
  5   Using Pig
Data Types




•   From largest to smallest:

    •   Bag (relation / group)

    •   Tuple

    •   Field

•   A bag is a collection of tuples, tuples have fields
Data Types                                                                   Bag

$ cat logs                    $ cat groupbooks.pig
101 1002       10.09          logs = LOAD 'logs' AS
101 8912       5.96             (userid: int, bookid: long, price: double);
102 1002       10.09          bookbuys = GROUP logs BY bookid;
103 8912       5.96           DESCRIBE bookbuys;
103 7122       88.99          DUMP bookbuys;

$ pig -x local groupbooks.pig
bookbuys: {group: long,logs: {userid: int,bookid: long,price: double}}
                                   Tuple
(1002L,{(101,1002L,10.09),(102,1002L,10.09)})
(7122L,{(103,7122L,88.99)})
(8912L,{(101,8912L,5.96),(103,8912L,5.96)})
                    Inner bag
                                                              Field
      Field
Data Types                                               Tuple and Fields

$ cat booksexpensive.pig
logs = LOAD 'logs' AS (userid: int, bookid: long, price: double);
bookbuys = GROUP logs BY bookid;
expensive = FOREACH bookbuys {
      inside = FILTER logs BY price > 6.0;
      GENERATE inside;
}
                                         Refers to the
DESCRIBE expensive;
                                            inner bag
DUMP expensive;
$ pig -x local booksexpensive.pig
expensive: {inside: {userid: int,bookid: long,price: double}}
({(101,1002L,10.09),(102,1002L,10.09)})
({(103,7122L,88.99)})
({})                Inner bag
                            Note: can always refer to $0, $1, etc
Agenda


  1   Why Pig?
  2   Data types
  3   Operators
  4   UDFs
  5   Using Pig
Operator                                                             Load

This will load all files under the logs/2010/05 directory
(or the logs/2010/05 file) and put into clicklogs:
      clicklogs = LOAD 'logs/2010/05';


Names the files in the tuple “userid” and “url” -- instead
of having to refer as $0 and $1

      clicklogs = LOAD 'logs/2010/05' as (userid: int, url: chararray)


              Inner bag occurs till a dump/store
  Note: no actual loading
  command is executed.
Operator                                           Load

By default splits on the tab character (the same as the
key/value separator in MapReduce jobs). Can also specify
your own delimiter:
         LOAD ‘logs’ USING PigStorage(‘~’)

PigStorage implements LoadFunc -- implement this
interface to create your own loader, ie “RegExLoader”
from the Piggybank.


              Inner bag
Operator                              Describe, Dump, and Store

“Describe” prints out that variable’s schema:
    DUMP combotimes;
    combotimes: {group: chararray,
      enter: {time: chararray,userid: chararray},
      exit: {time: chararray,userid: chararray,cost: double}}
To see output on the screen type “dump varname”:
    DUMP namesandaddresses;
 To output to a file / directory use store:
    STORE patienttrials INTO ‘trials/2010’;


                Inner bag
Operator                                                           Group

 $ cat starnames           $ cat starpositions
 1     Mintaka             1     R.A. 05h 32m 0.4s, Dec. -00 17' 57"
 2     Alnitak             2     R.A. 05h 40m 45.5s, Dec. -01 56' 34"
 3     Epsilon Orionis     3     R.A. 05h 36m 12.8s, Dec. -01 12' 07"
    $ cat starsandpositions.pig
    names = LOAD 'starnames' as (id: int, name: chararray);
    positions = LOAD 'starpositions' as (id: int, position: chararray);
    nameandpos = GROUP names BY id, positions BY id;
    DESCRIBE nameandpos;
    DUMP nameandpos;
 nameandpos: {group: int,names: {id: int,name: chararray},
  positions: {id: int,position: chararray}}
 (1,{(1,Mintaka)},{(1,R.A. bag 32m 0.4s, Dec. -00 17' 57")})
                   Inner 05h
 (2,{(2,Alnitak)},{(2,R.A. 05h 40m 45.5s, Dec. -01 56' 34")})
 (3,{(3,Epsilon Orionis)},{(3,R.A. 05h 36m 12.8s, Dec. -01 12' 07")})
Operator                                                                 Join

Just like GROUP but flatter
   $ cat starsandpositions2.pig
   names = LOAD 'starnames' as (id: int, name: chararray);
   positions = LOAD 'starpositions' as (id: int, position: chararray);
   nameandpos = JOIN names BY id, positions BY id;
   DESCRIBE nameandpos;
   DUMP nameandpos;

 nameandpos: {names::id: int,names::name: chararray,
 positions::id: int,positions::position: chararray}

 (1,Mintaka,1,R.A. 05h 32m 0.4s, Dec. -00 17' 57")
 (2,Alnitak,2,R.A.Inner bag
                   05h 40m 45.5s, Dec. -01 56' 34")
 (3,Epsilon Orionis,3,R.A. 05h 36m 12.8s, Dec. -01 12' 07")
Operator                                                          Flatten

Ugly looking output from before:
  expensive: {inside: {userid: int,bookid: long,price: double}}
  ({(101,1002L,10.09),(102,1002L,10.09)})
  ({(103,7122L,88.99)})
Use the FLATTEN operator
  expensive = FOREACH bookbuys {
        inside = FILTER logs BY price > 6.0;
        GENERATE group, FLATTEN (inside);
  }
  expensive: {group: long,inside::userid: int,inside::bookid:
  long,inside::price: double}
  (1002L,101,1002L,10.09)
                 Inner bag
  (1002L,102,1002L,10.09)
  (7122L,103,7122L,88.99)
Operator                                       Renaming in Foreach

 All columns with cumbersome names:
  expensive: {group: long,inside::userid: int,inside::bookid:
  long,inside::price: double}
 Pick and rename:
  expensive = FOREACH bookbuys {
     inside = FILTER logs BY price > 6.0;
     GENERATE group AS userid,
       FLATTEN (inside.(bookid, price)) AS (bookid, price);
  }
                                               Kept the type!
 Now easy to use:
  expensive: {userid: long,bookid: long,price: double}
  (1002L,1002L,10.09)
  (1002L,1002L,10.09) bag
                Inner
  (7122L,7122L,88.99)
Operator                                                          Split

When input file mixes types or needs separation
  $ cat enterexittimes
  2010-05-10 12:55:12     user123 enter
  2010-05-10 13:14:23     user456 enter
  2010-05-10 13:16:53     user123 exit 23.79
  2010-05-10 13:17:49     user456 exit 0.50

  inandout = LOAD 'enterexittimes';
  SPLIT inandout INTO enter1 IF $2 == 'enter', exit1 IF $2 == 'exit';

                               (2010-05-10 12:55:12,user123,enter)
                     enter1:
                               (2010-05-10 13:14:23,user456,enter)
  (2010-05-10 13:16:53,user123,exit,23.79)
                                            : exit1
  (2010-05-10 13:17:49,user456,exit,0.50)
Operator                                                          Split

If same schema for each line can specify on load, in this
case need to do a foreach:
 enter = FOREACH enter1 GENERATE
   (chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray;
 exit = FOREACH exit1 GENERATE
   (chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray,
   (double)$3 AS cost:double;
 DESCRIBE enter;
 DESCRIBE exit;

 enter: {time: chararray,userid: chararray}
 exit: {time: chararray,userid: chararray,cost: double}
Operator                                                 Sample, Limit

For testing purposes sample both large inputs:
   names1 = LOAD 'starnames' as (id: int, name: chararray);
   names = SAMPLE names1 0.3;
   positions1 = LOAD 'starpositions' as (id: int, position: chararray);
   positions = SAMPLE positions1 0.3;
Running returns random rows every time
   (1,Mintaka,1,R.A. 05h 32m 0.4s, Dec. -00 17' 57")
Limit only returns the first N results. Use with OrderBy
to return the top results:
   nameandpos1 = JOIN names BY id, positions BY id;
   nameandpos2 = ORDER nameandpos1 BY names::id DESC;
   nameandpos Inner bag
              = LIMIT nameandpos2 2;
   (3,Epsilon Orionis,3,R.A. 05h 36m 12.8s, Dec. -01 12' 07")
   (2,Alnitak,2,R.A. 05h 40m 45.5s, Dec. -01 56' 34")
Agenda


  1   Why Pig?
  2   Data types
  3   Operators
  4   UDFs
  5   Using Pig
UDF

UDF: User Defined Function
Operates on single values or a group
Simple example: IsEmpty (a FilterFunc)
   users = JOIN names BY id, addresses BY id;
   D = FOREACH users GENERATE group,
    FLATTEN ((IsEmpty(names::firstName) ? “none” : names::firstName)
Working over an aggregate, ie COUNT:
   users = JOIN names BY id, books BY buyerId;
   D = FOREACH users GENERATE group, COUNT(books)
Working on two values:
   distance1= CROSS stars and stars;
   distance =
Agenda


  1   Why Pig?
  2   Data types
  3   Operators
  4   UDFs
  5   Using Pig
LOAD and GROUP
logfiles = LOAD ‘logs’ AS (userid: int, bookid: long, price:
double);
userinfo = LOAD ‘users’ AS (userid: int, name: chararray);
userpurchases = GROUP logfiles BY userid, userinfo BY
userid;
DESCRIBE userpurchases;
DUMP userpurchases;
Inside {} are bags (unordered)
inside () are tuples (ordered list of fields)

report = FOREACH userpurchases GENERATE
FLATTEN(userinfo.name) AS name, group AS userid,
FLATTEN(SUM(logfiles.price)) AS cost;
bybigspender = ORDER report BY cost DESC;
DUMP bybigspender;

(Bob,103,94.94999999999999)
(Joe,101,16.05)
(Cindy,102,10.09)
Entering and exiting recorded in same file:

2010-05-10 12:55:12 user123 enter
2010-05-10 13:14:23 user456 enter
2010-05-10 13:16:53 user123 exit 23.79
2010-05-10 13:17:49 user456 exit 0.50
inandout = LOAD 'enterexittimes';
SPLIT inandout INTO enter
  IF $2 == 'enter', exit1 IF $2 == 'exit';

enter = FOREACH enter1 GENERATE
 (chararray)$0 AS time:chararray,
 (chararray)$1 AS userid:chararray;

exit = FOREACH exit1 GENERATE
 (chararray)$0 AS time:chararray,
 (chararray)$1 AS userid:chararray,
 (double)$3 AS cost:double;
combotimes = GROUP enter BY $1, exit BY $1;

purchases = FOREACH combotimes GENERATE
 group AS userid,
 FLATTEN(enter.$0) AS entertime,
 FLATTEN(exit.$0) AS exittime,
 FLATTEN(exit.$2);

DUMP purchases;
Schema for inandout, enter1, exit1 unknown.

enter: {time: chararray,userid: chararray}
exit: {time: chararray,userid: chararray,cost: double}

combotimes: {group: chararray,
 enter: {time: chararray,userid: chararray},
 exit: {time: chararray,userid: chararray,cost: double}}

purchases: {userid: chararray,entertime: chararray,
 exittime: chararray,cost: double}
UDFs
• User Defined Function
• For doing an operation on data
• Already use several builtins:
  • COUNT
  • SUM
•

Pig Introduction to Pig

  • 1.
  • 2.
    Agenda 1 Why Pig? 2 Data types 3 Operators 4 UDFs 5 Using Pig
  • 3.
    Agenda 1 Why Pig? 2 Data types 3 Operators 4 UDFs 5 Using Pig
  • 4.
    Why Pig? Tired of boilerplate • Started off writing Mappers/Reducers in Java • Fun at first • Gets a little tedious • Need to do more than one MR step • Write own flow control • Utility classes to pass parameters / input paths • Go back and change a Reducer’s input type • Did you change it in the Job setup? • Processing two different input types in first job
  • 5.
    Why Pig? Java MapReduce boilerplate example • Typical use case: have two different input types • log files (timestamps and userids) • database table dump (userids and names) • Want to combine the two together • Relatively simple, but tedious
  • 6.
    Why Pig? Java MapReduce boilerplate example Need to handle two different output types, need a single class that can handle both, designate with a “tag”: Mapper<LongWritable,Text,TaggedKeyWritable,TaggedValueWritable> Reducer<TaggedKeyWritable,TaggedValueWritable,Text,PurchaseInfoWritable> Inside of Mapper check in setup() or run() for Path of input to decide if this is a log file or database table if (context.getInputSplit().getPath().contains(“logfile”)) { inputType=”LOGFILE” } else if { ... inputType=”DATABASE”} Reducer: check tag and then combine if (key.getTag().equals(“LOGFILE”) { LogEntry logEntry = value.getValue() } else (key.getTag().equals(“DATABASE”) { UserInfo userInfo = value.getValue() } context.write(userInfo.getId(), logEntry.getTime() + “ “ + userInfo.getName())
  • 7.
    Where’s your shears? "Iwas working on my thesis and realized I needed a reference. I'd seen a post on comp.arch recently that cited a paper, so I fired up gnus. While I was searching the for the post, I came across another post whose MIME encoding screwed up my ancient version of gnus, so I stopped and downloaded the latest version of gnus.
  • 8.
    Agenda 1 Why Pig? 2 Data types 3 Operators 4 UDFs 5 Using Pig
  • 9.
    Data Types • From largest to smallest: • Bag (relation / group) • Tuple • Field • A bag is a collection of tuples, tuples have fields
  • 10.
    Data Types Bag $ cat logs $ cat groupbooks.pig 101 1002 10.09 logs = LOAD 'logs' AS 101 8912 5.96 (userid: int, bookid: long, price: double); 102 1002 10.09 bookbuys = GROUP logs BY bookid; 103 8912 5.96 DESCRIBE bookbuys; 103 7122 88.99 DUMP bookbuys; $ pig -x local groupbooks.pig bookbuys: {group: long,logs: {userid: int,bookid: long,price: double}} Tuple (1002L,{(101,1002L,10.09),(102,1002L,10.09)}) (7122L,{(103,7122L,88.99)}) (8912L,{(101,8912L,5.96),(103,8912L,5.96)}) Inner bag Field Field
  • 11.
    Data Types Tuple and Fields $ cat booksexpensive.pig logs = LOAD 'logs' AS (userid: int, bookid: long, price: double); bookbuys = GROUP logs BY bookid; expensive = FOREACH bookbuys { inside = FILTER logs BY price > 6.0; GENERATE inside; } Refers to the DESCRIBE expensive; inner bag DUMP expensive; $ pig -x local booksexpensive.pig expensive: {inside: {userid: int,bookid: long,price: double}} ({(101,1002L,10.09),(102,1002L,10.09)}) ({(103,7122L,88.99)}) ({}) Inner bag Note: can always refer to $0, $1, etc
  • 12.
    Agenda 1 Why Pig? 2 Data types 3 Operators 4 UDFs 5 Using Pig
  • 13.
    Operator Load This will load all files under the logs/2010/05 directory (or the logs/2010/05 file) and put into clicklogs: clicklogs = LOAD 'logs/2010/05'; Names the files in the tuple “userid” and “url” -- instead of having to refer as $0 and $1 clicklogs = LOAD 'logs/2010/05' as (userid: int, url: chararray) Inner bag occurs till a dump/store Note: no actual loading command is executed.
  • 14.
    Operator Load By default splits on the tab character (the same as the key/value separator in MapReduce jobs). Can also specify your own delimiter: LOAD ‘logs’ USING PigStorage(‘~’) PigStorage implements LoadFunc -- implement this interface to create your own loader, ie “RegExLoader” from the Piggybank. Inner bag
  • 15.
    Operator Describe, Dump, and Store “Describe” prints out that variable’s schema: DUMP combotimes; combotimes: {group: chararray, enter: {time: chararray,userid: chararray}, exit: {time: chararray,userid: chararray,cost: double}} To see output on the screen type “dump varname”: DUMP namesandaddresses; To output to a file / directory use store: STORE patienttrials INTO ‘trials/2010’; Inner bag
  • 16.
    Operator Group $ cat starnames $ cat starpositions 1 Mintaka 1 R.A. 05h 32m 0.4s, Dec. -00 17' 57" 2 Alnitak 2 R.A. 05h 40m 45.5s, Dec. -01 56' 34" 3 Epsilon Orionis 3 R.A. 05h 36m 12.8s, Dec. -01 12' 07" $ cat starsandpositions.pig names = LOAD 'starnames' as (id: int, name: chararray); positions = LOAD 'starpositions' as (id: int, position: chararray); nameandpos = GROUP names BY id, positions BY id; DESCRIBE nameandpos; DUMP nameandpos; nameandpos: {group: int,names: {id: int,name: chararray}, positions: {id: int,position: chararray}} (1,{(1,Mintaka)},{(1,R.A. bag 32m 0.4s, Dec. -00 17' 57")}) Inner 05h (2,{(2,Alnitak)},{(2,R.A. 05h 40m 45.5s, Dec. -01 56' 34")}) (3,{(3,Epsilon Orionis)},{(3,R.A. 05h 36m 12.8s, Dec. -01 12' 07")})
  • 17.
    Operator Join Just like GROUP but flatter $ cat starsandpositions2.pig names = LOAD 'starnames' as (id: int, name: chararray); positions = LOAD 'starpositions' as (id: int, position: chararray); nameandpos = JOIN names BY id, positions BY id; DESCRIBE nameandpos; DUMP nameandpos; nameandpos: {names::id: int,names::name: chararray, positions::id: int,positions::position: chararray} (1,Mintaka,1,R.A. 05h 32m 0.4s, Dec. -00 17' 57") (2,Alnitak,2,R.A.Inner bag 05h 40m 45.5s, Dec. -01 56' 34") (3,Epsilon Orionis,3,R.A. 05h 36m 12.8s, Dec. -01 12' 07")
  • 18.
    Operator Flatten Ugly looking output from before: expensive: {inside: {userid: int,bookid: long,price: double}} ({(101,1002L,10.09),(102,1002L,10.09)}) ({(103,7122L,88.99)}) Use the FLATTEN operator expensive = FOREACH bookbuys { inside = FILTER logs BY price > 6.0; GENERATE group, FLATTEN (inside); } expensive: {group: long,inside::userid: int,inside::bookid: long,inside::price: double} (1002L,101,1002L,10.09) Inner bag (1002L,102,1002L,10.09) (7122L,103,7122L,88.99)
  • 19.
    Operator Renaming in Foreach All columns with cumbersome names: expensive: {group: long,inside::userid: int,inside::bookid: long,inside::price: double} Pick and rename: expensive = FOREACH bookbuys { inside = FILTER logs BY price > 6.0; GENERATE group AS userid, FLATTEN (inside.(bookid, price)) AS (bookid, price); } Kept the type! Now easy to use: expensive: {userid: long,bookid: long,price: double} (1002L,1002L,10.09) (1002L,1002L,10.09) bag Inner (7122L,7122L,88.99)
  • 20.
    Operator Split When input file mixes types or needs separation $ cat enterexittimes 2010-05-10 12:55:12 user123 enter 2010-05-10 13:14:23 user456 enter 2010-05-10 13:16:53 user123 exit 23.79 2010-05-10 13:17:49 user456 exit 0.50 inandout = LOAD 'enterexittimes'; SPLIT inandout INTO enter1 IF $2 == 'enter', exit1 IF $2 == 'exit'; (2010-05-10 12:55:12,user123,enter) enter1: (2010-05-10 13:14:23,user456,enter) (2010-05-10 13:16:53,user123,exit,23.79) : exit1 (2010-05-10 13:17:49,user456,exit,0.50)
  • 21.
    Operator Split If same schema for each line can specify on load, in this case need to do a foreach: enter = FOREACH enter1 GENERATE (chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray; exit = FOREACH exit1 GENERATE (chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray, (double)$3 AS cost:double; DESCRIBE enter; DESCRIBE exit; enter: {time: chararray,userid: chararray} exit: {time: chararray,userid: chararray,cost: double}
  • 22.
    Operator Sample, Limit For testing purposes sample both large inputs: names1 = LOAD 'starnames' as (id: int, name: chararray); names = SAMPLE names1 0.3; positions1 = LOAD 'starpositions' as (id: int, position: chararray); positions = SAMPLE positions1 0.3; Running returns random rows every time (1,Mintaka,1,R.A. 05h 32m 0.4s, Dec. -00 17' 57") Limit only returns the first N results. Use with OrderBy to return the top results: nameandpos1 = JOIN names BY id, positions BY id; nameandpos2 = ORDER nameandpos1 BY names::id DESC; nameandpos Inner bag = LIMIT nameandpos2 2; (3,Epsilon Orionis,3,R.A. 05h 36m 12.8s, Dec. -01 12' 07") (2,Alnitak,2,R.A. 05h 40m 45.5s, Dec. -01 56' 34")
  • 23.
    Agenda 1 Why Pig? 2 Data types 3 Operators 4 UDFs 5 Using Pig
  • 24.
    UDF UDF: User DefinedFunction Operates on single values or a group Simple example: IsEmpty (a FilterFunc) users = JOIN names BY id, addresses BY id; D = FOREACH users GENERATE group, FLATTEN ((IsEmpty(names::firstName) ? “none” : names::firstName) Working over an aggregate, ie COUNT: users = JOIN names BY id, books BY buyerId; D = FOREACH users GENERATE group, COUNT(books) Working on two values: distance1= CROSS stars and stars; distance =
  • 25.
    Agenda 1 Why Pig? 2 Data types 3 Operators 4 UDFs 5 Using Pig
  • 26.
    LOAD and GROUP logfiles= LOAD ‘logs’ AS (userid: int, bookid: long, price: double); userinfo = LOAD ‘users’ AS (userid: int, name: chararray); userpurchases = GROUP logfiles BY userid, userinfo BY userid; DESCRIBE userpurchases; DUMP userpurchases;
  • 27.
    Inside {} arebags (unordered) inside () are tuples (ordered list of fields) report = FOREACH userpurchases GENERATE FLATTEN(userinfo.name) AS name, group AS userid, FLATTEN(SUM(logfiles.price)) AS cost; bybigspender = ORDER report BY cost DESC; DUMP bybigspender; (Bob,103,94.94999999999999) (Joe,101,16.05) (Cindy,102,10.09)
  • 28.
    Entering and exitingrecorded in same file: 2010-05-10 12:55:12 user123 enter 2010-05-10 13:14:23 user456 enter 2010-05-10 13:16:53 user123 exit 23.79 2010-05-10 13:17:49 user456 exit 0.50
  • 29.
    inandout = LOAD'enterexittimes'; SPLIT inandout INTO enter IF $2 == 'enter', exit1 IF $2 == 'exit'; enter = FOREACH enter1 GENERATE (chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray; exit = FOREACH exit1 GENERATE (chararray)$0 AS time:chararray, (chararray)$1 AS userid:chararray, (double)$3 AS cost:double;
  • 30.
    combotimes = GROUPenter BY $1, exit BY $1; purchases = FOREACH combotimes GENERATE group AS userid, FLATTEN(enter.$0) AS entertime, FLATTEN(exit.$0) AS exittime, FLATTEN(exit.$2); DUMP purchases;
  • 31.
    Schema for inandout,enter1, exit1 unknown. enter: {time: chararray,userid: chararray} exit: {time: chararray,userid: chararray,cost: double} combotimes: {group: chararray, enter: {time: chararray,userid: chararray}, exit: {time: chararray,userid: chararray,cost: double}} purchases: {userid: chararray,entertime: chararray, exittime: chararray,cost: double}
  • 32.
    UDFs • User DefinedFunction • For doing an operation on data • Already use several builtins: • COUNT • SUM •