SlideShare a Scribd company logo
Alan F. GatesYahoo!Pig, Making Hadoop Easy
Who Am I?Pig committerHadoop PMC MemberAn architect in Yahoo!grid teamOr, as one coworker put it, “the lipstick on the Pig”
Who are you?
Motivation By Example   Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18 - 25.Load UsersLoad PagesFilter by ageJoin on nameGroup on urlCount clicksOrder by clicksTake top 5
In Map Reduce
In Pig LatinUsers = load‘users’as (name, age);Fltrd = filter Users by        age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url);Jnd = joinFltrdby name, Pages by user;Grpd = groupJndbyurl;Smmd = foreachGrpdgenerate group,COUNT(Jnd) as clicks;Srtd = orderSmmdby clicks desc;Top5 = limitSrtd 5;store Top5 into‘top5sites’;
Performance0.10.4,0.50.20.30.6, 0.7
Why not SQL?Data FactoryPigPipelinesIterative ProcessingResearchData WarehouseHiveBI ToolsAnalysisData Collection
Pig HighlightsUser defined functions (UDFs) can be written for column transformation (TOUPPER), or aggregation (SUM)UDFs can be written to take advantage of the combinerFour join implementations built in:  hash, fragment-replicate, merge, skewedMulti-query:  Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scannedOrder by provides total ordering across reducers in a balanced wayWriting load and store functions is easy once an InputFormat and OutputFormat existPiggybank, a collection of user contributed UDFs
Who uses Pig for What?70% of production jobs at Yahoo (10ks per day)Also used by Twitter, LinkedIn, Ebay, AOL, …Used toProcess web logsBuild user behavior modelsProcess imagesBuild maps of the webDo research on raw data sets
Accessing PigSubmit a script directlyGrunt, the pig shellPigServer Java class, a JDBC like interface
ComponentsJob executes on clusterHadoop ClusterPig resides on user machineUser machineNo need to install anything extra on your Hadoop cluster.
How It WorksPig LatinA = LOAD ‘myfile’    AS (x, y, z);B = FILTER A by x > 0; C = GROUP B BY x;D = FOREACH A GENERATEx, COUNT(B);STORE D INTO ‘output’;pig.jar:parses
checks
optimizes
plans execution

More Related Content

PDF
introduction to data processing using Hadoop and Pig
PPT
HIVE: Data Warehousing & Analytics on Hadoop
PPT
Introduction To Map Reduce
PDF
Introduction To Apache Pig at WHUG
PPT
Hadoop basics
KEY
Hadoop, Pig, and Twitter (NoSQL East 2009)
KEY
Getting Started on Hadoop
PDF
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
introduction to data processing using Hadoop and Pig
HIVE: Data Warehousing & Analytics on Hadoop
Introduction To Map Reduce
Introduction To Apache Pig at WHUG
Hadoop basics
Hadoop, Pig, and Twitter (NoSQL East 2009)
Getting Started on Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop

What's hot (19)

PPT
Another Intro To Hadoop
KEY
Hive vs Pig for HadoopSourceCodeReading
KEY
Intro to Hadoop
PDF
Practical Problem Solving with Apache Hadoop & Pig
PPTX
Introduction to Apache Hadoop
PPTX
Map Reduce
PPTX
Pig programming is more fun: New features in Pig
PPTX
MapReduce basic
PPTX
Introduction to Pig
PDF
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
PPTX
Hadoop & Hive Change the Data Warehousing Game Forever
ODP
Hadoop - Overview
DOCX
Hadoop Seminar Report
PDF
Hadoop Administration pdf
PDF
Hadoop-Introduction
PPT
Seminar Presentation Hadoop
PPT
2008 Ur Tech Talk Zshao
PPTX
PDF
Report Hadoop Map Reduce
Another Intro To Hadoop
Hive vs Pig for HadoopSourceCodeReading
Intro to Hadoop
Practical Problem Solving with Apache Hadoop & Pig
Introduction to Apache Hadoop
Map Reduce
Pig programming is more fun: New features in Pig
MapReduce basic
Introduction to Pig
Introduction to the Hadoop Ecosystem (FrOSCon Edition)
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop - Overview
Hadoop Seminar Report
Hadoop Administration pdf
Hadoop-Introduction
Seminar Presentation Hadoop
2008 Ur Tech Talk Zshao
Report Hadoop Map Reduce
Ad

Viewers also liked (10)

PDF
Integration of Hive and HBase
PDF
Hive Quick Start Tutorial
PPTX
Big Data Analytics with Hadoop
PPTX
Big Data & Hadoop Tutorial
PPTX
Hadoop introduction , Why and What is Hadoop ?
PPTX
Big data and Hadoop
ODP
Hadoop demo ppt
PDF
A beginners guide to Cloudera Hadoop
PPSX
PDF
Hadoop Overview & Architecture
 
Integration of Hive and HBase
Hive Quick Start Tutorial
Big Data Analytics with Hadoop
Big Data & Hadoop Tutorial
Hadoop introduction , Why and What is Hadoop ?
Big data and Hadoop
Hadoop demo ppt
A beginners guide to Cloudera Hadoop
Hadoop Overview & Architecture
 
Ad

Similar to Pig, Making Hadoop Easy (20)

PPTX
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
PDF
Apache Pig: A big data processor
PPTX
03 pig intro
PDF
Big Data Hadoop Training
PPTX
power point presentation on pig -hadoop framework
PPTX
Pig power tools_by_viswanath_gangavaram
PDF
Sql saturday pig session (wes floyd) v2
PPTX
PPTX
Introduction to pig.
PPTX
Apache PIG
PDF
43_Sameer_Kumar_Das2
PPTX
Understanding Pig and Hive in Apache Hadoop
PPTX
PigHive.pptx
PPTX
PigHive.pptx
PPTX
Running, execution and HDFS(Hadoop distributed file system)in pig
PPTX
PigHive presentation and hive impor.pptx
PPTX
An Introduction to Apache Pig
PDF
Unit V.pdf
PDF
Introduction to pig & pig latin
Apache Hadoop India Summit 2011 talk "Pig - Making Hadoop Easy" by Alan Gate
Apache Pig: A big data processor
03 pig intro
Big Data Hadoop Training
power point presentation on pig -hadoop framework
Pig power tools_by_viswanath_gangavaram
Sql saturday pig session (wes floyd) v2
Introduction to pig.
Apache PIG
43_Sameer_Kumar_Das2
Understanding Pig and Hive in Apache Hadoop
PigHive.pptx
PigHive.pptx
Running, execution and HDFS(Hadoop distributed file system)in pig
PigHive presentation and hive impor.pptx
An Introduction to Apache Pig
Unit V.pdf
Introduction to pig & pig latin

More from Nick Dimiduk (13)

PDF
Apache Big Data EU 2015 - HBase
PDF
Apache Big Data EU 2015 - Phoenix
PDF
Apache HBase 1.0 Release
PPTX
HBase Low Latency, StrataNYC 2014
PDF
HBase Blockcache 101
PDF
HBase Data Types
PDF
Apache HBase Low Latency
PDF
Apache HBase for Architects
PDF
HBase Data Types (WIP)
PDF
Bring Cartography to the Cloud
PDF
HBase for Architects
PDF
HBase Client APIs (for webapps?)
KEY
Introduction to Hadoop, HBase, and NoSQL
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - Phoenix
Apache HBase 1.0 Release
HBase Low Latency, StrataNYC 2014
HBase Blockcache 101
HBase Data Types
Apache HBase Low Latency
Apache HBase for Architects
HBase Data Types (WIP)
Bring Cartography to the Cloud
HBase for Architects
HBase Client APIs (for webapps?)
Introduction to Hadoop, HBase, and NoSQL

Pig, Making Hadoop Easy

  • 1. Alan F. GatesYahoo!Pig, Making Hadoop Easy
  • 2. Who Am I?Pig committerHadoop PMC MemberAn architect in Yahoo!grid teamOr, as one coworker put it, “the lipstick on the Pig”
  • 4. Motivation By Example Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18 - 25.Load UsersLoad PagesFilter by ageJoin on nameGroup on urlCount clicksOrder by clicksTake top 5
  • 6. In Pig LatinUsers = load‘users’as (name, age);Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url);Jnd = joinFltrdby name, Pages by user;Grpd = groupJndbyurl;Smmd = foreachGrpdgenerate group,COUNT(Jnd) as clicks;Srtd = orderSmmdby clicks desc;Top5 = limitSrtd 5;store Top5 into‘top5sites’;
  • 8. Why not SQL?Data FactoryPigPipelinesIterative ProcessingResearchData WarehouseHiveBI ToolsAnalysisData Collection
  • 9. Pig HighlightsUser defined functions (UDFs) can be written for column transformation (TOUPPER), or aggregation (SUM)UDFs can be written to take advantage of the combinerFour join implementations built in: hash, fragment-replicate, merge, skewedMulti-query: Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scannedOrder by provides total ordering across reducers in a balanced wayWriting load and store functions is easy once an InputFormat and OutputFormat existPiggybank, a collection of user contributed UDFs
  • 10. Who uses Pig for What?70% of production jobs at Yahoo (10ks per day)Also used by Twitter, LinkedIn, Ebay, AOL, …Used toProcess web logsBuild user behavior modelsProcess imagesBuild maps of the webDo research on raw data sets
  • 11. Accessing PigSubmit a script directlyGrunt, the pig shellPigServer Java class, a JDBC like interface
  • 12. ComponentsJob executes on clusterHadoop ClusterPig resides on user machineUser machineNo need to install anything extra on your Hadoop cluster.
  • 13. How It WorksPig LatinA = LOAD ‘myfile’ AS (x, y, z);B = FILTER A by x > 0; C = GROUP B BY x;D = FOREACH A GENERATEx, COUNT(B);STORE D INTO ‘output’;pig.jar:parses
  • 17. submits jar to Hadoop
  • 18. monitors job progressExecution PlanMap:Filter CountCombine/Reduce:Sum
  • 20. Upcoming FeaturesIn 0.8 (plan to branch end of August, release this fall):Runtime statistics collectionUDFs in scripting languages (e.g. python)Ability to specify a custom partitionerAdding many string and math functions as Pig supported UDFsPost 0.8Adding branches, loops, functions, and modulesUsabilityBetter error messagesFix ILLUSTRATEImproved integration with workflow systems
  • 21. Learn MoreRead the online documentation: http://hadoop.apache.org/pig/On line tutorialsFrom Yahoo, http://developer.yahoo.com/hadoop/tutorial/From Cloudera, http://www.cloudera.com/hadoop-trainingUsing Pig on EC2: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2728A couple of Hadoop books available that include chapters on Pig, search at your favorite bookstoreJoin the mailing lists:[email protected] for user [email protected] for developer [email protected] for Howl

Editor's Notes

  • #4: How many have used Pig? How many have looked at it and have a basic understanding of it?
  • #15: Demo script:Show group query first, talk about: load and schema (none, declared, from data) data types data sources need not be from HDFS or even from files parallel clause, how parallelism is determined on maps how grouping works in Pig LatinSo far what I’ve shown you is a simple join/group query. Now let’s look at something less straight forward in SQLOften people want to group data a number of different ways. Look at multiquery script: Note how there’s a branch in the logic nowOften want to operate on the result of each record in a previous statement. Look at top5 query Note nested foreach allows you to operate on each record coming out of group by Since result of group by is a bag in each record, can apply operators to that bag Currently support order, distinct, filter, limit Use of flatten at the end Use of positional parametersThere will always be logic you need to write that you can’t get from Pig Latin. This is where rich support of UDFs come in. Look at session query Note registering UDF UDF now called like any other Pig builtin function (in fact Pig builtins implemented as UDFs)Look at SessionAnalysis.java Class name is UDF name Input to UDF is always a Tuple, avoids need to declare expected input, means UDF has to check what it gets Talk about how projection of bags works Talk about how EvalFunc is templatized on return typeAlso easy to write load and store functions to fit your data needs