Pig, Making Hadoop Easy

1. Alan F. GatesYahoo!Pig, Making Hadoop Easy

2. Who Am I?Pig committerHadoop PMC MemberAn architect in Yahoo!grid teamOr, as one coworker put it, “the lipstick on the Pig”

3. Who are you?

4. Motivation By Example Suppose you have user data in one file, website data in another, and you need to find the top 5 most visited pages by users aged 18 - 25.Load UsersLoad PagesFilter by ageJoin on nameGroup on urlCount clicksOrder by clicksTake top 5

5. In Map Reduce

6. In Pig LatinUsers = load‘users’as (name, age);Fltrd = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url);Jnd = joinFltrdby name, Pages by user;Grpd = groupJndbyurl;Smmd = foreachGrpdgenerate group,COUNT(Jnd) as clicks;Srtd = orderSmmdby clicks desc;Top5 = limitSrtd 5;store Top5 into‘top5sites’;

7. Performance0.10.4,0.50.20.30.6, 0.7

8. Why not SQL?Data FactoryPigPipelinesIterative ProcessingResearchData WarehouseHiveBI ToolsAnalysisData Collection

9. Pig HighlightsUser defined functions (UDFs) can be written for column transformation (TOUPPER), or aggregation (SUM)UDFs can be written to take advantage of the combinerFour join implementations built in: hash, fragment-replicate, merge, skewedMulti-query: Pig will combine certain types of operations together in a single pipeline to reduce the number of times data is scannedOrder by provides total ordering across reducers in a balanced wayWriting load and store functions is easy once an InputFormat and OutputFormat existPiggybank, a collection of user contributed UDFs

10. Who uses Pig for What?70% of production jobs at Yahoo (10ks per day)Also used by Twitter, LinkedIn, Ebay, AOL, …Used toProcess web logsBuild user behavior modelsProcess imagesBuild maps of the webDo research on raw data sets

11. Accessing PigSubmit a script directlyGrunt, the pig shellPigServer Java class, a JDBC like interface

12. ComponentsJob executes on clusterHadoop ClusterPig resides on user machineUser machineNo need to install anything extra on your Hadoop cluster.

13. How It WorksPig LatinA = LOAD ‘myfile’ AS (x, y, z);B = FILTER A by x > 0; C = GROUP B BY x;D = FOREACH A GENERATEx, COUNT(B);STORE D INTO ‘output’;pig.jar:parses

14. checks

15. optimizes

16. plans execution

17. submits jar to Hadoop

18. monitors job progressExecution PlanMap:Filter CountCombine/Reduce:Sum

19. Demos3://hadoopday/pig_tutorial

20. Upcoming FeaturesIn 0.8 (plan to branch end of August, release this fall):Runtime statistics collectionUDFs in scripting languages (e.g. python)Ability to specify a custom partitionerAdding many string and math functions as Pig supported UDFsPost 0.8Adding branches, loops, functions, and modulesUsabilityBetter error messagesFix ILLUSTRATEImproved integration with workflow systems

21. Learn MoreRead the online documentation: http://hadoop.apache.org/pig/On line tutorialsFrom Yahoo, http://developer.yahoo.com/hadoop/tutorial/From Cloudera, http://www.cloudera.com/hadoop-trainingUsing Pig on EC2: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2728A couple of Hadoop books available that include chapters on Pig, search at your favorite bookstoreJoin the mailing lists:[email protected] for user [email protected] for developer [email protected] for Howl

Pig, Making Hadoop Easy

More Related Content

What's hot (19)

Viewers also liked (10)

Similar to Pig, Making Hadoop Easy (20)

More from Nick Dimiduk (13)

Pig, Making Hadoop Easy

Editor's Notes