Cascading 
or, “was it worth three days 
out of the office?”
Agenda 
What is Cascading? 
Building cascades and flows 
How does this fit our needs? 
Advantages/disadvantages 
Q&A
What is Cascading anyway?
Cascading 101 
JVM framework and SDK for creating abstracted data 
flows 
Translates data flows into actual Hadoop/RDBMS/local 
jobs
Huh? 
Okay, let’s back up a bit.
Data flows 
Think of an ETL: Extract-Transform-Load 
In simple terms, take data from a source, change it 
somehow, and stick the result into something (a “sink”) 
Data 
source 
Data 
sink 
Extract Load 
Transformation(s)
Data flow implementation 
Pretty much everything we do is some flavor of this 
Sources: Games, Hadoop, Hive/MySQL, Couchbase, 
web service 
Transformations: Aggregations, group-bys, combined 
fields, filtering, etc. 
Sinks: Hadoop, Hive/MySQL, Couchbase
Cascading 101 (Part Deux) 
JVM data flow framework 
Models data flows as abstractions: 
Separates details of where and how we get data 
from what we do with it 
Implements transform operations as SQL or 
MapReduce or whatever
In other words… 
An ETL framework. 
A Pentaho we can program.
Building cascades 
and flows
Cascading terminology 
Flow: A path for data with some number of inputs, 
some operations, and some outputs 
Cascade: A series of connected flows
More terminology 
Operation: A function applied to data, yielding new 
data 
Pipe: Moves data from someplace to some other place 
Tap: Feeds data from outside the flow into it 
and writes data from inside the flow out of it
Simplest possible flow 
// create the source tap 
Tap inTap = new Hfs(new TextDelimited(true, "t"), inPath); 
! 
// create the sink tap 
Tap outTap = new Hfs(new TextDelimited(true, "t"), outPath); 
! 
// specify a pipe to connect the taps 
Pipe copyPipe = new Pipe(“copy"); 
! 
// connect the taps, pipes, etc., into a flow 
FlowDef flowDef = FlowDef.flowDef() 
.addSource(copyPipe, inTap) 
.addTailSink(copyPipe, outTap); 
! 
// run the flow 
flowConnector.connect(flowDef).complete();
We already 
have that. 
! 
It’s called ‘cp’.
Actually… 
Runs entirely in the cluster 
Works fine on megabytes, gigabytes, terabytes or 
petabytes; i.e., IT SCALES 
Completely testable outside of the cluster 
Who gets shell access to a namenode to run the bash 
or python equivalent?
Reliability is 
ESSENTIAL 
! 
if we, and our system, are to 
be taken srsly. 
Reliability is a feature, 
not a goal.
Let’s do something more 
interesting.
Real world use case: 
Word counting 
Read a simple file format 
Count the occurrence of every word in the file 
Output a list of all words and their counts
doc_id text 
doc01 A rain shadow is a dry area on the lee back side 
doc02 This sinking, dry air produces a rain shadow, or 
doc03 A rain shadow is an area of dry land that lies on 
doc04 This is known as the rain shadow effect and is the 
doc05 Two Women. Secrets. A Broken Land. [DVD Australia] 
Newline-delimited entries 
ID and text fields, separated by tabs 
Plan: Split lines into words and count them over each line
Flow I/O 
Tap docTap = new Hfs(new TextDelimited(true, "t"), docPath); 
Tap wcTap = new Hfs(new TextDelimited(true, "t"), wcPath); 
No surprises here: 
docTap reads a file from HDFS 
wcTap will write the results to a different HDFS file
File parsing 
Fields token = new Fields("token"); 
Fields text = new Fields("text"); 
RegexSplitGenerator splitter = 
new RegexSplitGenerator(token, "[ [](),.]"); 
Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS); 
Fields are names for the tuple elements 
RegexSplitGenerator applies the regex to input and 
yields matches under the “token” field 
docPipe takes each “token” generated by the splitter 
and outputs them
Count the tokens (words) 
Pipe wcPipe = new Pipe("wc", docPipe); 
wcPipe = new GroupBy(wcPipe, token); 
wcPipe = new Every(wcPipe, Fields.ALL, new Count(), Fields.ALL); 
wcPipe connects to docPipe, using it for input 
Fit a GroupBy function onto wcPipe, grouping by the 
token field (the actual words) 
for every tuple in wcPipe (every word), count each 
occurrence and output the result
Create and run the flow 
FlowDef flowDef = FlowDef.flowDef() 
.setName("wc") 
.addSource(docPipe, docTap) 
.addTailSink(wcPipe, wcTap); 
Flow wcFlow = flowConnector.connect(flowDef).complete(); 
Define a new flow with name “wc” 
Feed the docTap (the original text file) into the 
docPipe 
Feed the wcTap (the output word counts) into the 
wcPipe 
Connect to the flowConnector (Hadoop) and go!
Cascading flow 
100% Java 
Databases and processing 
are behind class 
abstractions 
Automatically scalable 
Easily testable
How could this help us?
Testing 
Create flows entirely in code on a local machine 
Write tests for controlled sample data sets 
Run tests as regular old Java without needing access 
to actual Hadoopery or databases 
Local machine and CI testing are easy!
Reusability 
Pipe assemblies are designed for reuse 
Once created and tested, use them in other flows 
Write logic to do something only once 
This is *essential* for data integrity as well as 
good programming
Common code base 
Infrastructure writes MR-type jobs in Cascading, 
warehouse writes data manipulations in Cascading 
Everybody uses the same terms and same tech 
Teams understand each other’s code 
Can be modified by anyone, not just tool experts
Simpler stack 
Cascading creates DAG of dependent jobs for us 
Removes most of the need for Oozie (ew) 
Keeps track of where a flow fails and can rerun from 
that point on failure
Disadvantages 
“silver bullets are not a thing”
Some bad news 
JVM, which means Java (or Scala (or CLOJURE :) :) 
Argument: Java is the platform for big data, so we 
can’t avoid embracing it. 
PyCascading uses Jython, which kinda sucks
Some other bad news 
Doesn’t have job scheduler 
Can figure out dependency graph for jobs, but 
nothing to run them on a regular interval 
We still need Jenkins or quartz 
Concurrent is doing proprietary products (read: $) 
for this kind of thing, but they’re months away
Other bad news 
No real built-in monitoring 
Easy to have a flow report what it has done; 
hard to watch it in progress 
We’d have to roll our own (but we’d have to do that 
anyway, so whatevs)
Recommendations 
“Enough already!”
Yes, we should try it. 
It’s not everything we need, but it’s a lot 
Possibly replace MapReduce and Sqoop 
Proven tech; this isn’t bleeding edge work 
We need an ETL framework and we don’t have time 
to write one from scratch.
Let’s prototype a couple of jobs and 
see what people other than me think.
Questions? 
Satisfactory answers 
not guaranteed.

Intro to Cascading

  • 1.
    Cascading or, “wasit worth three days out of the office?”
  • 2.
    Agenda What isCascading? Building cascades and flows How does this fit our needs? Advantages/disadvantages Q&A
  • 3.
  • 4.
    Cascading 101 JVMframework and SDK for creating abstracted data flows Translates data flows into actual Hadoop/RDBMS/local jobs
  • 5.
    Huh? Okay, let’sback up a bit.
  • 6.
    Data flows Thinkof an ETL: Extract-Transform-Load In simple terms, take data from a source, change it somehow, and stick the result into something (a “sink”) Data source Data sink Extract Load Transformation(s)
  • 7.
    Data flow implementation Pretty much everything we do is some flavor of this Sources: Games, Hadoop, Hive/MySQL, Couchbase, web service Transformations: Aggregations, group-bys, combined fields, filtering, etc. Sinks: Hadoop, Hive/MySQL, Couchbase
  • 8.
    Cascading 101 (PartDeux) JVM data flow framework Models data flows as abstractions: Separates details of where and how we get data from what we do with it Implements transform operations as SQL or MapReduce or whatever
  • 9.
    In other words… An ETL framework. A Pentaho we can program.
  • 10.
  • 11.
    Cascading terminology Flow:A path for data with some number of inputs, some operations, and some outputs Cascade: A series of connected flows
  • 12.
    More terminology Operation:A function applied to data, yielding new data Pipe: Moves data from someplace to some other place Tap: Feeds data from outside the flow into it and writes data from inside the flow out of it
  • 13.
    Simplest possible flow // create the source tap Tap inTap = new Hfs(new TextDelimited(true, "t"), inPath); ! // create the sink tap Tap outTap = new Hfs(new TextDelimited(true, "t"), outPath); ! // specify a pipe to connect the taps Pipe copyPipe = new Pipe(“copy"); ! // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef() .addSource(copyPipe, inTap) .addTailSink(copyPipe, outTap); ! // run the flow flowConnector.connect(flowDef).complete();
  • 14.
    We already havethat. ! It’s called ‘cp’.
  • 15.
    Actually… Runs entirelyin the cluster Works fine on megabytes, gigabytes, terabytes or petabytes; i.e., IT SCALES Completely testable outside of the cluster Who gets shell access to a namenode to run the bash or python equivalent?
  • 16.
    Reliability is ESSENTIAL ! if we, and our system, are to be taken srsly. Reliability is a feature, not a goal.
  • 17.
    Let’s do somethingmore interesting.
  • 18.
    Real world usecase: Word counting Read a simple file format Count the occurrence of every word in the file Output a list of all words and their counts
  • 19.
    doc_id text doc01A rain shadow is a dry area on the lee back side doc02 This sinking, dry air produces a rain shadow, or doc03 A rain shadow is an area of dry land that lies on doc04 This is known as the rain shadow effect and is the doc05 Two Women. Secrets. A Broken Land. [DVD Australia] Newline-delimited entries ID and text fields, separated by tabs Plan: Split lines into words and count them over each line
  • 20.
    Flow I/O TapdocTap = new Hfs(new TextDelimited(true, "t"), docPath); Tap wcTap = new Hfs(new TextDelimited(true, "t"), wcPath); No surprises here: docTap reads a file from HDFS wcTap will write the results to a different HDFS file
  • 21.
    File parsing Fieldstoken = new Fields("token"); Fields text = new Fields("text"); RegexSplitGenerator splitter = new RegexSplitGenerator(token, "[ [](),.]"); Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS); Fields are names for the tuple elements RegexSplitGenerator applies the regex to input and yields matches under the “token” field docPipe takes each “token” generated by the splitter and outputs them
  • 22.
    Count the tokens(words) Pipe wcPipe = new Pipe("wc", docPipe); wcPipe = new GroupBy(wcPipe, token); wcPipe = new Every(wcPipe, Fields.ALL, new Count(), Fields.ALL); wcPipe connects to docPipe, using it for input Fit a GroupBy function onto wcPipe, grouping by the token field (the actual words) for every tuple in wcPipe (every word), count each occurrence and output the result
  • 23.
    Create and runthe flow FlowDef flowDef = FlowDef.flowDef() .setName("wc") .addSource(docPipe, docTap) .addTailSink(wcPipe, wcTap); Flow wcFlow = flowConnector.connect(flowDef).complete(); Define a new flow with name “wc” Feed the docTap (the original text file) into the docPipe Feed the wcTap (the output word counts) into the wcPipe Connect to the flowConnector (Hadoop) and go!
  • 24.
    Cascading flow 100%Java Databases and processing are behind class abstractions Automatically scalable Easily testable
  • 25.
  • 26.
    Testing Create flowsentirely in code on a local machine Write tests for controlled sample data sets Run tests as regular old Java without needing access to actual Hadoopery or databases Local machine and CI testing are easy!
  • 27.
    Reusability Pipe assembliesare designed for reuse Once created and tested, use them in other flows Write logic to do something only once This is *essential* for data integrity as well as good programming
  • 28.
    Common code base Infrastructure writes MR-type jobs in Cascading, warehouse writes data manipulations in Cascading Everybody uses the same terms and same tech Teams understand each other’s code Can be modified by anyone, not just tool experts
  • 29.
    Simpler stack Cascadingcreates DAG of dependent jobs for us Removes most of the need for Oozie (ew) Keeps track of where a flow fails and can rerun from that point on failure
  • 30.
  • 31.
    Some bad news JVM, which means Java (or Scala (or CLOJURE :) :) Argument: Java is the platform for big data, so we can’t avoid embracing it. PyCascading uses Jython, which kinda sucks
  • 32.
    Some other badnews Doesn’t have job scheduler Can figure out dependency graph for jobs, but nothing to run them on a regular interval We still need Jenkins or quartz Concurrent is doing proprietary products (read: $) for this kind of thing, but they’re months away
  • 33.
    Other bad news No real built-in monitoring Easy to have a flow report what it has done; hard to watch it in progress We’d have to roll our own (but we’d have to do that anyway, so whatevs)
  • 34.
  • 35.
    Yes, we shouldtry it. It’s not everything we need, but it’s a lot Possibly replace MapReduce and Sqoop Proven tech; this isn’t bleeding edge work We need an ETL framework and we don’t have time to write one from scratch.
  • 36.
    Let’s prototype acouple of jobs and see what people other than me think.
  • 37.