SlideShare a Scribd company logo
Introducing Apache
Airflow (Incubating)
Sid Anand (@r39132)
Data Day Seattle 2016
1
About Me
2
Work [ed | s] @
Maintainer on
Reports to
Co-Chair for
Apache Airflow
3
What is it?
4
Apache Airflow : What is it?
Airflow is a platform to programmatically author,
schedule and monitor workflows (a.k.a. DAGs)
5
Apache Airflow : What is it?
Airflow is a platform to programmatically author,
schedule and monitor workflows (a.k.a. DAGs)
It ships with
• DAG Scheduler
• Web application (UI)
• Powerful CLI
6
Apache Airflow : What is it?
Airflow - Authoring DAGs
7
Airflow: Visualizing a DAG
8
Airflow: Author DAGs in Python! No need to bundle many XML files!
Airflow - Authoring DAGs
9
Airflow: The Tree View offers a view of DAG Runs over time!
Airflow - Authoring DAGs
Airflow - Performance Insights
10
Airflow: Gantt charts reveal the slowest tasks for a run!
11
Airflow: …And we can easily see performance trends over time
Airflow - Performance Insights
12
Apache Airflow : What is it?
When would you use a Workflow Scheduler like
Airflow?
• ETL Pipelines
• Machine Learning Pipelines
• Predictive Data Pipelines
• Fraud Detection, Scoring/Ranking, Classification,
Recommender System, etc…
• General Job Scheduling (e.g. Cron)
• DB Back-ups, Scheduled code/config deployment
13
Apache Airflow : What is it?
What should a Workflow Scheduler do well?
• Schedule a graph of dependencies
• where Workflow = A DAG of Tasks
• Handle task failures
• Report / Alert on failures
• Monitor performance of tasks over time
• Enforce SLAs
• E.g. Alerting if time or correctness SLAs are not met
• Scale
14
Apache Airflow : What is it?
What Does Apache Airflow Add?
• Configuration-as-code
• Usability - Stunning UI / UX
• Centralized configuration
• Resource Pooling
• Extensibility
Apache Airflow
15
Incubating
16
Apache Airflow : Incubating
Timeline
• Airflow was created @ Airbnb in 2015 by Maxime
Beauchemin
• Max launched it @ Hadoop Summit in Summer 2015
• On 3/31/2016, Airflow —> Apache Incubator
Today
• 166+ Contributors
• 300+ Users
• 40+ companies officially using it!
• 9 Committers/Maintainers <— We’re growing here
Agari
17
What We Do!
Agari : What We Do
18
19
Agari : What We Do
20
Agari : What We Do
21
Agari : What We Do
22
Agari : What We Do
23
Enterprise
Customers
email
metadata
apply
trust
models
email md
+ trust
score
Agari’s Current Product
Agari : What We Do
24
email
metadata
apply
trust
models
email md
+ trust
score
Agari’s Future ProductEnterprise
Customers
Agari : What We Do
Apache Airflow @ Agari
How Do We Use It?
25
Classes of Orchestration
26
apply trust
models
(message
scoring)
build trust
models
cron++
(general
job
scheduler)
New Product
(Enterprise Protect)
Operational
Automation
BI / ETL
N / A
Classes of Orchestration
27
apply trust
models
(message
scoring)
build trust
models
cron++
(general
job
scheduler)
New Product
(Enterprise Protect)
Operational
Automation
BI / ETL
N / A
This Talk
Use-Case : Message
Scoring
Batch Pipeline Architecture
28
Use-Case : Message Scoring
29
enterprise A
enterprise B
enterprise C
S3
S3 uploads every 15
minutes
Use-Case : Message Scoring
30
enterprise A
enterprise B
enterprise C
S3
Airflow kicks of a Spark
message scoring job
every hour
Use-Case : Message Scoring
31
enterprise A
enterprise B
enterprise C
S3
Spark job writes scored
messages and stats to
another S3 bucket
S3
Use-Case : Message Scoring
32
enterprise A
enterprise B
enterprise C
S3
This triggers SNS/SQS
messages events
S3
SNS
SQS
Use-Case : Message Scoring
33
enterprise A
enterprise B
enterprise C
S3
An Autoscale Group
(ASG) of Importers spins
up when it detects SQS
messages
S3
SNS
SQS
Importers
ASG
34
enterprise A
enterprise B
enterprise C
S3
The importers rapidly ingest scored
messages and aggregate statistics into
the DB
S3
SNS
SQS
Importers
ASG
DB
Use-Case : Message Scoring
35
enterprise A
enterprise B
enterprise C
S3
Users receive alerts of
untrusted emails &
can review them in
the web app
S3
SNS
SQS
Importers
ASG
DB
Use-Case : Message Scoring
36
enterprise A
enterprise B
enterprise C
S3 S3
SNS
SQS
Importers
ASG
DB
Airflow manages the entire process
Use-Case : Message Scoring
Use-Case : Message
Scoring
Airflow DAG
37
38
Airflow DAG
39
5 minute wait for S3
eventual consistency
Airflow DAG
40
1 hour a day, we
also build new
models
Airflow DAG
41
build
models
Airflow DAG
42
dummy needed for
branch operator
Airflow DAG
43
• trigger_rule:
one_success
Airflow DAG
44
Prep Spark
Run
Airflow DAG
45
• Run Spark
• Verify a record is
written to the DB
• Wait for the SQS
queue to empty
Airflow DAG
46
• Compute
discrepancies
• Send email report
• Update monitoring
graphs
• Raise SLA
(correctness) alerts
Airflow DAG
SLAs & Insights
Airflow
47
48
Desirable Qualities of a Resilient
Data Pipeline
OperabilityCorrectness
Timeliness Scalable/Available
• Data Integrity (no loss, etc…)
• Expected data distributions
• All output within time-bound SLAs
(e.g. 1 hour)
• Fine-grained Monitoring &
Alerting of Correctness &
Timeliness SLAs
• Quick Recoverability
• ASGs, SQS, SNS, S3
49
Desirable Qualities of a Resilient
Data Pipeline
OperabilityCorrectness
Timeliness
• Data Integrity (no loss, etc…)
• Expected data distributions
• All output within time-bound SLAs
(e.g. 1 hour)
• Fine-grained Monitoring &
Alerting of Correctness &
Timeliness SLAs
• Quick Recoverability
SLA
SLA
• ASGs, SQS, SNS, S3
Scalable/Available
50
Correctness : Email Reporting
orgs
51
Correctness : Email Reporting
For each org, we check for duplicate or missing data
as a count & percentage
orgs
52
Correctness : Email Reporting
These are the 3 stages of the pipeline. We can detect where a
discrepancy is coming from - often related to a code push!
orgs
53
Correctness : Monitoring
54
Airflow: …And easy to integrate with Ops tools!
Correctness & Timeliness : Alerting
55
Airflow: …And easy to integrate with Ops tools!
Correctness & Timeliness : Alerting
Timeliness SLA
miss
56
Airflow: …And easy to integrate with Ops tools!
Correctness & Timeliness : Alerting
Timeliness SLA
miss
dag = DAG(DAG_NAME,
schedule_interval='@hourly',
default_args=default_args,
sla_miss_callback=sla_alert_func)
57
Airflow: …And easy to integrate with Ops tools!
Correctness & Timeliness : Alerting
Timeliness SLA
miss
Correctness
SLA miss
58
Airflow: …And easy to integrate with Ops tools!
Correctness & Timeliness : Alerting
Timeliness &
Correctness SLA
misses sent to
PagerDuty/VictorOps
Use-Case : Model Building v2
For Both Batch & Near Realtime Scoring
Pipelines
59
60
Airflow DAG
61
Model Building DAG
Launch an
EMR cluster
62
Run model
building as
EMR steps
Model Building DAG
63
Validate
models
Send email
notification if
tests fail
Model Building DAG
64
Terminate
EMR cluster
Model Building DAG
Apache Airflow Next Steps
65
Areas for Improvement
66
Apache Airflow Next Steps
Improvement Areas
• Security
• API (though we do have a CLI)
• Deployment / Versioning
• Execution Scale Out
• On-demand Execution
Acknowledgments
67
• Vidur Apparao
• Stephen Cattaneo
• Jon Chase
• Andrew Flury
• William Forrester
• Chris Haag
• Mike Jones
• Scot Kennedy
• Thede Loder
• Paul Lorence
• Kevin Mandich
• Gabriel Ortiz
• Jacob Rideout
• Josh Yang
• Julian Mehnle
None of this work would be possible without the
contributions of the strong team below
Questions? (@r39132)
68

More Related Content

PDF
Apache Airflow
Knoldus Inc.
 
PDF
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
PPTX
Apache Airflow overview
NikolayGrishchenkov
 
PPTX
OpenTelemetry For Operators
Kevin Brockhoff
 
PDF
Apache Airflow
Sumit Maheshwari
 
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
Laura Lorenz
 
PDF
Airflow tutorials hands_on
pko89403
 
PDF
Airflow introduction
Chandler Huang
 
Apache Airflow
Knoldus Inc.
 
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
Apache Airflow overview
NikolayGrishchenkov
 
OpenTelemetry For Operators
Kevin Brockhoff
 
Apache Airflow
Sumit Maheshwari
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
Laura Lorenz
 
Airflow tutorials hands_on
pko89403
 
Airflow introduction
Chandler Huang
 

What's hot (20)

PPTX
Introduction to Apache Kafka
AIMDek Technologies
 
PPTX
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
PPTX
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
PPTX
RocksDB detail
MIJIN AN
 
PDF
OSMC 2021 | Introduction into OpenSearch
NETWAYS
 
PDF
Apache Airflow Architecture
Gerard Toonstra
 
PPTX
Apache airflow
Pavel Alexeev
 
PPTX
02 terraform core concepts
zekeLabs Technologies
 
PPTX
Spark architecture
GauravBiswas9
 
PDF
Airflow Intro-1.pdf
BagustTriCahyo1
 
PPTX
Sharding Methods for MongoDB
MongoDB
 
PDF
Kafka At Scale in the Cloud
confluent
 
PPTX
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
PPTX
Apache Spark Architecture
Alexey Grishchenko
 
PPTX
Airflow 101
SaarBergerbest
 
PDF
Building an analytics workflow using Apache Airflow
Yohei Onishi
 
PPTX
Comprehensive Terraform Training
Yevgeniy Brikman
 
PPTX
Securing Hadoop with Apache Ranger
DataWorks Summit
 
PPTX
Apache Airflow Introduction
Liangjun Jiang
 
PDF
Galera cluster for high availability
Mydbops
 
Introduction to Apache Kafka
AIMDek Technologies
 
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
Apache Kafka Best Practices
DataWorks Summit/Hadoop Summit
 
RocksDB detail
MIJIN AN
 
OSMC 2021 | Introduction into OpenSearch
NETWAYS
 
Apache Airflow Architecture
Gerard Toonstra
 
Apache airflow
Pavel Alexeev
 
02 terraform core concepts
zekeLabs Technologies
 
Spark architecture
GauravBiswas9
 
Airflow Intro-1.pdf
BagustTriCahyo1
 
Sharding Methods for MongoDB
MongoDB
 
Kafka At Scale in the Cloud
confluent
 
Apache Tez - A New Chapter in Hadoop Data Processing
DataWorks Summit
 
Apache Spark Architecture
Alexey Grishchenko
 
Airflow 101
SaarBergerbest
 
Building an analytics workflow using Apache Airflow
Yohei Onishi
 
Comprehensive Terraform Training
Yevgeniy Brikman
 
Securing Hadoop with Apache Ranger
DataWorks Summit
 
Apache Airflow Introduction
Liangjun Jiang
 
Galera cluster for high availability
Mydbops
 
Ad

Viewers also liked (7)

PPTX
Airflow - a data flow engine
Walter Liu
 
PPTX
Airflow at WePay
Chris Riccomini
 
PDF
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
PDF
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
DataWorks Summit
 
PDF
20140409
Florence T.M. Ko
 
PDF
Continuous Delivery Workshop with Ansible x GitLab CI (2nd+)
Chu-Siang Lai
 
PPTX
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Bolke de Bruin
 
Airflow - a data flow engine
Walter Liu
 
Airflow at WePay
Chris Riccomini
 
How I learned to time travel, or, data pipelining and scheduling with Airflow
PyData
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
DataWorks Summit
 
Continuous Delivery Workshop with Ansible x GitLab CI (2nd+)
Chu-Siang Lai
 
Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Bolke de Bruin
 
Ad

Similar to Introduction to Apache Airflow - Data Day Seattle 2016 (20)

PDF
Airflow @ Agari
Sid Anand
 
PDF
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Sid Anand
 
PDF
Cloud Native Data Pipelines (GoTo Chicago 2017)
Sid Anand
 
PDF
Cloud Native Data Pipelines
Bill Liu
 
PDF
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
PDF
Resilient Predictive Data Pipelines (QCon London 2016)
Sid Anand
 
PDF
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Sid Anand
 
PPTX
DataPipelineApacheAirflow.pptx
John J Zhao
 
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
awuahmeraiga
 
PDF
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
Itai Yaffe
 
PPTX
Apache Airflow presentation by GenPPT.pptx
VikasTomar93
 
PDF
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Kaxil Naik
 
PPTX
Running Airflow Workflows as ETL Processes on Hadoop
clairvoyantllc
 
PDF
Cloud Native Data Pipelines (DataEngConf SF 2017)
Sid Anand
 
PDF
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
uzjrbdj376
 
PPTX
Apache Airdrop detailed description.pptx
prince07031999
 
PPTX
Introduction to Apache Airflow & Workflow Orchestration.pptx
Accentfuture
 
PDF
Building Automated Data Pipelines with Airflow.pdf
abhaykm804
 
PPTX
Apache AirfowAsaSAsaSAsSas - Session1.pptx
MuhamedAhmed35
 
PDF
Airflow presentation
Ilias Okacha
 
Airflow @ Agari
Sid Anand
 
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Sid Anand
 
Cloud Native Data Pipelines (GoTo Chicago 2017)
Sid Anand
 
Cloud Native Data Pipelines
Bill Liu
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
Resilient Predictive Data Pipelines (QCon London 2016)
Sid Anand
 
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Sid Anand
 
DataPipelineApacheAirflow.pptx
John J Zhao
 
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
awuahmeraiga
 
From AWS Data Pipeline to Airflow - managing data pipelines in Nielsen Market...
Itai Yaffe
 
Apache Airflow presentation by GenPPT.pptx
VikasTomar93
 
Apache Airflow in the Cloud: Programmatically orchestrating workloads with Py...
Kaxil Naik
 
Running Airflow Workflows as ETL Processes on Hadoop
clairvoyantllc
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Sid Anand
 
Data Pipelines with Apache Airflow 1st Edition Bas P Harenslak Julian Rutger ...
uzjrbdj376
 
Apache Airdrop detailed description.pptx
prince07031999
 
Introduction to Apache Airflow & Workflow Orchestration.pptx
Accentfuture
 
Building Automated Data Pipelines with Airflow.pdf
abhaykm804
 
Apache AirfowAsaSAsaSAsSas - Session1.pptx
MuhamedAhmed35
 
Airflow presentation
Ilias Okacha
 

More from Sid Anand (20)

PDF
Building High Fidelity Data Streams (QCon London 2023)
Sid Anand
 
PDF
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Sid Anand
 
PDF
Low Latency Fraud Detection & Prevention
Sid Anand
 
PDF
YOW! Data Keynote (2021)
Sid Anand
 
PDF
Big Data, Fast Data @ PayPal (YOW 2018)
Sid Anand
 
PPTX
Cloud Native Predictive Data Pipelines (micro talk)
Sid Anand
 
PDF
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Sid Anand
 
PPTX
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Sid Anand
 
PPTX
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
Sid Anand
 
PPTX
Building a Modern Website for Scale (QCon NY 2013)
Sid Anand
 
PDF
Hands On with Maven
Sid Anand
 
PDF
Learning git
Sid Anand
 
PDF
LinkedIn Data Infrastructure Slides (Version 2)
Sid Anand
 
PDF
LinkedIn Data Infrastructure (QCon London 2012)
Sid Anand
 
PPTX
Linked in nosql_atnetflix_2012_v1
Sid Anand
 
PDF
Keeping Movies Running Amid Thunderstorms!
Sid Anand
 
PDF
OSCON Data 2011 -- NoSQL @ Netflix, Part 2
Sid Anand
 
PDF
Intuit CTOF 2011 - Netflix for Mobile in the Cloud
Sid Anand
 
PPTX
Svccg nosql 2011_v4
Sid Anand
 
PPTX
Netflix's Transition to High-Availability Storage (QCon SF 2010)
Sid Anand
 
Building High Fidelity Data Streams (QCon London 2023)
Sid Anand
 
Building & Operating High-Fidelity Data Streams - QCon Plus 2021
Sid Anand
 
Low Latency Fraud Detection & Prevention
Sid Anand
 
YOW! Data Keynote (2021)
Sid Anand
 
Big Data, Fast Data @ PayPal (YOW 2018)
Sid Anand
 
Cloud Native Predictive Data Pipelines (micro talk)
Sid Anand
 
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Sid Anand
 
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Sid Anand
 
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
Sid Anand
 
Building a Modern Website for Scale (QCon NY 2013)
Sid Anand
 
Hands On with Maven
Sid Anand
 
Learning git
Sid Anand
 
LinkedIn Data Infrastructure Slides (Version 2)
Sid Anand
 
LinkedIn Data Infrastructure (QCon London 2012)
Sid Anand
 
Linked in nosql_atnetflix_2012_v1
Sid Anand
 
Keeping Movies Running Amid Thunderstorms!
Sid Anand
 
OSCON Data 2011 -- NoSQL @ Netflix, Part 2
Sid Anand
 
Intuit CTOF 2011 - Netflix for Mobile in the Cloud
Sid Anand
 
Svccg nosql 2011_v4
Sid Anand
 
Netflix's Transition to High-Availability Storage (QCon SF 2010)
Sid Anand
 

Recently uploaded (20)

PDF
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
PDF
Download iTop VPN Free 6.1.0.5882 Crack Full Activated Pre Latest 2025
imang66g
 
PDF
Exploring AI Agents in Process Industries
amoreira6
 
PPTX
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
PDF
Jenkins: An open-source automation server powering CI/CD Automation
SaikatBasu37
 
PPTX
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
PPTX
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
PPTX
oapresentation.pptx
mehatdhavalrajubhai
 
PDF
Appium Automation Testing Tutorial PDF: Learn Mobile Testing in 7 Days
jamescantor38
 
PDF
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
PDF
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
PPTX
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
PDF
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
PPTX
Role Of Python In Programing Language.pptx
jaykoshti048
 
PPTX
Visualising Data with Scatterplots in IBM SPSS Statistics.pptx
Version 1 Analytics
 
PPTX
Presentation about variables and constant.pptx
kr2589474
 
PDF
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
PDF
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
PPT
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
PDF
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 
advancepresentationskillshdhdhhdhdhdhhfhf
jasmenrojas249
 
Download iTop VPN Free 6.1.0.5882 Crack Full Activated Pre Latest 2025
imang66g
 
Exploring AI Agents in Process Industries
amoreira6
 
classification of computer and basic part of digital computer
ravisinghrajpurohit3
 
Jenkins: An open-source automation server powering CI/CD Automation
SaikatBasu37
 
GALILEO CRS SYSTEM | GALILEO TRAVEL SOFTWARE
philipnathen82
 
AI-Ready Handoff: Auto-Summaries & Draft Emails from MQL to Slack in One Flow
bbedford2
 
oapresentation.pptx
mehatdhavalrajubhai
 
Appium Automation Testing Tutorial PDF: Learn Mobile Testing in 7 Days
jamescantor38
 
Balancing Resource Capacity and Workloads with OnePlan – Avoid Overloading Te...
OnePlan Solutions
 
Teaching Reproducibility and Embracing Variability: From Floating-Point Exper...
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
slidesgo-unlocking-the-code-the-dynamic-dance-of-variables-and-constants-2024...
kr2589474
 
lesson-2-rules-of-netiquette.pdf.bshhsjdj
jasmenrojas249
 
Role Of Python In Programing Language.pptx
jaykoshti048
 
Visualising Data with Scatterplots in IBM SPSS Statistics.pptx
Version 1 Analytics
 
Presentation about variables and constant.pptx
kr2589474
 
Adobe Illustrator Crack Full Download (Latest Version 2025) Pre-Activated
imang66g
 
Salesforce Implementation Services Provider.pdf
VALiNTRY360
 
Why Reliable Server Maintenance Service in New York is Crucial for Your Business
Sam Vohra
 
Key Features to Look for in Arizona App Development Services
Net-Craft.com
 

Introduction to Apache Airflow - Data Day Seattle 2016