0% found this document useful (0 votes)
884 views14 pages

Intro To Apache Airflow

Apache Airflow is an open-source platform for authoring, scheduling, and monitoring workflows or data pipelines. It allows users to define workflows as directed acyclic graphs (DAGs) of tasks and automates the process of creating, scheduling, and updating data pipelines. Airflow addresses the challenges of using cron jobs to run workflows by providing a rich web UI and APIs to programmatically author, schedule, monitor, debug, and scale workflows across multiple tasks and data sources.

Uploaded by

Robert Ngenzi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
884 views14 pages

Intro To Apache Airflow

Apache Airflow is an open-source platform for authoring, scheduling, and monitoring workflows or data pipelines. It allows users to define workflows as directed acyclic graphs (DAGs) of tasks and automates the process of creating, scheduling, and updating data pipelines. Airflow addresses the challenges of using cron jobs to run workflows by providing a rich web UI and APIs to programmatically author, schedule, monitor, debug, and scale workflows across multiple tasks and data sources.

Uploaded by

Robert Ngenzi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
  • Introduction: Introduces Apache Airflow, an open-source workflow management platform, establishing the context and purpose of the presentation.
  • Agenda: Lists the main topics covered in the presentation to give the audience a roadmap of the content.
  • What is Airflow: Explains Apache Airflow as a tool for scheduling and managing workflows, highlighting its key features.
  • What is a Workflow?: Defines what constitutes a workflow in the context of Apache Airflow, describing tasks and triggers.
  • Example of an Airflow Workflow: Provides a step-by-step example of creating a workflow with Apache Airflow, illustrating the processes involved.
  • Background: Discusses the motivation for using Apache Airflow by examining traditional job scheduling methods.
  • Challenges with Cron Jobs: Outlines issues associated with using cron jobs for task scheduling, which Airflow aims to solve.
  • Airflow Advantages: Highlights the benefits of using Apache Airflow for developers, including improved workflow management and scalability.
  • Airflow Terms: Defines key terminology used in Apache Airflow, providing a foundational understanding for new users.
  • Airflow DAG: Explains Directed Acyclic Graphs (DAGs) in Airflow, detailing their role in defining workflow relationships.
  • Airflow Core Components: Describes the core components of Apache Airflow, including the web server and scheduler.
  • Airflow Usage: Discusses various use cases for Apache Airflow, such as ETL pipelines and general-purpose scheduling.
  • Airflow Architecture: Explains the architecture of Apache Airflow, detailing the interactions between its components and the roles they play in workflow management.

Apache Airflow

Introduction
Agenda
● What is Airflow
● What is a workflow
● Example of an Airflow workflow
● Background and the world before Airflow
● Purpose
● Terminologies
● Core Components
● Usages
● Demo
What is Airflow
● Apache Airflow is an open-source platform for programmatically
authoring, scheduling, and monitoring workflows.

● It allows you to define and manage complex data pipelines as


directed acyclic graphs (DAGs) of tasks and automate the process of
creating and updating data pipelines. It provides a rich web-based
interface for setting up, monitoring and managing workflow
execution, and an API for triggering and monitoring workflows.
What is a Workflow?

● A sequence of tasks
● Started on a schedule or triggered by an event
● Frequently used to handle big data processing pipelines

Example for an Airflow workflow


Example of an Airflow workflow

1. Download data
2. Send data to processing
3. monitor processing
4. Generate report
5. Send email
Background

A developer wants to run a job on schedule


● Cron job (Job scheduling)
● Python or bash script

Extract data from data


Start End
source A to storage B
Background
Business demands more data extractions from various sources

Solution. Develop more cron job

1 Extract data from


Start data source A to End
storage B

2 Extract data from


Start data source C to End
storage D
…. ….
n
Extract data from
Start data source E to End
storage F
Challenges with cron jobs
● Hard to scale
● Hard to monitor
● Hard to maintain
● Hard to maintaining dependencies
● Hard to manage jobs failures and timeouts
● Hard to manage deployments
Airflow advantages

Developers can programmatically:


● Author workflows
● Schedule workflows
● Monitor workflow
● Debug
● Scale easly
Airflow Terms
● Task

1 Extract data from


Start data source A to End
storage B

2 Extract data from


Start data source A to End
storage B
Airflow DAG
● A DAG (Directed Acyclic Graph) is used to define a workflow as a series of
tasks and how they interact with each other.
● Each task in a DAG represents a single operation in your workflow, such as
running a query, sending an email, or uploading a file.
● The relationships between tasks are defined by dependencies, where one task
can only run after another task has completed
Airflow Core components

Task
Execution Webserver Web UI
logs

Metadata
Scheduler Workers
database
Airflow usage
● Run and automate ETL pipelines
● Data ingestion pipelines
● Machine learning pipelines
● Predictive data pipelines
● General purpose scheduling
Airflow architecture
● Scheduler: Triggers scheduled workflows and submits tasks to executor to
run
● Executor: Manages tasks
● Worker: Runs the tasks
● Webserver:Supports the user interface
● Metadata database: Stores information about DAGs and tasks

Apache Airflow 
Introduction
Agenda 
●
What is Airflow
●
What is a workflow
●
Example of an Airflow workflow
●
Background and the world before Airflow
●
P
What is Airflow
●Apache Airflow is an open-source platform for programmatically 
authoring, scheduling, and monitoring workfl
What is a Workflow?
●A sequence of tasks 
●Started on a schedule or triggered by an event
●Frequently used to handle big data
Example of an Airflow workflow 
1. Download data 
2. Send data to processing
3. monitor processing
4. Generate report
5. Send
Background
A developer wants to run a job on schedule 
●
Cron job (Job scheduling)
●
Python or bash script
 Start
  End 
Extr
Background
Business demands more data extractions from various sources
Solution. Develop more cron job 
 Start
 End 
Extract
Challenges with cron jobs
●
Hard to scale
●
Hard to monitor
●
Hard to maintain 
●
Hard to maintaining dependencies 
●
Hard to
Airflow advantages 
Developers can programmatically:
●
Author workflows
●
Schedule workflows
●
Monitor workflow
●
Debug
●
Sca
Airflow Terms
●
 Task
 Start
 End 
Extract data from 
data source A to 
storage B
    1
 Start
 End 
Extract data from 
data

You might also like