SlideShare a Scribd company logo
2
Most read
3
Most read
4
Most read
Version 1.0
Airflow and Google Dataproc
In Data Engineer's Lunch #76, Arpan Patel will cover how to
connect Airflow and Google Dataproc with a demo using an Airflow
DAG to create a Dataproc cluster, submit an Apache Spark job to
Dataproc, and destroy the Dataproc cluster upon completion.
Arpan Patel
Engineer @ Anant
Google Dataproc
● Fully managed and highly scalable service for running
Apache Spark, Apache Flink, Presto, and 30+ open source
tools and frameworks
○ Lets you take advantage of open source data tools
for batch processing, querying, streaming, and
machine learning
● Dataproc clusters are quick to start, scale, and shutdown,
with each of these operations taking 90 seconds or less,
on average
● Built-in integration with other Google Cloud Platform
services, such as BigQuery, Cloud Storage, Cloud
Bigtable, Cloud Logging, and Cloud Monitoring
● Can easily interact with clusters and Spark or Hadoop
jobs through the Google Cloud console, the Cloud SDK, or
the Dataproc REST API
Google Dataproc
● https://cloud.google.com/dataproc/docs/concepts
/versioning/dataproc-version-clusters
○ https://cloud.google.com/dataproc/docs/co
ncepts/versioning/dataproc-release-2.0
○ https://cloud.google.com/dataproc/docs/co
ncepts/versioning/dataproc-release-1.5
● Can run on GCE / GKE
● Dataproc Serverless for Spark
Google Dataproc + DataStax Astra
● Cluster Properties
○ dataproc:dataproc.conscrypt.provider.enable=false
● Job Properties
○ spark.jars.packages → com.datastax.spark:spark-cassandra-connector_2.12:3.1.0
● DAG param mappings to GCP REST API mappings
○ need to convert camel casing to "_". For example masterConfig -> master_config
○ if we want to use GKE for Dataproc cluster creation, then need to swap cluster_config for
virtual_cluster_config
Demo
● Open repo on Gitpod
● Set GCP Connection and Variables
● Run Dag that will:
○ Spin up Dataproc Cluster on GCE
○ Submit Dataproc Spark Job to read from DataStax Astra
○ Destroy Cluster
Strategy: Scalable Fast Data
Architecture: Cassandra, Spark, Kafka
Engineering: Node, Python, JVM,CLR
Operations: Cloud, Container
Rescue: Downtime!! I need help.
www.anant.us | solutions@anant.us | (855) 262-6826
3 Washington Circle, NW | Suite 301 | Washington, DC 20037

More Related Content

What's hot (20)

PDF
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Brian Brazil
 
PPTX
Rancher and Kubernetes Best Practices
Avinash Patil
 
PDF
kubernetes를 부탁해~ Prometheus 기반 Monitoring 구축&활용기
Jinsu Moon
 
PDF
Kubernetes Networking
CJ Cullen
 
PDF
Kubernetes
erialc_w
 
PDF
Migration From Oracle to PostgreSQL
PGConf APAC
 
PPTX
Prometheus design and philosophy
Docker, Inc.
 
PDF
What Is Kubernetes | Kubernetes Introduction | Kubernetes Tutorial For Beginn...
Edureka!
 
PPTX
Introduction to helm
Jeeva Chelladhurai
 
PDF
Velero & Beyond: Backup & Restore for Kubernetes Applications (Cloud Native S...
Chakradhar Rao Jonagam
 
PDF
Introduction to kubernetes
Raffaele Di Fazio
 
PPTX
Prometheus (Prometheus London, 2016)
Brian Brazil
 
PPTX
Monitoring_with_Prometheus_Grafana_Tutorial
Tim Vaillancourt
 
PPTX
Kubernetes Basics
Antonin Stoklasek
 
PDF
Monitoring with prometheus
Kasper Nissen
 
PDF
Kubernetes Requests and Limits
Ahmed AbouZaid
 
PDF
Kubernetes Monitoring & Best Practices
Ajeet Singh Raina
 
PDF
Microsoft Antimalware for Azure による Azure 仮想マシンの簡易的なマルウェア対策
wintechq
 
PPTX
containerd the universal container runtime
Docker, Inc.
 
PDF
REST vs. Messaging For Microservices
Eberhard Wolff
 
Monitoring Kubernetes with Prometheus (Kubernetes Ireland, 2016)
Brian Brazil
 
Rancher and Kubernetes Best Practices
Avinash Patil
 
kubernetes를 부탁해~ Prometheus 기반 Monitoring 구축&활용기
Jinsu Moon
 
Kubernetes Networking
CJ Cullen
 
Kubernetes
erialc_w
 
Migration From Oracle to PostgreSQL
PGConf APAC
 
Prometheus design and philosophy
Docker, Inc.
 
What Is Kubernetes | Kubernetes Introduction | Kubernetes Tutorial For Beginn...
Edureka!
 
Introduction to helm
Jeeva Chelladhurai
 
Velero & Beyond: Backup & Restore for Kubernetes Applications (Cloud Native S...
Chakradhar Rao Jonagam
 
Introduction to kubernetes
Raffaele Di Fazio
 
Prometheus (Prometheus London, 2016)
Brian Brazil
 
Monitoring_with_Prometheus_Grafana_Tutorial
Tim Vaillancourt
 
Kubernetes Basics
Antonin Stoklasek
 
Monitoring with prometheus
Kasper Nissen
 
Kubernetes Requests and Limits
Ahmed AbouZaid
 
Kubernetes Monitoring & Best Practices
Ajeet Singh Raina
 
Microsoft Antimalware for Azure による Azure 仮想マシンの簡易的なマルウェア対策
wintechq
 
containerd the universal container runtime
Docker, Inc.
 
REST vs. Messaging For Microservices
Eberhard Wolff
 

Similar to Data Engineer's Lunch #76: Airflow and Google Dataproc (20)

PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
PDF
Talend Summer '17 Release: New Features and Tech Overview
Talend
 
PDF
Spark on Dataproc - Israel Spark Meetup at taboola
tsliwowicz
 
PDF
Introduction to Apache Airflow
mutt_data
 
PPTX
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
PDF
Hybrid data lake on google cloud with alluxio and dataproc
Alluxio, Inc.
 
PDF
[Study Guide] Google Professional Cloud Architect (GCP-PCA) Certification
Amaaira Johns
 
PDF
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
Searce Inc
 
PDF
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
Kaxil Naik
 
PPTX
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
PDF
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
HostedbyConfluent
 
PDF
Cloud Composer workshop at Airflow Summit 2023.pdf
Leah Cole
 
PDF
Dsdt meetup 2017 11-21
JDA Labs MTL
 
PDF
DSDT Meetup Nov 2017
DSDT_MTL
 
PDF
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward
 
PPTX
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
Omid Vahdaty
 
PDF
Google Cloud Dataflow
Alex Van Boxel
 
PPTX
SEC302 Twitter's GCP Architecture for its petabyte scale data storage in gcs...
Vrushali Channapattan
 
PDF
Improving Apache Spark Downscaling
Databricks
 
PDF
Introduction to spark 2.0
datamantra
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Talend Summer '17 Release: New Features and Tech Overview
Talend
 
Spark on Dataproc - Israel Spark Meetup at taboola
tsliwowicz
 
Introduction to Apache Airflow
mutt_data
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
Hybrid data lake on google cloud with alluxio and dataproc
Alluxio, Inc.
 
[Study Guide] Google Professional Cloud Architect (GCP-PCA) Certification
Amaaira Johns
 
Running Dataproc At Scale in production - Searce Talk at GDG Delhi
Searce Inc
 
What's coming in Airflow 2.0? - NYC Apache Airflow Meetup
Kaxil Naik
 
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
HostedbyConfluent
 
Cloud Composer workshop at Airflow Summit 2023.pdf
Leah Cole
 
Dsdt meetup 2017 11-21
JDA Labs MTL
 
DSDT Meetup Nov 2017
DSDT_MTL
 
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
Omid Vahdaty
 
Google Cloud Dataflow
Alex Van Boxel
 
SEC302 Twitter's GCP Architecture for its petabyte scale data storage in gcs...
Vrushali Channapattan
 
Improving Apache Spark Downscaling
Databricks
 
Introduction to spark 2.0
datamantra
 
Ad

More from Anant Corporation (20)

PPTX
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
Anant Corporation
 
PPTX
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
Anant Corporation
 
PDF
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Anant Corporation
 
PDF
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Anant Corporation
 
PDF
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
Anant Corporation
 
PDF
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Anant Corporation
 
PPTX
YugabyteDB Developer Tools
Anant Corporation
 
PPTX
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Anant Corporation
 
PPTX
Machine Learning Orchestration with Airflow
Anant Corporation
 
PDF
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Anant Corporation
 
PDF
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Anant Corporation
 
PDF
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Anant Corporation
 
PDF
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Anant Corporation
 
PDF
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Anant Corporation
 
PDF
Data Engineer's Lunch #85: Designing a Modern Data Stack
Anant Corporation
 
PPTX
CL 121
Anant Corporation
 
PDF
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Anant Corporation
 
PDF
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Anant Corporation
 
PPTX
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Anant Corporation
 
PPTX
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Anant Corporation
 
LLM Fine Tuning with QLoRA Cassandra Lunch 4, presented by Anant
Anant Corporation
 
QLoRA Fine-Tuning on Cassandra Link Data Set (1/2) Cassandra Lunch 137
Anant Corporation
 
Kono.IntelCraft.Weekly.AI.LLM.Landscape.2024.02.28.pdf
Anant Corporation
 
Data Engineer's Lunch 96: Intro to Real Time Analytics Using Apache Pinot
Anant Corporation
 
NoCode, Data & AI LLM Inside Bootcamp: Episode 6 - Design Patterns: Retrieval...
Anant Corporation
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Anant Corporation
 
YugabyteDB Developer Tools
Anant Corporation
 
Episode 2: The LLM / GPT / AI Prompt / Data Engineer Roadmap
Anant Corporation
 
Machine Learning Orchestration with Airflow
Anant Corporation
 
Cassandra Lunch 130: Recap of Cassandra Forward Talks
Anant Corporation
 
Data Engineer's Lunch 90: Migrating SQL Data with Arcion
Anant Corporation
 
Data Engineer's Lunch 89: Machine Learning Orchestration with AirflowMachine ...
Anant Corporation
 
Cassandra Lunch 129: What’s New: Apache Cassandra 4.1+ Features & Future
Anant Corporation
 
Data Engineer's Lunch #86: Building Real-Time Applications at Scale: A Case S...
Anant Corporation
 
Data Engineer's Lunch #85: Designing a Modern Data Stack
Anant Corporation
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Anant Corporation
 
Apache Cassandra Lunch 120: Apache Cassandra Monitoring Made Easy with AxonOps
Anant Corporation
 
Apache Cassandra Lunch 119: Desktop GUI Tools for Apache Cassandra
Anant Corporation
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Anant Corporation
 
Ad

Recently uploaded (20)

PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPTX
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PPTX
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
PPTX
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
PDF
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
PPTX
Presentation on animal welfare a good topic
kidscream385
 
PDF
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PPTX
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
PDF
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
PPTX
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PPTX
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
PPTX
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
PDF
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
Blue Futuristic Cyber Security Presentation.pdf
tanvikhunt1003
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
Data-Users-in-Database-Management-Systems (1).pptx
dharmik832021
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
Blue and Dark Blue Modern Technology Presentation.pptx
ap177979
 
Databricks-DE-Associate Certification Questions-june-2024.pptx
pedelli41
 
Mastering Financial Analysis Materials.pdf
SalamiAbdullahi
 
Presentation on animal welfare a good topic
kidscream385
 
202501214233242351219 QASS Session 2.pdf
lauramejiamillan
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
Multiscale Segmentation of Survey Respondents: Seeing the Trees and the Fores...
Sione Palu
 
Practical Measurement Systems Analysis (Gage R&R) for design
Rob Schubert
 
Presentation (1) (1).pptx k8hhfftuiiigff
karthikjagath2005
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
Fuzzy_Membership_Functions_Presentation.pptx
pythoncrazy2024
 
HSE WEEKLY REPORT for dummies and lazzzzy.pptx
ahmedibrahim691723
 
Key_Statistical_Techniques_in_Analytics_by_CA_Suvidha_Chaplot.pdf
CA Suvidha Chaplot
 

Data Engineer's Lunch #76: Airflow and Google Dataproc

  • 1. Version 1.0 Airflow and Google Dataproc In Data Engineer's Lunch #76, Arpan Patel will cover how to connect Airflow and Google Dataproc with a demo using an Airflow DAG to create a Dataproc cluster, submit an Apache Spark job to Dataproc, and destroy the Dataproc cluster upon completion. Arpan Patel Engineer @ Anant
  • 2. Google Dataproc ● Fully managed and highly scalable service for running Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks ○ Lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning ● Dataproc clusters are quick to start, scale, and shutdown, with each of these operations taking 90 seconds or less, on average ● Built-in integration with other Google Cloud Platform services, such as BigQuery, Cloud Storage, Cloud Bigtable, Cloud Logging, and Cloud Monitoring ● Can easily interact with clusters and Spark or Hadoop jobs through the Google Cloud console, the Cloud SDK, or the Dataproc REST API
  • 3. Google Dataproc ● https://cloud.google.com/dataproc/docs/concepts /versioning/dataproc-version-clusters ○ https://cloud.google.com/dataproc/docs/co ncepts/versioning/dataproc-release-2.0 ○ https://cloud.google.com/dataproc/docs/co ncepts/versioning/dataproc-release-1.5 ● Can run on GCE / GKE ● Dataproc Serverless for Spark
  • 4. Google Dataproc + DataStax Astra ● Cluster Properties ○ dataproc:dataproc.conscrypt.provider.enable=false ● Job Properties ○ spark.jars.packages → com.datastax.spark:spark-cassandra-connector_2.12:3.1.0 ● DAG param mappings to GCP REST API mappings ○ need to convert camel casing to "_". For example masterConfig -> master_config ○ if we want to use GKE for Dataproc cluster creation, then need to swap cluster_config for virtual_cluster_config
  • 5. Demo ● Open repo on Gitpod ● Set GCP Connection and Variables ● Run Dag that will: ○ Spin up Dataproc Cluster on GCE ○ Submit Dataproc Spark Job to read from DataStax Astra ○ Destroy Cluster
  • 6. Strategy: Scalable Fast Data Architecture: Cassandra, Spark, Kafka Engineering: Node, Python, JVM,CLR Operations: Cloud, Container Rescue: Downtime!! I need help. www.anant.us | [email protected] | (855) 262-6826 3 Washington Circle, NW | Suite 301 | Washington, DC 20037