SlideShare a Scribd company logo
Running Spark on Kubernetes:
Best Practices and Pitfalls
Jean-Yves Stephan, Co-Founder & CEO @ Data Mechanics
Julien Dumazert, Co-Founder & CTO @ Data Mechanics
Who We Are
Jean-Yves “JY” Stephan
Co-Founder & CEO @ Data Mechanics
jy@datamechanics.co
Previously:
Software Engineer and
Spark Infrastructure Lead @ Databricks
Julien Dumazert
Co-Founder & CTO @ Data Mechanics
julien@datamechanics.co
Previously:
Lead Data Scientist @ ContentSquare
Data Scientist @ BlaBlaCar
Who Are You?
Poll: What is your experience with running Spark on Kubernetes?
● 61% - I’ve never used it, but I’m curious about it.
● 24% - I’ve prototyped using it, but I’m not using it in production.
● 15% - I’m using it in production.
This slide was edited after the conference to show the results for the poll.
You can see and take the poll at https://www.datamechanics.co/spark-summit-poll
Agenda
A quick primer on Data Mechanics
Spark on Kubernetes
Core Concepts & Setup
Configuration & Performance Tips
Monitoring & Security
Future Works
Conclusion: Should you get started?
Data Mechanics - A serverless Spark platform
● Applications start and autoscale in
seconds.
● Seamless transition from local
development to running at scale.
● Tunes the infra parameters and Spark
configurations automatically for each
pipeline to make them fast and stable.
https://www.datamechanics.co
Customer story: Impact of automated tuning on Tradelab
For details, watch our SSAI 2019 Europe talk
How to automate performance tuning for Apache Spark
● Stability: Automatic remediation of
OutOfMemory errors and timeouts
● 2x performance boost on average
(speed and cost savings)
GatewayData engineers
Data scientists
We’re deployed on k8s in our customers cloud account
Spark on Kubernetes:
Core Concepts & Setup
Where does Kubernetes fit within Spark?
Kubernetes is a new cluster-manager/scheduler for Spark.
● Standalone
● Apache Mesos
● Yarn
● Kubernetes (since version 2.3)
Spark on Kubernetes - Architecture
Source
Two ways to submit Spark applications on k8s
● “Vanilla” way from Spark main open
source repo
● Configs spread between Spark config
(mostly) and k8s manifests
● Little pod customization support
before Spark 3.0
● App management is more manual
● Open-sourced by Google (but works on
any platform)
● Configs in k8s-style YAML with sugar
on top (configmaps, volumes, affinities)
● Tooling to read logs, kill, restart,
schedule apps
● Requires a long-running system pod
spark-on-k8s operatorSpark-submit
App management in practice
spark-on-k8s operatorSpark-submit
# Run an app
$ spark-submit --master k8s://https://<api-server> …
# List apps
k get pods -label "spark-role=driver"
NAME READY STATUS RESTARTS AGE
my-app-driver 0/1 Completed 0 25h
# Read logs
k logs my-app-driver
# Describe app
# No way to actually describe an app and its parameters…
# Run an app
$ kubectl apply -f <app-manifest>.yaml
# List apps
$ k get sparkapplications
NAME AGE
my-app 2d22h
# Read logs
sparkctl log my-app
# Describe app
$ k get sparkapplications my-app -o yaml
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
arguments:
- gs://path/to/data.parquet
mainApplicationFile: local:///opt/my-app/main.jar
...
status:
applicationState:
state: COMPLETED
...
Dependency Management Comparison
● Lack of isolation
○ Global Spark version
○ Global Python version
○ Global dependencies
● Lack of reproducibility
○ Flaky Init scripts
○ Subtle differences in AMIs or system
● Full isolation
○ Each Spark app runs in its own docker
container
● Control your environment
○ Package each app in a docker image
○ Or build a small set of docker images
for major changes and specify your app
code using URIs
KubernetesYARN
Spark on Kubernetes:
Configuration & Performance Tips
A surprise when sizing executors on k8s
Assume you have a k8s cluster with 16GB-RAM 4-core instances.
Do one of these and you’ll never get an executor!
● Set spark.executor.cores=4
● Set spark.executor.memory=11g
k8s-aware executor sizing
What happened?
→ Only a fraction of capacity is available to Spark pods,
and spark.executor.cores=4 requests 4 cores!
Compute available resources
● Estimate node allocatable: usually 95%
● Measure what’s taken by your daemonsets (say 10%)
→ 85% of cores are available
Configure Spark
spark.executor.cores=4
spark.kubernetes.executor.request.cores=3400m
Node capacity
Resources reserved for k8s and system
daemons
Node allocatable
Resources requested by daemonsets
Remaining space for Spark pods!
More configuration tips here
Dynamic allocation on Kubernetes
● Full dynamic allocation is not available. When killing an exec pod, you
may lose shuffle files that are expensive to recompute.
There is ongoing work to enable it (JIRA: SPARK-24432).
● In the meantime, a soft dynamic allocation is available from Spark 3.0
Only executors which do not hold active shuffle files can be scaled down.
spark.dynamicAllocation.enabled=true
spark.dynamicAllocation.shuffleTracking.enabled=true
Cluster autoscaling & dynamic allocation
k8s can be configured to autoscale if pending pods
cannot be allocated.
Autoscaling plays well with dynamic allocation:
● <10s to get a new exec if there is room in the cluster
● 1-2 min if the cluster needs to autoscale
Requires to install the cluster autoscaler on AKS (Azure)
and EKS (AWS). It is natively installed on GKE (GCP).
K8s cluster
Spark
application
Spark
application
Cluster autoscaling
Dynamic allocation
Dynamic allocation
Overprovisioning to speed up dynamic allocation
To further improve the speed of dynamic allocation,
overprovision the cluster with low-prio pause pods:
● The pause pods force k8s to scale up
● Spark pods preempt pause pods’ resources when
needed
Cluster autoscaler doc about overprovisioning.
K8s cluster
Spark
application
Cluster autoscaling
Dynamic allocation
Spark
application
Dynamic allocation
Pause
pod
Further cost reduction with spot instances
Spot (or preemptible) instances can reduce costs up to 75%.
● If an executor is killed, Spark can recover
● If the driver is killed, game over!
Node selectors and affinities can be used to
constrain drivers on non-preemptible nodes.
Non-preemptible node
Driver Driver
Preemptible node
Exec
Preemptible node
Exec
Preemptible node
Exec
I/O with an object storage
Usually in Spark on Kubernetes, data is read and written to an object storage.
Cloud providers write optimized committers for their object storages, like the S3A
Committers.
If it’s not the case, use the version 2 of the Hadoop committer bundled with Spark:
The performance boost may be up to 2x! (if you write many files)
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
Improve shuffle performance with volumes
I/O speed is critical in shuffle-bound workloads, because Spark uses local files as
scratch space.
Docker filesystem is slow → Use volumes to improve performance!
● emptyDir: use a temporary directory on the host (by default in Spark 3.0)
● hostPath: Leverage a fast disk mounted in the host (NVMe-based SSD)
● tmpfs: Use your RAM as local storage (⚠ dangerous)
Performance
We ran performance benchmarks to compare Kubernetes and YARN.
Results will be published on our blog early July 2020.
(Sneak peek: There is no performance penalty for running on k8s if you follow our recommendations)
Spark on Kubernetes:
Monitoring & Security
Monitor pod resource usage with k8s tools
Workload-agnostic tools to monitor pod usages:
● Kubernetes dashboard (installation on EKS)
● The GKE console
Issues:
● Hard to reconcile with Spark jobs/stages/tasks
● Executors metadata are lost when the Spark app
is completed
GKE console
Kubernetes dashboard
Spark history server
Setting up a History server is relatively easy:
● Direct your Spark event log file to S3/GCS/Azure Storage Account with
the spark.eventLog.dir config
● Install the Spark history server Helm chart on your cluster
What’s missing: resource usage metrics!
“Spark Delight” - A Spark UI replacement
Note: This slide was added after the conference.
Sorry for the self-promotion. We look forward to the feedback from the community!
● We’re building a better Spark UI
○ better UX
○ new system metrics
○ automated performance
recommendations
○ free of charge
○ cross-platform
● Not released yet, but we’re working
on it! Learn more and leave us
feedback.
Export Spark metrics to a time-series database
Spark leverages the DropWizard library to produce
detailed metrics.
The metrics can be exported to a time-series
database:
● InfluxDB (see spark-dashboard by Luca Canali)
● Prometheus
○ Spark has a built-in Prometheus servlet since version 3.0
○ The spark-operator proposes a Docker image with a
Prometheus java agent for older versions
Use sparkmeasure to pipe task metrics and stage
boundaries to the database Luca Canali, spark-dashboard
Security
Kubernetes security best practices apply to Spark on Kubernetes for free!
Access control
Strong built-in RBAC system in Kubernetes
Spark apps and pods benefit from it as native k8s resources
Secrets management
Kubernetes secrets as a first step
Integrations with solutions like HashiCorp Vault
Networking
Mutual TLS, Network policies (since v1.18)
Service mesh like Istio
Spark on Kubernetes:
Future Works
Features being worked on
● Shuffle improvements: Disaggregating storage and compute
○ Use remote storage for persisting shuffle data: SPARK-25299
○ Goal: Enable full dynamic allocation, and make Spark resilient to node loss (e.g. spot/pvm)
▪ Better Handling for node shutdown
▪ Copy shuffle and cache data during graceful decomissioning of a node: SPARK-20624
▪ Support local python dependency upload (SPARK-27936)
▪ Job Queues and Resource Management
Spark on Kubernetes:
Should You Get Started?
● Native Containerization
● A single cloud-agnostic infrastructure for your
entire tech stack with a rich ecosystem
● Efficient resource sharing guaranteeing both
resource isolation and cost efficiency
● Learning curve if you’re new to Kubernetes
● A lot to setup yourself since most managed
platforms do not support Kubernetes
● Marked as experimental (until 2.4) with missing
features like the External Shuffle service.
ConsPros
We chose Kubernetes for our platform - should you?
For more details, read our blog post
The Pros and Cons of Running Apache Spark on Kubernetes
Checklist to get started with Spark-on-Kubernetes
● Setup the infrastructure
○ Create the Kubernetes cluster
○ Optional: Setup the spark operator
○ Create a Docker Registry
○ Host the Spark History Server
○ Setup monitoring for Spark application logs and metrics
● Configure your apps for success
○ Configure node pools and your pod sizes for optimal binpacking
○ Optimize I/O with proper libraries and volume mounts
○ Optional: Enable k8s autoscaling and Spark app dynamic allocation
○ Optional: Use spot/preemptible VMs for cost reduction
● Enjoy the Ride !
Our platform helps
with this, and we’re
happy to help too!
The Simplest Way To Run Spark
https://www.datamechanics.co
Thank you!
Appendix
Cost reduction with cluster autoscaling
Configure two node pools for your k8s cluster
● Node pool of small instances for system pods (e.g. ingress controller,
autoscaler, spark-operator )
● Node pool of larger instances for Spark applications
Since node pools can scale down to zero on all cloud providers,
● you have large instances at your disposal for Spark apps
● you only pay for a small instance when the cluster is idle!

More Related Content

What's hot (20)

Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
Aljoscha Krettek
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
datamantra
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Hardening Kafka Replication
Hardening Kafka Replication Hardening Kafka Replication
Hardening Kafka Replication
confluent
 
Node Labels in YARN
Node Labels in YARNNode Labels in YARN
Node Labels in YARN
DataWorks Summit
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
Flink Forward
 
kafka
kafkakafka
kafka
Amikam Snir
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
Gerard Toonstra
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
Databricks
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache Kafka
Jiangjie Qin
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
Aljoscha Krettek
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Noritaka Sekiyama
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
datamantra
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Databricks
 
Hardening Kafka Replication
Hardening Kafka Replication Hardening Kafka Replication
Hardening Kafka Replication
confluent
 
Apache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native EraApache Flink in the Cloud-Native Era
Apache Flink in the Cloud-Native Era
Flink Forward
 
Apache Airflow Architecture
Apache Airflow ArchitectureApache Airflow Architecture
Apache Airflow Architecture
Gerard Toonstra
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Introducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes OperatorIntroducing the Apache Flink Kubernetes Operator
Introducing the Apache Flink Kubernetes Operator
Flink Forward
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon KimHDFS on Kubernetes—Lessons Learned with Kimoon Kim
HDFS on Kubernetes—Lessons Learned with Kimoon Kim
Databricks
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
Handle Large Messages In Apache Kafka
Handle Large Messages In Apache KafkaHandle Large Messages In Apache Kafka
Handle Large Messages In Apache Kafka
Jiangjie Qin
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
Databricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
Databricks
 

Similar to Running Apache Spark on Kubernetes: Best Practices and Pitfalls (20)

Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
Databricks
 
[Spark Summit 2017 NA] Apache Spark on Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes[Spark Summit 2017 NA] Apache Spark on Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes
Timothy Chen
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Webinar kubernetes and-spark
Webinar  kubernetes and-sparkWebinar  kubernetes and-spark
Webinar kubernetes and-spark
cnvrg.io AI OS - Hands-on ML Workshops
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Databricks
 
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdfRun Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Anya Bida
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Chris Fregly
 
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
Athens Big Data
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March Meetup
Holden Karau
 
Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...
DataWorks Summit
 
Big data and Kubernetes
Big data and KubernetesBig data and Kubernetes
Big data and Kubernetes
Anirudh Ramanathan
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Holden Karau
 
Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...
Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...
Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...
DoKC
 
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayMigrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Databricks
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
Reliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on KubernetesReliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on Kubernetes
Databricks
 
Serverless spark
Serverless sparkServerless spark
Serverless spark
MamathaBusi
 
Spark volume requirements 2018
Spark volume requirements 2018Spark volume requirements 2018
Spark volume requirements 2018
Rachit Arora
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on Kubernetes
DataWorks Summit
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
Databricks
 
[Spark Summit 2017 NA] Apache Spark on Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes[Spark Summit 2017 NA] Apache Spark on Kubernetes
[Spark Summit 2017 NA] Apache Spark on Kubernetes
Timothy Chen
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Databricks
 
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdfRun Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Anya Bida
 
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Spark on Kubernetes - Advanced Spark and Tensorflow Meetup - Jan 19 2017 - An...
Chris Fregly
 
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
18th Athens Big Data Meetup - 2nd Talk - Run Spark and Flink Jobs on Kubernetes
Athens Big Data
 
PySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March MeetupPySpark on Kubernetes @ Python Barcelona March Meetup
PySpark on Kubernetes @ Python Barcelona March Meetup
Holden Karau
 
Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...Why Kubernetes as a container orchestrator is a right choice for running spar...
Why Kubernetes as a container orchestrator is a right choice for running spar...
DataWorks Summit
 
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at LyftSF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
SF Big Analytics_20190612: Scaling Apache Spark on Kubernetes at Lyft
Chester Chen
 
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018
Holden Karau
 
Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...
Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...
Dynamic Large Scale Spark on Kubernetes: Empowering the Community with Argo W...
DoKC
 
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native WayMigrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Migrating Airflow-based Apache Spark Jobs to Kubernetes – the Native Way
Databricks
 
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Running Spark Inside Containers with Haohai Ma and Khalid Ahmed
Spark Summit
 
Reliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on KubernetesReliable Performance at Scale with Apache Spark on Kubernetes
Reliable Performance at Scale with Apache Spark on Kubernetes
Databricks
 
Serverless spark
Serverless sparkServerless spark
Serverless spark
MamathaBusi
 
Spark volume requirements 2018
Spark volume requirements 2018Spark volume requirements 2018
Spark volume requirements 2018
Rachit Arora
 
Storage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on KubernetesStorage Requirements and Options for Running Spark on Kubernetes
Storage Requirements and Options for Running Spark on Kubernetes
DataWorks Summit
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
Databricks
 
Ad

Recently uploaded (20)

Chronic constipation presentaion final.ppt
Chronic constipation presentaion final.pptChronic constipation presentaion final.ppt
Chronic constipation presentaion final.ppt
DrShashank7
 
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Designer
 
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptxrefractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
KannanDamodaram
 
HPC High Performance Course Presentation.pptx
HPC High Performance Course Presentation.pptxHPC High Performance Course Presentation.pptx
HPC High Performance Course Presentation.pptx
naziaahmadnm
 
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docxGeospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
sofiawilliams5966
 
语法专题3-状语从句.pdf 英语语法基础部分,涉及到状语从句部分的内容来米爱上
语法专题3-状语从句.pdf 英语语法基础部分,涉及到状语从句部分的内容来米爱上语法专题3-状语从句.pdf 英语语法基础部分,涉及到状语从句部分的内容来米爱上
语法专题3-状语从句.pdf 英语语法基础部分,涉及到状语从句部分的内容来米爱上
JunZhao68
 
IST606_SecurityManagement-slides_ 4 pdf
IST606_SecurityManagement-slides_ 4  pdfIST606_SecurityManagement-slides_ 4  pdf
IST606_SecurityManagement-slides_ 4 pdf
nwanjamakane
 
delta airlines new york office (Airwayscityoffice)
delta airlines new york office (Airwayscityoffice)delta airlines new york office (Airwayscityoffice)
delta airlines new york office (Airwayscityoffice)
jamespromind
 
Internal Architecture of Database Management Systems
Internal Architecture of Database Management SystemsInternal Architecture of Database Management Systems
Internal Architecture of Database Management Systems
M Munim
 
Cyber Security Presentation(Neon)xu.pptx
Cyber Security Presentation(Neon)xu.pptxCyber Security Presentation(Neon)xu.pptx
Cyber Security Presentation(Neon)xu.pptx
vilakshbhargava
 
time_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptxtime_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptx
stefanopinto1113
 
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
Fwdays
 
Artificial-Intelligence-in-Autonomous-Vehicles (1).pptx
Artificial-Intelligence-in-Autonomous-Vehicles (1).pptxArtificial-Intelligence-in-Autonomous-Vehicles (1).pptx
Artificial-Intelligence-in-Autonomous-Vehicles (1).pptx
AbhijitPal87
 
egc.pdf tài liệu tiếng Anh cho học sinh THPT
egc.pdf tài liệu tiếng Anh cho học sinh THPTegc.pdf tài liệu tiếng Anh cho học sinh THPT
egc.pdf tài liệu tiếng Anh cho học sinh THPT
huyenmy200809
 
1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf
1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf
1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf
elinavihriala
 
Market Share Analysis.pptx nnnnnnnnnnnnnn
Market Share Analysis.pptx nnnnnnnnnnnnnnMarket Share Analysis.pptx nnnnnnnnnnnnnn
Market Share Analysis.pptx nnnnnnnnnnnnnn
rocky
 
How to Choose the Right Online Proofing Software
How to Choose the Right Online Proofing SoftwareHow to Choose the Right Online Proofing Software
How to Choose the Right Online Proofing Software
skalatskayaek
 
Comprehensive Roadmap of AI, ML, DS, DA & DSA.pdf
Comprehensive Roadmap of AI, ML, DS, DA & DSA.pdfComprehensive Roadmap of AI, ML, DS, DA & DSA.pdf
Comprehensive Roadmap of AI, ML, DS, DA & DSA.pdf
epsilonice
 
BADS-MBA-Unit 1 that what data science and Interpretation
BADS-MBA-Unit 1 that what data science and InterpretationBADS-MBA-Unit 1 that what data science and Interpretation
BADS-MBA-Unit 1 that what data science and Interpretation
srishtisingh1813
 
Chapter 5.1.pptxsertj you can get it done before the election and I will
Chapter 5.1.pptxsertj you can get it done before the election and I willChapter 5.1.pptxsertj you can get it done before the election and I will
Chapter 5.1.pptxsertj you can get it done before the election and I will
SotheaPheng
 
Chronic constipation presentaion final.ppt
Chronic constipation presentaion final.pptChronic constipation presentaion final.ppt
Chronic constipation presentaion final.ppt
DrShashank7
 
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Designer
 
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptxrefractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
KannanDamodaram
 
HPC High Performance Course Presentation.pptx
HPC High Performance Course Presentation.pptxHPC High Performance Course Presentation.pptx
HPC High Performance Course Presentation.pptx
naziaahmadnm
 
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docxGeospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
sofiawilliams5966
 
语法专题3-状语从句.pdf 英语语法基础部分,涉及到状语从句部分的内容来米爱上
语法专题3-状语从句.pdf 英语语法基础部分,涉及到状语从句部分的内容来米爱上语法专题3-状语从句.pdf 英语语法基础部分,涉及到状语从句部分的内容来米爱上
语法专题3-状语从句.pdf 英语语法基础部分,涉及到状语从句部分的内容来米爱上
JunZhao68
 
IST606_SecurityManagement-slides_ 4 pdf
IST606_SecurityManagement-slides_ 4  pdfIST606_SecurityManagement-slides_ 4  pdf
IST606_SecurityManagement-slides_ 4 pdf
nwanjamakane
 
delta airlines new york office (Airwayscityoffice)
delta airlines new york office (Airwayscityoffice)delta airlines new york office (Airwayscityoffice)
delta airlines new york office (Airwayscityoffice)
jamespromind
 
Internal Architecture of Database Management Systems
Internal Architecture of Database Management SystemsInternal Architecture of Database Management Systems
Internal Architecture of Database Management Systems
M Munim
 
Cyber Security Presentation(Neon)xu.pptx
Cyber Security Presentation(Neon)xu.pptxCyber Security Presentation(Neon)xu.pptx
Cyber Security Presentation(Neon)xu.pptx
vilakshbhargava
 
time_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptxtime_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptx
stefanopinto1113
 
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
Fwdays
 
Artificial-Intelligence-in-Autonomous-Vehicles (1).pptx
Artificial-Intelligence-in-Autonomous-Vehicles (1).pptxArtificial-Intelligence-in-Autonomous-Vehicles (1).pptx
Artificial-Intelligence-in-Autonomous-Vehicles (1).pptx
AbhijitPal87
 
egc.pdf tài liệu tiếng Anh cho học sinh THPT
egc.pdf tài liệu tiếng Anh cho học sinh THPTegc.pdf tài liệu tiếng Anh cho học sinh THPT
egc.pdf tài liệu tiếng Anh cho học sinh THPT
huyenmy200809
 
1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf
1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf
1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf
elinavihriala
 
Market Share Analysis.pptx nnnnnnnnnnnnnn
Market Share Analysis.pptx nnnnnnnnnnnnnnMarket Share Analysis.pptx nnnnnnnnnnnnnn
Market Share Analysis.pptx nnnnnnnnnnnnnn
rocky
 
How to Choose the Right Online Proofing Software
How to Choose the Right Online Proofing SoftwareHow to Choose the Right Online Proofing Software
How to Choose the Right Online Proofing Software
skalatskayaek
 
Comprehensive Roadmap of AI, ML, DS, DA & DSA.pdf
Comprehensive Roadmap of AI, ML, DS, DA & DSA.pdfComprehensive Roadmap of AI, ML, DS, DA & DSA.pdf
Comprehensive Roadmap of AI, ML, DS, DA & DSA.pdf
epsilonice
 
BADS-MBA-Unit 1 that what data science and Interpretation
BADS-MBA-Unit 1 that what data science and InterpretationBADS-MBA-Unit 1 that what data science and Interpretation
BADS-MBA-Unit 1 that what data science and Interpretation
srishtisingh1813
 
Chapter 5.1.pptxsertj you can get it done before the election and I will
Chapter 5.1.pptxsertj you can get it done before the election and I willChapter 5.1.pptxsertj you can get it done before the election and I will
Chapter 5.1.pptxsertj you can get it done before the election and I will
SotheaPheng
 

Running Apache Spark on Kubernetes: Best Practices and Pitfalls

  • 1. Running Spark on Kubernetes: Best Practices and Pitfalls Jean-Yves Stephan, Co-Founder & CEO @ Data Mechanics Julien Dumazert, Co-Founder & CTO @ Data Mechanics
  • 2. Who We Are Jean-Yves “JY” Stephan Co-Founder & CEO @ Data Mechanics [email protected] Previously: Software Engineer and Spark Infrastructure Lead @ Databricks Julien Dumazert Co-Founder & CTO @ Data Mechanics [email protected] Previously: Lead Data Scientist @ ContentSquare Data Scientist @ BlaBlaCar
  • 3. Who Are You? Poll: What is your experience with running Spark on Kubernetes? ● 61% - I’ve never used it, but I’m curious about it. ● 24% - I’ve prototyped using it, but I’m not using it in production. ● 15% - I’m using it in production. This slide was edited after the conference to show the results for the poll. You can see and take the poll at https://www.datamechanics.co/spark-summit-poll
  • 4. Agenda A quick primer on Data Mechanics Spark on Kubernetes Core Concepts & Setup Configuration & Performance Tips Monitoring & Security Future Works Conclusion: Should you get started?
  • 5. Data Mechanics - A serverless Spark platform ● Applications start and autoscale in seconds. ● Seamless transition from local development to running at scale. ● Tunes the infra parameters and Spark configurations automatically for each pipeline to make them fast and stable. https://www.datamechanics.co
  • 6. Customer story: Impact of automated tuning on Tradelab For details, watch our SSAI 2019 Europe talk How to automate performance tuning for Apache Spark ● Stability: Automatic remediation of OutOfMemory errors and timeouts ● 2x performance boost on average (speed and cost savings)
  • 7. GatewayData engineers Data scientists We’re deployed on k8s in our customers cloud account
  • 8. Spark on Kubernetes: Core Concepts & Setup
  • 9. Where does Kubernetes fit within Spark? Kubernetes is a new cluster-manager/scheduler for Spark. ● Standalone ● Apache Mesos ● Yarn ● Kubernetes (since version 2.3)
  • 10. Spark on Kubernetes - Architecture Source
  • 11. Two ways to submit Spark applications on k8s ● “Vanilla” way from Spark main open source repo ● Configs spread between Spark config (mostly) and k8s manifests ● Little pod customization support before Spark 3.0 ● App management is more manual ● Open-sourced by Google (but works on any platform) ● Configs in k8s-style YAML with sugar on top (configmaps, volumes, affinities) ● Tooling to read logs, kill, restart, schedule apps ● Requires a long-running system pod spark-on-k8s operatorSpark-submit
  • 12. App management in practice spark-on-k8s operatorSpark-submit # Run an app $ spark-submit --master k8s://https://<api-server> … # List apps k get pods -label "spark-role=driver" NAME READY STATUS RESTARTS AGE my-app-driver 0/1 Completed 0 25h # Read logs k logs my-app-driver # Describe app # No way to actually describe an app and its parameters… # Run an app $ kubectl apply -f <app-manifest>.yaml # List apps $ k get sparkapplications NAME AGE my-app 2d22h # Read logs sparkctl log my-app # Describe app $ k get sparkapplications my-app -o yaml apiVersion: sparkoperator.k8s.io/v1beta2 kind: SparkApplication arguments: - gs://path/to/data.parquet mainApplicationFile: local:///opt/my-app/main.jar ... status: applicationState: state: COMPLETED ...
  • 13. Dependency Management Comparison ● Lack of isolation ○ Global Spark version ○ Global Python version ○ Global dependencies ● Lack of reproducibility ○ Flaky Init scripts ○ Subtle differences in AMIs or system ● Full isolation ○ Each Spark app runs in its own docker container ● Control your environment ○ Package each app in a docker image ○ Or build a small set of docker images for major changes and specify your app code using URIs KubernetesYARN
  • 15. A surprise when sizing executors on k8s Assume you have a k8s cluster with 16GB-RAM 4-core instances. Do one of these and you’ll never get an executor! ● Set spark.executor.cores=4 ● Set spark.executor.memory=11g
  • 16. k8s-aware executor sizing What happened? → Only a fraction of capacity is available to Spark pods, and spark.executor.cores=4 requests 4 cores! Compute available resources ● Estimate node allocatable: usually 95% ● Measure what’s taken by your daemonsets (say 10%) → 85% of cores are available Configure Spark spark.executor.cores=4 spark.kubernetes.executor.request.cores=3400m Node capacity Resources reserved for k8s and system daemons Node allocatable Resources requested by daemonsets Remaining space for Spark pods! More configuration tips here
  • 17. Dynamic allocation on Kubernetes ● Full dynamic allocation is not available. When killing an exec pod, you may lose shuffle files that are expensive to recompute. There is ongoing work to enable it (JIRA: SPARK-24432). ● In the meantime, a soft dynamic allocation is available from Spark 3.0 Only executors which do not hold active shuffle files can be scaled down. spark.dynamicAllocation.enabled=true spark.dynamicAllocation.shuffleTracking.enabled=true
  • 18. Cluster autoscaling & dynamic allocation k8s can be configured to autoscale if pending pods cannot be allocated. Autoscaling plays well with dynamic allocation: ● <10s to get a new exec if there is room in the cluster ● 1-2 min if the cluster needs to autoscale Requires to install the cluster autoscaler on AKS (Azure) and EKS (AWS). It is natively installed on GKE (GCP). K8s cluster Spark application Spark application Cluster autoscaling Dynamic allocation Dynamic allocation
  • 19. Overprovisioning to speed up dynamic allocation To further improve the speed of dynamic allocation, overprovision the cluster with low-prio pause pods: ● The pause pods force k8s to scale up ● Spark pods preempt pause pods’ resources when needed Cluster autoscaler doc about overprovisioning. K8s cluster Spark application Cluster autoscaling Dynamic allocation Spark application Dynamic allocation Pause pod
  • 20. Further cost reduction with spot instances Spot (or preemptible) instances can reduce costs up to 75%. ● If an executor is killed, Spark can recover ● If the driver is killed, game over! Node selectors and affinities can be used to constrain drivers on non-preemptible nodes. Non-preemptible node Driver Driver Preemptible node Exec Preemptible node Exec Preemptible node Exec
  • 21. I/O with an object storage Usually in Spark on Kubernetes, data is read and written to an object storage. Cloud providers write optimized committers for their object storages, like the S3A Committers. If it’s not the case, use the version 2 of the Hadoop committer bundled with Spark: The performance boost may be up to 2x! (if you write many files) spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2
  • 22. Improve shuffle performance with volumes I/O speed is critical in shuffle-bound workloads, because Spark uses local files as scratch space. Docker filesystem is slow → Use volumes to improve performance! ● emptyDir: use a temporary directory on the host (by default in Spark 3.0) ● hostPath: Leverage a fast disk mounted in the host (NVMe-based SSD) ● tmpfs: Use your RAM as local storage (⚠ dangerous) Performance We ran performance benchmarks to compare Kubernetes and YARN. Results will be published on our blog early July 2020. (Sneak peek: There is no performance penalty for running on k8s if you follow our recommendations)
  • 24. Monitor pod resource usage with k8s tools Workload-agnostic tools to monitor pod usages: ● Kubernetes dashboard (installation on EKS) ● The GKE console Issues: ● Hard to reconcile with Spark jobs/stages/tasks ● Executors metadata are lost when the Spark app is completed GKE console Kubernetes dashboard
  • 25. Spark history server Setting up a History server is relatively easy: ● Direct your Spark event log file to S3/GCS/Azure Storage Account with the spark.eventLog.dir config ● Install the Spark history server Helm chart on your cluster What’s missing: resource usage metrics!
  • 26. “Spark Delight” - A Spark UI replacement Note: This slide was added after the conference. Sorry for the self-promotion. We look forward to the feedback from the community! ● We’re building a better Spark UI ○ better UX ○ new system metrics ○ automated performance recommendations ○ free of charge ○ cross-platform ● Not released yet, but we’re working on it! Learn more and leave us feedback.
  • 27. Export Spark metrics to a time-series database Spark leverages the DropWizard library to produce detailed metrics. The metrics can be exported to a time-series database: ● InfluxDB (see spark-dashboard by Luca Canali) ● Prometheus ○ Spark has a built-in Prometheus servlet since version 3.0 ○ The spark-operator proposes a Docker image with a Prometheus java agent for older versions Use sparkmeasure to pipe task metrics and stage boundaries to the database Luca Canali, spark-dashboard
  • 28. Security Kubernetes security best practices apply to Spark on Kubernetes for free! Access control Strong built-in RBAC system in Kubernetes Spark apps and pods benefit from it as native k8s resources Secrets management Kubernetes secrets as a first step Integrations with solutions like HashiCorp Vault Networking Mutual TLS, Network policies (since v1.18) Service mesh like Istio
  • 30. Features being worked on ● Shuffle improvements: Disaggregating storage and compute ○ Use remote storage for persisting shuffle data: SPARK-25299 ○ Goal: Enable full dynamic allocation, and make Spark resilient to node loss (e.g. spot/pvm) ▪ Better Handling for node shutdown ▪ Copy shuffle and cache data during graceful decomissioning of a node: SPARK-20624 ▪ Support local python dependency upload (SPARK-27936) ▪ Job Queues and Resource Management
  • 31. Spark on Kubernetes: Should You Get Started?
  • 32. ● Native Containerization ● A single cloud-agnostic infrastructure for your entire tech stack with a rich ecosystem ● Efficient resource sharing guaranteeing both resource isolation and cost efficiency ● Learning curve if you’re new to Kubernetes ● A lot to setup yourself since most managed platforms do not support Kubernetes ● Marked as experimental (until 2.4) with missing features like the External Shuffle service. ConsPros We chose Kubernetes for our platform - should you? For more details, read our blog post The Pros and Cons of Running Apache Spark on Kubernetes
  • 33. Checklist to get started with Spark-on-Kubernetes ● Setup the infrastructure ○ Create the Kubernetes cluster ○ Optional: Setup the spark operator ○ Create a Docker Registry ○ Host the Spark History Server ○ Setup monitoring for Spark application logs and metrics ● Configure your apps for success ○ Configure node pools and your pod sizes for optimal binpacking ○ Optimize I/O with proper libraries and volume mounts ○ Optional: Enable k8s autoscaling and Spark app dynamic allocation ○ Optional: Use spot/preemptible VMs for cost reduction ● Enjoy the Ride ! Our platform helps with this, and we’re happy to help too!
  • 34. The Simplest Way To Run Spark https://www.datamechanics.co Thank you!
  • 36. Cost reduction with cluster autoscaling Configure two node pools for your k8s cluster ● Node pool of small instances for system pods (e.g. ingress controller, autoscaler, spark-operator ) ● Node pool of larger instances for Spark applications Since node pools can scale down to zero on all cloud providers, ● you have large instances at your disposal for Spark apps ● you only pay for a small instance when the cluster is idle!