Running Apache Spark Jobs Using Kubernetes

Running Apache Spark Jobs
Using Kubernetes
Yaron Haviv
CTO, Iguazio
Marcelo Litovsky
Solution Architect, Iguazio

85% of AI Projects Never Make it to Production
Research Environment Production Pipeline
Build from
Scratch
with a Large
Team
Manual
extraction
In-mem
analysis
Small scale
training
Manual
evaluation
Real-time
ingestion
Preparation at
scale
Train with many
params & large data
Real-time events
& data features
ETL Streaming APIs
Sync

Spark Help Us Scale ML Pipeline
4
ETL, Streaming,
Logs, Scrapers, ..
Ingest Prepare Train
With hyper-params,
multiple algorithms
Validate Deploy ++
Join, Aggregate,
Split, ..
Test, deploy, monitor
model & API serversServerless:
ML & Analytics
Functions
Features/Data:
Fast, Secure,
Versioned base features train + test datasets model report report metricsRT features
Selected model
with test data

Why Spark on Kubernetes?
▪ Unified management —Getting away from two cluster management
interfaces if your organization already is using Kubernetes elsewhere.
▪ Ability to isolate jobs —You can move models and ETL pipelines from
dev to production without the headaches of dependency
management.
▪ Resilient infrastructure —You don’t worry about sizing and building the
cluster, manipulating Docker files or networking configurations.
▪ Vibrant community constantly evolving
5

Goodbye Hadoop, Hello Cloud-Native
Eliminate complexity and inefficiency, gain cloud agility
6
YARN
HbaseHDFS
Map
Reduce
Pig,
Hive, ..
Data
Orchestration
Middleware
Your Business Logic
Consume
Innovate
Managed Storage
and Databases
Any Containerized Microservice

Spark on Kubernetes
7
Diagram and Bullet point Credit: https://spark.apache.org/docs/latest/running-on-kubernetes.html#prerequisites
• Spark creates a Spark driver running within
a Kubernetes pod.
• The driver creates executors which are also
running within Kubernetes pods and connects
to them, and executes application code.
• When the application completes, the executor
pods terminate and are cleaned up, but the
driver pod persists logs and remains in
“completed” state in the Kubernetes API until
it’s eventually garbage collected or manually
cleaned up.

How to run your spark job in Kubernetes ?
Cluster
Mode
Client
Mode
K8S
Operator
Spark Executors
Client
Spark-submit
Spark
driver
Spark ExecutorsSpark
driver
Client
Spark-submit
Spark Executors
Spark
driver
Spark
Operator
Kubernetes API
kubectl

Comparing Modes
Client Mode Cluster Mode K8S Operator
Execution
environment
Driver runs on job scheduling
environment
Driver runs in a Kubernetes pod Driver runs in a Kubernetes pod
Driver pod
communication
User needs to define
communications between
driver and executors
Kubernetes networking needs to be
properly configured for drive and
executor pods to communicate
The operator enables proper
communication between driver
and executors
Role based
access controls
User needs direct access to
Kubernetes with proper RBAC
User needs direct access to
Kubernetes with proper RBAC
The operator handles
deployments. More flexibility
configuring RBAC
Execution Driver could be located in a
separate host/container
Driver runs in the same kubernetes
cluster as executors
Driver runs in the same
kubernetes cluster as executors

Demo I
See repo: http://github.com/marcelonyc/igz_sparkk8s
• Instructions to deploy Spark Operator on Docker Desktop
• Configuration commands and files
• Examples

DevOps Challenges Remain
▪ Per J ob custom resources and configuration
▪ Specific runtime requirements and package dependencies
▪ Elastic scaling, resource limits &guarantees, ..
▪ ML Pipeline integration
▪ Coexistence/integration with other frameworks
▪ Resource and job monitoring
▪ …

Serverless: resource elasticity (to Zero),
automated deployment and operations
Serverless Today Data Prep and Training Jobs
Task lifespan Millisecs to mins Secs to hours
Scaling Load-balancer Partition, shuffle, reduce,
Hyper-params, RDD
State Stateless Stateful
Input Event Params, Datasets
6
Why Not Make Spark Serverless?
Time we extend Serverless to data-science !

ML & Analytics Functions Architecture
User Code OR
ML service
Runtime / SaaS
(e.g. Spark, Dask,
Horovod, Nuclio, ..)
Data / Feature
stores
Secrets
Artifacts &
Models
Ops
ML Pipeline
Inputs OutputsML Function

Serverless Spark ML Function Example
https://github.com/mlrun/mlrun/blob/master/examples/mlrun_sparkk8s.ipynb

Automating The Development & Tracking Workflow
Write and
test locally
specify runtime
configuration
Run/scale on
the cluster
Build
(if needed)
Document
& Publish
Run in a
Pipeline
Track experiments/runs, functions and data
image, deps
cpu/gpu/mem
data, volumes, ..
Use
published
functions

KubeFlow+Serverless: Automated ML Pipelines
What is Kubeflow ?
▪ Operators for ML frameworks
(lifecycle management, scale-out, ..)
▪ Managed notebooks
▪ ML Pipeline Automation
▪ With Serverless, we automate the
deployment, execution, scaling and
monitoring of our code
16

• 4M global customers
• 200 countries and territories - streaming global commerce
• Understanding illicit patterns of behavior in real time
based on 90 different parameters
• Proactively preventing money laundering before it occurs
Want To Move From Fraud Detection to
Prevention And Cut Time To Production
Fraud Prevention
Case Study: Payoneer

Traditional Fraud-Detection
Architecture (Hadoop)
18
SQL Server
Operational database
ETL to the DWH
every 30min
Data warehouse
Mirror table
Offline
processing
(SQL)
Feature vector Batch prediction
Using R Server
40 Minutes to identify suspicious money laundering account
40 Precious Minutes (detect fraud after the fact)
Long and complex process to production

Moving To Real-Time Fraud Prevention
19
SQL Server
Operational database
CDC
(Real-time)
Real-time
Ingestion Online + Offline
Feature Store
Model Training
(sklearn)
Model Inferencing
(Nuclio)
Block account !
Queue
Analysis
12 Seconds (prevent fraud)
12 Seconds to detect and prevent fraud !
Automated dev to production using a serverless approach

Demo II
Fully automated ML Pipeline with
Serverless Spark +Kubeflow

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Running Apache Spark Jobs Using Kubernetes

Recommended

More Related Content

What's hot (20)

Similar to Running Apache Spark Jobs Using Kubernetes (20)

More from Databricks (20)

Recently uploaded (20)

Running Apache Spark Jobs Using Kubernetes