SlideShare a Scribd company logo
Project Hydrogen: Unifying State-of-the-art
Big Data and AI in Apache Spark
Xiangrui Meng
2018/07
Spark: unified analytics engine for big data
“Apache Spark is the Taylor Swift of big data software.” - Fortune
500,000
meetup
members
4,000
summit
attendees
AI: re-shaping the world
“Artificial Intelligence is the New Electricity.”  -  Andrew Ng
Transportation Healthcare and Genomics Internet of Things Fraud Prevention Personalization
Get started with Spark + AI?
Try Databricks Community Edition: http://databricks.com/try
Two communities: big data and AI
Significant progress has been made by both big data and AI
communities in recent years to push the cutting edge:
more complex
big data
scenarios
more complex
deep learning
scenarios
The cross?
6
Map/Reduce
CaffeOnSpark
TensorFlowOnSpark
DataFrame-based APIs
50+ Data Sources
Python/Java/R interfaces
Structured Streaming
ML Pipelines API
Continuous
Processing
RDD
Project Tungsten
Pandas UDF
TensorFrames
scikit-learn
pandas/numpy/scipy
LIBLINEAR
R
glmnet
xgboost
GraphLab
Caffe/PyTorch/MXNet
TensorFlow
Keras
Distributed
TensorFlow
Horovod
tf.data
tf.transform
AI/ML
??
TF XLA
Neither big data nor AI is easy
“Hidden Technical Debt in Machine Learning Systems,” Google NIPS 2015
Figure 1: Only a small fraction of real-world ML systems is composed of the ML code, as shown by the
small green box in the middle. The required surrounding infrastructure is vast and complex.
ML
Code
Configuration
Data Collection
Data
Verification
Feature
Extraction
Machine
Resource
Management
Analysis Tools
Process
Management Tools
Serving
Infrastructure
Monitoring
Big data for AI
A common goal of collecting, processing, and managing big data
is to extract value from it. AI/ML is the most mentioned approach
in our conversations with customers.
improved customer engagement and conversions by
improving image classification (case study)
increased customer satisfaction, retention, and lifetime value
by detecting abusive language in real-time (case study)
Big data for AI
There are many efforts from the Spark community to integrate
Spark with AI/ML frameworks:
● (Yahoo) CaffeOnSpark, TensorFlowOnSpark
● (Intel) BigDL
● (John Snow Labs) Spark-NLP
● (Databricks) spark-sklearn, tensorframes, spark-deep-learning
● … 80+ ML/AI packages on spark-packages.org
AI needs big data
One cannot apply AI techniques without data. And DL models
scale with amount of data. We have seen efforts from the AI
community to handle different data scenarios:
● tf.data, tf.Transform
● spark-tensorflow-connector
● ...
source: Andrew Ng
When AI goes distributed ...
When datasets get bigger and bigger, we see more and more
distributed training scenarios and open-source offerings, e.g.,
distributed TensorFlow, Horovod, and distributed MXNet.
This is where we see Spark and AI efforts overlap more.
The status quo: two simple stories
As a data scientist, I can:
● build a pipeline that fetches training events from a production
data warehouse and trains a DL model in parallel;
● apply a trained DL model to a distributed stream of events and
enrich it with predicted labels.
Distributed training
data warehouse load fit model
Required: Be able to read from
Databricks Delta, Parquet,
MySQL, Hive, etc.
Answer: Apache Spark
Required: distributed GPU cluster
for fast training
Answer: Horovod, Distributed
Tensorflow, etc
Databricks
Delta
Two separate data and AI clusters?
load using a Spark
cluster
fit on a GPU
cluster
model
save data
required: glue code
Streaming model inference
Kafka load predict model
required:
● save to stream sink
● GPU for fast inference
A hybrid Spark and AI cluster
load using a Spark
cluster w/ GPUs
fit a model
distributedly
on the same
cluster
model
load using a Spark
cluster w/ GPUs
predict w/ GPUs as
a Spark task
model
Unfortunately, it doesn’t work out of the box.
(Demo that a distributed DL job needs to run as a Spark job to
avoid crashing into other Spark workloads on the same cluster.)
18
Different execution models
Task 1
Task 2
Task 3
Spark
Tasks are independent of each other
Embarrassingly parallel & massively scalable
Distributed Training
Complete coordination among tasks
Optimized for communication
19
Different execution models
Task 1
Task 2
Task 3
Spark
Tasks are independent of each other
Embarrassingly parallel & massively scalable
If one crashes…
Distributed Training
Complete coordination among tasks
Optimized for communication
20
Different execution models
Task 1
Task 2
Task 3
Spark
Tasks are independent of each other
Embarrassingly parallel & massively scalable
If one crashes, rerun that one
Distributed Training
Complete coordination among tasks
Optimized for communication
If one crashes, must rerun all tasks
21
Project Hydrogen: Spark + AI
Barrier
Execution
Mode
Optimized
Data
Exchange
Accelerator
Aware
Scheduling
22
Barrier execution mode
We introduce gang scheduling to Spark on top of MapReduce
execution model. So a distributed DL job can run as a Spark job.
● It starts all tasks together.
● It provides sufficient info and tooling to run a hybrid distributed job.
● It cancels and restarts all tasks in case of failures.
Umbrella JIRA: SPARK-24374 (ETA: Spark 2.4, 3.0)
23
RDD.barrier()
RDD.barrier() tells Spark to launch the tasks together.
rdd.barrier().mapPartitions { iter =>
val context = BarrierTaskContext.get()
...
}
24
context.barrier()
context.barrier() places a global barrier and waits until all tasks in
this stage hit this barrier.
val context = BarrierTaskContext.get()
… // write partition data out
context.barrier()
25
context.getTaskInfos()
context.getTaskInfos() returns info about all tasks in this stage.
if (context.partitionId == 0) {
val hosts = context.getTaskInfos().map(_.host)
... // start a hybrid training job, e.g., via MPI
}
context.barrier() // wait until training finishes
26
Cluster manager support
YARN
SPARK-24723
Kubernetes
SPARK-24724
Mesos
SPARK-24725
In Spark standalone mode, users need passwordless SSH among
workers to run a hybrid MPI job. The community is working on the
support of other cluster managers.
27
Demo
(Demo barrier mode prototype, where a distributed DL job runs
as a Spark job and does not crash into other Spark workloads.)
28
Optimized data exchange (SPIP)
None of the integrations are possible without exchanging data
between Spark and AI frameworks. And performance matters. We
proposed using a standard interface for data exchange to simplify
the integrations without introducing much overhead.
SPIP JIRA: SPARK-24579 (pending vote, ETA 3.0)
29
Sources × destinations
Parquet
Avro
SequenceFile
MySQL
ORC
...
TFRecords
CSV
LIBSVM
HDF5
NetCDF
...
×
30
Sources + 2 + destinations
Parquet
Avro
SequenceFile
MySQL
ORC
...
TFRecords
CSV
LIBSVM
HDF5
NetCDF
...
+
Apache
Spark
+
Apache
Arrow
+
31
From source to destination
Parquet
file
Spark tf.Dataset
32
From source to destination: a long path
Parquet
file
Spark
Tungsten
format in
JVM
Arrow
buffer in
JVM
native
Parquet
reader
Arrow
buffer in
Python
data in
GPU
Pandastf.Dataset
33
Shortcuts
Parquet
file
Spark
Tungsten
format in
JVM
Arrow
buffer in
JVM
native
Parquet
reader
Arrow
buffer in
Python
data in
GPU
Pandastf.Dataset
34
Vectorized computation
Though Spark uses a columnar storage format, the public user
access interface is row-based.
df.toArrowRDD.map { batches =>
... // vectorized computation
}
35
Generalize Pandas UDF to Scala/Java
Pandas UDF was introduced in Spark 2.3, which uses Arrow for
data exchange and utilizes Pandas for vectorized computation.
36
Data pipelining
CPU GPU
t1 fetch batch #1
t2 fetch batch #2 process batch #1
t3 fetch batch #3 process batch #2
t4 process batch #3
CPU GPU
t1 fetch batch #1
t2 process batch #1
t3 fetch batch #2
t4 process batch #2
t5 fetch batch #3
t6 process batch #3
(pipelining)
37
Accelerator-aware scheduling (SPIP)
To utilize accelerators (GPUs, FPGAs) in a heterogeneous cluster
or to utilize multiple accelerators in a multi-task node, Spark
needs to understand the accelerators installed on each node.
SPIP JIRA: SPARK-24615 (pending vote, ETA: 3.0)
38
Request accelerators
With accelerator awareness, users can specify accelerators
constraints or hints (API pending discussion):
rdd.accelerated
.by(“/gpu/p100”)
.numPerTask(2)
.required // or .optional
39
Multiple tasks on the same node
When multiples tasks are scheduled on the same node with
multiple GPUs, each task knows which GPUs are assigned to avoid
crashing into each other (API pending discussion):
// inside a task closure
val gpus = context.getAcceleratorInfos()
40
Cluster manager support
In Spark standalone mode, we can list accelerators in worker conf
and aggregate the information for scheduling. We also need to
provide support for other cluster managers:
YARN Kubernetes Mesos
41
Timeline
Spark 2.4
● barrier execution mode (basic scenarios)
● (Databricks) horovod integration w/ barrier mode
Spark 3.0
● barrier execution mode
● optimized data exchange
● accelerator-aware scheduling
42
Acknowledgement
● Many ideas in Project Hydrogen are based on previous
community work: TensorFrames, BigDL, Apache Arrow,
Pandas UDF, Spark GPU support, MPI, etc.
● We would like to thank many Spark committers and
contributors who helped the project proposal, design, and
implementation.
Thank you!

More Related Content

What's hot (20)

Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
Joud Khattab
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
Julian Hyde
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Flink Forward
 
Better than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouseBetter than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouse
Altinity Ltd
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
Aakashdata
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
Knoldus Inc.
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Oracle AHF Insights 23c - Deeper Diagnostic Insights for your Oracle Database...
Oracle AHF Insights 23c - Deeper Diagnostic Insights for your Oracle Database...Oracle AHF Insights 23c - Deeper Diagnostic Insights for your Oracle Database...
Oracle AHF Insights 23c - Deeper Diagnostic Insights for your Oracle Database...
Sandesh Rao
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
Databricks
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark Internals
Knoldus Inc.
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Common Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta LakehouseCommon Strategies for Improving Performance on Your Delta Lakehouse
Common Strategies for Improving Performance on Your Delta Lakehouse
Databricks
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
CloudxLab
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Databricks
 
Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013Parquet Strata/Hadoop World, New York 2013
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
Julian Hyde
 
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, UberDemystifying flink memory allocation and tuning - Roshan Naik, Uber
Demystifying flink memory allocation and tuning - Roshan Naik, Uber
Flink Forward
 
Better than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouseBetter than you think: Handling JSON data in ClickHouse
Better than you think: Handling JSON data in ClickHouse
Altinity Ltd
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
Aakashdata
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
Knoldus Inc.
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
Anton Kirillov
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
Walaa Hamdy Assy
 
Oracle AHF Insights 23c - Deeper Diagnostic Insights for your Oracle Database...
Oracle AHF Insights 23c - Deeper Diagnostic Insights for your Oracle Database...Oracle AHF Insights 23c - Deeper Diagnostic Insights for your Oracle Database...
Oracle AHF Insights 23c - Deeper Diagnostic Insights for your Oracle Database...
Sandesh Rao
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
Databricks
 
Memory Management in Apache Spark
Memory Management in Apache SparkMemory Management in Apache Spark
Memory Management in Apache Spark
Databricks
 
Apache Spark overview
Apache Spark overviewApache Spark overview
Apache Spark overview
DataArt
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
Databricks
 
Apache Spark Internals
Apache Spark InternalsApache Spark Internals
Apache Spark Internals
Knoldus Inc.
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
Zahra Eskandari
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 

Similar to Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark (20)

Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Databricks
 
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARKBig Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Matt Stubbs
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
MaheshPandit16
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
Martin Zapletal
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Anyscale
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Databricks
 
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark w...
Databricks
 
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARKBig Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Big Data LDN 2018: PROJECT HYDROGEN: UNIFYING AI WITH APACHE SPARK
Matt Stubbs
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 
Apache Spark Introduction.pdf
Apache Spark Introduction.pdfApache Spark Introduction.pdf
Apache Spark Introduction.pdf
MaheshPandit16
 
An introduction To Apache Spark
An introduction To Apache SparkAn introduction To Apache Spark
An introduction To Apache Spark
Amir Sedighi
 
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Databricks
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Paco Nathan
 
Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)Unified Big Data Processing with Apache Spark (QCON 2014)
Unified Big Data Processing with Apache Spark (QCON 2014)
Databricks
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
Lightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache SparkLightening Fast Big Data Analytics using Apache Spark
Lightening Fast Big Data Analytics using Apache Spark
Manish Gupta
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
Martin Zapletal
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64Data Analytics and Machine Learning: From Node to Cluster on ARM64
Data Analytics and Machine Learning: From Node to Cluster on ARM64
Ganesh Raju
 
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to ClusterBKK16-404B Data Analytics and Machine Learning- from Node to Cluster
BKK16-404B Data Analytics and Machine Learning- from Node to Cluster
Linaro
 
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to ClusterBKK16-408B Data Analytics and Machine Learning From Node to Cluster
BKK16-408B Data Analytics and Machine Learning From Node to Cluster
Linaro
 
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Project Hydrogen, HorovodRunner, and Pandas UDF: Distributed Deep Learning Tr...
Anyscale
 
Big data distributed processing: Spark introduction
Big data distributed processing: Spark introductionBig data distributed processing: Spark introduction
Big data distributed processing: Spark introduction
Hektor Jacynycz García
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Databricks
 
Ad

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 
Ad

Recently uploaded (20)

Optimising Claims Management with Claims Processing Systems
Optimising Claims Management with Claims Processing SystemsOptimising Claims Management with Claims Processing Systems
Optimising Claims Management with Claims Processing Systems
Insurance Tech Services
 
Autoposting.ai Sales Deck - Skyrocket your LinkedIn's ROI
Autoposting.ai Sales Deck - Skyrocket your LinkedIn's ROIAutoposting.ai Sales Deck - Skyrocket your LinkedIn's ROI
Autoposting.ai Sales Deck - Skyrocket your LinkedIn's ROI
Udit Goenka
 
Custom Software Development: Types, Applications and Benefits.pdf
Custom Software Development: Types, Applications and Benefits.pdfCustom Software Development: Types, Applications and Benefits.pdf
Custom Software Development: Types, Applications and Benefits.pdf
Digital Aptech
 
Scalefusion Remote Access for Apple Devices
Scalefusion Remote Access for Apple DevicesScalefusion Remote Access for Apple Devices
Scalefusion Remote Access for Apple Devices
Scalefusion
 
zOS CommServer support for the Network Express feature on z17
zOS CommServer support for the Network Express feature on z17zOS CommServer support for the Network Express feature on z17
zOS CommServer support for the Network Express feature on z17
zOSCommserver
 
How to purchase, license and subscribe to Microsoft Azure_PDF.pdf
How to purchase, license and subscribe to Microsoft Azure_PDF.pdfHow to purchase, license and subscribe to Microsoft Azure_PDF.pdf
How to purchase, license and subscribe to Microsoft Azure_PDF.pdf
victordsane
 
War Story: Removing Offensive Language from Percona Toolkit
War Story: Removing Offensive Language from Percona ToolkitWar Story: Removing Offensive Language from Percona Toolkit
War Story: Removing Offensive Language from Percona Toolkit
Sveta Smirnova
 
How John started to like TDD (instead of hating it) (ViennaJUG, June'25)
How John started to like TDD (instead of hating it) (ViennaJUG, June'25)How John started to like TDD (instead of hating it) (ViennaJUG, June'25)
How John started to like TDD (instead of hating it) (ViennaJUG, June'25)
Nacho Cougil
 
ICDL FULL STANDARD 2025 Luisetto mauro - Academia domani- 55 HOURS LONG pdf
ICDL FULL STANDARD  2025 Luisetto mauro - Academia domani- 55 HOURS LONG pdfICDL FULL STANDARD  2025 Luisetto mauro - Academia domani- 55 HOURS LONG pdf
ICDL FULL STANDARD 2025 Luisetto mauro - Academia domani- 55 HOURS LONG pdf
M. Luisetto Pharm.D.Spec. Pharmacology
 
SQL-COMMANDS instructionsssssssssss.pptx
SQL-COMMANDS instructionsssssssssss.pptxSQL-COMMANDS instructionsssssssssss.pptx
SQL-COMMANDS instructionsssssssssss.pptx
Ashlei5
 
Agentic AI Desgin Principles in five slides.pptx
Agentic AI Desgin Principles in five slides.pptxAgentic AI Desgin Principles in five slides.pptx
Agentic AI Desgin Principles in five slides.pptx
MOSIUOA WESI
 
Content Mate Web App Triples Content Managers‘ Productivity
Content Mate Web App Triples Content Managers‘ ProductivityContent Mate Web App Triples Content Managers‘ Productivity
Content Mate Web App Triples Content Managers‘ Productivity
Alex Vladimirovich
 
Techdebt handling with cleancode focus and as risk taker
Techdebt handling with cleancode focus and as risk takerTechdebt handling with cleancode focus and as risk taker
Techdebt handling with cleancode focus and as risk taker
RajaNagendraKumar
 
Software Risk and Quality management.pptx
Software Risk and Quality management.pptxSoftware Risk and Quality management.pptx
Software Risk and Quality management.pptx
HassanBangash9
 
The rise of e-commerce has redefined how retailers operate—and reconciliation...
The rise of e-commerce has redefined how retailers operate—and reconciliation...The rise of e-commerce has redefined how retailers operate—and reconciliation...
The rise of e-commerce has redefined how retailers operate—and reconciliation...
Prachi Desai
 
AI-ASSISTED METAMORPHIC TESTING FOR DOMAIN-SPECIFIC MODELLING AND SIMULATION
AI-ASSISTED METAMORPHIC TESTING FOR DOMAIN-SPECIFIC MODELLING AND SIMULATIONAI-ASSISTED METAMORPHIC TESTING FOR DOMAIN-SPECIFIC MODELLING AND SIMULATION
AI-ASSISTED METAMORPHIC TESTING FOR DOMAIN-SPECIFIC MODELLING AND SIMULATION
miso_uam
 
How a Staff Augmentation Company IN USA Powers Flutter App Breakthroughs.pdf
How a Staff Augmentation Company IN USA Powers Flutter App Breakthroughs.pdfHow a Staff Augmentation Company IN USA Powers Flutter App Breakthroughs.pdf
How a Staff Augmentation Company IN USA Powers Flutter App Breakthroughs.pdf
mary rojas
 
BoxLang-Dynamic-AWS-Lambda by Luis Majano.pdf
BoxLang-Dynamic-AWS-Lambda by Luis Majano.pdfBoxLang-Dynamic-AWS-Lambda by Luis Majano.pdf
BoxLang-Dynamic-AWS-Lambda by Luis Majano.pdf
Ortus Solutions, Corp
 
Issues in AI Presentation and machine learning.pptx
Issues in AI Presentation and machine learning.pptxIssues in AI Presentation and machine learning.pptx
Issues in AI Presentation and machine learning.pptx
Jalalkhan657136
 
Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...
Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...
Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...
Prachi Desai
 
Optimising Claims Management with Claims Processing Systems
Optimising Claims Management with Claims Processing SystemsOptimising Claims Management with Claims Processing Systems
Optimising Claims Management with Claims Processing Systems
Insurance Tech Services
 
Autoposting.ai Sales Deck - Skyrocket your LinkedIn's ROI
Autoposting.ai Sales Deck - Skyrocket your LinkedIn's ROIAutoposting.ai Sales Deck - Skyrocket your LinkedIn's ROI
Autoposting.ai Sales Deck - Skyrocket your LinkedIn's ROI
Udit Goenka
 
Custom Software Development: Types, Applications and Benefits.pdf
Custom Software Development: Types, Applications and Benefits.pdfCustom Software Development: Types, Applications and Benefits.pdf
Custom Software Development: Types, Applications and Benefits.pdf
Digital Aptech
 
Scalefusion Remote Access for Apple Devices
Scalefusion Remote Access for Apple DevicesScalefusion Remote Access for Apple Devices
Scalefusion Remote Access for Apple Devices
Scalefusion
 
zOS CommServer support for the Network Express feature on z17
zOS CommServer support for the Network Express feature on z17zOS CommServer support for the Network Express feature on z17
zOS CommServer support for the Network Express feature on z17
zOSCommserver
 
How to purchase, license and subscribe to Microsoft Azure_PDF.pdf
How to purchase, license and subscribe to Microsoft Azure_PDF.pdfHow to purchase, license and subscribe to Microsoft Azure_PDF.pdf
How to purchase, license and subscribe to Microsoft Azure_PDF.pdf
victordsane
 
War Story: Removing Offensive Language from Percona Toolkit
War Story: Removing Offensive Language from Percona ToolkitWar Story: Removing Offensive Language from Percona Toolkit
War Story: Removing Offensive Language from Percona Toolkit
Sveta Smirnova
 
How John started to like TDD (instead of hating it) (ViennaJUG, June'25)
How John started to like TDD (instead of hating it) (ViennaJUG, June'25)How John started to like TDD (instead of hating it) (ViennaJUG, June'25)
How John started to like TDD (instead of hating it) (ViennaJUG, June'25)
Nacho Cougil
 
ICDL FULL STANDARD 2025 Luisetto mauro - Academia domani- 55 HOURS LONG pdf
ICDL FULL STANDARD  2025 Luisetto mauro - Academia domani- 55 HOURS LONG pdfICDL FULL STANDARD  2025 Luisetto mauro - Academia domani- 55 HOURS LONG pdf
ICDL FULL STANDARD 2025 Luisetto mauro - Academia domani- 55 HOURS LONG pdf
M. Luisetto Pharm.D.Spec. Pharmacology
 
SQL-COMMANDS instructionsssssssssss.pptx
SQL-COMMANDS instructionsssssssssss.pptxSQL-COMMANDS instructionsssssssssss.pptx
SQL-COMMANDS instructionsssssssssss.pptx
Ashlei5
 
Agentic AI Desgin Principles in five slides.pptx
Agentic AI Desgin Principles in five slides.pptxAgentic AI Desgin Principles in five slides.pptx
Agentic AI Desgin Principles in five slides.pptx
MOSIUOA WESI
 
Content Mate Web App Triples Content Managers‘ Productivity
Content Mate Web App Triples Content Managers‘ ProductivityContent Mate Web App Triples Content Managers‘ Productivity
Content Mate Web App Triples Content Managers‘ Productivity
Alex Vladimirovich
 
Techdebt handling with cleancode focus and as risk taker
Techdebt handling with cleancode focus and as risk takerTechdebt handling with cleancode focus and as risk taker
Techdebt handling with cleancode focus and as risk taker
RajaNagendraKumar
 
Software Risk and Quality management.pptx
Software Risk and Quality management.pptxSoftware Risk and Quality management.pptx
Software Risk and Quality management.pptx
HassanBangash9
 
The rise of e-commerce has redefined how retailers operate—and reconciliation...
The rise of e-commerce has redefined how retailers operate—and reconciliation...The rise of e-commerce has redefined how retailers operate—and reconciliation...
The rise of e-commerce has redefined how retailers operate—and reconciliation...
Prachi Desai
 
AI-ASSISTED METAMORPHIC TESTING FOR DOMAIN-SPECIFIC MODELLING AND SIMULATION
AI-ASSISTED METAMORPHIC TESTING FOR DOMAIN-SPECIFIC MODELLING AND SIMULATIONAI-ASSISTED METAMORPHIC TESTING FOR DOMAIN-SPECIFIC MODELLING AND SIMULATION
AI-ASSISTED METAMORPHIC TESTING FOR DOMAIN-SPECIFIC MODELLING AND SIMULATION
miso_uam
 
How a Staff Augmentation Company IN USA Powers Flutter App Breakthroughs.pdf
How a Staff Augmentation Company IN USA Powers Flutter App Breakthroughs.pdfHow a Staff Augmentation Company IN USA Powers Flutter App Breakthroughs.pdf
How a Staff Augmentation Company IN USA Powers Flutter App Breakthroughs.pdf
mary rojas
 
BoxLang-Dynamic-AWS-Lambda by Luis Majano.pdf
BoxLang-Dynamic-AWS-Lambda by Luis Majano.pdfBoxLang-Dynamic-AWS-Lambda by Luis Majano.pdf
BoxLang-Dynamic-AWS-Lambda by Luis Majano.pdf
Ortus Solutions, Corp
 
Issues in AI Presentation and machine learning.pptx
Issues in AI Presentation and machine learning.pptxIssues in AI Presentation and machine learning.pptx
Issues in AI Presentation and machine learning.pptx
Jalalkhan657136
 
Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...
Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...
Why Indonesia’s $12.63B Alt-Lending Boom Needs Loan Servicing Automation & Re...
Prachi Desai
 

Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark

  • 1. Project Hydrogen: Unifying State-of-the-art Big Data and AI in Apache Spark Xiangrui Meng 2018/07
  • 2. Spark: unified analytics engine for big data “Apache Spark is the Taylor Swift of big data software.” - Fortune 500,000 meetup members 4,000 summit attendees
  • 3. AI: re-shaping the world “Artificial Intelligence is the New Electricity.”  -  Andrew Ng Transportation Healthcare and Genomics Internet of Things Fraud Prevention Personalization
  • 4. Get started with Spark + AI? Try Databricks Community Edition: http://databricks.com/try
  • 5. Two communities: big data and AI Significant progress has been made by both big data and AI communities in recent years to push the cutting edge: more complex big data scenarios more complex deep learning scenarios
  • 6. The cross? 6 Map/Reduce CaffeOnSpark TensorFlowOnSpark DataFrame-based APIs 50+ Data Sources Python/Java/R interfaces Structured Streaming ML Pipelines API Continuous Processing RDD Project Tungsten Pandas UDF TensorFrames scikit-learn pandas/numpy/scipy LIBLINEAR R glmnet xgboost GraphLab Caffe/PyTorch/MXNet TensorFlow Keras Distributed TensorFlow Horovod tf.data tf.transform AI/ML ?? TF XLA
  • 7. Neither big data nor AI is easy “Hidden Technical Debt in Machine Learning Systems,” Google NIPS 2015 Figure 1: Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small green box in the middle. The required surrounding infrastructure is vast and complex. ML Code Configuration Data Collection Data Verification Feature Extraction Machine Resource Management Analysis Tools Process Management Tools Serving Infrastructure Monitoring
  • 8. Big data for AI A common goal of collecting, processing, and managing big data is to extract value from it. AI/ML is the most mentioned approach in our conversations with customers. improved customer engagement and conversions by improving image classification (case study) increased customer satisfaction, retention, and lifetime value by detecting abusive language in real-time (case study)
  • 9. Big data for AI There are many efforts from the Spark community to integrate Spark with AI/ML frameworks: ● (Yahoo) CaffeOnSpark, TensorFlowOnSpark ● (Intel) BigDL ● (John Snow Labs) Spark-NLP ● (Databricks) spark-sklearn, tensorframes, spark-deep-learning ● … 80+ ML/AI packages on spark-packages.org
  • 10. AI needs big data One cannot apply AI techniques without data. And DL models scale with amount of data. We have seen efforts from the AI community to handle different data scenarios: ● tf.data, tf.Transform ● spark-tensorflow-connector ● ... source: Andrew Ng
  • 11. When AI goes distributed ... When datasets get bigger and bigger, we see more and more distributed training scenarios and open-source offerings, e.g., distributed TensorFlow, Horovod, and distributed MXNet. This is where we see Spark and AI efforts overlap more.
  • 12. The status quo: two simple stories As a data scientist, I can: ● build a pipeline that fetches training events from a production data warehouse and trains a DL model in parallel; ● apply a trained DL model to a distributed stream of events and enrich it with predicted labels.
  • 13. Distributed training data warehouse load fit model Required: Be able to read from Databricks Delta, Parquet, MySQL, Hive, etc. Answer: Apache Spark Required: distributed GPU cluster for fast training Answer: Horovod, Distributed Tensorflow, etc Databricks Delta
  • 14. Two separate data and AI clusters? load using a Spark cluster fit on a GPU cluster model save data required: glue code
  • 15. Streaming model inference Kafka load predict model required: ● save to stream sink ● GPU for fast inference
  • 16. A hybrid Spark and AI cluster load using a Spark cluster w/ GPUs fit a model distributedly on the same cluster model load using a Spark cluster w/ GPUs predict w/ GPUs as a Spark task model
  • 17. Unfortunately, it doesn’t work out of the box. (Demo that a distributed DL job needs to run as a Spark job to avoid crashing into other Spark workloads on the same cluster.)
  • 18. 18 Different execution models Task 1 Task 2 Task 3 Spark Tasks are independent of each other Embarrassingly parallel & massively scalable Distributed Training Complete coordination among tasks Optimized for communication
  • 19. 19 Different execution models Task 1 Task 2 Task 3 Spark Tasks are independent of each other Embarrassingly parallel & massively scalable If one crashes… Distributed Training Complete coordination among tasks Optimized for communication
  • 20. 20 Different execution models Task 1 Task 2 Task 3 Spark Tasks are independent of each other Embarrassingly parallel & massively scalable If one crashes, rerun that one Distributed Training Complete coordination among tasks Optimized for communication If one crashes, must rerun all tasks
  • 21. 21 Project Hydrogen: Spark + AI Barrier Execution Mode Optimized Data Exchange Accelerator Aware Scheduling
  • 22. 22 Barrier execution mode We introduce gang scheduling to Spark on top of MapReduce execution model. So a distributed DL job can run as a Spark job. ● It starts all tasks together. ● It provides sufficient info and tooling to run a hybrid distributed job. ● It cancels and restarts all tasks in case of failures. Umbrella JIRA: SPARK-24374 (ETA: Spark 2.4, 3.0)
  • 23. 23 RDD.barrier() RDD.barrier() tells Spark to launch the tasks together. rdd.barrier().mapPartitions { iter => val context = BarrierTaskContext.get() ... }
  • 24. 24 context.barrier() context.barrier() places a global barrier and waits until all tasks in this stage hit this barrier. val context = BarrierTaskContext.get() … // write partition data out context.barrier()
  • 25. 25 context.getTaskInfos() context.getTaskInfos() returns info about all tasks in this stage. if (context.partitionId == 0) { val hosts = context.getTaskInfos().map(_.host) ... // start a hybrid training job, e.g., via MPI } context.barrier() // wait until training finishes
  • 26. 26 Cluster manager support YARN SPARK-24723 Kubernetes SPARK-24724 Mesos SPARK-24725 In Spark standalone mode, users need passwordless SSH among workers to run a hybrid MPI job. The community is working on the support of other cluster managers.
  • 27. 27 Demo (Demo barrier mode prototype, where a distributed DL job runs as a Spark job and does not crash into other Spark workloads.)
  • 28. 28 Optimized data exchange (SPIP) None of the integrations are possible without exchanging data between Spark and AI frameworks. And performance matters. We proposed using a standard interface for data exchange to simplify the integrations without introducing much overhead. SPIP JIRA: SPARK-24579 (pending vote, ETA 3.0)
  • 30. 30 Sources + 2 + destinations Parquet Avro SequenceFile MySQL ORC ... TFRecords CSV LIBSVM HDF5 NetCDF ... + Apache Spark + Apache Arrow +
  • 31. 31 From source to destination Parquet file Spark tf.Dataset
  • 32. 32 From source to destination: a long path Parquet file Spark Tungsten format in JVM Arrow buffer in JVM native Parquet reader Arrow buffer in Python data in GPU Pandastf.Dataset
  • 34. 34 Vectorized computation Though Spark uses a columnar storage format, the public user access interface is row-based. df.toArrowRDD.map { batches => ... // vectorized computation }
  • 35. 35 Generalize Pandas UDF to Scala/Java Pandas UDF was introduced in Spark 2.3, which uses Arrow for data exchange and utilizes Pandas for vectorized computation.
  • 36. 36 Data pipelining CPU GPU t1 fetch batch #1 t2 fetch batch #2 process batch #1 t3 fetch batch #3 process batch #2 t4 process batch #3 CPU GPU t1 fetch batch #1 t2 process batch #1 t3 fetch batch #2 t4 process batch #2 t5 fetch batch #3 t6 process batch #3 (pipelining)
  • 37. 37 Accelerator-aware scheduling (SPIP) To utilize accelerators (GPUs, FPGAs) in a heterogeneous cluster or to utilize multiple accelerators in a multi-task node, Spark needs to understand the accelerators installed on each node. SPIP JIRA: SPARK-24615 (pending vote, ETA: 3.0)
  • 38. 38 Request accelerators With accelerator awareness, users can specify accelerators constraints or hints (API pending discussion): rdd.accelerated .by(“/gpu/p100”) .numPerTask(2) .required // or .optional
  • 39. 39 Multiple tasks on the same node When multiples tasks are scheduled on the same node with multiple GPUs, each task knows which GPUs are assigned to avoid crashing into each other (API pending discussion): // inside a task closure val gpus = context.getAcceleratorInfos()
  • 40. 40 Cluster manager support In Spark standalone mode, we can list accelerators in worker conf and aggregate the information for scheduling. We also need to provide support for other cluster managers: YARN Kubernetes Mesos
  • 41. 41 Timeline Spark 2.4 ● barrier execution mode (basic scenarios) ● (Databricks) horovod integration w/ barrier mode Spark 3.0 ● barrier execution mode ● optimized data exchange ● accelerator-aware scheduling
  • 42. 42 Acknowledgement ● Many ideas in Project Hydrogen are based on previous community work: TensorFrames, BigDL, Apache Arrow, Pandas UDF, Spark GPU support, MPI, etc. ● We would like to thank many Spark committers and contributors who helped the project proposal, design, and implementation.