SlideShare a Scribd company logo
Foundations for Scaling ML
in Apache Spark
Joseph K. Bradley
August 14, 2016
® ™
Who am I?
Apache Spark committer & PMC member
Software Engineer @ Databricks
Machine Learning Department @ Carnegie Mellon
2
•  General engine for big data computing
•  Fast
•  Easy to use
•  APIs in Python, Scala, Java & R
3
Apache Spark
Spark	SQL	 Streaming	 MLlib	 GraphX	
Largest cluster:
8000 Nodes (Tencent)
Open source
•  Apache Software Foundation
•  1000+ contributors
•  200+ companies & universities
NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN
FRANCISCO
Source: Slide 5 of Spark Community Update
MLlib: Spark’s ML library
5
0
500
1000
v0.8 v0.9 v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6 v2.0
commits/release
Learning tasks
Classification
Regression
Recommendation
Clustering
Frequent itemsets
Data utilities
Featurization
Statistics
Linear algebra
Workflow utilities
Model import/export
Pipelines
DataFrames
Cross validation
Goals
Scale-out ML
Standard library
Extensible API
MLlib: original design
RDDs
Challenges for scalability
6
Resilient Distributed Datasets (RDDs)
7
Map Reduce
master
Resilient Distributed Datasets (RDDs)
8
Resilient Distributed Datasets (RDDs)
9
Resiliency
•  Lineage
•  Caching &
checkpointing
ML on RDDs: the good
Flexible: GLMs, trees, matrix factorization, etc.
Scalable: E.g., Alternating Least Squares on Spotify data
•  50+ million users x 30+ million songs
•  50 billion ratings
Cost ~ $10
•  32 r3.8xlarge nodes (spot instances)
•  For rank 10 with 10 iterations, ~1 hour running time.
10
ML on RDDs: the challenges
Partitioning
•  Data partitioning impacts performance.
•  E.g., for Alternating Least Squares
Lineage
•  Iterative algorithms à long RDD lineage
•  Solvable via careful caching and checkpointing
JVM
•  Garbage collection (GC)
•  Boxed types
11
MLlib: current status
DataFrame & Dataset integration
Pipelines API
12
Spark DataFrames & Datasets
13
dept	 age	 name	
Bio	 48	 H	Smith	
CS	 34	 A	Turing	
Bio	 43	 B	Jones	
Chem	 61	 M	Kennedy	
Data grouped into
named columns
DSL for common tasks
•  Project, filter, aggregate, join, …
•  100+ functions available
•  User-Defined Functions (UDFs)
data.groupBy(“dept”).avg(“age”)
Datasets: Strongly typed DataFrames
DataFrame optimizations
Catalyst query optimizer
Project Tungsten
• Memory management
• Code generation
14
Predicate pushdown
Join selection
…
Off-heap
Avoid JVM GC
Compressed format
Combine operations into single,
efficient code blocks
ML Pipelines
•  DataFrames: unified ML dataset API
•  Flexible types
•  Add & remove columns during Pipeline execution
15
Load data
Feature	
extracIon	
Original	
dataset	
16
PredicIve	
model	
EvaluaIon	
Text Label
I bought the game... 4
Do NOT bother try... 1
this shirt is aweso... 5
never got it. Seller... 1
I ordered this to... 3
Extract features
Feature	
extracIon	
Original	
dataset	
17
PredicIve	
model	
EvaluaIon	
Text Label Words Features
I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...]
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...]
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...]
never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...]
I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...]
Fit a model
Feature	
extracIon	
Original	
dataset	
18
PredicIve	
model	
EvaluaIon	
Text Label Words Features Prediction Probability
I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9
never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7
I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7
Evaluate
Feature	
extracIon	
Original	
dataset	
19
PredicIve	
model	
EvaluaIon	
Text Label Words Features Prediction Probability
I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9
never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7
I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7
ML Pipelines
DataFrames: unified ML dataset API
•  Flexible types
•  Add & remove columns during Pipeline execution
•  Materialize columns lazily
•  Inspect intermediate results
20
Under the hood: optimizations
Current use of DataFrames
•  API
•  Transformations & predictions
21
Feature transformation & model
prediction are phrased as User-
Defined Functions (UDFs)
à Catalyst query optimizer
à Tungsten memory management
+ code generation
MLlib: future scaling
DataFrames for training
Potential benefits
•  Spilling to disk
•  Catalyst
•  Tungsten
Challenges remaining
22
Implementing ML on DataFrames
23
Map Reduce
master
Scalability
DataFrames automatically spill to disk
à Classic pain point of RDDs
24
java.lang.OutOfMemoryError
Goal: Smoothly scale, without custom per-algorithm optimizations
Catalyst in ML
Key idea: automatic query (ML algorithm) optimization
•  DataFrame operations are lazy.
•  Express entire algorithm as DataFrame operations.
•  Let Catalyst reorganize the algorithm, data, etc.
à Fewer manual optimizations
25
Tungsten in ML
Tungsten: off-heap memory management
•  Avoids JVM GC
•  Uses efficient storage formats
•  Code generation
26
Issue in ML:
object creation during each iteration
Issue in ML:
Array[(Int,Double,Double)]
Issue in ML:
Volcano iterator model in MR/RDDs
Prototyping ML on DataFrames
Currently:
•  Belief propagation
•  Connected components
Current challenges:
•  DataFrame query plans do not have iteration as a top-level concept
•  ML/Graph-specific optimizations for Catalyst query planner
Eventual goal: Port all ML algorithms to run on top of DataFrames
à speed & scalability
27
To summarize...
MLlib on RDDs
•  Required custom optimizations
MLlib with a DataFrame-based API
•  Friendly API
•  Improvements for prediction
MLlib on DataFrames
•  Potential for even greater scaling for training
•  Simpler for non-experts to write new algorithms
28
Get started
Get involved
•  JIRA http://issues.apache.org
•  mailing lists http://spark.apache.org
•  Github http://github.com/apache/spark
•  Spark Packages http://spark-packages.org
Learn more
•  New in Apache Spark 2.0
http://databricks.com/blog/2016/06/01
•  MOOCs on EdX http://databricks.com/spark/training
29
Try out Apache Spark 2.0 in
Databricks Community Edition
http://databricks.com/ce
Many thanks to the community
for contributions & support!
Databricks
Founded by the creators of Apache Spark
Offers hosted service
•  Spark on EC2
•  Notebooks
•  Visualizations
•  Cluster management
•  Scheduled jobs
30
We’re hiring!
Thank you!
Twitter: @jkbatcmu

More Related Content

PPT
Peno 3 Google App Engine introduction
PDF
Extracting information from images using deep learning and transfer learning ...
PPSX
Speed Up Your APEX Apps with JSON and Handlebars
PDF
Eclipse e4 - Google Eclipse Day
PDF
ES2015 and Beyond
PDF
Image Classification and Retrieval on Spark
PDF
GraphQL-PHP: Dos and don'ts
PDF
Productionizing Real-time Serving With MLflow
Peno 3 Google App Engine introduction
Extracting information from images using deep learning and transfer learning ...
Speed Up Your APEX Apps with JSON and Handlebars
Eclipse e4 - Google Eclipse Day
ES2015 and Beyond
Image Classification and Retrieval on Spark
GraphQL-PHP: Dos and don'ts
Productionizing Real-time Serving With MLflow

Similar to Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16 (20)

PDF
Distributed ML in Apache Spark
PPTX
Practical Distributed Machine Learning Pipelines on Hadoop
PDF
Practical Machine Learning Pipelines with MLlib
PPTX
Combining Machine Learning Frameworks with Apache Spark
PPTX
Apache Spark MLlib
PDF
Apache Spark MLlib 2.0 Preview: Data Science and Production
PPTX
Combining Machine Learning frameworks with Apache Spark
PDF
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
PPTX
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
PPTX
Machine Learning Pipelines - Joseph Bradley - Databricks
PDF
MLlib: Spark's Machine Learning Library
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
PPTX
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
PDF
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
PPTX
MLlib and Machine Learning on Spark
PDF
Spark DataFrames and ML Pipelines
PDF
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
PDF
Machine learning at scale challenges and solutions
PDF
Productionalizing Spark ML
PDF
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Distributed ML in Apache Spark
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Machine Learning Pipelines with MLlib
Combining Machine Learning Frameworks with Apache Spark
Apache Spark MLlib
Apache Spark MLlib 2.0 Preview: Data Science and Production
Combining Machine Learning frameworks with Apache Spark
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Apache® Spark™ MLlib 2.x: migrating ML workloads to DataFrames
Machine Learning Pipelines - Joseph Bradley - Databricks
MLlib: Spark's Machine Learning Library
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Joseph Bradley, Software Engineer, Databricks Inc. at MLconf SEA - 5/01/15
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
MLlib and Machine Learning on Spark
Spark DataFrames and ML Pipelines
Apache Spark MLlib's Past Trajectory and New Directions with Joseph Bradley
Machine learning at scale challenges and solutions
Productionalizing Spark ML
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ad

More from BigMine (10)

PDF
Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
PDF
From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...
PDF
Big Data and Small Devices by Katharina Morik
PDF
Exact Data Reduction for Big Data by Jieping Ye
PPT
Processing Reachability Queries with Realistic Constraints on Massive Network...
PPT
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
PDF
Big & Personal: the data and the models behind Netflix recommendations by Xa...
PPSX
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
PPT
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
PDF
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Inside the Atoms: Mining a Network of Networks and Beyond by HangHang Tong at...
From Practice to Theory in Learning from Massive Data by Charles Elkan at Big...
Big Data and Small Devices by Katharina Morik
Exact Data Reduction for Big Data by Jieping Ye
Processing Reachability Queries with Realistic Constraints on Massive Network...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Big & Personal: the data and the models behind Netflix recommendations by Xa...
Large Graph Mining – Patterns, tools and cascade analysis by Christos Faloutsos
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Big Data Analytics: Applications and Opportunities in On-line Predictive Mode...
Ad

Recently uploaded (20)

PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
PDF
Data Science Trends & Career Guide---ppt
PPTX
Global journeys: estimating international migration
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Understanding Prototyping in Design and Development
PDF
Report The-State-of-AIOps 20232032 3.pdf
PDF
Foundation of Data Science unit number two notes
PPTX
1intro to AI.pptx AI components & composition
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Computer network topology notes for revision
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
lec_5(probability).pptxzzjsjsjsjsjsjjsjjssj
PPTX
Purple and Violet Modern Marketing Presentation (1).pptx
PPTX
Trading Procedures (1).pptxcffcdddxxddsss
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
IB Computer Science - Internal Assessment.pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Master Databricks SQL with AccentFuture – The Future of Data Warehousing
Data Science Trends & Career Guide---ppt
Global journeys: estimating international migration
Moving the Public Sector (Government) to a Digital Adoption
Miokarditis (Inflamasi pada Otot Jantung)
Understanding Prototyping in Design and Development
Report The-State-of-AIOps 20232032 3.pdf
Foundation of Data Science unit number two notes
1intro to AI.pptx AI components & composition
Launch Your Data Science Career in Kochi – 2025
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Computer network topology notes for revision
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
lec_5(probability).pptxzzjsjsjsjsjsjjsjjssj
Purple and Violet Modern Marketing Presentation (1).pptx
Trading Procedures (1).pptxcffcdddxxddsss

Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16

  • 1. Foundations for Scaling ML in Apache Spark Joseph K. Bradley August 14, 2016 ® ™
  • 2. Who am I? Apache Spark committer & PMC member Software Engineer @ Databricks Machine Learning Department @ Carnegie Mellon 2
  • 3. •  General engine for big data computing •  Fast •  Easy to use •  APIs in Python, Scala, Java & R 3 Apache Spark Spark SQL Streaming MLlib GraphX Largest cluster: 8000 Nodes (Tencent) Open source •  Apache Software Foundation •  1000+ contributors •  200+ companies & universities
  • 4. NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN FRANCISCO Source: Slide 5 of Spark Community Update
  • 5. MLlib: Spark’s ML library 5 0 500 1000 v0.8 v0.9 v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6 v2.0 commits/release Learning tasks Classification Regression Recommendation Clustering Frequent itemsets Data utilities Featurization Statistics Linear algebra Workflow utilities Model import/export Pipelines DataFrames Cross validation Goals Scale-out ML Standard library Extensible API
  • 7. Resilient Distributed Datasets (RDDs) 7 Map Reduce master
  • 9. Resilient Distributed Datasets (RDDs) 9 Resiliency •  Lineage •  Caching & checkpointing
  • 10. ML on RDDs: the good Flexible: GLMs, trees, matrix factorization, etc. Scalable: E.g., Alternating Least Squares on Spotify data •  50+ million users x 30+ million songs •  50 billion ratings Cost ~ $10 •  32 r3.8xlarge nodes (spot instances) •  For rank 10 with 10 iterations, ~1 hour running time. 10
  • 11. ML on RDDs: the challenges Partitioning •  Data partitioning impacts performance. •  E.g., for Alternating Least Squares Lineage •  Iterative algorithms à long RDD lineage •  Solvable via careful caching and checkpointing JVM •  Garbage collection (GC) •  Boxed types 11
  • 12. MLlib: current status DataFrame & Dataset integration Pipelines API 12
  • 13. Spark DataFrames & Datasets 13 dept age name Bio 48 H Smith CS 34 A Turing Bio 43 B Jones Chem 61 M Kennedy Data grouped into named columns DSL for common tasks •  Project, filter, aggregate, join, … •  100+ functions available •  User-Defined Functions (UDFs) data.groupBy(“dept”).avg(“age”) Datasets: Strongly typed DataFrames
  • 14. DataFrame optimizations Catalyst query optimizer Project Tungsten • Memory management • Code generation 14 Predicate pushdown Join selection … Off-heap Avoid JVM GC Compressed format Combine operations into single, efficient code blocks
  • 15. ML Pipelines •  DataFrames: unified ML dataset API •  Flexible types •  Add & remove columns during Pipeline execution 15
  • 16. Load data Feature extracIon Original dataset 16 PredicIve model EvaluaIon Text Label I bought the game... 4 Do NOT bother try... 1 this shirt is aweso... 5 never got it. Seller... 1 I ordered this to... 3
  • 17. Extract features Feature extracIon Original dataset 17 PredicIve model EvaluaIon Text Label Words Features I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...]
  • 18. Fit a model Feature extracIon Original dataset 18 PredicIve model EvaluaIon Text Label Words Features Prediction Probability I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8 Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6 this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9 never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7 I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7
  • 19. Evaluate Feature extracIon Original dataset 19 PredicIve model EvaluaIon Text Label Words Features Prediction Probability I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8 Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6 this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9 never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7 I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7
  • 20. ML Pipelines DataFrames: unified ML dataset API •  Flexible types •  Add & remove columns during Pipeline execution •  Materialize columns lazily •  Inspect intermediate results 20
  • 21. Under the hood: optimizations Current use of DataFrames •  API •  Transformations & predictions 21 Feature transformation & model prediction are phrased as User- Defined Functions (UDFs) à Catalyst query optimizer à Tungsten memory management + code generation
  • 22. MLlib: future scaling DataFrames for training Potential benefits •  Spilling to disk •  Catalyst •  Tungsten Challenges remaining 22
  • 23. Implementing ML on DataFrames 23 Map Reduce master
  • 24. Scalability DataFrames automatically spill to disk à Classic pain point of RDDs 24 java.lang.OutOfMemoryError Goal: Smoothly scale, without custom per-algorithm optimizations
  • 25. Catalyst in ML Key idea: automatic query (ML algorithm) optimization •  DataFrame operations are lazy. •  Express entire algorithm as DataFrame operations. •  Let Catalyst reorganize the algorithm, data, etc. à Fewer manual optimizations 25
  • 26. Tungsten in ML Tungsten: off-heap memory management •  Avoids JVM GC •  Uses efficient storage formats •  Code generation 26 Issue in ML: object creation during each iteration Issue in ML: Array[(Int,Double,Double)] Issue in ML: Volcano iterator model in MR/RDDs
  • 27. Prototyping ML on DataFrames Currently: •  Belief propagation •  Connected components Current challenges: •  DataFrame query plans do not have iteration as a top-level concept •  ML/Graph-specific optimizations for Catalyst query planner Eventual goal: Port all ML algorithms to run on top of DataFrames à speed & scalability 27
  • 28. To summarize... MLlib on RDDs •  Required custom optimizations MLlib with a DataFrame-based API •  Friendly API •  Improvements for prediction MLlib on DataFrames •  Potential for even greater scaling for training •  Simpler for non-experts to write new algorithms 28
  • 29. Get started Get involved •  JIRA http://issues.apache.org •  mailing lists http://spark.apache.org •  Github http://github.com/apache/spark •  Spark Packages http://spark-packages.org Learn more •  New in Apache Spark 2.0 http://databricks.com/blog/2016/06/01 •  MOOCs on EdX http://databricks.com/spark/training 29 Try out Apache Spark 2.0 in Databricks Community Edition http://databricks.com/ce Many thanks to the community for contributions & support!
  • 30. Databricks Founded by the creators of Apache Spark Offers hosted service •  Spark on EC2 •  Notebooks •  Visualizations •  Cluster management •  Scheduled jobs 30 We’re hiring!