SlideShare a Scribd company logo
Intro to Apache Spark
w/ML & Python
Lightning fast cluster computing with Python
For PyLadies Toronto 2016 :)
Who am I?
● Prefered pronouns are she/her
● I’m a Principal Software Engineer at IBM’s Spark Technology Center
● previously Alpine, Databricks, Google, Foursquare & Amazon
● co-author of Learning Spark & Fast Data processing with Spark
○ co-author of a new book focused on Spark performance coming out this year*
● @holdenkarau
● Slide share http://www.slideshare.net/hkarau
● Linkedin https://www.linkedin.com/in/holdenkarau
● Github https://github.com/holdenk
● Spark Videos http://bit.ly/holdenSparkVideos
Who I think you wonderful humans are?
● Nice* people
● Don’t mind pictures of cats
● Want to learn about using PySpark for distributed computing
● Don’t overly mind a grab-bag of topics
Lori Erickson
What we are going to explore together!
● What is Spark?
● Spark’s primary distributed collection
● Word count
● How PySpark works
● Machine Learning with PySpark
Ryan McGilchrist
Companion notebook funtimes:
● Small companion IJupyter notebook to explore with:
○ http://bit.ly/hkMLExample
● If you want to use it you will access to Apache Spark
○ Install from http://spark.apache.org
○ Or get access to one of the online notebook environments (IBM
BlueMix, DataBricks Cloud, Microsoft Spark HDInsights Cluster
Notebook, etc.)
David DeHetre
Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Photo from Cocoa Dream
What is Spark?
How do I make MapReduce tasks faster?
What is Spark?
● General purpose distributed system
○ With a really nice API
● Apache project (one of the most active)
● Must faster than Hadoop Map/Reduce
● Has Python APIs
Bernhard Latzko
What is Spark?
● General purpose distributed system
○ With a really nice API
● Apache project (one of the most
active)
● Must faster than Hadoop
Map/Reduce
● Good when too big for a single
machine
● Built on top of two abstractions for
distributed data: RDDs & Datasets
The different pieces of Spark: 2.0+
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark
ML
bagel &
Graph X
MLLib
Community
Packages
Structured
Streaming
SparkContext: entry to the world
● Can be used to create RDDs from many input sources
○ Native collections, local & remote FS
○ Any Hadoop Data Source
● Also create counters & accumulators
● Automatically created in the shells (called sc)
● Specify master & app name when creating
○ Master can be local[*], spark:// , yarn, etc.
○ app name should be human readable and make sense
● etc.
Petfu
l
RDDs: Spark’s Primary abstraction
RDD (Resilient Distributed Dataset)
● Distributed collection
● Recomputed on node failure
● Distributes data & work across the cluster
● Lazily evaluated (transformations & actions)
Helen Olney
Word count (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile(“output”)
Photo By: Will
Keightley
Word count (in python)
lines = sc.textFile(src)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x+y))
word_count.saveAsTextFile("output")
No data is read or
processed until after
this line
This is an “action”
which forces spark to
evaluate the RDD
daniilr
Some common transformations & actions
Transformations (lazy)
● map
● filter
● flatMap
● reduceByKey
● join
● cogroup
Actions (eager)
● count
● reduce
● collect
● take
● saveAsTextFile
● saveAsHadoop
● countByValue
Photo by Steve
Photo by Dan G
A detour into the internals
Photo by Bill Ward
Why lazy evaluation?
● Allows pipelining procedures
○ Less passes over our data, extra happiness
● Can skip materializing intermediate results which are
really really big*
● Figuring out where our code fails becomes a little
trickier
So what happens when we run this code?
Driver
Worker
Worker
Worker
HDFS /
Cassandra/
etc
So what happens when we run this code?
Driver
Worker
Worker
Worker
HDFS /
Cassandra/
etc
function
So what happens when we run this code?
Driver
Worker
Worker
Worker
HDFS /
Cassandra/
etc
read
read
read
So what happens when we run this code?
Driver
Worker
Worker
Worker
HDFS /
Cassandra/
etc
cached
cached
cached
counts
Spark in Scala, how does PySpark work?
● Py4J + pickling + magic
○ This can be kind of slow sometimes
● RDDs are generally RDDs of pickled objects
● Spark SQL (and DataFrames) avoid some of this
So what does that look like?
Driver
py4j
Worker 1
Worker K
pipe
pipe
Why should we consider Spark SQL?
● Performance
○ Smart optimizer
○ More efficient storage
○ Faster serialization
● Simplicity
○ Windowed operations
○ Multi-column & multi-type aggregates
● Integrated into the ML Pipeline API
Rikki's Refuge
How much faster can it be?
Andrew Skudder
Loading with sparkSQL & spark-csv
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("resources/adult.data")
Jess Johnson
What about other data formats?
● Built in
○ Parquet
○ JDBC
○ Json (which is amazing!)
○ Orc
○ Hive
● Available as packages
○ csv*
○ Avro, Redshift, Mongo, Cassandra, Cloudant, Couchbase, etc.
○ +34 at
http://spark-packages.org/?q=tags%3A%22Data%20Sources%22
Michael Coghlan
*pre-2.0 package, 2.0+ built in hopefully
Lets explore training a Decision Tree
● Step 1: Data loading (done!)
● Step 2: Data prep (select features, etc.)
● Step 3: Train
● Step 4: Predict
Data prep / cleaning
● We need to predict a double (can be 0.0, 1.0, but type
must be double)
● We need to train with a vector of features
Imports:
from pyspark.mllib.linalg import Vectors
from pyspark.ml.classification import DecisionTreeClassifier
from pyspark.ml.param import Param, Params
from pyspark.ml.feature import Bucketizer, VectorAssembler,
StringIndexer
from pyspark.ml import Pipeline
Huang
Yun
Chung
Data prep / cleaning continued
# Combines a list of double input features into a vector
assembler = VectorAssembler(inputCols=["age", "education-num"],
outputCol="feautres")
# String indexer converts a set of strings into doubles
indexer =
StringIndexer(inputCol="category")
.setOutputCol("category-index")
# Can be used to combine pipeline components together
pipeline = Pipeline().setStages([assembler, indexer])
Huang
Yun
Chung
So a bit more about that pipeline
● Each of our previous components has “fit” & “transform”
stage
● Constructing the pipeline this way makes it easier to
work with (only need to call one fit & one transform)
● Can re-use the fitted model on future data
model=pipeline.fit(df)
prepared = model.transform(df)
Andrey
What does our pipeline look like so far?
Input Data Assembler
Input Data
+ Vectors StringIndexer
Input Data
+Cat ID
+ Vectors
While not an ML learning
algorithm this still needs to
be fit
This is a regular
transformer - no fitting
required.
Let's train a model on our prepared data:
# Specify model
dt = DecisionTreeClassifier(labelCol = "category-index",
featuresCol="features")
# Fit it
dt_model = dt.fit(prepared)
# Or as part of the pipeline
pipeline_and_model = Pipeline().setStages([assembler, indexer,
dt])
pipeline_model = pipeline_and_model.fit(df)
And predict the results on the same data:
pipeline_model.transform(df).select("prediction",
"category-index").take(20)
Cross-validation
because saving a test set is effort
● Automagically* fit your model params
● Because thinking is effort
● org.apache.spark.ml.tuning has the tools
○ (not in Python yet so skipping for now)
Jonathan Kotta
Pipeline API has many models:
● org.apache.spark.ml.classification
○ BinaryLogisticRegressionClassification, DecissionTreeClassification,
GBTClassifier, etc.
● org.apache.spark.ml.regression
○ DecissionTreeRegression, GBTRegressor, IsotonicRegression,
LinearRegression, etc.
● org.apache.spark.ml.recommendation
○ ALS
PROcarterse Follow
And the next book…..
First five chapters are available in “Early Release”*:
● Buy from O’Reilly - http://bit.ly/highPerfSpark
Get notified when updated & finished:
● http://www.highperformancespark.com
● https://twitter.com/highperfspark
* Early Release means extra mistakes, but also a chance to help us make a more awesome
book.
And some upcoming talks:
● September
○ This workshop (yay!)
○ New York City Strata Conf (Structured Streaming & Machine Learning)
● October
○ PyData DC - Making Spark go fast in Python (vroom vroom)
○ Salt Lake City Spark Meetup - TBD
○ London - OSCON - Getting Started Contributing to Spark
● December
○ Strata Singapore (Introduction to Datasets)
k thnx bye!
If you care about Spark testing and
don’t hate surveys:
http://bit.ly/holdenTestingSpark
Will tweet results
“eventually” @holdenkarau
Any PySpark Users: Have some
simple UDFs you wish ran faster
you are willing to share?:
http://bit.ly/pySparkUDF
Pssst: Have feedback on the presentation? Give me a
shout (holden@pigscanfly.ca) if you feel comfortable doing
so :)

More Related Content

What's hot (20)

Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
sparktc
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Holden Karau
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
Holden Karau
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
Holden Karau
 
Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...
Holden Karau
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
Holden Karau
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
Holden Karau
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
Holden Karau
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016
Holden Karau
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Holden Karau
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
Holden Karau
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Holden Karau
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Holden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Holden Karau
 
Holden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom ModelsHolden Karau - Spark ML for Custom Models
Holden Karau - Spark ML for Custom Models
sparktc
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Holden Karau
 
Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016Streaming & Scaling Spark - London Spark Meetup 2016
Streaming & Scaling Spark - London Spark Meetup 2016
Holden Karau
 
Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
Holden Karau
 
Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...Testing and validating distributed systems with Apache Spark and Apache Beam ...
Testing and validating distributed systems with Apache Spark and Apache Beam ...
Holden Karau
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
Holden Karau
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Holden Karau
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
Holden Karau
 
A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
Holden Karau
 
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop seriesIntroducing Apache Spark's Data Frames and Dataset APIs workshop series
Introducing Apache Spark's Data Frames and Dataset APIs workshop series
Holden Karau
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
Holden Karau
 
Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018Accelerating Big Data beyond the JVM - Fosdem 2018
Accelerating Big Data beyond the JVM - Fosdem 2018
Holden Karau
 
Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016Getting the best performance with PySpark - Spark Summit West 2016
Getting the best performance with PySpark - Spark Summit West 2016
Holden Karau
 
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...Beyond Shuffling  - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Beyond Shuffling - Effective Tips and Tricks for Scaling Spark (Vancouver Sp...
Holden Karau
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
Holden Karau
 
Improving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVMImproving PySpark performance: Spark Performance Beyond the JVM
Improving PySpark performance: Spark Performance Beyond the JVM
Holden Karau
 
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Sparkling pandas Letting Pandas Roam - PyData Seattle 2015
Holden Karau
 
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Apache Spark Structured Streaming for Machine Learning - StrataConf 2016
Holden Karau
 
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Making the big data ecosystem work together with Python & Apache Arrow, Apach...
Holden Karau
 
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Improving PySpark Performance - Spark Beyond the JVM @ PyData DC 2016
Holden Karau
 

Similar to Getting started with Apache Spark in Python - PyLadies Toronto 2016 (20)

An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
Data Con LA
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Holden Karau
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
Holden Karau
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
NigthClazz Spark - Machine Learning / Introduction à Spark et ZeppelinNigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Zenika
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
Holden Karau
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
Gokhan Atil
 
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Holden Karau
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
Databricks
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Databricks
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
Data Con LA
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Holden Karau
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
Holden Karau
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Jose Quesada (hiring)
 
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
NigthClazz Spark - Machine Learning / Introduction à Spark et ZeppelinNigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Zenika
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
Databricks
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
Databricks
 
A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...A really really fast introduction to PySpark - lightning fast cluster computi...
A really really fast introduction to PySpark - lightning fast cluster computi...
Holden Karau
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
Holden Karau
 
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos LinardosApache Spark Workshop, Apr. 2016, Euangelos Linardos
Apache Spark Workshop, Apr. 2016, Euangelos Linardos
Euangelos Linardos
 
Introduction to Spark with Python
Introduction to Spark with PythonIntroduction to Spark with Python
Introduction to Spark with Python
Gokhan Atil
 
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)Sharing (or stealing) the jewels of python with big data & the jvm (1)
Sharing (or stealing) the jewels of python with big data & the jvm (1)
Holden Karau
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Spark Summit
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
Databricks
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
Anyscale
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
C4Media
 
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Spark Machine Learning: Adding Your Own Algorithms and Tools with Holden Kara...
Databricks
 
Spark from the Surface
Spark from the SurfaceSpark from the Surface
Spark from the Surface
Josi Aranda
 
Large Scale Machine learning with Spark
Large Scale Machine learning with SparkLarge Scale Machine learning with Spark
Large Scale Machine learning with Spark
Md. Mahedi Kaysar
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
Databricks
 
Ad

Recently uploaded (20)

GROUP 7 CASE STUDY Real Life Incident.pptx
GROUP 7 CASE STUDY Real Life Incident.pptxGROUP 7 CASE STUDY Real Life Incident.pptx
GROUP 7 CASE STUDY Real Life Incident.pptx
mardoglenn21
 
1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf
1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf
1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf
elinavihriala
 
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptxrefractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
KannanDamodaram
 
How to Choose the Right Online Proofing Software
How to Choose the Right Online Proofing SoftwareHow to Choose the Right Online Proofing Software
How to Choose the Right Online Proofing Software
skalatskayaek
 
GDPR Audit - GDPR gap analysis cost Data Protection People.pdf
GDPR Audit - GDPR gap analysis cost  Data Protection People.pdfGDPR Audit - GDPR gap analysis cost  Data Protection People.pdf
GDPR Audit - GDPR gap analysis cost Data Protection People.pdf
Data Protection People
 
9.-Composite-Dr.-B.-Nalini.pptxfdrtyuioklj
9.-Composite-Dr.-B.-Nalini.pptxfdrtyuioklj9.-Composite-Dr.-B.-Nalini.pptxfdrtyuioklj
9.-Composite-Dr.-B.-Nalini.pptxfdrtyuioklj
aishwaryavdcw
 
Artificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptx
Artificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptxArtificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptx
Artificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptx
AbhijitPal87
 
egc.pdf tài liệu tiếng Anh cho học sinh THPT
egc.pdf tài liệu tiếng Anh cho học sinh THPTegc.pdf tài liệu tiếng Anh cho học sinh THPT
egc.pdf tài liệu tiếng Anh cho học sinh THPT
huyenmy200809
 
Chronic constipation presentaion final.ppt
Chronic constipation presentaion final.pptChronic constipation presentaion final.ppt
Chronic constipation presentaion final.ppt
DrShashank7
 
Alcoholic liver disease slides presentation new.pptx
Alcoholic liver disease slides presentation new.pptxAlcoholic liver disease slides presentation new.pptx
Alcoholic liver disease slides presentation new.pptx
DrShashank7
 
IST606_SecurityManagement-slides_ 4 pdf
IST606_SecurityManagement-slides_ 4  pdfIST606_SecurityManagement-slides_ 4  pdf
IST606_SecurityManagement-slides_ 4 pdf
nwanjamakane
 
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
Fwdays
 
Tableau Finland User Group June 2025.pdf
Tableau Finland User Group June 2025.pdfTableau Finland User Group June 2025.pdf
Tableau Finland User Group June 2025.pdf
elinavihriala
 
Internal Architecture of Database Management Systems
Internal Architecture of Database Management SystemsInternal Architecture of Database Management Systems
Internal Architecture of Database Management Systems
M Munim
 
Tableau Cloud - what to consider before making the move update 2025.pdf
Tableau Cloud - what to consider before making the move update 2025.pdfTableau Cloud - what to consider before making the move update 2025.pdf
Tableau Cloud - what to consider before making the move update 2025.pdf
elinavihriala
 
Arrays in c programing. practicals and .ppt
Arrays in c programing. practicals and .pptArrays in c programing. practicals and .ppt
Arrays in c programing. practicals and .ppt
Carlos701746
 
time_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptxtime_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptx
stefanopinto1113
 
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Designer
 
llm lecture 4 stanford blah blah blah blah
llm lecture 4 stanford blah blah blah blahllm lecture 4 stanford blah blah blah blah
llm lecture 4 stanford blah blah blah blah
saud140081
 
Chapter4.1.pptx you can come to the house and statistics
Chapter4.1.pptx you can come to the house and statisticsChapter4.1.pptx you can come to the house and statistics
Chapter4.1.pptx you can come to the house and statistics
SotheaPheng
 
GROUP 7 CASE STUDY Real Life Incident.pptx
GROUP 7 CASE STUDY Real Life Incident.pptxGROUP 7 CASE STUDY Real Life Incident.pptx
GROUP 7 CASE STUDY Real Life Incident.pptx
mardoglenn21
 
1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf
1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf
1022_ExtendEnrichExcelUsingPythonWithTableau_04_16+04_17 (1).pdf
elinavihriala
 
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptxrefractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
refractiveindexexperimentdetailed-250528162156-4516aa1c.pptx
KannanDamodaram
 
How to Choose the Right Online Proofing Software
How to Choose the Right Online Proofing SoftwareHow to Choose the Right Online Proofing Software
How to Choose the Right Online Proofing Software
skalatskayaek
 
GDPR Audit - GDPR gap analysis cost Data Protection People.pdf
GDPR Audit - GDPR gap analysis cost  Data Protection People.pdfGDPR Audit - GDPR gap analysis cost  Data Protection People.pdf
GDPR Audit - GDPR gap analysis cost Data Protection People.pdf
Data Protection People
 
9.-Composite-Dr.-B.-Nalini.pptxfdrtyuioklj
9.-Composite-Dr.-B.-Nalini.pptxfdrtyuioklj9.-Composite-Dr.-B.-Nalini.pptxfdrtyuioklj
9.-Composite-Dr.-B.-Nalini.pptxfdrtyuioklj
aishwaryavdcw
 
Artificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptx
Artificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptxArtificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptx
Artificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptx
AbhijitPal87
 
egc.pdf tài liệu tiếng Anh cho học sinh THPT
egc.pdf tài liệu tiếng Anh cho học sinh THPTegc.pdf tài liệu tiếng Anh cho học sinh THPT
egc.pdf tài liệu tiếng Anh cho học sinh THPT
huyenmy200809
 
Chronic constipation presentaion final.ppt
Chronic constipation presentaion final.pptChronic constipation presentaion final.ppt
Chronic constipation presentaion final.ppt
DrShashank7
 
Alcoholic liver disease slides presentation new.pptx
Alcoholic liver disease slides presentation new.pptxAlcoholic liver disease slides presentation new.pptx
Alcoholic liver disease slides presentation new.pptx
DrShashank7
 
IST606_SecurityManagement-slides_ 4 pdf
IST606_SecurityManagement-slides_ 4  pdfIST606_SecurityManagement-slides_ 4  pdf
IST606_SecurityManagement-slides_ 4 pdf
nwanjamakane
 
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
"Machine Learning in Agriculture: 12 Production-Grade Models", Danil Polyakov
Fwdays
 
Tableau Finland User Group June 2025.pdf
Tableau Finland User Group June 2025.pdfTableau Finland User Group June 2025.pdf
Tableau Finland User Group June 2025.pdf
elinavihriala
 
Internal Architecture of Database Management Systems
Internal Architecture of Database Management SystemsInternal Architecture of Database Management Systems
Internal Architecture of Database Management Systems
M Munim
 
Tableau Cloud - what to consider before making the move update 2025.pdf
Tableau Cloud - what to consider before making the move update 2025.pdfTableau Cloud - what to consider before making the move update 2025.pdf
Tableau Cloud - what to consider before making the move update 2025.pdf
elinavihriala
 
Arrays in c programing. practicals and .ppt
Arrays in c programing. practicals and .pptArrays in c programing. practicals and .ppt
Arrays in c programing. practicals and .ppt
Carlos701746
 
time_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptxtime_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptx
stefanopinto1113
 
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Designer
 
llm lecture 4 stanford blah blah blah blah
llm lecture 4 stanford blah blah blah blahllm lecture 4 stanford blah blah blah blah
llm lecture 4 stanford blah blah blah blah
saud140081
 
Chapter4.1.pptx you can come to the house and statistics
Chapter4.1.pptx you can come to the house and statisticsChapter4.1.pptx you can come to the house and statistics
Chapter4.1.pptx you can come to the house and statistics
SotheaPheng
 
Ad

Getting started with Apache Spark in Python - PyLadies Toronto 2016

  • 1. Intro to Apache Spark w/ML & Python Lightning fast cluster computing with Python For PyLadies Toronto 2016 :)
  • 2. Who am I? ● Prefered pronouns are she/her ● I’m a Principal Software Engineer at IBM’s Spark Technology Center ● previously Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & Fast Data processing with Spark ○ co-author of a new book focused on Spark performance coming out this year* ● @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Spark Videos http://bit.ly/holdenSparkVideos
  • 3. Who I think you wonderful humans are? ● Nice* people ● Don’t mind pictures of cats ● Want to learn about using PySpark for distributed computing ● Don’t overly mind a grab-bag of topics Lori Erickson
  • 4. What we are going to explore together! ● What is Spark? ● Spark’s primary distributed collection ● Word count ● How PySpark works ● Machine Learning with PySpark Ryan McGilchrist
  • 5. Companion notebook funtimes: ● Small companion IJupyter notebook to explore with: ○ http://bit.ly/hkMLExample ● If you want to use it you will access to Apache Spark ○ Install from http://spark.apache.org ○ Or get access to one of the online notebook environments (IBM BlueMix, DataBricks Cloud, Microsoft Spark HDInsights Cluster Notebook, etc.) David DeHetre
  • 6. Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455 Photo from Cocoa Dream
  • 7. What is Spark? How do I make MapReduce tasks faster?
  • 8. What is Spark? ● General purpose distributed system ○ With a really nice API ● Apache project (one of the most active) ● Must faster than Hadoop Map/Reduce ● Has Python APIs Bernhard Latzko
  • 9. What is Spark? ● General purpose distributed system ○ With a really nice API ● Apache project (one of the most active) ● Must faster than Hadoop Map/Reduce ● Good when too big for a single machine ● Built on top of two abstractions for distributed data: RDDs & Datasets
  • 10. The different pieces of Spark: 2.0+ Apache Spark SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Graph X MLLib Community Packages Structured Streaming
  • 11. SparkContext: entry to the world ● Can be used to create RDDs from many input sources ○ Native collections, local & remote FS ○ Any Hadoop Data Source ● Also create counters & accumulators ● Automatically created in the shells (called sc) ● Specify master & app name when creating ○ Master can be local[*], spark:// , yarn, etc. ○ app name should be human readable and make sense ● etc. Petfu l
  • 12. RDDs: Spark’s Primary abstraction RDD (Resilient Distributed Dataset) ● Distributed collection ● Recomputed on node failure ● Distributes data & work across the cluster ● Lazily evaluated (transformations & actions) Helen Olney
  • 13. Word count (in python) lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile(“output”) Photo By: Will Keightley
  • 14. Word count (in python) lines = sc.textFile(src) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x+y)) word_count.saveAsTextFile("output") No data is read or processed until after this line This is an “action” which forces spark to evaluate the RDD daniilr
  • 15. Some common transformations & actions Transformations (lazy) ● map ● filter ● flatMap ● reduceByKey ● join ● cogroup Actions (eager) ● count ● reduce ● collect ● take ● saveAsTextFile ● saveAsHadoop ● countByValue Photo by Steve Photo by Dan G
  • 16. A detour into the internals Photo by Bill Ward
  • 17. Why lazy evaluation? ● Allows pipelining procedures ○ Less passes over our data, extra happiness ● Can skip materializing intermediate results which are really really big* ● Figuring out where our code fails becomes a little trickier
  • 18. So what happens when we run this code? Driver Worker Worker Worker HDFS / Cassandra/ etc
  • 19. So what happens when we run this code? Driver Worker Worker Worker HDFS / Cassandra/ etc function
  • 20. So what happens when we run this code? Driver Worker Worker Worker HDFS / Cassandra/ etc read read read
  • 21. So what happens when we run this code? Driver Worker Worker Worker HDFS / Cassandra/ etc cached cached cached counts
  • 22. Spark in Scala, how does PySpark work? ● Py4J + pickling + magic ○ This can be kind of slow sometimes ● RDDs are generally RDDs of pickled objects ● Spark SQL (and DataFrames) avoid some of this
  • 23. So what does that look like? Driver py4j Worker 1 Worker K pipe pipe
  • 24. Why should we consider Spark SQL? ● Performance ○ Smart optimizer ○ More efficient storage ○ Faster serialization ● Simplicity ○ Windowed operations ○ Multi-column & multi-type aggregates ● Integrated into the ML Pipeline API Rikki's Refuge
  • 25. How much faster can it be? Andrew Skudder
  • 26. Loading with sparkSQL & spark-csv df = sqlContext.read .format("com.databricks.spark.csv") .option("header", "true") .option("inferSchema", "true") .load("resources/adult.data") Jess Johnson
  • 27. What about other data formats? ● Built in ○ Parquet ○ JDBC ○ Json (which is amazing!) ○ Orc ○ Hive ● Available as packages ○ csv* ○ Avro, Redshift, Mongo, Cassandra, Cloudant, Couchbase, etc. ○ +34 at http://spark-packages.org/?q=tags%3A%22Data%20Sources%22 Michael Coghlan *pre-2.0 package, 2.0+ built in hopefully
  • 28. Lets explore training a Decision Tree ● Step 1: Data loading (done!) ● Step 2: Data prep (select features, etc.) ● Step 3: Train ● Step 4: Predict
  • 29. Data prep / cleaning ● We need to predict a double (can be 0.0, 1.0, but type must be double) ● We need to train with a vector of features Imports: from pyspark.mllib.linalg import Vectors from pyspark.ml.classification import DecisionTreeClassifier from pyspark.ml.param import Param, Params from pyspark.ml.feature import Bucketizer, VectorAssembler, StringIndexer from pyspark.ml import Pipeline Huang Yun Chung
  • 30. Data prep / cleaning continued # Combines a list of double input features into a vector assembler = VectorAssembler(inputCols=["age", "education-num"], outputCol="feautres") # String indexer converts a set of strings into doubles indexer = StringIndexer(inputCol="category") .setOutputCol("category-index") # Can be used to combine pipeline components together pipeline = Pipeline().setStages([assembler, indexer]) Huang Yun Chung
  • 31. So a bit more about that pipeline ● Each of our previous components has “fit” & “transform” stage ● Constructing the pipeline this way makes it easier to work with (only need to call one fit & one transform) ● Can re-use the fitted model on future data model=pipeline.fit(df) prepared = model.transform(df) Andrey
  • 32. What does our pipeline look like so far? Input Data Assembler Input Data + Vectors StringIndexer Input Data +Cat ID + Vectors While not an ML learning algorithm this still needs to be fit This is a regular transformer - no fitting required.
  • 33. Let's train a model on our prepared data: # Specify model dt = DecisionTreeClassifier(labelCol = "category-index", featuresCol="features") # Fit it dt_model = dt.fit(prepared) # Or as part of the pipeline pipeline_and_model = Pipeline().setStages([assembler, indexer, dt]) pipeline_model = pipeline_and_model.fit(df)
  • 34. And predict the results on the same data: pipeline_model.transform(df).select("prediction", "category-index").take(20)
  • 35. Cross-validation because saving a test set is effort ● Automagically* fit your model params ● Because thinking is effort ● org.apache.spark.ml.tuning has the tools ○ (not in Python yet so skipping for now) Jonathan Kotta
  • 36. Pipeline API has many models: ● org.apache.spark.ml.classification ○ BinaryLogisticRegressionClassification, DecissionTreeClassification, GBTClassifier, etc. ● org.apache.spark.ml.regression ○ DecissionTreeRegression, GBTRegressor, IsotonicRegression, LinearRegression, etc. ● org.apache.spark.ml.recommendation ○ ALS PROcarterse Follow
  • 37. And the next book….. First five chapters are available in “Early Release”*: ● Buy from O’Reilly - http://bit.ly/highPerfSpark Get notified when updated & finished: ● http://www.highperformancespark.com ● https://twitter.com/highperfspark * Early Release means extra mistakes, but also a chance to help us make a more awesome book.
  • 38. And some upcoming talks: ● September ○ This workshop (yay!) ○ New York City Strata Conf (Structured Streaming & Machine Learning) ● October ○ PyData DC - Making Spark go fast in Python (vroom vroom) ○ Salt Lake City Spark Meetup - TBD ○ London - OSCON - Getting Started Contributing to Spark ● December ○ Strata Singapore (Introduction to Datasets)
  • 39. k thnx bye! If you care about Spark testing and don’t hate surveys: http://bit.ly/holdenTestingSpark Will tweet results “eventually” @holdenkarau Any PySpark Users: Have some simple UDFs you wish ran faster you are willing to share?: http://bit.ly/pySparkUDF Pssst: Have feedback on the presentation? Give me a shout ([email protected]) if you feel comfortable doing so :)