Spark ML
Spark ML
Table of Content
○ Classification
○ Regression
○ Clustering
○ Dimensionality Reduction
○ Collaborative Filtering
Spark MLlib
Algorithms Featurization
Utilities Pipeline
• There are two machine learning implementations in Spark (ML and MLlib):
● Using spark.ml is recommended because the DataFrame API is more versatile and
flexible
“Spark ML” is not an official name but used to refer to the MLlib
DataFrame-based API (spark.ml)
Spark ML Pipeline
Spark ML Pipeline
• In machine learning, there are a lot of transformation steps that are performed to pre-process
the data
• You may often get confused about this transformations while working on huge projects
• To avoid this, pipelines were introduced that hold every step that is performed to fit the data
on a model
Spark ML Pipeline
• The Pipeline API in Spark chains multiple Transformers and Estimator specifying a ML workflow
• It is a high-level API for MLlib that lives under the spark.ml package
Pipeline
Transformer
Estimators
• For eg., logistic regression is an Estimator that trains on a dataset with labels and
features and produces a logistic regression model
Estimator Transformer
• The model acts as a Transformer that transforms the input dataset
For eg., logistic regression model can later be used to make predictions which
technically adds prediction columns (Transformation) in the dataset
Spark ML Component Flow
Spark ML Component Flow
Min-Max Scalar
Vector Assembler
Evaluate
Spark ML Component Flow
Evaluate Evaluate
• It is one of the important spark ml data types that you need to understand before we
take a look at different feature transformers
Spark ML Data Types
Spark ML Data Types
• Spark ML uses the following data types internally for machine learning algorithms
○ Vectors
○ Matrix
Spark ML Data Types
● Conversion of numerical value, string value, character value, categorical value into
numerical feature is called featurization
● The data once converted to these data types can be further passed to the ML
algorithm in Spark
Spark ML Data Types
● Make list of the word such that one word should be occurring only once, then the list
looks like as follow:
I love programming = [1 1 1 0 0 0 0 0 0 0 0 0]
Python is a programming language = [0 0 1 1 1 1 1 0 0 0 0 0]
Python is my favourite programming language = [0 0 1 1 1 0 1 1 1 0 0 0]
Data science using Python = [0 0 0 1 0 0 0 0 0 1 1 1]
Spark ML Data Types
array([[1 1 1 0 0 0 0 0 0 0 0 0],
[0 0 1 1 1 1 1 0 0 0 0 0],
[0 0 1 1 1 0 1 1 1 0 0 0],
[0 0 0 1 0 0 0 0 0 1 1 1]])
● Now that you have understood vectors and matrix, let us now focus on the different
types of vectors
Please Note
The elements of vectors and matrix are NOT always 0s and 1s.
Spark ML Data Types - Local Vector
● A local vector has integer-typed and 0-based indices and double-typed values
● Dense Vectors:
○ Dense Vector is a vector representation that contains many values or values that
are not zeros (very few zero values)
○ For eg. A vector (3.0, 5.0, 8.0, 0.0) can be represented in dense format as [3.0,
5.0, 8.0, 0.0]
Spark ML Data Types - Local Vector
● Sparse Vectors:
○ It stores the size of the vector, an array of indices, and an array of values
corresponding to those indices
○ For ex. A vector (0.0, 3.0, 0.0, 8.0, 0.0) can be represented in sparse format as
[5, [1,3], [3.0, 8.0]]
Spark ML Data Types - Local Vector
Total no. of elements Array of indices where non-zero Array of values corresponding to
(integer typed) elements are present (integer each index (double-typed)
typed)
Parallel Arrays
● A sparse vector is used for storing non-zero entries for saving space
Spark ML Data Types - Local Vector
Each of these arrays are of the same size and the array elements are related to each other
i-th element of each array is closely related and all i-th elements together represent an
object or entity
Spark ML Data Types - Labeled Point
Binary Classification 0 or 1
● Some of the common feature transformer that we use for model building are:
○ Binarizer ○ VectorAssembler
○ Bucketizer ○ VectorIndexer
○ StringIndexer ○ StandardScaler
○ IndexToString ○ MinMaxScaler
○ OneHotEncoder
● Most Transforms are under org.apache.spark.ml.feature package
ML Feature Transformers
● Binarizer
○ Binarization is used for thresholding numerical feature to binary feature (0 or 1)
We can create a new variable “BodyType” by binarizing the 'BMI' variable (1- obese and 0- healthy)
If your BMI is 30.0 or higher, the BodyType falls in the obese range)
ML Feature Transformers
● Code:
● Bucketizer
○ Bucketization is used for creating group of values of a continuous feature
○ The splits that you provided have to be in strictly increasing order, i.e. s0 < s1
< s2 < ... < sn
ML Feature Transformers
● Code:
● StringIndexer
○ StringIndexer converts a string column to an index column
There can be a situation when the StringIndexer may encounter a new label
This usually happens when you fit StringIndexer on one dataset and then use it to transform
incoming data that may have a new label
You can use any of the following three strategies to handle the situation by setting the parameter
setHandleInvalid to:
● Code:
ML Feature Transformers
● Output:
ML Feature Transformers
● IndexToString
○ IndexToString converts a column of label indices back to a column containing
the original labels as strings
○ It is like the inverse of StringIndexer: You can retrieve the labels that were
transformed by StringIndexer
○ This transformer is mostly used after training a model where you can retrieve
the original labels from the prediction column
Class Exercise
● OneHotEncoderEstimator
○ OneHotEncoderEstimator converts the label indices to binary vector
representation with at most a single one-value
○ It represents the presence of a specific feature value from among the set of all
feature values
● Code:
ML Feature Transformers
● Output:
Please Note
One hot encoder in spark work very differently than the way it works in sklearn (like
dummy column creation style)
Only one feature column is created representing categorical indices in the form of sparse
vector in each row
You may want to convert this sparse vector to dense vector later for scaling, if required
Please Note
It is primarily used for linear model (ex. Logistic Regression) to encode categorical features
since these algorithms expect continuous features
● VectorAssembler
○ MLlib expects all features to be contained within a single column
○ The output column represents the values for all of the input columns in the
form of vector (DenseVector or SparseVector depending on which use the least
memory)
ML Feature Transformers
● Code:
ML Feature Transformers
● Output:
It does not convert the vector into a dense vector during the merging process
You may want to convert this feature vector, if sparse, into a dense vector to perform
scaling
ML Feature Transformers
● VectorIndexer
○ VectorIndexer automatically identifies the categorical features from the feature
vector (output from VectorAssembler)
● Code:
ML Feature Transformers
● Output:
ML Feature Transformers
Using the StringIndexer output directly as a feature will not make sense because it converts
the categorical variable into nominal variable (do not have any order). Hence we one hot
encode them
VectorIndexer let us skip the one hot encoding stage for encoding the categorical features
As discussed earlier, we should not use one hot encoding on categorical variables for
algorithms like decision tree and tree ensembles
● StandardScaler
○ StandardScaler scales each value in the feature vector such that the mean is 0
and the standard deviation is 1
○ It takes parameters:
■ withStd: True by default. Scales the data to unit standard deviation
■ withMean: False by default. Centers the data with mean before scaling
Please Note
To use scaling transformers, we need to assemble the features into a feature vector first
(using VectorAssembler)
They do not convert sparse vector to dense vector internally. Therefore, it is very important
to convert the sparse vector to a dense vector before running this step to avoid incorrect
results as it does not throw error for the input sparse vector
ML Feature Transformers
● Output:
ML Feature Transformers
● Output:
ML Feature Transformers
● MinMaxScaler
○ MinMaxScaler scales each value in the feature vector between 0 and 1
○ Though (0, 1) is the default range, we can define our range of max and min
values as well
○ It takes parameters:
■ min: 0.0 by default. Lower bound value
■ max: 1.0 by default. Upper bound value
Class Exercise
● Normalizer
○ Normalizer normalize each value in the feature vector to have unit norm
• After you transform the dataframe with the model that you built, it may add
additional columns as predictions depending upon the algorithm:
○ rawPrediction
○ probability
○ prediction
Understanding Output of a Model
• rawPrediction
- It stores the raw output of a classifier for each possible target variable label
- For eg., for logistic regression the rawPrediction is calculated with the help of logit
Understanding Output of a Model
• probability
- It stores the probability of a classifier for each possible target variable label given
the raw prediction
- For eg., In logistic regression, probability is the result of applying the logistic
function ( exp(x)/(1+exp(x)) ) to rawPrediction
Understanding Output of a Model
• prediction
- It is the corresponding class that the model has predicted for given probability array
- It takes the maximum value out of the probability array, and it gives the most
probable label (single number)
Interpretation
● As discussed earlier, all spark ml model trains off only one column of data
● You should extract values from each row and pack them into a vector in a single
column named features (name not compulsory)
● Only supervised learning models will have ‘labelCol’ along with ‘featureCol’ as a
parameter
Spark ML Algorithms
Parameter Input
Description Note
Name Type
● This object acts as a transformer that add the prediction columns to the dataframe
● Output
Interpretation
• rawPrediction: it is the raw output of the logistic regression classifier (array with
length equal to the number of classes)
• prediction: it is the argument where the array probability takes its maximum value,
and it gives the most probable label (single number)
Logistic Regression Model Evaluation
• Spark ML provides a suite of metrics for the purpose of evaluating the performance
of machine learning models
● You can also use model.summary for logistic regression to get the performance metrics
Model Evaluation
● Output
Spark ML Algorithms
Linear
pyspark.ml.regression LinearRegression LinearRegression predictionCol
Regression
rawPredictionCol
Logistic
pyspark.ml.classification LogisticRegression LogisticRegression probabilityCol
Regression
predictionCol
rawPredictionCol
Decision Tree
pyspark.ml.classification DecisionTreeClassifier DecisionTreeClassifier probabilityCol
Classification
predictionCol
Decision Tree
pyspark.ml.regression DecisionTreeRegressor DecisionTreeRegressor predictionCol
Regression
Spark ML Algorithms
Random Forest
pyspark.ml.regression RandomForestRegressor RandomForestRegressor predictionCol
Regression
Gradient rawPredictionCol
Boosted Trees pyspark.ml.classification GBTClassifier GradientBoostingClassifier probabilityCol
Classification predictionCol
Gradient
Boosted Trees pyspark.ml.regression GBTRegressor GradientBoostingRegressor predictionCol
Regression
Spark ML Algorithms
rawPredictionCol
Naive Bayes pyspark.ml.classification NaiveBayes GaussianNB probabilityCol
predictionCol
BinaryClassificationEvaluator areaUnderROC
areaUnderPR
MulticlassClassificationEvaluator
f1, accuracy, weightedPrecision,
weightedRecall, weightedTruePositiveRate,
weightedFalsePositiveRate, weightedFMeasure,
truePositiveRateByLabel, falsePositiveRateByLabel,
precisionByLabel,
recallByLabel, fMeasureByLabel, logLoss, hammingLoss
Model Evaluation
MultilabelClassificationEvaluator
subsetAccuracy, accuracy, hammingLoss,
precision, recall, f1Measure,
precisionByLabel, recallByLabel, f1MeasureByLabel,
microPrecision, microRecall, microF1Measure
ClusteringEvaluator silhouette
Building Pipeline
Building Spark ML Pipeline
● These stages run in order and the dataframe is transformed as it passes through each
stage
● This object is later fit on the raw training set, which creates a pipeline model
● Output
Model Persistence
Model Persistence
● In real-life scenarios, you will be producing ML model and hands it over to the
development team for deploying in a production environment
● Model persistence means saving your model to a disk for later use without the need
to retrain your model
Model Persistence
● It might happen that you wish to retrain your model and save it at the same the place
● Use PipelineModel module from pyspark.ml package to load the persisted pipeline
model
● The loaded model can then be used for perform prediction on test data
Model Persistence
● Output
Summary
● Pipeline API chains Transformers and Estimator each as a stage to specifying ML workflow