0% found this document useful (0 votes)
35 views

Spark ML

Uploaded by

rakeshpandey01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Spark ML

Uploaded by

rakeshpandey01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 110

Spark ML

Table of Content

● Intro to Spark MLlib


● Spark ML Pipeline
● Spark ML Component Flow
● Spark ML Data Types
● Spark ML Transformers
● Understanding Outputs
● Spark ML Algorithms
● Building Pipelines
● Model Persistence
Intro to Spark ML
Spark MLlib

● Spark MLlib is Apache Spark’s Machine Learning library

● It consists of algorithms like:

○ Classification

○ Regression

○ Clustering

○ Dimensionality Reduction

○ Collaborative Filtering
Spark MLlib
Algorithms Featurization

● Regression ● Feature Extraction


● Classification ● Feature Selection
● Clustering ● Transformation
● Collaborative Filtering ● Dimensionality Reduction

Utilities Pipeline

● Linear Algebra ● Pipeline Construction


● Statistics ● Model Tuning
● Data Handling ● Model Persistence
Spark MLlib and ML

• There are two machine learning implementations in Spark (ML and MLlib):

○ spark.mllib :- ML package built on top of the RDD API

○ spark.ml :- ML package built on top of higher-level DataFrame API

● Using spark.ml is recommended because the DataFrame API is more versatile and
flexible
“Spark ML” is not an official name but used to refer to the MLlib
DataFrame-based API (spark.ml)
Spark ML Pipeline
Spark ML Pipeline

• In machine learning, there are a lot of transformation steps that are performed to pre-process
the data

• We repeat the same steps while making prediction

• You may often get confused about this transformations while working on huge projects

• To avoid this, pipelines were introduced that hold every step that is performed to fit the data
on a model
Spark ML Pipeline

• The Pipeline API in Spark chains multiple Transformers and Estimator specifying a ML workflow

• It is a high-level API for MLlib that lives under the spark.ml package

Pipeline

Transformer x Transformer y … Transformer z Estimator


Dataframe Dataframe
Transformers

• A Transformer takes a dataset as input and produces an augmented dataset as output

• It basically transforms one DataFrame into another DataFrame

Transformer
Estimators

• An Estimator fit on the input data that produces a model

• For eg., logistic regression is an Estimator that trains on a dataset with labels and
features and produces a logistic regression model

Estimator Transformer
• The model acts as a Transformer that transforms the input dataset

For eg., logistic regression model can later be used to make predictions which
technically adds prediction columns (Transformation) in the dataset
Spark ML Component Flow
Spark ML Component Flow

• Pipeline API chains Transformers and Estimator each as a stage to specifying ML


workflow

• These stages are run in order

• The input DataFrame is transformed as it passes through each stage

• Evaluator then evaluates the model performance


Spark ML Component Flow
Load Data
Pipeline

Min-Max Scalar

One Hot Encoder Transformers

Vector Assembler

Logistic Regression Estimator

Evaluate
Spark ML Component Flow

Load Data Load Data

Min-Max Scalar Min-Max Scalar

One Hot Encoder One Hot Encoder

Vector Assembler Vector Assembler

Logistic Regression Logistic Regression

Evaluate Evaluate

Pipeline of machine learning components Reusing the pipeline on new data


Spark ML Component Flow

• Spark ML algorithms (estimators) expects all features to be contained within a


single column in the form of Vector

• It is one of the important spark ml data types that you need to understand before we
take a look at different feature transformers
Spark ML Data Types
Spark ML Data Types

• Spark ML uses the following data types internally for machine learning algorithms

○ Vectors

○ Matrix
Spark ML Data Types

● These data types help you for a process called featurization

● Conversion of numerical value, string value, character value, categorical value into
numerical feature is called featurization

● The data once converted to these data types can be further passed to the ML
algorithm in Spark
Spark ML Data Types

● For eg.: Consider the following sentences


○ I love programming
○ Python is a programming language
○ Python is my favourite programming language
○ Data science using Python
Spark ML Data Types

● Make list of the word such that one word should be occurring only once, then the list
looks like as follow:

[“I”, “love”, “programming”, “Python”, “is”, “a”, “language”, “my”, “favourite”,


“Data”, “Science”, “using”]

● Now count occurrence of word in a sentence with respect to this list


Spark ML Data Types
● For example- vector conversion of sentence “Data science using Python” can be
represented as :
“I” - 0
“love”- 0
“programming” - 0
“Python” - 1
“is” - 0
“a” - 0
“language” - 0
“my” - 0
“favourite” - 0
“Data” - 1
“Science” - 1
“using” - 1
Spark ML Data Types

● By following same approach other vector value are as follow:

I love programming = [1 1 1 0 0 0 0 0 0 0 0 0]
Python is a programming language = [0 0 1 1 1 1 1 0 0 0 0 0]
Python is my favourite programming language = [0 0 1 1 1 0 1 1 1 0 0 0]
Data science using Python = [0 0 0 1 0 0 0 0 0 1 1 1]
Spark ML Data Types

● And the sentences can also be converted into 4*12 matrix

array([[1 1 1 0 0 0 0 0 0 0 0 0],
[0 0 1 1 1 1 1 0 0 0 0 0],
[0 0 1 1 1 0 1 1 1 0 0 0],
[0 0 0 1 0 0 0 0 0 1 1 1]])

● Now that you have understood vectors and matrix, let us now focus on the different
types of vectors
Please Note

The elements of vectors and matrix are NOT always 0s and 1s.
Spark ML Data Types - Local Vector

● A local vector has integer-typed and 0-based indices and double-typed values

● They are stored on local machine

● A local vector can be represented as:


○ Dense Vector
○ Sparse Vector
Please Note

Dense and Sparse Vectors are vector representation of data


Spark ML Data Types - Local Vector

● Dense Vectors:

○ By definition, dense means closely compacted

○ Dense Vector is a vector representation that contains many values or values that
are not zeros (very few zero values)

○ It is basically an array of values

○ For eg. A vector (3.0, 5.0, 8.0, 0.0) can be represented in dense format as [3.0,
5.0, 8.0, 0.0]
Spark ML Data Types - Local Vector

● Dense Vectors PySpark


Spark ML Data Types - Local Vector

● Sparse Vectors:

○ By definition, sparse means thinly dispersed or scattered

○ If a vector has a majority of its elements as zero, it can be represented as sparse


vector

○ It stores the size of the vector, an array of indices, and an array of values
corresponding to those indices

○ For ex. A vector (0.0, 3.0, 0.0, 8.0, 0.0) can be represented in sparse format as
[5, [1,3], [3.0, 8.0]]
Spark ML Data Types - Local Vector

[5, [1,3], [3.0, 8.0]]

Total no. of elements Array of indices where non-zero Array of values corresponding to
(integer typed) elements are present (integer each index (double-typed)
typed)

Parallel Arrays

● A sparse vector is used for storing non-zero entries for saving space
Spark ML Data Types - Local Vector

● Sparse Vector PySpark


In computer science, Parallel Array is an implicit data structure that contains multiple
arrays

Each of these arrays are of the same size and the array elements are related to each other

i-th element of each array is closely related and all i-th elements together represent an
object or entity
Spark ML Data Types - Labeled Point

● Labeled Point is a type of local vector

● It can either be dense or sparse

● It is associated with label/response variable

● Used in supervised learning algorithms

● A label should either be 0 (-ve) or 1 (+ve) for binary classification

● A label should be class indices starting from zero: 0, 1, 2,…


Spark ML Data Types - Labeled Point

Type Label Values


Regression Decimal Values

Binary Classification 0 or 1

Multi-class Classification 0,1,2,3, ….


Spark ML Transformers
ML Feature Transformers

● Feature building is a super important step for model building

● Some of the common feature transformer that we use for model building are:

○ Binarizer ○ VectorAssembler
○ Bucketizer ○ VectorIndexer
○ StringIndexer ○ StandardScaler
○ IndexToString ○ MinMaxScaler
○ OneHotEncoder
● Most Transforms are under org.apache.spark.ml.feature package
ML Feature Transformers

● Binarizer
○ Binarization is used for thresholding numerical feature to binary feature (0 or 1)

○ Binarizer takes inputCol, outputCol and threshold for binarization as parameter

○ Values greater than the threshold value are binarized to 1.0

○ Values less than the threshold value are binarized to 0.0


ML Feature Transformers

● Consider the following dataframe:

We can create a new variable “BodyType” by binarizing the 'BMI' variable (1- obese and 0- healthy)
If your BMI is 30.0 or higher, the BodyType falls in the obese range)
ML Feature Transformers

● Code:

Value equal to or above the threshold value 30 is set to 1 (denoting


obesity) in the new column ‘BodyType’
ML Feature Transformers

● Bucketizer
○ Bucketization is used for creating group of values of a continuous feature

○ Bucketizer takes inputCol, outputCol and splits for mapping continuous


features into buckets as parameter

○ There are n buckets for n+1 splits

○ The splits that you provided have to be in strictly increasing order, i.e. s0 < s1
< s2 < ... < sn
ML Feature Transformers
● Code:

You can check there are 4 (0,1,2,3) buckets for 5 splits


ML Feature Transformers

● StringIndexer
○ StringIndexer converts a string column to an index column

○ The most frequent label gets index 0

○ Labels are basically ordered by their frequencies


Please Note

There can be a situation when the StringIndexer may encounter a new label

This usually happens when you fit StringIndexer on one dataset and then use it to transform
incoming data that may have a new label

You can use any of the following three strategies to handle the situation by setting the parameter
setHandleInvalid to:

○ ‘error’: throw an exception (which is the default)


○ ‘skip’: skip the row containing the unseen label entirely
○ ‘keep’: put unseen labels in a special additional bucket, at index numLabels
ML Feature Transformers

● Code:
ML Feature Transformers

● Output:
ML Feature Transformers

● IndexToString
○ IndexToString converts a column of label indices back to a column containing
the original labels as strings

○ It is like the inverse of StringIndexer: You can retrieve the labels that were
transformed by StringIndexer

○ This transformer is mostly used after training a model where you can retrieve
the original labels from the prediction column
Class Exercise

Use IndexToString to convert index column into it respective string value


ML Feature Transformers

● OneHotEncoderEstimator
○ OneHotEncoderEstimator converts the label indices to binary vector
representation with at most a single one-value

○ It represents the presence of a specific feature value from among the set of all
feature values

○ It encodes the features into a sparse vector


ML Feature Transformers

● Code:
ML Feature Transformers

● Output:
Please Note

One hot encoder in spark work very differently than the way it works in sklearn (like
dummy column creation style)

Only one feature column is created representing categorical indices in the form of sparse
vector in each row

You may want to convert this sparse vector to dense vector later for scaling, if required
Please Note

It is primarily used for linear model (ex. Logistic Regression) to encode categorical features
since these algorithms expect continuous features

Such representations proves to be inefficient to be used with algorithms which handle


categorical features intrinsically
ML Feature Transformers

● VectorAssembler
○ MLlib expects all features to be contained within a single column

○ VectorAssembler combines multiple columns and gives single column as


output

○ The output column represents the values for all of the input columns in the
form of vector (DenseVector or SparseVector depending on which use the least
memory)
ML Feature Transformers

● Code:
ML Feature Transformers

● Output:

If you notice, the feature column contains sparse vector


Please Note

VectorAssembler chooses dense vs sparse output format based on whichever one


uses less memory

It does not convert the vector into a dense vector during the merging process

You may want to convert this feature vector, if sparse, into a dense vector to perform
scaling
ML Feature Transformers

● VectorIndexer
○ VectorIndexer automatically identifies the categorical features from the feature
vector (output from VectorAssembler)

○ It then indexes categorical features inside of a Vector

○ It is the vectorized version of StringIndexer

○ This step is mostly used after the VectorAssembler stage


ML Feature Transformers

● Code:
ML Feature Transformers

● Output:
ML Feature Transformers

Using the StringIndexer output directly as a feature will not make sense because it converts
the categorical variable into nominal variable (do not have any order). Hence we one hot
encode them

The VectorIndexer does the same but in the backend


Please Note

VectorIndexer let us skip the one hot encoding stage for encoding the categorical features

As discussed earlier, we should not use one hot encoding on categorical variables for
algorithms like decision tree and tree ensembles

VectorIndexer are chosen over OneHotEncoderEstimator in such scenario which allows


these algorithms to treat categorical features appropriately
ML Feature Transformers

● StandardScaler
○ StandardScaler scales each value in the feature vector such that the mean is 0
and the standard deviation is 1

○ It takes parameters:
■ withStd: True by default. Scales the data to unit standard deviation

■ withMean: False by default. Centers the data with mean before scaling
Please Note

To use scaling transformers, we need to assemble the features into a feature vector first
(using VectorAssembler)

They do not convert sparse vector to dense vector internally. Therefore, it is very important
to convert the sparse vector to a dense vector before running this step to avoid incorrect
results as it does not throw error for the input sparse vector
ML Feature Transformers

● Code: We first convert sparse vector into dense vector


ML Feature Transformers

● Output:
ML Feature Transformers

● Code: We then apply StandardScaler on the dense vector


ML Feature Transformers

● Output:
ML Feature Transformers

● MinMaxScaler
○ MinMaxScaler scales each value in the feature vector between 0 and 1

○ Though (0, 1) is the default range, we can define our range of max and min
values as well

○ It takes parameters:
■ min: 0.0 by default. Lower bound value
■ max: 1.0 by default. Upper bound value
Class Exercise

Use MinMaxScaler to scale the dense features


ML Feature Transformers

● Normalizer
○ Normalizer normalize each value in the feature vector to have unit norm

○ It takes parameter p which specifies p-norm used for normalization. By


default, the value of p is 2
Class Exercise

Use Normalizer to scale the dense features


Understanding Outputs
Understanding Output of a Model

• After you transform the dataframe with the model that you built, it may add
additional columns as predictions depending upon the algorithm:

○ rawPrediction
○ probability
○ prediction
Understanding Output of a Model

• rawPrediction

- It stores the raw output of a classifier for each possible target variable label

- The meaning of a “raw” prediction may vary between algorithms

- It gives a measure of confidence in each possible label (where larger = more


confident)

- For eg., for logistic regression the rawPrediction is calculated with the help of logit
Understanding Output of a Model

• probability

- It stores the probability of a classifier for each possible target variable label given
the raw prediction

- For eg., In logistic regression, probability is the result of applying the logistic
function ( exp(x)/(1+exp(x)) ) to rawPrediction
Understanding Output of a Model

• prediction

- It is the corresponding class that the model has predicted for given probability array

- It takes the maximum value out of the probability array, and it gives the most
probable label (single number)
Interpretation

Probability for class 0 is greater

rawPrediction rawPrediction probability for probability for


for class 0 for class 1 class 0 class 1

Probability for class 1 is greater


Spark ML Algorithms
Spark ML Algorithms

● As discussed earlier, all spark ml model trains off only one column of data

● You should extract values from each row and pack them into a vector in a single
column named features (name not compulsory)

● Therefore, every spark ml model has ‘featureCol’ as a parameter

● Only supervised learning models will have ‘labelCol’ along with ‘featureCol’ as a
parameter
Spark ML Algorithms

Common Spark ML Parameters:

Parameter Input
Description Note
Name Type

labelCol Double Target Column Only for supervised learning


algorithms

featuresCol Vector Features Vector For all algorithms


Spark ML Algorithms Example: Logistic Regression

● We use LogisticRegression from pyspark.ml package to train (fit) Logistic Regression


with the features

● LogisticRegression.fit returns LogisticRegressionModel object

● This object acts as a transformer that add the prediction columns to the dataframe

● This is applicable to all the spark ml algorithms


Spark ML Algorithms Example: Logistic Regression

● Code: Logistic Regression pyspark ml implementation


Spark ML Algorithms Example: Logistic Regression

● Output
Interpretation

• rawPrediction: it is the raw output of the logistic regression classifier (array with
length equal to the number of classes)

• probability: it is the result of applying the logistic function to rawPrediction (array


of length equal to that of rawPrediction)

• prediction: it is the argument where the array probability takes its maximum value,
and it gives the most probable label (single number)
Logistic Regression Model Evaluation

• Spark ML provides a suite of metrics for the purpose of evaluating the performance
of machine learning models

• Let us evaluate the logistic regression model that we built using


BinaryClassificationEvaluator
Logistic Regression Model Evaluation

● Code: Evaluating model performance using BinaryClassificationEvaluator


Model Evaluation

● You can also use model.summary for logistic regression to get the performance metrics
Model Evaluation

● Output
Spark ML Algorithms

Algorithm Spark ML Package Spark ML Sklearn Equivalent Output


Parameter(s)

Linear
pyspark.ml.regression LinearRegression LinearRegression predictionCol
Regression

rawPredictionCol
Logistic
pyspark.ml.classification LogisticRegression LogisticRegression probabilityCol
Regression
predictionCol

rawPredictionCol
Decision Tree
pyspark.ml.classification DecisionTreeClassifier DecisionTreeClassifier probabilityCol
Classification
predictionCol

Decision Tree
pyspark.ml.regression DecisionTreeRegressor DecisionTreeRegressor predictionCol
Regression
Spark ML Algorithms

Algorithm Spark ML Package Spark ML Sklearn Equivalent Output


Parameter(s)
rawPredictionCol
Random Forest
pyspark.ml.classification RandomForestClassifier RandomForestClassifier probabilityCol
Classification
predictionCol

Random Forest
pyspark.ml.regression RandomForestRegressor RandomForestRegressor predictionCol
Regression

Gradient rawPredictionCol
Boosted Trees pyspark.ml.classification GBTClassifier GradientBoostingClassifier probabilityCol
Classification predictionCol

Gradient
Boosted Trees pyspark.ml.regression GBTRegressor GradientBoostingRegressor predictionCol
Regression
Spark ML Algorithms

Algorithm Spark ML Package Spark ML Sklearn Equivalent Output


Parameter(s)
Support Vector rawPredictionCol
pyspark.ml.classification LinearSVC
Machines LinearSVC probabilityCol
(SVM) predictionCol

rawPredictionCol
Naive Bayes pyspark.ml.classification NaiveBayes GaussianNB probabilityCol
predictionCol

K-means pyspark.ml.clustering KMeans GradientBoostingClassifier predictionCol


Model Evaluation

● Following evaluators are available in pyspark.ml.evaluation package


Evaluator Metric Available

BinaryClassificationEvaluator areaUnderROC
areaUnderPR

MulticlassClassificationEvaluator
f1, accuracy, weightedPrecision,
weightedRecall, weightedTruePositiveRate,
weightedFalsePositiveRate, weightedFMeasure,
truePositiveRateByLabel, falsePositiveRateByLabel,
precisionByLabel,
recallByLabel, fMeasureByLabel, logLoss, hammingLoss
Model Evaluation

Evaluator Metric Available

RegressionEvaluator rmse, mse, r2, mae, var

MultilabelClassificationEvaluator
subsetAccuracy, accuracy, hammingLoss,
precision, recall, f1Measure,
precisionByLabel, recallByLabel, f1MeasureByLabel,
microPrecision, microRecall, microF1Measure

ClusteringEvaluator silhouette
Building Pipeline
Building Spark ML Pipeline

● As discussed earlier, a spark pipeline is a sequence of Transformers and an Estimator

● These stages run in order and the dataframe is transformed as it passes through each
stage

● We will now see how to build a pipeline in pyspark


Building Spark ML Pipeline

● To build a pipeline we import the Pipeline module from pyspark.ml package

● Next, we create a pipeline object by passing all transformers and an estimator as a


list of stages

● This object is later fit on the raw training set, which creates a pipeline model

● This model is later used as a transformer to be applied on testing set to make


predictions
Building Spark ML Pipeline

● Code: Building and implementing a spark ml pipeline


Building Spark ML Pipeline

● Output
Model Persistence
Model Persistence

● In real-life scenarios, you will be producing ML model and hands it over to the
development team for deploying in a production environment

● This becomes easier with model persistence

● Model persistence means saving your model to a disk for later use without the need
to retrain your model
Model Persistence

● We use model.save(‘path’) to save our model at the desired location

● It might happen that you wish to retrain your model and save it at the same the place

● In those cases, use model.write().overwrite().save(‘path’) to save your retrained


model at the same place
Model Persistence

● Code - Saving the model


Model Persistence

● You can then load the model and perform predictions

● Use PipelineModel module from pyspark.ml package to load the persisted pipeline
model

● The loaded model can then be used for perform prediction on test data
Model Persistence

● Code: Loading the model


Model Persistence

● Output
Summary

● Spark MLlib is Apache Spark’s Machine Learning library

● spark.mllib package built on top of the RDD API

● spark.ml package built on top of higher-level DataFrame API

● Pipeline API chains Transformers and Estimator each as a stage to specifying ML workflow

● Spark ML library provides number of transformers to preprocess the data


Thank You

You might also like