0% found this document useful (0 votes)
61 views429 pages

SIC - AI - Chapter 5. Machine Learning 1 - v2.1

Chapter 5 of the Samsung Innovation Campus Artificial Intelligence Course focuses on Machine Learning, specifically supervised learning. It covers objectives such as data analysis, algorithm selection, and workflow optimization, along with detailed units on various machine learning models and the scikit-learn library. The chapter emphasizes practical applications and techniques for implementing machine learning algorithms using Python.

Uploaded by

masterhunter405
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views429 pages

SIC - AI - Chapter 5. Machine Learning 1 - v2.1

Chapter 5 of the Samsung Innovation Campus Artificial Intelligence Course focuses on Machine Learning, specifically supervised learning. It covers objectives such as data analysis, algorithm selection, and workflow optimization, along with detailed units on various machine learning models and the scikit-learn library. The chapter emphasizes practical applications and techniques for implementing machine learning algorithms using Python.

Uploaded by

masterhunter405
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 429

Samsung

Innovation
Campus
Artificial Intelligence Course
Chapter 5.

Machine Learning 1
- Supervised Learning
Artificial Intelligence
Course

Samsung Innovation Campus 2


Chapter Description
Chapter objectives

 Be able to introduce machine learning-based data analysis according to the business objective,
strategy, and policy and manage the overall process.
 Be able to select and apply a machine learning algorithm that is the most suitable to the given
problem and perform hyperparameter tuning.
 Be able to design, maintain, and optimize a machine learning workflow for AI modeling using
structured and unstructured data.

Chapter contents

 Unit 1. Machine Learning Based Data Analysis


 Unit 2. Application of Supervised Learning Model for Numerical Prediction
 Unit 3. Application of Supervised Learning Model for Classification
 Unit 4. Decision Tree
 Unit 5. Naïve Bayes Algorithm
 Unit 6. KNN Algorithm
 Unit 7. SVM Algorithm
 Unit 8. Ensemble Algorithm

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 3


Unit 1.

Machine Learning Based Data Analysis


1.1. What is machine learning?
1.2. Python scikit-learn library for machine learning
1.3. Preparation and division of data set
1.4. Data pre-processing for making a good training data set
1.5. Practicing to find an optimal method to solve problems with
scikit-learn

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 4


1.1. What is machine learning? UNIT
01

What is machine learning?


What is machine learning?

‣ A statistical model that learns from data.


‣ A rather simple model can make complex predictions.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 5


1.1. What is machine learning? UNIT
01

Samuel’s definition in the early phase of artificial intelligence


‣ “Programming Computers to learn from experience should eventually eliminate the need for much
of this detailed programming effort.” - Samuel, 1959

Modern definition
‣ “A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with
experience E.” – Mitchell, 1997 (p.2)
‣ “Programming computers to optimize a performance criterion using example data or past
experience.”
–Alpaydin, 2010
‣ “Computational methods using experience to improve performance or to make accurate
predictions.” – Mohri, 2012

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 6


1.1. What is machine learning? UNIT
01

Mathematical definition
‣ Suppose that the x-axis is invested advertising expenses while the y-axis is sales.

(Target) ‣ Question about prediction – What are the sales when random
advertising expenses are given?
𝑓2
‣ Linear regression
𝑓3
• w and b as parameters
𝑓1
𝑦=𝑤 𝑥+𝑏
(Feature) • ‘w’ is commonly used as an abbreviation of ‘weigh.’
2 4 6 8 10

‣ Since the optimal value is unknown in the beginning, start with an arbitrary value and then reach
the optimal value by gradually enhancing the performance.
• From the graph, it starts from f1 to continue as f1 → f2 → f3.
• The optimal value is f3 where w=0.5 and b=2.0.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 7


1.1. What is machine learning? UNIT
01

Statistics and machine learning from a data analysis perspective

Pattern
Recognition
Artificial Intelligence

Statistics

Machine
Learning
Deep
Learning
Data Mining
Data
Science
Database Computational
Neuroscience

Connection among machine learning and different kinds of study

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 8


1.1. What is machine learning? UNIT
01

Data mining and machine learning from a data analysis perspective

Pattern
Recognition
Artificial Intelligence

Statistics

Machine
Learning
Deep
Learning
Data Mining
Data
Science
Database Computational
Neuroscience

Connection among machine learning and different kinds of study

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 9


1.1. What is machine learning? UNIT
01

Types of machine learning according to methods of supervision

Machine Learning

Supervised Unsupervised Reinforcement

Target pattern is given. Target pattern must be found out. Policy optimization

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 10


1.1. What is machine learning? UNIT
01

Machine learning workflow

Machine
Learning Feature
Understanding the business and Pre-processing engineering
problem definition and searching of data
Train
Problem Data Modeling and
Definition Preparation optimization
Validate Model training for data

Performance
Test metrics
Raw
Data
Model performance evaluation

Data collection
Enhanced model performance and
application to real life

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 11


1.1. What is machine learning? UNIT
01

Machine learning types:

Type Algorithm/Method
Clustering
MDS, t-SNE
Unsupervised learning
PCA, NMF
Association analysis
Linear regression
Logistic regression
Tree, Random Forest, Ada Boost, XGBoost
Supervised learning Naïve Bayes
KNN
Support vector machine (SVM)
Neural Network

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 12


1.1. What is machine learning? UNIT
01

Parameters vs. Hyperparameters


Parameters
‣ Learned from data by training and not manually set by the practitioner.
‣ Contain the data pattern.

Ex Coefficients of linear regression


Ex Weights of neural network

Hyperparameters
‣ Can be set manually by the practitioner.
‣ Can be tuned to optimize the machine learning performance.

Ex in KNN algorithm
Ex Learning rate in neural network
Ex Maximum depth in Tree algorithm

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 13


Unit 1.

Machine Learning Based Data Analysis


1.1. What is machine learning?
1.2. Phython scikit-learn library for machine learning
1.3. Preparation and division of data set
1.4. Data pre-processing for making a good training data set
1.5. Practicing to find an optimal method to solve problems with scikit-
learn

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 14


1.2. Phyton scikit-learn library for machine learning UNIT
01

Features of the scikit-learn library


Features
‣ Integrated library interface by applying the façade design pattern
‣ Installed with various kinds of machine learning algorithms, model selection, and data pre-
processing functions
‣ Simple and efficient tool to analyze predicted data
‣ Based on NumPy, SciPy and matplotlib
‣ Easily accessible and can be reused in many different situations
‣ Highly compatible with different libraries
‣ Does not support GPU
‣ Can be used as an open source and for commercial purposes

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 15


1.2. Phyton scikit-learn library for machine learning UNIT
01

Mechanism of scikit-learn
Scikit-learn is characterized by its intuitive and easy interface, complete with high-level API.

Predict /
Instance Fit transfor
m

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 16


1.2. Phyton scikit-learn library for machine learning UNIT
01

Estimator, Classifier, Regressor


‣ Estimator refers to an object that can fit the model and deduce certain features of new data based
on the training data.
‣ Classifier refers to a class that realizes a classifying algorithm, while regression refers to a class that
realizes regressing algorithm.

Estimator

Training: .fit
Prediction: .predict

Classifier Regressor
DecisionTreeClassifier LinearRegression
KNeighborsClassifier KNeighborsRegressor
GradientBoostingRegress GradientBoostingRegress
or or
Gaussian NB … Ridge …

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 17


1.2. Phyton scikit-learn library for machine learning UNIT
01

Scikit-Learn Library
About the Scikit-Learn library
‣ It is a representative Python machine learning library.
‣ To import a machine learning algorithm as a class:
from sklearn.<family> import <machine learning algorithm>
Ex from sklearn.linear_model import LinearRegression
‣ Hyperparameters are specified when the machine learning object is instantiated:

Ex myModel = KNeighborsClassifier(n_neighbors=10) # KNN with k = 10

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 18


1.2. Phyton scikit-learn library for machine learning UNIT
01

About the Scikit-Learn library


‣ To train a supervised learning model: myModel.fit(X_train, Y_train)
‣ To train a unsupervised learning model: myModel.fit(X_train)
‣ To predict using an already trained model: myModel.predict(X_test)
‣ To import a preprocessor as class: from sklearn.preprocessing import <a preprocessor>
‣ To split the dataset into a training set and a testing set:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=123)
‣ To calculate a performance metric (accuracy): metrics.accuracy_score(Y_test, Y_pred)
‣ To cross validate and do hyperparameter tuning at the same time:
myGridCV = GridSearchCV(estimator, parameter_grid, cv=k)
myGridCV.fit(X_train, Y_train)

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 19


1.2. Phyton scikit-learn library for machine learning UNIT
01

Practicing scikit-learn
‣ The sklearn.datasets module includes utilities to load datasets, including methods to load and fetch
popular reference datasets. It also features some artificial data generators.

‣ Import data with the load_breast_cancer().

‣ Container object exposing keys as attributes.


Bunch objects are sometimes used as an output for functions and methods.
They extend dictionaries by enabling values to be accessed by key, bunch[“value_key”], or by an
attribute, bunch.value_key.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 20


1.2. Phyton scikit-learn library for machine learning UNIT
01

Practicing scikit-learn

Line 5
• This data becomes x (independent variable, data).

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 21


1.2. Phyton scikit-learn library for machine learning UNIT
01

Practicing scikit-learn

Line 6
• This data becomes y (dependent variable, actual value).

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 22


1.2. Phyton scikit-learn library for machine learning UNIT
01

Practicing scikit-learn

Line 1
• Provides details about the data.
• The help shows that the default value of test_size is 0.25.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 23


1.2. Phyton scikit-learn library for machine learning UNIT
01

Practicing scikit-learn

Line 7
• From the total of 569 observed values, divide the data for training and evaluation into
7:3 or 8:2.
7.5:2.5 is the default value.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 24


1.2. Phyton scikit-learn library for machine learning UNIT
01

Practicing scikit-learn
‣ Use train_test_split() to split the data for making and evaluating the model.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 25


1.2. Phyton scikit-learn library for machine learning UNIT
01

Practicing scikit-learn

Line 11
• 426 observed values (75%) out of total 569 observations are found.
Line 13
• 143 observed values (25%) out of total 569 observations are found.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 26


1.2. Phyton scikit-learn library for machine learning UNIT
01

Practicing scikit-learn
‣ For instancing, use the model’s hyperparameter as an argument. Hyperparameter is an option that
requires a human setting and greatly affects the model performance.

Line 1-
5• Loading the test data set

Line 1-
8• Instancing the estimator and hyperparameter setting

• Model initialization by using the entropy for branching

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 27


1.2. Phyton scikit-learn library for machine learning UNIT
01

fit
‣ Use the fit method with an instance estimator for training. Send the training data and label data
together as an argument to the supervised learning algorithm.

predict
‣ The instance estimator that has completed training with fitting can be applied with the predict
method. ‘Predict’ converts the estimated results of the model regarding the entered data.

Line 2
• It is an estimated value, so the actual value for X_test may vary.
Measure the accuracy by comparing the two values.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 28


1.2. Phyton scikit-learn library for machine learning UNIT
01

Practicing scikit-learn

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 29


1.2. Phyton scikit-learn library for machine learning UNIT
01

Practicing scikit-learn

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 30


1.2. Phyton scikit-learn library for machine learning UNIT
01

Practicing scikit-learn

Line 57
• Data frame shows a result where predicted value and actual value differ.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 31


1.2. Phyton scikit-learn library for machine learning UNIT
01

Practicing scikit-learn

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 32


1.2. Phyton scikit-learn library for machine learning UNIT
01

Practicing scikit-learn

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 33


1.2. Phyton scikit-learn library for machine learning UNIT
01

Practicing scikit-learn

Line 66
• 133/143

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 34


1.2. Phyton scikit-learn library for machine learning UNIT
01

Practicing scikit-learn

‣ It showed 93% accuracy, which is quite a rare result. In fact, a process of increasing data accuracy is
required during data pre-processing, and standardization is one of the options. The following is a brief
summary of standardization.
• Standardization can be done by calculating standard normal distribution.
Another term for standardization is z-transformation, and the standardized value is also referred to
as z-score. 94% accuracy would be obtained from KNN wine classification through standardization.

(, standard deviation)
• Standardization is widely used in data pre-processing in general other than KNN, and the following
is the equation.

• The standardization is available as StandardScaler class in the scikit-learn.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 35


1.2. Phyton scikit-learn library for machine learning UNIT
01

Practicing scikit-learn

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 36


1.2. Phyton scikit-learn library for machine learning UNIT
01

Practicing scikit-learn

Line 35
• Data frame before standardization

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 37


1.2. Phyton scikit-learn library for machine learning UNIT
01

Practicing scikit-learn

Line 39
• The differences among column values are huge before standardization.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 38


1.2. Phyton scikit-learn library for machine learning UNIT
01

Practicing scikit-learn

Line 40
• After standardization, the column values do not significantly deviate from 0.
• Better performance would be possible compared to the performance before
standardization.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 39


1.2. Phyton scikit-learn library for machine learning UNIT
01

transform
‣ Feature processing is done with ‘transform’ to return the result.

Line 3-
3• Output before pre-processing

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 40


1.2. Phyton scikit-learn library for machine learning UNIT
01

transform
‣ Feature processing is done with ‘transform’ to return the result.

Line 3-
7• Pre-processing – Apply scaling

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 41


1.2. Phyton scikit-learn library for machine learning UNIT
01

transform
‣ Feature processing is done with ‘transform’ to return the result.

Line 3-
9• Result check after pre-processing

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 42


1.2. Phyton scikit-learn library for machine learning UNIT
01

transform
‣ Feature processing is done with ‘transform’ to return the result.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 43


1.2. Phyton scikit-learn library for machine learning UNIT
01

fit_transform
‣ Fit and Transform is combined as fit_transform.

Line 4-1 & 4-


5• Before & After

Line 4-3
• Combination of fit and transform

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 44


1.2. Phyton scikit-learn library for machine learning UNIT
01

Major scikit modules

Classification Module Embedded functions


Data example sklearn.datasets Data set for practicing
Pre-processing techniques
sklearn.preprocessing
(One-hot encoding, normalization, scaling, etc.)
sklearn.feature_selectio Technique to search and select a feature that provides a
n significant impact to the model
Feature
processing Feature extraction from source data

sklearn.feature_extracti The supporting API for feature extraction regarding image


on is present in the submodule image, while the supporting
API for text data feature extraction is present in the
submodule test.
Dimension Algorithms related to dimension reduction
sklearn.decomposition
reduction (PCA, NMF, Truncated SVD, etc.)
Validation,
Validation, hyperparameter tuning, data separation, etc.
hyperparamet sklearn.model_selectio
(corss_validate, GridSearchCV, train_test_split,
er tuning, data n
learning_curve, etc.)
separation
Model Techniques to measure and evaluate model performance
sklearn.metrics
evaluation (accuracy, precision, recall, ROC curve, etc.)

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 45


1.2. Phyton scikit-learn library for machine learning UNIT
01

Major scikit modules

Classification Module Embedded functions


Ensemble algorithms (Random forest, AdaBoost, bagging,
sklearn.ensemble
etc.)
Linear algorithms (Linear regression, logistic regression,
sklearn.linear_model
SGD, etc.)
Naive Bayes algorithms
Machine sklearn.naïve_bayes (Bernoulli NB, Gaussian NB, multinomial distribution NB,
learning etc.)
algorithm
sklearn.neighbors Nearest neighbor algorithms (K-NN, etc.)
sklearn.svm Support Vector Machine algorithms
sklearn.tree Decision tree algorithms
Unsupervised learning (clustering) algorithms
sklearn.cluster
(Kmeans, DBSCAN, etc.)
Serial conversion of feature processing and machine
Utility sklearn.pipeline
learning algorithms, etc.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 46


Unit 1.

Machine Learning Based Data Analysis


1.1. What is machine learning?
1.2. Phyton scikit-learn library for machine learning
1.3. Preparation and division of data set
1.4. Data pre-processing for making a good training data set
1.5. Practicing to find an optimal method to solve problems with scikit-
learn

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 47


1.3. Preparation and division of data set UNIT
01

Preparation and division of data set


Chapter objectives
‣ Be able to understand the meaning and ripple effect of overfitting and generalization and design
data set division to solve issues.
‣ Be able to properly divide the training and test data sets for machine learning technique application
according to the analysis purpose and data set features.
‣ Be able to divide the training and validation data sets and decide the appropriate k-value for cross-
validation by deciding the necessity of cross-validation according to the issue and applied
technique. Be able to divide the data sets and perform sampling by considering the prediction
results based on data features and classified variable distribution.
‣ Be able to analyze differences among various sampling methods for data set division and apply
appropriate sampling methods.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 48


1.3. Preparation and division of data set UNIT
01

Necessity of data set division


‣ When analyzing machine learning-based data, especially when applying a supervised learning-
based model, do not analyze the overall data set but analyze by dividing the training and evaluation
(test) data sets.

Training data
Division of
set
data set
Mod
(Perform K-fold el
cross validation Final
Overall data if necessary) model
set

Performance
Test data set
evaluation

Machine learning modeling process through division of training


and test data sets

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 49


1.3. Preparation and division of data set UNIT
01

Overfitting and generalization of modeling


‣ Strictly speaking, the data included in the provided training data set can be considered as the
values obtained by chance. Hence, a new data set obtained to predict the values of new objective
variables (or response variables) is not the same as the existing training data set.
‣ The chance that the patterns from the training data and new data to perfectly accord is extremely
low. Thus, when learning the machine learning-based model, overfitting occurs in highlighting the
training data set pattern when reflecting too much of the training data set. At the same time,
generalization for accurate prediction of new data is underperformed.
‣ To prevent such issues, the data sets are generally divided into training and test data sets. Measure
how the machine learning model learned with the training data set accurately predicts the test data
set’s objective variables (or response variables). The resulting standard will become the model
performance evaluation standard.
12t
1st 2nd 3rd 4th
h

𝑦 𝑦 𝑦 𝑦 𝑦

𝑥 𝑥 𝑥 𝑥 𝑥
Overfitting and underfitting

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 50


1.3. Preparation and division of data set UNIT
01

Overfitting and generalization of modeling


12t
1st 2nd 3rd 4th
h

𝑦 𝑦 𝑦 𝑦 𝑦

𝑥 𝑥 𝑥 𝑥 𝑥

Overfitting and underfitting

‣ Even if machine learning finds the optimal solution in data distribution, a wide margin of error
occurs since the model has a small capacity. Such a phenomenon is referred to as underfitting; the
linear equation model on the leftmost figure above is an example.
‣ An easy alternative is to use higher-degree polynomials, which are non-linear equations.
‣ The rightmost figure provided above is applied with the 12th-order polynomial.
‣ The model capacity got larger, and there are 13 parameters for estimation.

12 11 10 1
𝑦 =𝑤12 𝑥 +𝑤11 𝑥 +𝑤10 𝑥 ⋯ 𝑤1 𝑥 +𝑤 0
Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 51
1.3. Preparation and division of data set UNIT
01

Overfitting
‣ When choosing a 12th-order polynomial curve, it approximates almost perfectly to the training set.
‣ However, an issue occurs when predicting new data.
• The region around the red bar at should be predicted, but the red dot is predicted instead.
‣ The reason is because of the large capacity of the model.
• Accepting the noise during the learning process → Overfitting
‣ Model selection is required to select an adequate size model.
12t
h

𝑥 𝑥0

Inaccurate prediction in
overfitting

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 52


1.3. Preparation and division of data set UNIT
01

Overfitting and generalization of modeling


‣ As the flexibility of the machine learning technique
RMSE Graph
increases (in other words, flexibility is increased as
Training set the possibility of the given data patterns
Test set accurately according with each other rises,
followed by increased order of the polynomial.):
• the root means squared error of the training
data set shows a monotone decreasing
trend.
RMSE

• the root means squared error of the test


data set declines in the beginning as the
order of polynomial rises, but it increases
after a certain point.
‣ Summing up, the figure on the left shows an
overfitting trend that reflects the training data set
pattern too much after the 4th-order polynomial. If
Flexibility (Degree of polynomial expression)
there was no root means squared error index
calculated with test data, the root means squared
Comparison of the difference between the
error of the training data would decline as the
root means squared errors of training and
order of polynomial rises, leading to selecting an
test data
overfitting model.
‣ Thus, test data is required.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 53


1.3. Preparation and division of data set UNIT
01

Method and process of data set division


Cross-Validation:
‣ The data should be split into a training set and a testing set.
‣ In principle, the testing set should be used only once! It should not be reused!
‣ If the training set is also used for evaluation, the errors can be unrealistically small.
‣ We would like to evaluate realistic errors while training by splitting the training data into two.

Training Data Testing Data

Train Cross Validate Evaluate

Cross-Validation and Hyperparameter optimization:


‣ As we can repeatedly evaluate errors while training, it is also possible to tune the hyperparameters.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 54


1.3. Preparation and division of data set UNIT
01

Cross-Validation:
1) Split the data into a training set and a testing set.
2) Further subdivide the training set into a smaller training and a validation set.
3) Train the model with the smaller training set.
4) Evaluate the errors with the validation set.
5) Repeat from step 2) a few times.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 55


1.3. Preparation and division of data set UNIT
01

Cross-Validation method: k-Fold

Validation

Training

‣ Subdivide the training dataset into 𝑘 equal parts. Then, apply sequentially.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 56


1.3. Preparation and division of data set UNIT
01

Cross-Validation method: Leave One Out (LOO)

Validation

Training

‣ Leave only one observation for validation. Apply sequentially. More time consuming.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 57


1.3. Preparation and division of data set UNIT
01

Cross-Validation method: k-cross folding

k-cross
folding
n=k n=10 most of the
Repeated time
epoch
measurement
round 1 round 2 round 3 round 4 round 5 round 6 round 7 round 8 round 9 round 10
validation
set validation
set validation
set validation training training training training
set validation set set set set
training training training training set validation
set set set set training set validation
set training set validation
set set validation
set validation
set
93
Accuracy 90% 91% 95%
%

Final average accuracy (Round1, Round2, …,


Round10)

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 58


Unit 1.

Machine Learning Based Data Analysis


1.1. What is machine learning?
1.2. Phyton scikit-learn library for machine learning
1.3. Preparation and division of data set
1.4. Data pre-processing for making a good training data set
1.5. Practicing to find an optimal method to solve problems with scikit-
learn

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 59


1.4. Data pre-processing for making a good training data set UNIT
01

Missing value processing


Data cleansing for machine learning-based data analysis uses missing value and noise
processing to eliminate discrepancies in collected data.
‣ Missing value processing is done as follows.
‣ First, import the iris data for quick examination.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 60


1.4. Data pre-processing for making a good training data set UNIT
01

Missing value processing

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 61


1.4. Data pre-processing for making a good training data set UNIT
01

Missing value processing


1) Ignore the record (row)
• In data classification, ignore the record if the class label is not distinguished.

Ex In the case of the iris data, ignore the fourth row as shown on the table below.

x1 x2 x3 x4 y
Sepa-
Sepal.Width Petal.Length Petal.Width Species
l.Length
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

• ‘Ignore the record’ is extremely inefficient if missing values frequently occur.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 62


1.4. Data pre-processing for making a good training data set UNIT
01

Missing value processing


2) Insert the missing value
• Enter a certain value like ‘unknown’ for missing value. Or enter the average value of data, such as
the overall average value, median value, or class that belongs to the same record.

x1 x2 x3 x4 y
Sepa-
Sepal.Width Petal.Length Petal.Width Species
l.Length
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 unknown
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 63


1.4. Data pre-processing for making a good training data set UNIT
01

Missing value processing


2) Insert the missing value
• The average value of Sepal.Length for iris is 5.843 as provided earlier, so insert 5.843.
x1 x2 x3 x4 y
Sepa-
Sepal.Width Petal.Length Petal.Width Species
l.Length
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
x1 x2 x3 x4 y
Sepa-
Sepal.Width Petal.Length Petal.Width Species
l.Length
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 5.843 3.1 1.5 0.2 setosa
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 64
1.4. Data pre-processing for making a good training data set UNIT
01

Missing value processing


3) Manual entry
• A person in charge (or an expert) should check the data and modify it into an appropriate value.
• It requires a lot of time but provides high reliability.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 65


1.4. Data pre-processing for making a good training data set UNIT
01

Missing value processing


‣ Identify the missing value from the table type of data below and make appropriate processing.
• In python, the missing value is specified as np.nan or null value.
• ’nan’ is an abbreviation of Not a Number.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 66


1.4. Data pre-processing for making a good training data set UNIT
01

Missing value processing

‣ The omitted value is changed to NaN. In this case, it is not problematic due to the low number of
data, but it is extremely inconvenient to manually find missing values from a huge data frame.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 67


1.4. Data pre-processing for making a good training data set UNIT
01

Missing value processing


‣ isnull() returns the data frame with a Boolean, which shows if the cell has a numerical value (False)
or is omitted with a numerical value (True). Then, sum() is used to obtain the number of omissions.
‣ It is mandatory to check the number of missing values when importing data.

Line 17
• The number of missing values can be counted.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 68


1.4. Data pre-processing for making a good training data set UNIT
01

Removing the training sample or feature with missing value


‣ Use dropna() to completely delete a certain training sample (row) or feature (column).
‣ help(df.dropna) shows axis=0 is default.

Line 18
• axis=0 is default, so the row with the NaN value is deleted.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 69


1.4. Data pre-processing for making a good training data set UNIT
01

Removing the training sample or feature with missing value


‣ When trying to reflect the deleted result immediately to the object, do not omit inplace=True option.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 70


1.4. Data pre-processing for making a good training data set UNIT
01

Removing the training sample or feature with missing value


‣ Delete the row with missing value.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 71


1.4. Data pre-processing for making a good training data set UNIT
01

Removing the training sample or feature with missing value

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 72


1.4. Data pre-processing for making a good training data set UNIT
01

Removing the training sample or feature with missing value


‣ If all rows have NaN, use how='all’ to delete.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 73


1.4. Data pre-processing for making a good training data set UNIT
01

Removing the training sample or feature with missing value

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 74


1.4. Data pre-processing for making a good training data set UNIT
01

Removing the training sample or feature with missing value

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 75


1.4. Data pre-processing for making a good training data set UNIT
01

Removing the training sample or feature with missing value


‣ When deleting 3 or more NaN values, use thresh.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 76


1.4. Data pre-processing for making a good training data set UNIT
01

Removing the training sample or feature with missing value

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 77


1.4. Data pre-processing for making a good training data set UNIT
01

Removing the training sample or feature with missing value


‣ When deleting a row with NaN on a certain column, use subject.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 78


1.4. Data pre-processing for making a good training data set UNIT
01

Imputation
‣ It is sometimes hard to delete a training sample or a certain column because it loses too much
useful data. If so, estimate missing values from other training samples in the data set by kriging.
The most commonly used method is to impute with average value, which is to change the missing
value into the overall average of a certain column. In scikit-learn, use SimpleImputer class.

‣ Impute with using df.values.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 79


1.4. Data pre-processing for making a good training data set UNIT
01

Imputation

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 80


1.4. Data pre-processing for making a good training data set UNIT
01

Imputation

Line 45
• Check it is the average of the column.

‣ For strategy parameters, median and most_frequent can also be set.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 81


1.4. Data pre-processing for making a good training data set UNIT
01

Review on the scikit-learn estimator API


‣ Impute the missing value of the data set by using the SimpleImputer class of scikit-learn from the
previous clause.
‣ The SimpleImputer class is a transformer class of scikit-learn that is used for data conversion.
‣ The two main methods for the estimator include fit and transform.
‣ Use the fit method to learn the model parameter in the training data.
‣ Use the transform method to convert the data into the learned parameter.
‣ The data array for conversion should be the same as the number of data features used in the model
learning.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 82


1.4. Data pre-processing for making a good training data set UNIT
01

Categorical data processing


‣ Data is generally classified into categorical and continuous scales depending on their features.
‣ In liberal arts and social science, questionnaires are mainly used to collect data.
‣ Categorical scale
• It is a scale that can distinguish data into different categories and is classified into the nominal
and ordinal scales.
‣ Continuous scale
• It is a scale that divides linked data into the purpose of the survey. It is classified into interval and
ratio scales.
‣ Actual data sets would include more than one categorical feature. As explained earlier, categorical
data would classify sequential and non-sequential features. Ordinal scale can be referred to as
sequential categorical scale that can array features with sequences.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 83


1.4. Data pre-processing for making a good training data set UNIT
01

Categorical data processing

‣ The data on the table has features that have orders and do not have orders.
The size is ordered, but the color is not ordered. Thus, the size is classified as the ordinal scale,
while the color is the nominal scale.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 84


1.4. Data pre-processing for making a good training data set UNIT
01

Categorical data processing


‣ Convert ordered data into numerical values. The reason why changing text data into numerical data
is to allow a computer to process arithmetic operations.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 85


1.4. Data pre-processing for making a good training data set UNIT
01

Categorical data processing

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 86


1.4. Data pre-processing for making a good training data set UNIT
01

Class label encoding


‣ The class refers to the y value, a column with an actual value.
‣ Create a mapping to convert the class label from strings to integers.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 87


1.4. Data pre-processing for making a good training data set UNIT
01

Class label encoding


‣ ‘enumerate’ creates an object with an index.

‣ ‘enumerate’ creates an object with an index.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 88


1.4. Data pre-processing for making a good training data set UNIT
01

Class label encoding


‣ ‘enumerate’ creates an object with an index.

Line 62
• Change the class label from strings to integers.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 89


1.4. Data pre-processing for making a good training data set UNIT
01

Class label encoding


‣ Since the method in the previous slide is rather inconvenient, scikit-learn supports LabelEncoder for
easy conversion.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 90


1.4. Data pre-processing for making a good training data set UNIT
01

Class label encoding


‣ enumerate 는 열거형으로 인덱스가 있는 형태의 객체로 만들어 준다 .

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 91


1.4. Data pre-processing for making a good training data set UNIT
01

Application of one-hot encoding to unordered feature


There are cases when it is impossible to directly use categorical data in machine learning
algorithms such as regression analysis, etc. If so, conversion is required to be recognized by a
computer.
‣ In such cases, use a dummy variable expressed as 0 or 1. 0 or 1 does not represent how the number
is large or small but shows whether a certain feature is present.
‣ If a certain feature is present, it is expressed as 1; if it’s not found, it is classified as 0. Likewise, one-
hot encoding converts categorical data to a one-hot vector consisting of 0 or 1 that can be
recognized by a computer.
‣ Practice with the iris.target object.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 92


1.4. Data pre-processing for making a good training data set UNIT
01

The encoding is done with integers, so insert the iris ‘species’ value.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 93


1.4. Data pre-processing for making a good training data set UNIT
01

The encoding is done with integers, so insert the iris ‘species’ value.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 94


1.4. Data pre-processing for making a good training data set UNIT
01

Use the get_dummies() function of pandas to convert every eigenvalue of categorical variables
into new dummy variable.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 95


1.4. Data pre-processing for making a good training data set UNIT
01

Use sklearn library to conveniently process one-hot encoding. The result is given as a sparse
matrix in linear algebra. In the sparse matrix, the value of most matrices is 0. An opposite
concept to the sparse matrix is a dense matrix.

Example of sparse matrix

Only 9 out of the above 35 coefficients are not 0


in the above sparse matrix.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 96


1.4. Data pre-processing for making a good training data set UNIT
01

OneHotEncoder

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 97


1.4. Data pre-processing for making a good training data set UNIT
01

OneHotEncoder

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 98


1.4. Data pre-processing for making a good training data set UNIT
01

OneHotEncoder

#Convert to a sparse matrix

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 99


1.4. Data pre-processing for making a good training data set UNIT
01

Conversion to the sparse matrix

Line 82
• (0, 0) is 1, thus setosa (setosa up to 50 matrices).

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 100


1.4. Data pre-processing for making a good training data set UNIT
01

Refer to the figure below for easier understanding.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 101


1.4. Data pre-processing for making a good training data set UNIT
01

Refer to the figure below for easier understanding.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 102


1.4. Data pre-processing for making a good training data set UNIT
01

Refer to the figure below for easier understanding.

Row index 0 2
Column index Species setosa versicolor virginica Sparse matrix
expression
0 setosa 1 0 0 (0,0)
1 setosa 1 0 0 (1,0)
setosa 1 0 0
49 setosa 1 0 0 (49,0
50 versicolor 0 1 0 (50,1)
51 versicolor 0 1 0 (51,1)
versicolor 0 1 0
100 versicolor 0 1 0
101 virginica 0 0 1 (101,2)
virginica 0 0 1 (102,2)
virginica 0 0 1
150 virginica 0 0 1 (150,2)

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 103


1.4. Data pre-processing for making a good training data set UNIT
01

Using hold-out in real life that splits the data set into training and test data sets
‣ df_wine is the data that measure wines produced in Vinho Verde, which is adjacent to the Atlantic
Ocean in the northwest of Portugal. It measured and analyzed the grade, taste, and acidity of 1,599
red wine and 4,898 white wine samples to create data. If the data is not found in the following route,
it is possible to import the data from local by directly downloading it from the UCI repository.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 104


1.4. Data pre-processing for making a good training data set UNIT
01

Using hold-out in real life that splits the data set into training and test data sets

Line 85
• When it is not accessible to the wine data set of the UCI machine learning repository,
remove the remark of the following code and read the data set from the local route:
• df_wine = pd.read_csv(‘wine.data’, header=None)

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 105


1.4. Data pre-processing for making a good training data set UNIT
01

Using hold-out in real life that splits the data set into training and test data sets

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 106


1.4. Data pre-processing for making a good training data set UNIT
01

Using hold-out in real life that splits the data set into training and test data sets
‣ Data splitting is possible by using the train_test_split function provided in the model_selection
module of scikit-learn. First, convert the features from index 1 to 13 to NumPy array and assign to
variable X. With the train_test_split function, data conversion is done in four tuples, so assign by
designating appropriate variables.

‣ Randomly split X and y into training and test data sets. test_size=0.3, so 30% of the sample is
assigned to X_test and y_test.
‣ Regarding the stratify parameter, if the class label array y is sent, the class ratio found in the
training and test data sets is identically maintained with the original data set.
‣ The most widely used ratios in real life are 6:4, 7:3, and 8:2, depending on the size of the data set. It
is common and suitable for large data sets to split the training data set and test data set into the
ratio of 9:1 or 9.9:0.1.
Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 107
1.4. Data pre-processing for making a good training data set UNIT
01

Arranging the scale between features (variables)


‣ Refer to the practical code (Chapter5_Unit1_Machine Learning-Based Data Analysis 1 (Supervised
Learning)) for detailed code.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 108


1.4. Data pre-processing for making a good training data set UNIT
01

Arranging the scale between features (variables)


‣ Refer to the practical code (Chapter5_Unit1_Machine Learning-Based Data Analysis 1 (Supervised
Learning)) for detailed code.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 109


1.4. Data pre-processing for making a good training data set UNIT
01

Arranging the scale between features (variables)


‣ Refer to the practical code (Chapter5_Unit1_Machine Learning-Based Data Analysis 1 (Supervised
Learning)) for detailed code.

#MaxAbsScaler divides the data into the maximum absolute value based on each
feature. Thus, the maximum value of each feature becomes 1.
#The overall feature changes to [-1, 1] range.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 110


1.4. Data pre-processing for making a good training data set UNIT
01

Limiting model complexity through L1 and L2 regularizations


Bias and Variance
‣ If the predicted values are highly deviated from the actual target value in general, it is said that
there’s a high bias in the result. When the predicted values are scattered far away from one another,
it is said that it has a high variance in the result.
‣ The following figure expresses the predicted results of the model on the target. High bias refers to the
result when it significantly deviates from the target’s center. On the other hand, if the predicted
values are gathered around the target’s center, it has a low bias. Bias refers to the overall similarity
between the predicted values and actual values. In the figure, high variance is when the predicted
values are greatly apart from one another. On the contrary, low variance is when the predicted values
are closely gathered together. Thus, variance refers to the overall similarity among predicted values.
Low High
Variance Variance
×
×
×
×
×× ××××
×
High × ×
Bias

×× ×
Low ××
× ×
Bias ××
× ×
×
×
×

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 111


1.4. Data pre-processing for making a good training data set UNIT
01

trade-off
‣ Bias and variance have a trade-off relationship in which when one increases, the other falls, and vice
versa. The model becomes complex at the beginning of learning; the overall error cost falls due to
decreased bias. However, at some point, the model keeps learning and becomes much more
complicated, which causes higher variance and increased overall error cost. In other words, the
model gets overfitted to the training data. One way to prevent overfitting is to stop learning at the
appropriate time. Regularization is a method to prevent overfitting by lowering variance. Still, it can
increase bias instead due to the trade-off relationship.

Optimum Model Complexity


Total Error
Error

Variance

Bias2

Model Complexity

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 112


1.4. Data pre-processing for making a good training data set UNIT
01

Ridge Regression
‣ The ridge regression model is a technique to limit the L2norm of w, which is the regression
coefficient vector. A constraint is added to minimize the sum of squares of weight in the cost
function of linear regression. If the linear regression model is as follows:

(,w= )

‣ Then, the cost function of the ridge regression model is as follows. N is the number of data, and M is
the number of elements of the regression coefficient vector. A constraint is added to the existing
SSE (Sum of Squared Errors).

{ }
𝑁 𝑀
^ 𝑟 ⅈ 𝑑𝑔 ⅇ =𝑎𝑟𝑔𝑚𝑖𝑛𝑤
𝑤 ∑ 𝑖
( 𝑦 −𝑤𝑋) 2
+ 𝜆 ∑ 𝑗
𝑤 2

𝑖=1 𝑗=1
‣ λ is a hyperparameter to adjust the weight of existing SSE and added constraint.
When the λ is large, regularization is greatly applied, and the regression coefficients become lower.
When the λ becomes smaller, regularization gets weaker. When the λ equals 0, the constraint clause
also becomes 0, the same as the general linear regression model.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 113


1.4. Data pre-processing for making a good training data set UNIT
01

Ridge Regression
‣ The following is an example of simple linear regression model equation:

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 114


1.4. Data pre-processing for making a good training data set UNIT
01

Ridge Regression
‣ When drawing the cost function SSE (w1, w2) on the coordinate with the x-axis and y-axis, an ellipse
is created as provided in the following figure:
𝑤2
Minimize cost

2
𝜆∥𝑊 ∥
2
𝑤1

Minimize penalty Minimize cost + penalty

‣ In the figure above, the ellipse drawn in a solid line is the cost function, which is the combination of
w1 and w2 with the same cost (SSE). The central point of the ellipse is when the cost becomes 0.
Outward of the ellipse is the combination of w1 and w2 with higher cost, which is the model with
higher error (consists of w1 and w2 weights). The colored circle refers to the constraint. The circle
becomes smaller when the λ gets larger, and vice versa. The point where the cost function (ellipse)
and constraint (colored circle) meet is the optimal solution where the cost of the ridge regression
model is minimum.
Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 115
1.4. Data pre-processing for making a good training data set UNIT
01

Lasso Regression
‣ The Lasso (Least Absolute Shrinkage and Selection Operator) regression model is a technique to
limit the L1norn of the regression coefficient vector w. A constraint is added to minimize the sum of
the absolute values of the weights in the cost function of linear regression. The following figure
shows the cost function of the Lasso regression model:

{∑ }
𝑁 𝑀
^ 𝑙𝑎𝑠𝑠𝑜
𝑤 =𝑎𝑟𝑔𝑚𝑖𝑛𝑤 (𝑦 𝑖 − 𝑤𝑋) +𝜆 ∑ |𝑤 𝑗|
2

𝑖=1 𝑗=1

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 116


1.4. Data pre-processing for making a good training data set UNIT
01

Lasso Regression
‣ When drawing the cost function of the Lasso regression model (w1, w2) on the coordinate with the
x-axis and y-axis, a rhombus is created as provided in the following figure:

𝑤2

𝜆 ∥𝑊 ∥1

𝑤1

Minimize cost + penalty

‣ Since the constraint of the Lasso regression model is a rhombus, it is highly possible that the point
meeting the cost function is the vertex of the rhombus. The vertexes of the rhombus are always the
points where w1 or w2 are 0. Thus, the Lasso regression model results in 0 weight.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 117


1.4. Data pre-processing for making a good training data set UNIT
01

Elastic-net regression
‣ The Elastic-net regression model applies both the L2norm and L1norm to the regression coefficient
vector. The constraint is both the sum of weight squares and the sum of the absolute weight values.
The following figure shows the cost function of Elastic-net. There are two hyperparameters of
Elastic-net, which are λ1 and λ2.

{∑ }
𝑁 𝑀 𝑀
^
𝑤 𝑒𝑙𝑎𝑠𝑡𝑖𝑐
=𝑎𝑟𝑔𝑚𝑖𝑛 𝑤 ( 𝑦 𝑖 −𝑤𝑋) + 𝜆1 ∑ 𝑤 + 𝜆2 ∑ |𝑤 𝑗|
2 2
𝑗
𝑖=1 𝑗=1 𝑗=1

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 118


1.4. Data pre-processing for making a good training data set UNIT
01

Elastic-net regression
‣ Elastic-net applies both the L2norn and L1norn at the same time, so the constraint is somewhere in
the middle. It reduces a larger weight while making an unimportant weight 0.
1.5

1.0

0.5

0.0
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

el
as L -0.5
t1ic
L -ne
2 t

-1.0

-1.5

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 119


Unit 1.

Machine Learning Based Data Analysis


1.1. What is machine learning?
1.2. Phyton scikit-learn library for machine learning
1.3. Preparation and division of data set
1.4. Data pre-processing for making a good training data set
1.5. Practicing to find an optimal method to solve problems with scikit-
learn

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 120


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Finding an optimal method to solve problems with scikit-learn


Practicing
‣ Use the following problem-solving methodology and consider which algorithm to apply. Perform pre-pro-
cessing and overall process regarding the iris data and compare with the result code provided below.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 121


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Practicing

Sample
(Instance, observation)
Petal
Sepal Petal Petal
Sepal Class
lengt lengt widt
width label
h h h
1 5.1 3.5 1.4 0.2 Setosa
2 4.9 3.0 1.4 0.2 Setosa

50 6.4 3.5 4.5 1.2 Versicolor

150 5.9 3.0 5.0 1.8 Virginica
Sepal

Feature Class label


(Property, (Target)
measurement value, dimension)

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 122


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Considerations in machine learning


1) Define the problem of the business and check for solutions or best alternative plans.
2) Check if it is possible to define it as a supervised or unsupervised problem.
3) Check which method to use for measuring model performance.
4) Check if the performance index is linked with the business objective and confirm if the project
participants made an agreement regarding the performance.
‣ In this practicing problem, define the problem as iris species (supervised) and suppose that the
result is satisfactory if the performance predicts 85% or higher classification accuracy.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 123


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Understanding the iris data

Domain knowledge of the iris data


‣ Data name: IRIS
‣ Number of data: 150
‣ Number of variables: 5
‣ Understanding variables
Sepal Length Length information of the sepal
Sepal Width Width information of the sepal
Petal Length Length information of the petal
Petal Width Width information of the petal
Species Flower species, classified into setosa / versicolor / virginica

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 124


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Understanding the iris data

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 125


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Understanding the iris data

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 126


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Iris data pre-processing and EDA

Line 3-1 ~
4• Import the library required for practicing.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 127


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Iris data pre-processing and EDA

Line 4-1
• Convert the current variable data to ndarray and DataFrame of NumPy.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 128


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Iris data pre-processing and EDA

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 129


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Iris data pre-processing and EDA

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 130


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Iris data pre-processing and EDA

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 131


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Iris data pre-processing and EDA

Line 1
• Merge feature and target.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 132


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Iris data pre-processing and EDA

Line 3 ~ 5
• Change the column name.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 133


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Iris data pre-processing and EDA

Line 10
• Change the target value.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 134


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Iris data pre-processing and EDA

Line 11
• Check the missing value.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 135


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Iris data pre-processing and EDA


‣ Basic statistical analysis
• Perform basic statistical analysis to better understand the data by understanding the data size
(numbers of data), the shape of data (matrix shape), data type, data distribution, and the
relationship between features. It will also enhance the performance of machine learning.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 136


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Iris data pre-processing and EDA

Line 13
• petal_length has the greatest standard deviation. Compared to other features,
petal_width seems to have a narrower range of values. It would be better to perform
regularization after checking the model performance due to the scale differences
between features.
Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 137
1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Iris data pre-processing and EDA


‣ Correlation analysis
• Use corr to analyze the relationship among features.

Line 14
• The correlation coefficient of petal_length and petal_width is 0.962865, which is
extremely high. Since highly correlated features may induce multicollinearity problems,
it is recommended to select one of the two variables to use.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 138


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Iris data pre-processing and EDA


‣ Aggregation analysis

Line 15
• Number of data in each target was counted using the aggregation function ‘size,’ and it
was confirmed that 50 data were equally found in each feature. Select between ‘size’
and ‘count’ depending on the purpose of the analysis. The ‘size’ counts the number of
data, including missing values. On the other hand, the ‘count’ counts the number of
data without missing values. In this case, there’s no difference using ‘size’ and ‘count’
because iris data does not have any missing values.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 139


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Iris data pre-processing and EDA


‣ Data visualization
• Previously, basic statistics analysis was done on the data. Still, it is not easy to understand due to
too many numbers, and incorrect reading of decimal points would result in significant errors. If so,
visualization of data as a graph provides an intuitive understanding for the reader. Also, data
visualization is efficient in explaining data analysis results.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 140


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Iris data pre-processing and EDA


‣ Visualizing basic statistics and outlier

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 141


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Iris data pre-processing and EDA

setal_length setal_width (cm)

petal_length petal_width

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 142


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Iris data pre-processing and EDA


‣ Visualizing data distribution

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 143


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Iris data pre-processing and EDA


‣ The frequency of the median class interval is high for petal_width, and it becomes lower as it deviates
far away from the center. In the box plot, the box length of sepal_width is short because a lot of data
was aggregated at the median. In the case of petal_length, the frequency of the median class interval
is high, but there are a lot of data on the left class interval. In the box plot, the box of petal_length is
long to the bottom because there were a lot of data in lower values.

setal_length setal_width (cm) petal_length petal_width

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 144


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Visualizing correlation

petal_width petal_lengthsetal_width (cm)setal_length

1.00

0.75

0.5

0.25

0.00

-0.25

-0.50

-0.75

-1.00

setal_lengthsetal_width (cm)petal_length petal_width

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 145


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Visualizing the correlation between features and data distribution using pairplot

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 146


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Visualizing the correlation between features and data distribution using pairplot
setal_length
setal_width (cm)

species
• setosa
• versicolor
petal_length

• virginica
petal_widt
h

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 147


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Visualizing the correlation between features and data distribution using pairplot
‣ setosa is aggregated by clearly being deviated from other classes. Classification is possible by
drawing an imaginary line, and setosa will be classified as a linear model. For versicolor and virginica,
it seems difficult to classify them by drawing a line because they are mixed in the graph, complete
with the sepal_width and sepal_length features. However, even if it seems a little vague, they can be
classified in other graphs.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 148


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Visualizing the class ratio of the target

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 149


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Visualizing the class ratio of the target

versicolor

33.3%

33.3%
33.3%
se-
tosa
vir-
ginica

Line 15
• The data is evenly arranged in each target class.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 150


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Visualizing the class ratio of the target

‣ Before starting machine learning, split the data set into training and performance test data. The final
objective of machine learning is to create a generalized model so that it can accurately predict new
data. If evaluating the performance with data used in learning, the possibility of getting it right is high
since the model is already familiar with the given data feature. For reliable evaluation, separate the
performance test data set from the training data set. Because it is the separation of data, it is referred
to as the hold out method.
‣ Split training and performance test data sets with the train_test_split function of sklearn. Classify the
training data as ‘train’ and performance test data as ‘test.’ X is the feature of the data set, and y is
the target. For structured data analysis, indicate DataFrame with capital letters and Series with lower
cases. The test_size=0.33 option separates 33% of the total data as a test set. random_state=42 is an
option used to induce reproducible results for the practicing problem. If not designating random_state,
the data set for conversion will differ every time.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 151


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Algorithm selection

START

classification >50k get


regression
samples NO
more data
YES
Kernel NO Predicting Predicting Lasso
SGD <100k few features should
approximatio a
category
a
quantity
Classifier NO YES
samples YES be important YES ElasticNet
n
SVC NOT WORKING YES NO NO
Ensembil do you SGD
e Kneighbors <100k Ridge Regresson
have Regresso
Classifier Classifier samples
YES
Labeled data SVR (kernel= ‘linear’
r
s NOT WORKING NO NOT WORKING

Naïve Text Linear SVR(kernel=‘rbf'


Bayes YES Data SVC YES
NO
EnsembleRegressors
NOT WORKING

NO

clustering Dimensionality reduction


Number of just
Randomize
<10k Categoties d
KMeans samples lookong
YES YES known YES PCA
NOT WORKING NO NO NO NOT WORKING

Spectral MiniBatch <10k tough predicting <10k Isomap


structure LLE
Clustering KMeans samples luck samples
YES Spectral Embedding
NO
GMM NO NO
NOT WORKING

MeanShift kernel
approximati
VBGMM on

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 152


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Algorithm selection
‣ setosa is aggregated by clearly being deviated from other classes. Classification is possible by
drawing an imaginary line, and setosa will be classified as a linear model. For versicolor and virginica,
it seems difficult to classify them by drawing a line because they are mixed in the graph, complete
with the sepal_width and sepal_length features. However, even if it seems a little vague, they can be
classified in other graphs.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 153


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Algorithm selection

‣ # regularization: Constraints the degree of freedom of the decision tree


‣ # Lowering max_depth would constrain the model and reduce the risk of overfitting.
‣ # min_samples_split: Minimum amount of sample required by the node for splitting
‣ # min_weight_fraction_leaf: Identical to the min_samples_leaf, but is the ratio with weight in the entire
sample.
‣ # max_leaf_nodes: Maximum number of leaf nodes
‣ # max_features: The largest number of features that will be used for splitting by each node
‣ # Increasing the parameter that starts with min_ or lowering the parameter and that starts with max_
would increase the model constraint.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 154


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Algorithm selection
‣ Gini impurity or entropy
‣ The difference between Gini impurity and entropy is vague in real life. Both of them create a similar
tree.
‣ Calculation of Gini impurity is quicker, so it is recommended as a default.
‣ However, when creating a different tree, Gini impurity tends to isolate the most frequent class to one
side, while entropy results in a more balanced tree.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 155


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Model learning
‣ Perform model learning with the training data to check the model performance. The current model is
set with default hyperparameter except for random_state.

Score
‣ Evaluate the performance using the performance test data set. In the Scikit-learn, score refers to
accuracy. Since the iris data set is well-structured for practice, it generally shows high performance in
any model.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 156


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Model generalization strategy


‣ As machine learning shows data-driven model performance, enough amount of data is required for
good performance. An insufficient amount of data may result in overfitting, which means that the
model shows lower prediction ability regarding unseen data because it is fit only to the training data
features. The following is the generalization strategy so that the model would provide high
performance regarding unseen data.
Validation set
‣ The performance test data set to split with Train_test_split is for the final performance evaluation of
the model. Since it is necessary to check model performance during model learning, hold out some of
the data from the training data set and use it as the validation set. A chance of overfitting can be
found during learning by using the validation set. It is also used for finding hyperparameters.

training set validation set test set

Model fitting Parameter selection Evaluation

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 157


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Cross validation
‣ This strategy is to make many validation sets so that every data can be included in learning once.
Divide the data set into a random number n=k (k-fold). Use the first fold as a validation set and other
k-1 folds as a training set and measure the performance.
‣ Use the second fold as a test set and other folds as a training set for learning, and then measure the
performance. Repeat the same process for all the other folds so that all data can be included in the
training. Obtain k performance evaluation results and then average out to predict the model
performance. The following figure is an example of when k=5.

cross_validation
Split 1
CV Iterations

Split 2
Train
Split 3 Test
Split 4
Split 5
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 158


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Cross_val_score
‣ Cross validation can be easily performed using the cross_val_score function of scikit-learn.

("{}th cross validation score :


{}".format(i,_))
("\ncross validation final score:
{}".format(fin_result))
0th cross validation score : 0.9
1st cross validation score : 1.0
2nd cross validation score : 0.8
3rd cross validation score : 1.0
4th cross validation score : 0.8
5th cross validation score : 0.9
6th cross validation score : 1.0
7th cross validation score : 0.9
8th cross validation score : 1.0
9th cross validation score : 1.0
Final cross validation score :
0.93

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 159


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

stratified
‣ The random splitting of the train set and validation set would result in an inconstant target class when
hold-out. If so, the data distribution differs in training and validation sets, affecting learning. For
machine learning, a premise is present in which training data distribution and real-life data
distribution are the same. If the premise is not followed, the learning model performance falls. So, to
prevent such an issue, the stratified method is used to evenly distribute the target class ratio.
The following figure provides an intuitive understanding of how the stratified method classifies data.

Stratified Cross-validation
Split 1
CV Iterations

Split 2 Training
Split 3 data
Test data
Class label Class 0 Class 1 Class 2
| | | | | | | |
0 20 40 60 80 100 120 140
Data points

‣ Cross validation is possible by sending the instance of StratifiedKFold to the cv option of


cross_val_score.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 160


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

stratified

("{}th stratified cross validation score: {}".format(i,_))

0th stratified cross validation score: 0.9


1st stratified cross validation score: 0.9
2nd stratified cross validation score: 0.8
3rd stratified cross validation score: 0.9
4th stratified cross validation score: 1.0
5th stratified cross validation score: 1.0
6th stratified cross validation score: 0.9
7th stratified cross validation score: 0.8
8th stratified cross validation score: 1.0
9th stratified cross validation score: 1.0

("\nstratified Final stratified cross validation score: {}".format(fin)result))

Final stratified cross validation score:


0.9199999999999999

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 161


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Learning Curve
‣ !pip install scikit-plot

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 162


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Learning Curve
‣ !pip install scikit-plot
• The green line is the result of cross-validation. Overfitting occurs when the green line goes upward
to the right but then starts to fall. The red line is data used for training which is validated. The red
line may momentarily fall if there are too much data. This phenomenon is only temporary, and the
graph will converge in the long term.
• The cv option is not designated, so 3fold is applied as default. There are 100 data in the training
set, and 33% of it is used for cross-validation, so the maximum value of the x-axis is 66. The
training curve is disconnected as the green line increases, so it is unknown if there is more data. At
this point, it is impossible to know whether there is enough data. The training curve is drawn
differently depending on the algorithm, even when using the same data. Thus, what is known from
this training curve is that the performance of the current decision tree model would be better if
there were more data.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 163


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Learning Curve
‣ !pip install scikit-plot

Learning Curve
Score

Training examples

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 164


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Learning Curve
‣ If there is enough data, identical data distribution is maintained even when training and validation
sets are randomly split. The cross-validation method is required when there is insufficient data. Draw
a learning curve to determine whether there is enough data. The learning curve shows how
performance changes when slightly increasing the amount of training data by setting the x-axis as the
number of training data and the y-axis as the performance score. The test score is calculated by
internal cross-validation.
‣ The learning curve can be drawn using the scikitplot library, which supports the scikit-learn. Install the
library separately from scikit-learn to use.
‣ The scikitplot is not provided in anaconda as default, so it needs to be installed using the package
management tools. Run the following code with the Jupyter Notebook to install the library. Be aware of
different names when installing and importing the library.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 165


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Model optimization strategy

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 166


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Model optimization strategy


‣ #Hyperparameter
• In machine learning, the machine learns the data and finds parameters by itself.
Hyperparameters refer to parameters that need to be directly designated by a human as they
cannot be found by the machine.
In the scikit-learn, it is possible to set hyperparameters to instance an algorithm.
‣ #Hyperparameter search using GridSearchCV
• In general, hyperparameters are found by the analyst’s expertise.
The scikit-learn provides the GridSearchCV function that finds hyperparameters. It is a function that
lists all cases regarding hyperparameter combinations on a grid and learns and performs
performance measures for every combination. It may seem like working without any plans.
However, the machine automatically performs the work as the range of hyperparameters is
designated by the analyst. Although it takes some time, it eases finding hyperparameters.
‣ Similar to algorithms, instance GridSearchCV as well. When instancing, send the instanced algorithm
model as an argument to the estimator option. For param_grid, send the dictionary containing
hyperparameters for testing as an argument.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 167


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

The total number of cases that can possibly be made with the parameters from the practice
problem is 1600. Since k=10 in the K-fold cross-validation, 10 cross-validations were performed
for each case; a total of 16,000 training were done. The following table shows hyperparameter
combinations
‣ in the practice
The optimal parameters problem. performance found with GridSearCV are recorded in
and optimized
best_params_and best_score_ attributes.
‣ If the refit option is set to True, train the model with the optimal hyperparameters and record to the
best_estimator_ attribute.

0 1 2 3 4 ‧‧‧ 1595 1596 1597 1598 1599


en- en- en- en- en-
criterion gini gini gini gini gini ‧‧‧
tropy tropy tropy tropy tropy
max_depth 4 4 4 4 4 ‧‧‧ 12 12 12 12 12
min_impurity_decrease 0 0 0 0 0 ‧‧‧ 0.2 0.2 0.2 0.2 0.2
randomight_fraction_le
0 0 0 0 0 ‧‧‧ 0.3 0.3 0.3 0.3 0.3
af
random_state 7 7 23 23 42 ‧‧‧ 42 78 78 142 142
ran- ran- ran- ran- ran-
splitter best best best ‧‧‧ best best
6 rows * 1600 columns dom dom dom dom dom

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 168


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 169


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Evaluation criteria and model evaluation


‣ Use the X_test and y_test from hold out for the final model evaluation. For an accurate evaluation, it is
important to be aware of different kinds of evaluation criteria.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 170


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Evaluation criteria and model evaluation


‣ #Limitations of accuracy
• So far, accuracy was the only criterion to validate the model, but a limitation is present. It requires
more than accuracy to evaluate the model properly.
Ex If there’s a model that predicts ‘setosa’ only even when entering various kinds of data, the model
performance would be doubtful.

• However, let’s assume that there are 48 setosas, 1 versicolor, and 1 virginica in the test set. When
making an evaluation using this test set, a problem is that it would have 96% accuracy.
Nevertheless, it’s not because the model’a performance is great. It would be necessary to check
other evaluation criteria as well to accurately evaluate the model performance.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 171


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Confusion Matrix
‣ The following confusion matrix can be expressed with binary classification.

Predicted positive class Predicted negative class

Actual Positive TP (True Positive) FN (False Negative)


Actual Negative FP (False Positive) TN (True Negative)

‣ Evaluation scores, including precision, recall, f1-score, and others, can be made based on the
abovementioned concepts (TP, FP, TN, FN).
‣ Use the confusion matrix to analyze both right and wrong predicted results. The confusion matrix
can validate the performance differently to see how well the predicted and actual targets got right.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 172


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Multi label classification

Predicted setosa Predicted versicolor Predicted virginica

Actual setosa and Actual setosa but predicted Actual setosa but predicted
Actual setosa
predicted setosa versicolor virginica

Actual Actual versicolor but Actual versicolor and Actual versicolor but
versicolor predicted setosa predicted versicolor predicted virginica

Actual Actual virginica but Actual virginica but Actual virginica and
virginica predicted setosa predicted versicolor predicted virginica

‣ Since the iris data is a multi-label classification problem, it cannot be expressed in four different
concepts only as provided earlier. So, create three indices for each setosa, versicolor, virginica by
considering each as a binary classification problem. Take setosa, for example.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 173


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Confusion matrix of the iris data set

‣ With scikit-learn, it is possible to easily calculate the confusion matrix using confusion_matrix
function. Send the arguments to the actual class and then to the predicted class.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 174


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Confusion matrix of the iris data set

Confusion Matrix

setosa
True label

versicolor

virginica

setosa versicolor virginica


Predicted label

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 175


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Confusion matrix of the iris data set


‣ Use the scikit-plot to visualize the confusion matrix into a more intuitive heatmap. The scikit-learn did
not have labels for the x-axis and y-axis. On the other hand, scikit-plot has labels for the axis for
easier result interpretation.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 176


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

precision / recall / fall-out / f-score


‣ Evaluation scores differ in each target class in a multi-label classification problem.

Not predicted setosa


Predicted setosa
(predicted versicolor or virginica)
Actual setosa TP (True Positive) FN (False Negative)
Not actual versicolor
(versicolor or vir- FP (False Positive) TN (True Negative)
ginica)
‣ Score the evaluation results based on the confusion matrix’s TP, TN, FP, and FN. These four concepts
are only possible in binary classification problems. For multi-label classification problems having N
target classes, such as iris data, consider each target class as binary classification and obtain N
confusion matrixes.
Ex Iris data
Consider each setosa, versicolor, and virginica as a binary classification problem and create
three confusion matrixes.
The following shows the confusion matrix of setosa.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 177


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

precision
‣ Precision is the ratio of the correct predicted class.

𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛=
𝑇𝑃 +𝐹𝑃

(f"{target} precision:
{score}")
setosa precision: 1.0
versicolor precision: 0.9375
virginica precision: 1.0

Line 45
• In multi-label classification, average cannot be “binary.”
• “binary” is the average default.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 178


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

recall
‣ Also called sensitivity, recall is the correct prediction ratio among the actual target class.

𝑇𝑃
𝑟𝑒𝑐𝑎𝑙𝑙=
𝑇𝑃 +𝐹𝑁

(f"{target}sensitivity:
{score}")
setosa sensitivity: 1.0
versicolor sensitivity: 1.0
virginica sensitivity : 0.9375

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 179


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

fall-out
‣ Fall-out is the incorrect ratio among the actual class, not target. Also expressed as 1-specificity.

𝐹𝑃
𝑓𝑎𝑙𝑙− 𝑜𝑢𝑡=
𝐹𝑃 +𝑇𝑁
‣ The scikit-learn does not provide how to calculate fall-out.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 180


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

f-score
‣ Precision and recall have a trade-off relationship. The f-score is the weighted harmonic mean of
precision and recall. If the f-score is less than 1, more weight is provided to precision; if it is greater
than 1, more weight is provided to recall. The f-score is used to accurately understand the model
performance when the data class is imbalanced.
( 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛× 𝑟𝑒𝑐𝑎𝑙𝑙 )
𝐹 𝛽 =( 1+ 𝛽2 )
𝛽 2 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
‣ For even weight of precision and recall, 𝜷 is set 1 most of the time, which is specifically referred to as
f1-score.
¿ 𝑖𝑠𝑖𝑜𝑛×𝑟𝑒𝑐𝑎𝑙𝑙
𝐹 1=2 ∙
¿ 𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 181


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

f-score
‣ #F1 measure – both precision and recall are equally weighted. F1 score is used to calculate the
average from the harmonic mean of precision and recall (sensitivity) and weighting precision and
recall.
‣ #a (precision), b (recall) 2ab/a+b
‣ #F0.5 measure – Precision is more weighted than recall. 0.5 times greater weight is applied to the
recall compared to precision.
‣ #F2 measure – Recall is more weighted. Recall is 2 times more weighted than precision.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 182


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

f-score

(f"{target}fbetas score:
{score}")

(f"{target}f1 score:{score}")

setosa fbetas score : 1.0


versicolor fbetas score : 0.967741935483871
virginica fbetas score : 0.967741935483871
setosa f1 score : 1.0
versicolor f1 score : 0.967741935483871
virginica f1 score : 0.967741935483871

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 183


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

accuracy

𝑇𝑃+𝑇𝑁
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃+𝑇𝑁 +𝐹𝑃 +𝐹𝑁

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 184


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

classification_report
‣ Use the classification_report function of scikit-learn to batch calculate precision, recall, and f1-score.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 185


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

ROC curve
‣ ROC curve has TPR (True Positive Rate) on the y-axis and FPR(False Positive Rate) on the x-axis.
TPR is recall, and FPR refers to fall-out.

𝑇𝑃 𝐹𝑃
𝑇𝑃𝑅= 𝐹𝑃𝑅=
𝑇𝑃 +𝐹𝑁 𝐹𝑃 +𝑇𝑁

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 186


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

ROC curve

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 187


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

ROC curve

ROC Curves
True Positive Rate

ROC Curve of class setosa (area =1.00)


ROC Curve of class versicolor (area = 0.99)
ROC Curve of class virginica (area = 0.99)
micro-average ROC curve (area = 0.99)
micro-average ROC curve (area = 0.99)

False Positive Rate

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 188


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

AUC (Area Under Curve)

True positive rate

True positive rate


AUC = 0.4 AUC = 0.5
False positive rate False positive rate
True positive rate

True positive rate

AUC = 0.6 AUC = 0.85


False positive rate False positive rate
The Hundred-Page Machine
Learning Book

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 189


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Poor performance: Return to the previous step


‣ Final model
‣ Save model – It takes days to learn if there is a lot of data. It is extremely inefficient to train the model
every time for prediction, so the best way is to save the model for reuse. Use ‘pickle’ to save the
model.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 190


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Poor performance: Return to the previous step


‣ Final model

Line 55
• Import model.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 191


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Poor performance: Return to the previous step


‣ Final model

Line 56
• Final prediction

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 192


1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01

Poor performance: Return to the previous step


‣ Final model

Line 57
• Save csv.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 193


Unit 2.

Application of the Supervised Learning


Model for Numerical Prediction
2.1. Training and Testing in Machine Learning
2.2. Linear Regression Basics
2.3. Linear Regression Diagnostics
2.4. Other Regression Types
2.5. Practicing the Supervised Learning Model for Numerical
Prediction
Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 194
2.1. Training and Testing in Machine Learning UNIT
02

Machine Learning Types

Machine Learning

Supervised Unsupervised Reinforcement

Target pattern is given. Target pattern must be found out. Policy optimization

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 195


2.1. Training and Testing in Machine Learning UNIT
02

Training and Testing in Machine Learning


What is learning (human)?
‣ True learning is not just having a good memory.
‣ Learned material must be put to test.

Training and testing in machine learning


‣ A trained model must be put to test to generalize.
‣ Testing result is measured by errors.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 196


2.1. Training and Testing in Machine Learning UNIT
02

Error Types
Bias error (Underfitting error)
‣ Associated with simple/rigid/biased models.
‣ Prediction cannot account for the detailed data pattern.
‣ To lower this error type, increase the model complexity.

Data
Predicte
d
Target

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 197


2.1. Training and Testing in Machine Learning UNIT
02

Variance error (Overfitting error)


‣ Associated with models overly complex and sensitive to noise.
‣ Prediction performance is good while training but worsens when testing with a different dataset.
‣ To lower this error type, increase the amount of training data (*) or decrease the model complexity.

Data
Predicte
d
Target

(*) By data augmentation method, for


example.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 198


2.1. Training and Testing in Machine Learning UNIT
02

Total error

Bias Error Variance Error Total Error

Model Complexity

‣ The goal is to minimize the Total error = Bias error + Variance error.
‣ Just enough complexity is required to “optimize” the model.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 199


2.1. Training and Testing in Machine Learning UNIT
02

Minimizing Errors
Optimized machine learning model
‣ Prediction performance should be good in both training and testing.
‣ Given a machine learning algorithm, there is a model with just enough complexity (*).

⇦ Underfitting Optimum Overfitting ⇨

Testing error (Out of sample)

(*) Complexity is usually


controlled by the number Training error (In sample)
of parameters.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 200


2.1. Training and Testing in Machine Learning UNIT
02

Optimized machine learning model


‣ Prediction performance should be good in both training and testing.
‣ Given a machine learning algorithm, there is a model with just enough complexity (*).

Data
Predicted
Target

(*) Complexity is usually


controlled by the number of
parameters.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 201


2.1. Training and Testing in Machine Learning UNIT
02

Machine Learning Types

Error Metric

Numeric 𝑌 Categorical 𝑌

MSE, MAE, RMSE, Accuracy, Precision, Recall,


Correlation, etc. Specificity, etc.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 202


Unit 2.

Application of the Supervised Learning


Model for Numerical Prediction
2.1. Training and Testing in Machine Learning
2.2. Linear Regression Basics
2.3. Linear Regression Diagnostics
2.4. Other Regression Types
2.5. Practicing the Supervised Learning Model for Numerical
Prediction
Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 203
2.2. Linear Regression Basics UNIT
02

Linear Regression Basics


Supervised Learning

Numeric 𝑌 Categorical 𝑌

Y = 13.45, 73, 9.5, ….. Y = red, green, blue, …..

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 204


2.2. Linear Regression Basics UNIT
02

About linear regression

‣ There is one ore more explanatory variables: 𝑋1 , 𝑋2 ,…, 𝑋𝑘


‣ There is one response variable: 𝑌

‣ The variables 𝑋_𝑖 and 𝑌 are connected by a linear relation: 𝑌=𝛽0+𝛽1 𝑋1+𝛽2 𝑋2+⋯+𝛽𝑘 𝑋𝑘+

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 205


2.2. Linear Regression Basics UNIT
02

Purpose of linear regression


a) By modeling, find out which explanatory variables have the most impact on the response variable.

Ex If real estate price is the response variable 𝑌, which are the most statistically meaningful
explanatory variables? Area, location, age, distance to business center, etc.

b) Predict the response given the conditions for the explanatory variables.

Ex What is the price of a 10-year-old apartment with an area of 100 and located 3 km away from
the business center? ← “predict” the value that is not open to the public yet.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 206


2.2. Linear Regression Basics UNIT
02

Historical background
‣ Term “regression” was coined by Francis Galton, 19th-century biologist.
‣ The heights of the descendants tend to regress towards the mean.

Child Height

Parent Height Francis Galton

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 207


2.2. Linear Regression Basics UNIT
02

Pros
‣ Solid statistical and mathematical background
‣ Source of insights
‣ Fast training

Cons
‣ Many assumptions: linearity, normality, independence of the explanatory variables, etc.
‣ Sensitive to outliers
‣ Prone to multi-collinearity

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 208


2.2. Linear Regression Basics UNIT
02

Assumptions
‣ The response variable can be explained by a linear combination of the explanatory variables.
‣ There should be no multi-collinearity.
‣ Residuals should be normally distributed centered around 0.
‣ Residuals should be distributed with a constant variance. Residual analysis
‣ Residuals should be randomly distributed without a pattern.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 209


2.2. Linear Regression Basics UNIT
02

Linear model

𝑌 = 𝛽0 + 𝛽1 𝑋 1+ 𝛽 2 𝑋 2 +⋯+ 𝛽𝑘 𝑋 𝑘+

Response
Explanatory variables
variable

‣ Variable values are given by data.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 210


2.2. Linear Regression Basics UNIT
02

Linear model

𝑌 = 𝛽0 + 𝛽1 𝑋 1+ 𝛽 2 𝑋 2 +⋯+ 𝛽𝑘 𝑋 𝑘+

Regression coefficients

‣ Regression coefficients are model parameters: capture the data patterns.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 211


2.2. Linear Regression Basics UNIT
02

Linear model

𝑌 = 𝛽0 + 𝛽1 𝑋 1+ 𝛽 2 𝑋 2 +⋯+ 𝛽𝑘 𝑋 𝑘+

‣ The error term should have zero mean and constant variance.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 212


2.2. Linear Regression Basics UNIT
02

Linear model
Ex

𝑌 = 𝛽0 + 𝛽1 𝑋 1+ 𝛽 2 𝑋 2 +𝛽 3 𝑋 3 + 𝛽 4 𝑋 4 +

MPG N# of HP Weight Auto


Cylinders or
Manual

MPG can be explained by other variables.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 213


2.2. Linear Regression Basics UNIT
02

Interpreting the regression coefficients

𝑌 = 𝛽1  𝑋 1 +⋯+ 𝛽𝑖  𝑋 𝑖 +⋯+ 𝛽 𝑘  𝑋 𝑘

‣ If 𝑋1, 𝑋1, …, 𝑋𝑘 change by 𝑋1, 𝑋2, …, 𝑋𝑘, then the change in 𝑌 is 𝑌.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 214


2.2. Linear Regression Basics UNIT
02

Interpreting the regression coefficients

𝑌 = 𝛽1  𝑋 1 +⋯+ 𝛽𝑖  𝑋 𝑖 +⋯+ 𝛽 𝑘  𝑋 𝑘

‣ 𝛽𝑖 can be interpreted as the change in 𝑌 when the 𝑋𝑖 is increased by a unit ( 𝑋𝑖=1).

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 215


2.2. Linear Regression Basics UNIT
02

Interpreting the regression coefficients

𝑌 = 𝛽0 + 𝛽1 𝑋 1 + 𝛽2 𝑋 2+⋯+𝛽 𝑘 𝑋 𝑘 +

Intercept 0

‣ The intercept 𝛽0 is the value of 𝑌 when all the 𝑋𝑖 = 0. It’s like a “base line.”

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 216


2.2. Linear Regression Basics UNIT
02

Interpreting the regression coefficients


Ex Wage survey of a company’s employees

𝑌 = 𝛽0 + 𝛽1 𝑋 1+ 𝛽 2 𝑋 2 + 

Wage Experience
Qualification

‣ 𝛽0 can be interpreted as the base wage when there is no experience or qualification.


‣ 𝛽1 can be interpreted as the change in wage when the experience is increased by a unit.
‣ 𝛽2 can be interpreted as the change in wage when the qualification is increased by a unit.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 217


2.2. Linear Regression Basics UNIT
02

Ordinary Least Squares (OLS) solution

𝑦 𝑗 =𝛽 0 +𝛽 1 𝑥 𝑗, 1+ 𝛽 2 𝑥 𝑗, 2+⋯+𝛽 𝐾 𝑥 𝑗, 𝒌 +𝜀 𝑗

j
Over-determined!

‣ Now, we can write the linear relation in term of the actual data values.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 218


2.2. Linear Regression Basics UNIT
02

Ordinary Least Squares (OLS) solution

𝒀 = 𝑿 𝜷+𝜺
‣ A compact notation using matrices

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 219


2.2. Linear Regression Basics UNIT
02

Ordinary Least Squares (OLS) solution

𝒀 = 𝑿 𝜷+𝜺

[]
𝑦1
𝒀 = 𝑦2

𝑦𝑛

‣ A compact notation using matrices

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 220


2.2. Linear Regression Basics UNIT
02

Ordinary Least Squares (OLS) solution

𝒀 = 𝑿 𝜷+𝜺

[ ]
1 𝑥1 , 1 ⋯ 𝑥1 , 𝑘
1 𝑥2 , 1 ⋯ 𝑥2, 𝑘
𝑿=
⋮ ⋮ ⋮ ⋮
1 𝑥 𝑛,1 ⋯ 𝑥𝑛 ,𝑘

‣ A compact notation using matrices

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 221


2.2. Linear Regression Basics UNIT
02

Ordinary Least Squares (OLS) solution

𝒀 = 𝑿 𝜷+𝜺

[]
0
𝜷=  1

𝑘

‣ A compact notation using matrices

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 222


2.2. Linear Regression Basics UNIT
02

Ordinary Least Squares (OLS) solution

𝒀 = 𝑿 𝜷+𝜺

[]
𝜀1
𝜺= 𝜀2

𝜀𝑛

‣ A compact notation using matrices

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 223


2.2. Linear Regression Basics UNIT
02

Ordinary Least Squares (OLS) solution


‣ As we have an overdetermined system of linear equations, the exact solution does not exist.
‣ We can minimize |𝜺|2 and get the “best” solution 𝜷.
‣ The minimization condition for |𝜺|2 is given by the derivative:

2
𝑑|𝜺|
=0
𝑑𝛽

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 224


2.2. Linear Regression Basics UNIT
02

Ordinary Least Squares (OLS) solution


‣ The minimization condition for |𝜺|2 is given by the derivative that can be expanded as following:

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 225


2.2. Linear Regression Basics UNIT
02

Ordinary Least Squares (OLS) solution


‣ The solution 𝜷 from the previous slide is given by the following expression:

𝜷=[ ( 𝑿 𝒕 𝑿 ) 𝑿𝒕]𝒀
−1

Pseudo-inverse

‣ The matrix expression within the square parentheses is called “pseudo-inverse.”

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 226


2.2. Linear Regression Basics UNIT
02

Regression training and prediction (testing)


1) Training step: use the training dataset and get a set of model parameters {𝛽𝑖 }.

𝑥𝑖
Training dataset

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 227


2.2. Linear Regression Basics UNIT
02

Regression training and prediction (testing)


2) Prediction step: when a new set of {𝑥𝑖′} is given, calculate the value of 𝑦′, which was unknown.

𝑥𝑖 ′
‣ The predicted value of 𝑦′ is denoted as 𝑦 ̂, which is a conditional expectation 𝑦 ̂=𝐸[𝑦|𝑑𝑎𝑡𝑎].
‣ Given the values 𝑥1′, 𝑥2′, …, 𝑥𝑘′, calculate 𝑦 ̂=𝛽0+𝛽1 𝑥1′+𝛽2 𝑥2′+⋯+𝛽𝐾 𝑥𝑘′.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 228


2.2. Linear Regression Basics UNIT
02

Regression training and prediction (testing)

1
𝛽0
𝑋𝛽
1 1

Input 𝑋𝛽2 𝜺

2  Output
𝑌

𝛽𝑘 Summed
𝑋𝑘 Parameter
over

‣ Schematic view of the prediction step


‣ Notice some disagreement between the predicted 𝑦 ̂ and the true 𝑦.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 229


2.2. Linear Regression Basics UNIT
02

Error metrics

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 230


2.2. Linear Regression Basics UNIT
02

Confidence interval (95%) when there is only one explanatory variable

with

‣ 𝑞𝑢𝑎𝑛𝑡𝑖𝑙𝑒(𝛼,𝑘) is Student-t quantile of degree of freedom 𝑘.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 231


2.2. Linear Regression Basics UNIT
02

Role of categorical variables


‣ Without categorical variable: weight ~ height

¿ 𝑚𝑎𝑙𝑒
¿ 𝑓𝑒𝑚𝑎𝑙𝑒
weight

height

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 232


2.2. Linear Regression Basics UNIT
02

Role of categorical variables


‣ With a categorical variable gender: weight ~ height + gender

¿ 𝑚𝑎𝑙𝑒
¿ 𝑓𝑒𝑚𝑎𝑙𝑒
weight

height

‣ The categorical variable must be turned into dummy variable(s) first.


‣ Effectively raises or lowers the intercept.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 233


2.2. Linear Regression Basics UNIT
02

Role of categorical variables


‣ With a categorical variable gender that “interacts”: weight ~ height + gender + height  gender

¿ 𝑚𝑎𝑙𝑒
¿ 𝑓𝑒𝑚𝑎𝑙𝑒
weight

height

‣ Both the intercept and slope are dependent on the categorical variable.
‣ Further improves the error metrics.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 234


2.2. Linear Regression Basics UNIT
02

Coding Exercise #0301

Follow practice steps on 'ex_0301.ipynb’ file.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 235


Unit 2.

Application of the Supervised Learning


Model for Numerical Prediction
2.1. Training and Testing in Machine Learning
2.2. Linear Regression Basics
2.3. Linear Regression Diagnostics
2.4. Other Regression Types
2.5. Practicing the Supervised Learning Model for Numerical
Prediction
Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 236
2.3. Linear Regression Diagnostics UNIT
02

Linear Regression Diagnostics


Linear regression diagnostic methods

1) Error metrics: MSE, RMSE, MAE, MAPE, etc.


2) Coefficient of determination or “r-squared” 𝑅2
3) F-test for overall significance of the linear model
4) t-test for significance of individual regression coefficients
5) Correlation between 𝑌 and 𝑌 ̂
6) Variance inflation factor (VIF)

Modelling: optimization of the information criteria: AIC or BIC


Residual and leverage analysis

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 237


2.3. Linear Regression Diagnostics UNIT
02

Error metrics

‣ The smaller, the better!

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 238


2.3. Linear Regression Diagnostics UNIT
02

Coefficient of determination or 𝑅2

with and

‣ 𝑅2 is bounded above and below: 0<𝑅2<1.


‣ 𝑅2 close to one means that the response variable is well explained.
‣ As more explanatory variables are added, 𝑅2 tends to increase spuriously: adjusted 𝑅2 introduced.
‣ If there is only one explanatory variable 𝑋, then:

2 2
𝑅 =𝐶𝑜𝑟 ( 𝑋 ,𝑌 )

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 239


2.3. Linear Regression Diagnostics UNIT
02

F-test for overall significance of the linear model


‣ Null hypothesis 𝑯𝟎: 𝛽1=𝛽2=⋯=𝛽𝐾=0
‣ Alternate hypothesis 𝑯𝟏: At least a 𝛽𝑖 is non zero.

𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑡h𝑎𝑡 𝑐𝑎𝑛𝑏𝑒𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑


⇦ F-test statistic =
𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 𝑡h𝑎𝑡 𝑐𝑎𝑛𝑛𝑜𝑡𝑏𝑒 𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑

⇦ If the p-value is below a reference (say 0.05), then 𝑯𝟎 is rejected in favor of 𝑯𝟏.
In this case, the linear model has an overall significance.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 240


2.3. Linear Regression Diagnostics UNIT
02

t-test for significance of the individual regression coefficients


‣ Null hypothesis 𝑯𝟎: 𝛽𝑖=0
‣ Alternate hypothesis 𝑯𝟏: 𝛽𝑖≠0

^
𝛽𝑖
⇦ t-test statistic = , where ^
𝛽𝑖 = estimated coefficient .
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝛽 𝑖

⇦ If the p-value is below a reference (say 0.05), then 𝑯𝟎 is rejected in favor of 𝑯𝟏.

In this case, inclusion of the explanatory variable 𝑋𝑖 is justified.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 241


2.3. Linear Regression Diagnostics UNIT
02

Correlation between and

𝑌 𝑌

<

^
𝑌 ^
𝑌
Weak positive correlation Strong positive correlation

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 242


2.3. Linear Regression Diagnostics UNIT
02

Variance inflation factor (VIF)


‣ Can measure the “seriousness” of multi-collinearity.
‣ In general, 𝑉𝐼𝐹>10 is considered serious. However, this reference point is rather subjective.
‣ In case serious multi-collinearity is detected, model simplification is required.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 243


2.3. Linear Regression Diagnostics UNIT
02

Variance inflation factor (VIF)


‣ VIF is calculated individually for each explanatory variable 𝑋𝑖.
1) Place 𝑋𝑖 as the response and then regress using the expression:

𝑋 𝑖= 𝛽 0 +𝛽 1 𝑋 1 +𝛽 2 𝑋 2 +⋯+ 𝛽𝑖− 1 𝑋 𝑖−1 + 𝛽𝑖+1 𝑋 𝑖+1 +…+ε


2) Using the corresponding 𝑅𝑖2 calculate 𝑉𝐼𝐹𝑖:

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 244


2.3. Linear Regression Diagnostics UNIT
02

Information criteria and modelling


‣ Akaike information criteria (AIC) with 𝑝=number of parameters:

𝐿𝑜𝑔 𝑙𝑖𝑘𝑒𝑙𝑖h𝑜𝑜𝑑 𝑝
𝐴𝐼𝐶=− 2 +2
n 𝑛
‣ Bayes information criteria (BIC) with 𝑝=number of parameters:

𝐿𝑜𝑔 𝑙𝑖𝑘𝑒𝑙𝑖h𝑜𝑜𝑑 𝐿𝑛 (𝑛 )
𝐵𝐼𝐶=−2 +𝑝
𝑛 𝑛

‣ In the expressions above, we have:

𝐿𝑜𝑔 𝐿𝑖𝑘𝑒𝑙𝑖h𝑜𝑜𝑑=−
𝑛
2 (
1+𝐿𝑛 ( 2𝜋 ) +𝐿𝑛
𝑆𝑆𝐸
𝑛 ( ))

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 245


2.3. Linear Regression Diagnostics UNIT
02

Information criteria and modelling

AIC AIC

BIC BIC

Complexity (~) Complexity (~)

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 246


2.3. Linear Regression Diagnostics UNIT
02

Information criteria and modelling

AIC
⇨ There is a minimum.
BIC

Complexity (~)

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 247


2.3. Linear Regression Diagnostics UNIT
02

Residual analysis

X
‣ Residual is the difference between the predicted 𝑦 ̂ and the real 𝑦.
‣ We can easily detect outliers in 𝑌 that deviate substantially from the main trend.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 248


2.3. Linear Regression Diagnostics UNIT
02

Residual analysis
‣ Reasons for residual analysis:
1) To detect outliers in 𝑌.
2) To verify the assumptions of linear regression.

Residuals should be normally distributed centered around 0.


Residuals should be distributed with a constant variance.
Residuals should be randomly distributed without a pattern.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 249


2.3. Linear Regression Diagnostics UNIT
02

Leverage analysis
‣ Leverage of the i-th observation

, where

‣ There is a “sum rule.”

Mean leverage
𝑛

∑ 𝐻 𝑖𝑖=𝑝 Large leverage


𝑖=1 Small leverage

‣ Leverage tells how distant 𝑋 is from the center => Detection of outliers in 𝑋.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 250


Unit 2.

Application of the Supervised Learning


Model for Numerical Prediction
2.1. Training and Testing in Machine Learning
2.2. Linear Regression Basics
2.3. Linear Regression Diagnostics
2.4. Other Regression Types
2.5. Practicing the Supervised Learning Model for Numerical
Prediction
Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 251
2.4. Other Regression Types UNIT
02

Regularized Regression
Bias-Variance trade off
Bias Error Variance Error Total Error

Model Complexity

‣ Tradeoff relation between the Bias error and the Variance error.
‣ The goal should be to minimize the Total error = Bias error + Variance error.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 252


2.4. Other Regression Types UNIT
02

Ridge regression
‣ Useful when the usual linear regression overfits (bias error << variance error).
‣ We remember that the OLS solution consists in minimizing |𝜺|2.
‣ In the Ridge regression, we minimize the following “loss function”:

𝑘
𝑳=|𝜺| +𝝀 ∑ 𝛽2𝑖
2

𝑖=0

‣ In the loss function, is included as penalty: “L2 regularization.”

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 253


2.4. Other Regression Types UNIT
02

Ridge regression
‣ Positive and larger  further constrains the coefficients 𝛽_𝑖, decreasing the variance (overfitting)
error.
𝑘
𝑳=|𝜺| +𝝀 ∑ 𝛽2𝑖
2

𝑖=0

𝑌 = 𝛽0 + 𝛽1 𝑋 1+ 𝛽 2 𝑋 2 +⋯+ 𝛽𝑘 𝑋 𝑘+ 𝜀

‣ However, too large  can make the model too “biased.”


‣ Even when  is large, the coefficients never become exactly equal
to zero.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 254


2.4. Other Regression Types UNIT
02

Lasso regression
‣ Useful when the usual linear regression overfits (bias error << variance error).
‣ In the Lasso regression, we minimize the following “loss function”:

𝑘
𝑳=|𝜺| +𝝀 ∑ ¿ 𝛽𝑖2∨¿¿
2

𝑖=0

‣ In the loss function, is included as penalty: “L1 regularization.”


‣ Positive and larger  further constraints the coefficients decreasing the variance (overfitting) error.
‣ However, too large  can make the model too “biased.”
‣ When  is large, the coefficients can become exactly equal to zero.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 255


2.4. Other Regression Types UNIT
02

Polynomial Regression
Polynomial regression
‣ Useful when the usual linear regression underfits (bias error >> variance error).
‣ We can model the relationship between 𝑋 and 𝑌 using the polynomials:

‣ We notice that there is only one explanatory variable 𝑋.


‣ When the polynomial power is too high, overfitting may incur.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 256


2.4. Other Regression Types UNIT
02

Poisson Regression
Poisson regression
‣ Useful when we would like to model the response 𝑌 that represents counts or frequencies.

𝐿𝑜𝑔(𝜆)=𝛽 0 +𝛽 1 𝑋 1 +𝛽 2 𝑋 2 +⋯+ 𝛽 𝐾 𝑋 𝐾 +𝜀
‣ We are assuming that Y follows the Poisson distribution.
𝑦 −𝜆
𝜆 𝑒
𝑃 ( 𝑦 )=
𝑦!
a) Mean
b) Variance
c) Standard deviation

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 257


Unit 2.

Application of the Supervised Learning


Model for Numerical Prediction
2.1. Training and Testing in Machine Learning
2.2. Linear Regression Basics
2.3. Linear Regression Diagnostics
2.4. Other Regression Types
2.5. Practicing the Supervised Learning Model for Numerical Prediction

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 258


2.5. Practicing the Supervised Learning Model for Numerical Prediction UNIT
02

Practicing the Supervised Learning Model for Numerical


Prediction
Import modules.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 259


2.5. Practicing the Supervised Learning Model for Numerical Prediction UNIT
02

Coding Exercise #0302

Follow practice steps on 'ex_0302.ipynb’ file.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 260


2.5. Practicing the Supervised Learning Model for Numerical Prediction UNIT
02

Coding Exercise #0303

Follow practice steps on 'ex_0303.ipynb’ file.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 261


2.5. Practicing the Supervised Learning Model for Numerical Prediction UNIT
02

Coding Exercise #0304

Follow practice steps on 'ex_0304.ipynb’ file.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 262


2.5. Practicing the Supervised Learning Model for Numerical Prediction UNIT
02

Coding Exercise #0305

Follow practice steps on 'ex_0305.ipynb’ file.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 263


2.5. Practicing the Supervised Learning Model for Numerical Prediction UNIT
02

Coding Exercise #0306

Follow practice steps on 'ex_0306.ipynb’ file.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 264


Unit 3.

Application of Supervised Learning Model


for Classification
3.1. Training and Testing in Machine Learning
3.2. Logistic Regression Basics
3.3. Logistic Regression Performance Metrics

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 265


3.1. Training and Testing in Machine Learning UNIT
03

Training and Testing in Machine Learning


Select a machine learning algorithm suitable for the data.
STAR
T
classificatio regression
n >50k get
samples NO
more data
YES
Kernel SGD NO Predicting Predicting few features Lasso
approximati Classifie a a <100k should
category NO
quantity YES
samples
YES YES ElasticNet
on r be important
SVC NOT WORKING YES NO NO
Ensembil do you SGD
e Kneighbors <100k Ridge Regresson
Labeled
have Regress
Classifie Classifier samples
YES data SVR (kernel= ‘linear’
or
rs NOT WORKING NO NOT
WORKING
Naïve Text Linear SVR(kernel=‘rbf'
Bayes Data SVC NO
YES YES EnsembleRegressors
NOT WORKING

NO

Dimensionality
clustering
reduction
Number of just
Randomiz
<10k Categoties ed
KMeans samples lookong
YES YES known YES PCA
NOT
NOT WORKING NO NO NO
WORKING
Spectral
MiniBatch <10k tough predicting <10k Isomap
Clusterin Spectral LLE
KMeans samples luck structure samples
g NO YES
Embedding
GMM NO NO
NOT WORKING
MeanShif kernel
t approximat
VBGMM ion

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 266


3.1. Training and Testing in Machine Learning UNIT
03

Select a machine learning algorithm suitable for the data.


‣ Supervised learning is machine learning tasks that try to predict or classify the values of objective
variables (or response variables) in unseen data as the training data set has the objective variables
(or response variables) to be predicted (Y).
‣ It is called supervised learning because the algorithm is told to predict a certain target.
‣ Supervised learning is mainly classified into classification and regression problems depending on the
types of objective variables (or response variables).
‣ In other words, classification is applied if the objective variables (or response variables) are discrete
or nominal, such as ‘male/female,’ ‘spam mail/ham mail,’ ‘positive/neutral/negative,’
‘Seoul/Busan/Gyeonggi/Gangwon,’ etc. On the other hand, regression is applied if the objective
variables (or response variables) are numerical, such as 0~10, -500~500, -3.5~3.5, -∞~∞, etc.
‣ When the given data does not have classification marks or objective variables (or response
variables), and if it’s not trying to predict target value, unsupervised learning is applied.
‣ Unsupervised learning is classified into clustering, association, dimension reduction, and others
depending on the analysis purpose and methods. Further details will be explained in other chapters.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 267


3.1. Training and Testing in Machine Learning UNIT
03

Select a machine learning algorithm suitable for the data.

‣ As shown in the figure above, neural networks can be applied if there are correct answers and it is
for classification purposes or has a lot of data. If not, it is possible to use a decision tree or SVC
algorithm.
‣ Also, the Naïve Bayes algorithm can be used if the data is in text.
‣ The KNN method is used if it’s not text data. It is also possible to use the ensemble method with a
better performance.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 268


3.1. Training and Testing in Machine Learning UNIT
03

Machine learning for classification


‣ Classification is used if the objective variables (or response variables) can be classified into certain
categories, such as discrete or nominal type. It is the most common and frequently found in machine
learning-based data analysis.
‣ Machine learning algorithms for classification can be applied to extensive daily and business
Ex Frequently used examples
problems.
(1) Classifying spam mails
(2) Prediction of corporate bankruptcy
(3) Prediction of customer loss
(4) Classifying customers’ credit rating
(5) Prediction of occurrence of a certain disease (e.g. cancer, heart disease,
etc.)
(6) Prediction of customer reaction to a specific marketing event
(7) Prediction of customer’s purchase
‣ The examples provided above are extremely few real-life applications; there are infinite fields of
business that classification type machine learning is applied. This type of machine learning has been
continuously applied with new R&D and business methods.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 269


3.1. Training and Testing in Machine Learning UNIT
03

Types of machine learning algorithm for classification (1/2)


‣ There are many different classification-type machine learning algorithms. Some of those can be used
for regression problems. The following table provides descriptions of frequently used classification
methods.
Type Concept Note
A classification method that uses the majority rule on the closest
K-Nearest
k objective variable values (or response variables) based on the Lazy Learning
Neighbor
distance between data coordination other than a certain data.
A method based on Bayes’ theorem that makes classification
Provability
towards the higher probability by expressing the conditional
model
probability of objective variables (or response variables) as a
(Bayes’
Naïve Bayes multiplication of the prior probability and likelihood function. All
theorem based
observed values are assumed to occur statistically
conditional
independently from other observed values. (Referred to as a
probability)
Naïve model since the assumption is given without confidence.)
A method to estimate the probability of objective variables Provability
through the maximum likelihood estimation by assuming that model
Logistic
the probability of the objective variable value being in a certain (maximum
Regression
category is in the logistic function shape when the explanatory likelihood
value is given. estimation)
A method to create classification rules by splitting the branches
Divide &
Decision Tree towards lower impurity or entropy in the order of variables most
Conquer
associated with objective variables.
Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 270
3.1. Training and Testing in Machine Learning UNIT
03

Types of machine learning algorithm for classification (2/2)


‣ There are many different classification-type machine learning algorithms. Some of those can be used
for regression problems. The following table provides descriptions of frequently used classification
methods.
Type Concept Note
A method inspired by the human neuron network. Comprised of
Artificial Neural the input node, hidden node, and output node, this analysis
Black box test
Network method is used to solve complicated classification or black box
value prediction problems.
Linear and non-
Support Vector A method to classify certain data by finding a plane that
linear
Machine maximizes the margin between data in different categories.
(Kernel trick)
An ensemble method to decide the final classification result by
making various decision trees based on given data and Ensemble
Random Forest
aggregating the predicted results of each decision tree through model
voting.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 271


Unit 3.

Application of Supervised Learning Model


for Classification
3.1. Training and Testing in Machine Learning
3.2. Logistic Regression Basics
3.3. Logistic Regression Performance Metrics

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 272


3.2. Logistic Regression Basics UNIT
03

Logistic Regression Basics


What is logistic regression analysis?
‣ In contrast to general linear regression analysis used for numerical prediction, logistic regression
analysis is used to classify the category of objective variables (y) to be predicted. In other words,
what is being predicted is not the y value which is an objective variable. It is P(Y=i), which is the
probability of the objective variable y becoming a certain category (i).
Supervised
Learning

Numeric 𝑌 Categorical 𝑌

Y = 13.45, 73, 9.5, ….. Y = red, green, blue, …..

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 273


3.2. Logistic Regression Basics UNIT
03

About linear regression


‣ There is one ore more explanatory variables:
‣ There is one response variable:
‣ The response variable is binary where possible values are {0,1}, {𝐹𝑎𝑙𝑠𝑒,𝑇𝑟𝑢𝑒}, {𝑁𝑜,𝑌𝑒𝑠}, etc.
‣ One of the most basic classification algorithms.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 274


3.2. Logistic Regression Basics UNIT
03

Pros
‣ Simple and relatively easy to implement
‣ Source of intuitive insights
‣ Fast training

Cons
‣ Not among the most accurate classification algorithms
‣ Assumes that the explanatory variables are independent without multi-collinearity

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 275


3.2. Logistic Regression Basics UNIT
03

About logistic regression


‣ The linear combinations of variables 𝑋_𝑖 is the so-called “Logit” denoted here as 𝑆.

𝑆= 𝛽0 + 𝛽1 𝑋 1+ 𝛽 2 𝑋 2 +⋯+ 𝛽𝑘 𝑋 𝑘
‣ The conditional probability of 𝑌 being equal to 1 is denoted as .
‣ “Sigmoid” or “Logistic” function connects the probability with the logit.

𝑆
𝑒
𝑓 (𝑆)= 𝑆
1+ 𝑒 0.5

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 276


3.2. Logistic Regression Basics UNIT
03

About logistic regression


‣ If 𝑝 = 𝑝𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑌=1, we can define:

‣ The logistic function is the inverse of the logit (and vice versa):

𝑝
( )
𝑆
𝑒
𝑆=𝐿𝑜𝑔 𝑝= 𝑆
1 −𝑝 1+𝑒

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 277


3.2. Logistic Regression Basics UNIT
03

About logistic regression


‣ For logistic regression, the categories of objective variable (Y) to be predicted are 0 and 1 (binomial
logistic regression model). The possibility of the objective variable Y category becoming 1 is
expressed as p(Y=1) = P(Y). Start by expressing the description as a regression equation. The left
side of the equation is commonly referred to as ’odds.’
‣ The ‘odds’ signify dividend rate, which categorizes into success/failure probabilities. If the success
probability is high, it becomes 1; if it is low, it becomes 0. In other words, it is a success rate. Adding
a log to odds becomes logit, and this function is logistic.
𝑃 (𝑌 )
=exp⁡( 𝛽0 + 𝛽 1 𝑋)
1− 𝑃 (𝑌 )
‣ The left side of the equation is a ratio of probability; the right side is an exponential function that
has a (0, ) range. Add a log on both sides of the equation to give a range with (, ) values on both.

𝑃 (𝑌 )
log ⁡( )=𝛽 0 +𝛽 1 𝑋
1− 𝑃 ( 𝑌 )
‣ When looking at the equation in detail, the 𝛽_0+𝛽_1 on the right side is a linear model with a range
(−∞, ∞); the left side also has a range (−∞, ∞). The log(P(Y))/(1-P(Y)) on the left side of the equation
is called the logit function.
‣ Or add ‘exp’ to both sides of the equation and arrange it regarding P(Y) to get the following
equation.
Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 278
3.2. Logistic Regression Basics UNIT
03

About logistic regression

( 𝛽𝑜 +𝛽 1 𝑋 )
𝑒 1
𝑃 ( 𝑌 =1 )= =
1+𝑒 (𝛽 + 𝛽 𝑜 1 𝑋)
1+𝑒 −(𝛽 + 𝛽
𝑜 1 𝑋)

‣ To sum up, logistic regression analysis is an algorithm that establishes a model by using the logistic
function from the above equation and predicting parameters from the probability P(Y=1) = P(Y)
where objective variable Y having categorical value 1 from the training data. Maximum likelihood
estimation is generally used for predicting parameters and . This is an analytical method that
changes the formula. Since the direct calculation is difficult, estimation is done by giving a certain
initial value and performing repeated calculations to adjust the value as a numerical calculation.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 279


3.2. Logistic Regression Basics UNIT
03

About logistic regression


‣ The following figure shows a graph of the logistic function. As shown in the graph, the range of the
X-axis is The Y-axis is the probability of objective variable Y occurrence, which ranges between 0
and 1. Thus, the function P(Y=1) shows an S-shaped curve from 0 to 1 as the x value increases.

P(Y=1)

Predictor

Logistic regression graph

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 280


3.2. Logistic Regression Basics UNIT
03

Logistic regression training and prediction (testing)


2) Prediction step: when a new set of {} is given, calculate the value of , which was unknown.

Compare the conditional probability with a cutoff.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 281


3.2. Logistic Regression Basics UNIT
03

Logistic regression training and prediction (testing)

1
𝛽0

𝑋𝛽
1
1
0 ,0 ,1,1,0,…
𝛽2
Input 𝑋2

 Output
𝑌

𝛽𝑘
𝑋𝑘 S 𝑃(𝑌 =1∨𝑑𝑎𝑡𝑎)
Parameters
0.3,0.72,0.12,…

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 282


3.2. Logistic Regression Basics UNIT
03

Training by gradient descent algorithm


‣ We can get the parameter set by minimizing a target function .
We get to minimize the difference between the real 𝑌 and the predicted 𝑌 ̂.
‣ We can think of as a “loss” or “cost.”
‣ is a function of the parameter set , which can be represented by a vector .

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 283


3.2. Logistic Regression Basics UNIT
03

Training by gradient descent algorithm


‣ is the negative of logarithmic likelihood defined as following:
𝑛
𝐿(𝜷)=∑ 𝐿𝑜𝑔 ( 1+𝑒 )
𝒕
− 𝑦𝑖 𝜷 𝒙 𝑖

𝑖=1

‣ In the above relation, and represent values given by the training dataset.
‣ Here, we assumed the conversion from to .

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 284


3.2. Logistic Regression Basics UNIT
03

Training by gradient descent algorithm


‣ is the negative of logarithmic likelihood defined as following:
𝑛
𝐿(𝜷)=∑ 𝐿𝑜𝑔 ( 1+𝑒 )
𝒕
− 𝑦𝑖 𝜷 𝒙 𝑖

𝑖=1

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 285


3.2. Logistic Regression Basics UNIT
03

Training by gradient descent algorithm


‣ The gradient of is denoted as , which is also a function of the parameter set or .
‣ The expression for the gradient is:
𝒕
𝑁 −𝜷 𝒙𝑖
𝑦𝑖 𝒙 𝑖 𝑒
𝜵 𝐿(𝜷)=− ∑ 𝒕
− 𝜷 𝒙𝑖
𝑖=1 1+𝑒
‣ The growth of is steepest along the direction of .
 The descent of is steepest along the direction of − .

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 286


3.2. Logistic Regression Basics UNIT
03

Training by gradient descent algorithm


‣ The expression for the gradient is:
𝒕
𝑁 −𝜷 𝒙𝑖
𝑦𝑖 𝒙 𝑖 𝑒
𝜵 𝐿(𝜷)=− ∑ 𝒕
− 𝜷 𝒙𝑖
𝑖=1 1+𝑒
‣ The gradient is obtained by calculating the partial derivatives of :

[]
𝜕𝐿
𝜕 𝛽0
𝜕𝐿
𝜵 𝐿 ( 𝜷 )= 𝜕 𝛽1

𝜕𝐿
𝜕 𝛽𝐾

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 287


3.2. Logistic Regression Basics UNIT
03

Training by gradient descent algorithm


‣ 𝐿(𝜷) is minimized iteratively in small “steps” pushing 𝜷 along the direction −𝜵𝐿(𝜷).
1) is randomly initialized.
2) Calculate the gradient ).
3) Update by one step: .
Convergence speed is controlled by the “Learning rate” .
4) Repeat from step 2) a fixed number of times (epochs).

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 288


3.2. Logistic Regression Basics UNIT
03

Training by gradient descent algorithm


‣ 𝐿(𝜷) is minimized iteratively in small “steps” pushing 𝜷 along the direction −𝜵𝐿(𝜷).

𝐿(𝜷)

Optimized

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 289


3.2. Logistic Regression Basics UNIT
03

Training by gradient descent algorithm


‣ The gradient function in Python

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 290


3.2. Logistic Regression Basics UNIT
03

Training by gradient descent algorithm


‣ The gradient function in Python

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 291


Unit 3.

Application of Supervised Learning Model


for Classification
3.1. Training and Testing in Machine Learning
3.2. Logistic Regression Basics
3.3. Logistic Regression Performance Metrics

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 292


3.3. Logistic Regression Performance Metrics UNIT
03

Logistic Regression Performance Metrics


Confusion matrix
‣ In the classification machine learning method, the most common method for evaluating the analysis
model’s result is the metric calculation, including classification accuracy, using a confusion matrix.
‣ The confusion matrix refers to the matrix that makes a crosstable of the predicted classification
category from the analysis model and the actual classified category of data.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 293


3.3. Logistic Regression Performance Metrics UNIT
03

Confusion matrix

Predicted categorical value

Y N

O X
Y (TP: True Posi- (FN: False Nega-
Actual tive) tive)
categorical value X O
N (FP: False Posi- (TN: True Nega-
tive) tive)

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 294


3.3. Logistic Regression Performance Metrics UNIT
03

Confusion matrix
‣ Confusion matrix can be made with a 2X2 crosstable, along with 3X3 and higher crosstables. For
convenience, we will only use a 2X2 confusion matrix in this chapter.
‣ In the confusion matrix from the previous slide, the diagonally placed ‘O’ cases means the predicted
and actual categorical values are the same. In other words, the classification machine learning
predicted the results properly.
‣ On the other hand, if the predicted and actual categorical values differ, the machine learning model
has made incorrect predictions.
‣ The categories that the analysis is mostly interested in are positive categories; the others are called
negative categories. Depending on the accuracy of prediction (true or false) regarding positive and
negative categories, the accurate classification of interested categories is called TP (True
Positive). The accurate classification of uninterested categories is called TN (True Negative).
‣ The inaccurate classification of uninterested categories into interested categories is called FP
(False Positive). The inaccurate classification of interested categories into uninterested categories
is called FN (False Negative).
‣ There are various metrics based on different combinations of TP, TN, FP, and FN of the confusion
matrix for evaluating the analysis result of classification machine learning methods.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 295


3.3. Logistic Regression Performance Metrics UNIT
03

Metric
‣ Major metrics calculated from the confusion matrix include accuracy, error rate = 1-accuracy,
sensitivity (also referred to as recall, hit ratio, TP rate, etc.), specificity, FP rate, precision, and
others. Among those, accuracy, sensitivity, and precision are the most frequently used metrics.
‣ Also, there are F-Measure (or F1-Score) that combines sensitivity, precision, and Kappa Statistics,
where the predicted and actual values of the analysis model are exactly the same. The calculation
formulas and definitions of various metrics are in the following table.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 296


3.3. Logistic Regression Performance Metrics UNIT
03

Metric

Metric Calculation formula Definition


(TP+TN) / Ratio of accurate prediction of actual classification category
accuracy
(TP+TN+FP+FN) (Ratio of TP and TN from the entire prediction)
(FP+FN) / Ratio of inaccurate prediction of actual classification category
error rate
(TP+TN+FP+FN) (Identical to 1-accuracy)
Ratio of accurate prediction to ’positive’ from actual ‘positive’
sensitivity
(TP) / (TP+FN) categories (True Positive – also referred to as Recall, Hit Ratio,
= TP Rate
and TP Rate)
Ratio of accurate prediction to ‘negative’ from actual ‘negative’
specificity (TN) / (TN+FP)
categories (True Negative)
Ratio of inaccurate prediction to ‘positive’ from actual
FP Rate (FP) / (TN+FP)
‘negative’ categories = 1-specificity
Ratio of actual ’True Positive’ from the ratio of predicted
precision (TP) / (TP+FP)
‘positive’
Ranged between 0~1. It is the harmonic mean between
F-Measure precision and sensitivity (recall). If both precision and
(F1-Score) sensitivity are high, f-Measure also tends to have a larger
value.
The value after eliminating coincidental agreement between
predicted and actual values of the model (Ranged between
Kappa Statistic 0~1. When the value is closer to 1, the predicted and actual
values of the model accurately coincide. The values do not
Samsung Innovation Campus coincide when the value gets 5.
Chapter closer toLearning
Machine 0.) 1 – Supervised Learning 297
3.3. Logistic Regression Performance Metrics UNIT
03

Metric
‣ Among the metrics from the previous slide, sensitivity signifies how well the actual ‘positive’
category is predicted ‘positive.’ The precision is the index showing the ratio of actual ‘positive’ from
the predicted ‘positive’ categories. Thus, they are metrics that directly explain how well the
classification machine learning analysis model classifies interested categorical values of objective
variables.
‣ Sensitivity and precision are the most significant and frequently used metrics for classification
machine learning results in real life.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 298


3.3. Logistic Regression Performance Metrics UNIT
03

Confusion matrix
Ex Actual 0 Actual 1
Predicted 0 120 5
Predicted 1 15 20

‣ Confusion matrix is a contingency table that counts the frequencies of the actual vs. predicted.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 299


3.3. Logistic Regression Performance Metrics UNIT
03

Confusion matrix: Accuracy


Ex Actual 0 Actual 1
Predicted 0 120 5
Predicted 1 15 20

Accuracy =

‣ Accuracy is the ratio between the diagonal sum and the total sum.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 300


3.3. Logistic Regression Performance Metrics UNIT
03

Confusion matrix: Sensitivity


Ex Actual 0 Actual 1
Predicted 0 120 5
Predicted 1 15 20

Sensitivity =

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 301


3.3. Logistic Regression Performance Metrics UNIT
03

Confusion matrix: Specificity


Ex Actual 0 Actual 1
Predicted 0 120 5
Predicted 1 15 20

Specificity =

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 302


3.3. Logistic Regression Performance Metrics UNIT
03

Confusion matrix: Precision


Ex Actual 0 Actual 1
Predicted 0 120 5
Predicted 1 15 20

Precision =

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 303


3.3. Logistic Regression Performance Metrics UNIT
03

Note
‣ Accuracy alone is not sufficient for testing.

Ex If frauds constitute only 1% of all transactions, the accuracy of a fraud detection system
(FDS) that predicts non-fraud in all transactions would be quite high at 99%.
However, such FDS would be useless because it misses the 1% that really matters.

‣ We should also consider metrics other than just the accuracy.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 304


3.3. Logistic Regression Performance Metrics UNIT
03

Terminology

Accuracy =

Sensitivity =

Specificity =

Precision =

Cohen’s kappa where is the probability of correct prediction by chance.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 305


3.3. Logistic Regression Performance Metrics UNIT
03

Terminology

True Positive Rate = Sensitivity

True Negative Rate = Specificity

False Positive Rate = = 1-Specificity

False Negative Rate = = 1-Sensitivity

Positive Predicted Value = Precision

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 306


3.3. Logistic Regression Performance Metrics UNIT
03

ROC curve

True positive rate

False positive rate

‣ ROC curve is a parametric plot with respect to the cutoff probability.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 307


3.3. Logistic Regression Performance Metrics UNIT
03

ROC curve

As the cutoff increases (closer to 1)


Increase/
Performance Metric
Decrease
True positive rate

True Positive Rate 


(Sensitivity)
As the 
Specificity
cutoff
increases False Positive Rate 
(1-Specificity)

Precision 

False positive rate

‣ ROC curve is a parametric plot with respect to the cutoff probability.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 308


3.3. Logistic Regression Performance Metrics UNIT
03

ROC curve

As the cutoff decreases (closer to 0)


As the Increase/
Performance metric
cutoff Decrease
True positive rate

decreases True Positive Rate 


(Sensitivity)
Specificity 
False Positive Rate 
(1-Specificity)

Precision 

False positive rate

‣ ROC curve is a parametric plot with respect to the cutoff probability.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 309


3.3. Logistic Regression Performance Metrics UNIT
03

ROC curve

True positive rate

False positive rate

‣ AUC stands for Area Under the Curve.


‣ AUC closer to 1 means good overall performance.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 310


3.3. Logistic Regression Performance Metrics UNIT
03

Interpretation using Bayes’ theorem


Ex There are 100 observations. In 6 cases, the actual response is 1.
In the rest of the 94 cases, the actual response is 0. It is known that for a given logistic
regression model,
the sensitivity = 0.92 and the specificity = 0.90.
For a new observation (only explanatory variables), this model predicts 1 as the response.
What is the probability of this prediction being correct?

a) We have 𝑃(1) = 0.06, 𝑃(0) = 0.94.


⇦ “Sensitivity”
⇦ “Specificity”
We can also derive

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 311


3.3. Logistic Regression Performance Metrics UNIT
03

Interpretation using Bayes’ theorem


Ex There are 100 observations. In 6 cases, the actual response is 1.
In the rest of the 94 cases, the actual response is 0. It is known that for a given logistic
regression model,
the sensitivity = 0.92 and the specificity = 0.90.
For a new observation (only explanatory variables), this model predicts 1 as the response.
What is the probability of this prediction being correct?

b) The answer we are seeking is given by

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 312


3.3. Logistic Regression Performance Metrics UNIT
03

Coding Exercise #0307

Follow practice steps on 'ex_0307.ipynb’ file.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 313


Unit 4.

Decision Tree
4.1. Tree Algorithm

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 314


4.1. Tree Algorithm UNIT
04

Overview of Decision Tree


Definition
‣ A classification model that analyzes data collected from the past and expresses the patterns found
(characteristics of each category) as a combination of features

Purpose
‣ Classification of unseen data and prediction of categorical values
‣ Extraction of generalized knowledge in a tree structure from the data

Classification depending on the objective variable types


‣ Categorical variable: Classification Tree
‣ Continuous variable: Regression Tree

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 315


4.1. Tree Algorithm UNIT
04

Composition
‣ Node, Branch, Depth

X₁ Root node
<.47
yes no

X₂
<.39
yes no
0
X₂ Parent node
<.84
yes no
0
X₁
Child node
<.87
yes no
0
Terminal node
or Leaf node
1 0

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 316


4.1. Tree Algorithm UNIT
04

How to construct a decision tree model


Construction of a decision tree

Construction of a Making a decision tree by designating an appropriate split criterion and


decision tree stopping rules according to the purpose and data structure of analysis

Branching Removing branches that have a high risk of error rate or inappropriate rules

Validity Evaluation of the decision tree through cross-validation using the gain chart,
evaluation risk chart, or test data

Interpretation
Interpretation of the decision tree and setting a prediction model
and prediction

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 317


4.1. Tree Algorithm UNIT
04

Split criterion of the decision tree


Analysis process of a decision tree
‣ Repetitive splitting: Repetitive splitting of the dimensional space of independent variables using
training data
‣ Branching: Branching with evaluation data
Split criterion
‣ Create the classification tree so that the purity of the child node is greater than that of the parent
node.

40 60 40 60

1
28 42 18 40 60
2

No Improvement Perfect Split

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 318


4.1. Tree Algorithm UNIT
04

Repetitive splitting process


Purpose
‣ Divide the entire space into rectangles and make each as pure or homogeneous as possible.
‣ Definition of ‘pure’:
• Divide the area into pure or homogenous rectangular spaces as much as possible
• All variables in the final rectangle belong to the same group.
25

23

21

Owner
19
Non-owner

17

Size of the housing site 15


(1,000 )
13
20 40 60 80 100 120
Income (1,000 USD)
Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 319
4.1. Tree Algorithm UNIT
04

Repetitive splitting process


1) Select xi, one of the variables, and the xi value (si as a split criterion) is designated to split the p-
dimension space into two.
2) xi → {xi<=si} ∪ {xi>si}
3) Select a variable again and split it in the same way.
4) Repeat the process until it reaches desired purity.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 320


4.1. Tree Algorithm UNIT
04

Split criterion
Discrete objective variables
‣ Chi squared statistic – p-value: Creates child nodes with a predictor variable with the lowest p-value
and the optimal partitioning
‣ Gini index: Selects child nodes with a predictor variable that reduces the Gini index and the optimal
partitioning
‣ Entropy measure: Creates child nodes with a predictor variable with the lowest entropy measure and
the optimal partitioning

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 321


4.1. Tree Algorithm UNIT
04

Split criterion
Continuous objective variables
‣ F statistic in ANOVA: Creates child nodes with a predictor variable with the lowest p-value and the
optimal partitioning
‣ Variance reduction: Creates child nodes with the optimal partitioning that maximizes variance
reduction

Selection of algorithms and classification


variables
Discrete objective vari- Continuous objective vari-
ables ables

CHAID
Chi squared statistic ANOVA F statistic
(multi space partitioning)

CART
(binary space partition- Gini index Variance reduction
ing)

C4.5 Entropy measure

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 322


4.1. Tree Algorithm UNIT
04

Impurity measure
Gini index
‣ Selects child nodes with a predictor variable that reduces the Gini index and the optimal partitioning
‣ If the T data set is split into k categories and the category performance ratios are p1, …, pk, it is
expressed as the following equation.
𝑘
𝐺𝑖𝑛𝑖 ( 𝑇 ) =1− ∑ 𝑝 2𝑙
𝑙=1

high impurity(diversity), low purity

GI = 1 –(3/8)²−(3/8)²−(1/8)²−(3/8)²= .69
low impurity(diversity), high purity

GI = 1 –(6/7)²−(1/7)²= .24
Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 323
4.1. Tree Algorithm UNIT
04

Entropy measure
‣ In thermodynamics, entropy measures the degree of disorder.
‣ Creates child nodes with a predictor variable with the lowest entropy measure and the optimal
partitioning.
‣ If the T data set is split into k categories and the category performance ratios are p1, …, pk, it is
expressed as the following equation.
𝑘
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝑇 ) =− ∑ 𝑝 𝑙 log2 𝑝 𝑙
𝑙=1

Ex If 4 categories consist of ratios of 0.25, 0.25, 0.25, 0.25 (T0):

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝑇 0 ) =− ¿
Ex If 4 categories consists of ratios of 0.5, 0.25, 0.25, 0 (T1):

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝑇 1 )=− ¿

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 324


4.1. Tree Algorithm UNIT
04

Stopping criteria
A rule to designate the current node as a terminal node without further splitting
‣ Designates the depth of the decision tree
‣ Designates the minimum number of records in the terminal node

Branching criteria
Application of test data
‣ Application of the test data to the constructed model
‣ Reviewing the predictive value of the constructed model through test data
‣ Removing the branches that have a high risk of error rate or inappropriate rule of
inference

By an expert
‣ An expert reviewing the validity of rules suggested in the constructed model
‣ Removing rules without validity

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 325


4.1. Tree Algorithm UNIT
04

Overfitting problem
Overfitting problem graph

Evaluation data
Error rate

Training data

Number of split nodes

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 326


4.1. Tree Algorithm UNIT
04

Pros
‣ Creation of understandable rules (can be expressed with SQL)
‣ Useful in classification prediction
‣ Able to work with both continuous and discrete variables
‣ Shows a more relatively significant variable

Cons
‣ Not suitable to predict continuous variable values
‣ Unable to perform time series analysis
‣ Not stable

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 327


4.1. Tree Algorithm UNIT
04

Tree Algorithm
Pros
‣ Intuitive and easy to understand
‣ No assumptions about the variables
‣ No need to scale or normalize data
‣ Not that sensitive to the outliers

Cons
‣ Not that powerful in the most basic form
‣ Prone to overfitting. Thus, “pruning” is often required.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 328


4.1. Tree Algorithm UNIT
04

Classification Tree

Ex

Do you eat
no meat? yes

Do you eat Regularly?


no dairy? yes no yes

Predicted
Vegan Vegetarian Flexitarian Meat lover
response

‣ Training step creates an inverted tree structure as above.


‣ Conditions are evaluated at the nodes and then branch out.
‣ Each leaf node corresponds to a region in the configurational space.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 329


4.1. Tree Algorithm UNIT
04

Classification Tree
‣ The tree structure is trained by minimizing the Gini impurity (or entropy).

𝐾 𝐾
^ 𝑚𝑘 ( 1− 𝑝
𝑝 ^ 𝑚𝑘 )
𝐺𝑚 =∑ 𝑝^ 𝑚𝑘 ( 1 − 𝑝^ 𝑚𝑘 ) =1− ∑ 𝑝^ 𝑚𝑘 2

𝑘=1 𝑘=1
or
𝐾
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑚 =− ∑ 𝑝^ 𝑚𝑘 𝐿𝑜𝑔 ( 𝑝^ 𝑚𝑘 )
𝑘=1 ^ 𝑚𝑘
𝑝
• is the Gini impurity in the leaf node .
The smaller, the better.
• is the entropy in the leaf node .
• Here, is the proportion of the class in the leaf node .
• is the total number of possible classes.
• The class with the largest proportion is the prediction at that leaf node.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 330


4.1. Tree Algorithm UNIT
04

Classification Tree: Procedure


a) Make a basic tree.
b) Prune branches that do not provide better performance.
Pruning can be done during the cross-validation step.
c) Predict with the optimized tree.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 331


4.1. Tree Algorithm UNIT
04

Regression Tree

Ex

Experience < 5?
no yes

Performance <
80%?
no yes

Predicted
Wage = Wage = Wage =
response
40,000 35,000 30,000

‣ Predicts numeric values rather than categories or classes.


‣ Conditions are evaluated at the nodes and then branch out.
‣ Each leaf node corresponds to a region in the configurational space.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 332


4.1. Tree Algorithm UNIT
04

Regression Tree

Ex
100
%
𝑹𝟐
80%
Performan

𝑹𝟑
ce

𝑹𝟏
0%
0 year 5 years 20
years
Experienc
e
‣ Each leaf node corresponds to a region in the configurational space.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 333


4.1. Tree Algorithm UNIT
04

Regression Tree
‣ The configurational space is split into regions: {𝑅_1,𝑅_2,⋯,𝑅_𝐽 }.
‣ The tree structure is trained by minimizing the RSS (residual sum of squares):

𝐽
𝑅𝑆𝑆=∑ ∑ ^
( 𝑦𝑖 − 𝑦 𝑅 )
𝑗
2

𝑗=1 𝑖∈𝑅 𝑗

• is the total number of leaf nodes.


• The predicted value in the -th region is given by the average in that region.
• For the observations belonging to the region , the same predicted response is assigned analogous
to the classification Tree.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 334


4.1. Tree Algorithm UNIT
04

Scikit-Learn DecisionTreeClassifier/Regressor Hyperparameters:

Hyperparameter Explanation
max_depth The maximum depth of a tree

min_samples_leaf The minimum number of sample points required to be at a leaf node

min_samples_split The minimum number of sample points required to split an internal node

max_features The number of features to consider when looking for the best split

max_leaf_nodes The maximum number of leaf nodes in the tree

‣ Need to be tuned for optimized performance.


‣ More information can be found at:
Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
Regressor: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 335


4.1. Tree Algorithm UNIT
04

Decision Tree
‣ Decision tree refers to a modeling technique as the shape of a tree branching out from root to leaf
nodes. It depends on reference values of independent variables (explanatory or input variables) that
affect the classification or prediction of objective variables.
‣ In the decision tree, each node is split in the form of if-then depending on explanatory variables’
characteristics or reference values. When following the tree structure, it is possible to easily
understand how the attribute value of data is classified into the category.
‣ The figure provided below is a typical form of the decision tree. From the example, ‘age’ is the root.
It can be inferred that ‘age’ is the most significant variable when deciding the loan approval.
‣ The squared shape node at the end of each branch is the leaf node.
Age

< = 35 >35

Monthly Occup
income ation
< = 2,000,000 KRW
< 2,000,000 KRW
Unemployed / Others
Worker
Loan Family Loan
Loan denied
approved income approved
< = 3,000,000 KRW
< 3,000,000 KRW
Loan
Loan denied
approved

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 336


4.1. Tree Algorithm UNIT
04

Split Criteria of Decision Tree


‣ Decision tree gives a question to each node and separates data by branching according to the
response.
‣ To evaluate how well the data is separated, a specific criterion is required. In general, impurity is the
evaluation criterion. The impurity becomes higher as various classifications are mixed in the node; it
is the lowest when there’s only one classification.
‣ Thus, if each node’s impurity is low after node splitting, we can say that the decision tree is well
classified.
‣ Impurity measures include Gini impurity and entropy.

‣ The impurity index is 0 when or , and it gets the largest when , thus making a parabola. In other
words, the impurity index is the lowest when there’s a certain classification in the node or the node
is completely free from any classification. In contrast, the impurity becomes the largest when many
classifications are found in the same node.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 337


4.1. Tree Algorithm UNIT
04

Coding Exercise #0308

Follow practice steps on 'ex_0308.ipynb’ file.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 338


Unit 5.

Naïve Bayes Algorithm


5.1. Naïve Bayes Algorithm

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 339


5.1. Naïve Bayes Algorithm UNIT
05

Naïve Bayes Algorithm


About Naïve Bayes algorithm
‣ It is a straightforward application of Bayes’ theorem.

Pros
‣ Intuitive and simple
‣ Not that sensitive to the noise and outliers
‣ Fast

Cons
‣ Assumes that the features are independent, which may not be strictly true
‣ Not among the best-performing algorithms

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 340


5.1. Naïve Bayes Algorithm UNIT
05

About Naïve Bayes algorithm


‣ Bayes’ theorem:

𝑃 ( 𝐵| 𝐴 ) 𝑃 ( 𝐴)
𝑃 ( 𝐴| 𝐵 )=
𝑃(𝐵)
‣ Now we take 𝐴=𝐶𝑙𝑎𝑠𝑠 and 𝐵=𝐷𝑎𝑡𝑎, then:

𝑃 ( 𝐷𝑎𝑡𝑎|𝐶𝑙𝑎𝑠𝑠 ) 𝑃 𝑝𝑟𝑖𝑜𝑟 ( 𝐶𝑙𝑎𝑠𝑠 )


𝑃 𝑝𝑜𝑠𝑡 ( 𝐶𝑙𝑎𝑠𝑠 )= 𝑃 ( 𝐶𝑙𝑎𝑠𝑠|𝐷𝑎𝑡𝑎 )=
𝑃 ( 𝐷𝑎𝑡𝑎 )
‣ We are more interested in comparing relative probabilities:

𝑃 𝑝𝑜𝑠𝑡 (𝐶𝑙𝑎𝑠𝑠)∝𝑃 ( 𝐷𝑎𝑡𝑎|𝐶𝑙𝑎𝑠𝑠 ) 𝑃 𝑝𝑟𝑖𝑜𝑟 ( 𝐶𝑙𝑎𝑠𝑠 )

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 341


5.1. Naïve Bayes Algorithm UNIT
05

About Naïve Bayes algorithm


‣ We can approximate 𝑃(𝐷𝑎𝑡𝑎│𝐶𝑙𝑎𝑠𝑠) by a Gaussian (normal distribution):

𝑃 𝑝𝑜𝑠𝑡 (𝐶𝑙𝑎𝑠𝑠)∝
1
√2 𝜋 𝜎 𝑗
𝑒𝑥𝑝 −
(1
2𝜎 𝑗
2 ( 𝑥 −𝜇 𝑗 )
2
)𝑃 𝑝𝑟𝑖𝑜𝑟 ( 𝐶𝑙𝑎𝑠𝑠 )

where the parameters 𝜇_𝑗 and 𝜎_𝑗 are “learned” from the training data.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 342


5.1. Naïve Bayes Algorithm UNIT
05

Random variable
‣ The variable whose values are unknown until the outcome
‣ Independent events
• If the probability of the simultaneous occurrence of two cases is identical to the multiplication of
the probabilities of each event to occur, then the two events are independent of each other.

‣ Dice

4 6 Conditional probability
A • Events of B (1,2,3) when the A events
1 B (1,3,5) occur
5 2
3 • 0.6666667
Event Probability
Total Event B Probability A
A B
Odd num-
Event A 1 1 1 0.5 0.5
ber
Less than
Event B 2 2
3
3 3 3
4
5 5
Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 343
6
5.1. Naïve Bayes Algorithm UNIT
05

Random variable
‣ The variable whose values are unknown until the outcome
‣ Independent events
• Not affected
‣ Dice

2 4
A
B
1 3 6
5

Probability Probability Conditional


Total Event A Event B
A B
probability
Odd num-
Event A 1 1 0.5 0.333 • Events of B (3, 6)
ber
when the A events
Event B Multiple of 3 2 (1,3,5) occur
3 3 3 • 0.333333333
4
5 5 6
6
Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 344
5.1. Naïve Bayes Algorithm UNIT
05

Random variable
‣ The variable whose values are unknown until the outcome
‣ Exclusive events
• Intersection is the null set.
‣ Dice

6
A
1 B
3 2 4
5

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 345


5.1. Naïve Bayes Algorithm UNIT
05

Introduction to Bayesian statistics


1) Differentiate groups of people for ‘shopping’ and ‘window shopping’ through Bayesian
interpretation.
• Setting prior probability for different types
• Prior probability: Probability before getting certain information
0.2 0.8

Shopping Window shopping


A B

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 346


5.1. Naïve Bayes Algorithm UNIT
05

Introduction to Bayesian statistics


2) Set the conditional probability of ‘initiating a conversation’ for each type.
Probability of initiating a Probability of not initiating a
Total
conversation conversation
Shopping 0.9 0.1 1
Window shop-
0.3 0.7 1
ping
Shopping Window shopping
0.2 0.8

Initiating a conversation 0.3


Initiating
a conversation
0.9 Not initiating a conversation
0.7

Not initiating a conversation


0.1
• The probability of each section is the area of each rectangle. (0.18, 0.02, 0.24, 0.56)

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 347


5.1. Naïve Bayes Algorithm UNIT
05

Introduction to Bayesian statistics


3) Remove ‘impossible event’ from the observations.
• The customer initiated a conversation. (additional information)

Shopping Window shopping


0.2 0.8

Initiating a conversation 0.3


Initiating
a conversation
0.9 Not initiating a conversation
0.7

Not initiating a conversation


0.1

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 348


5.1. Naïve Bayes Algorithm UNIT
05

Introduction to Bayesian statistics


4) Calculate Bayesian probability of the shopping group.

Shopping Window shopping


0.24
Normal-
3/7 4/7
ization
0.18
0.56 43%

Result Cause
0.2
Initiating a conversation A certain group

• The result is 0.18:0.24, which is 3:4.


• Thus, it can be estimated that the probability of the customer who initiated a conversation being
in the shopping group is 3/7 (43%). It is either called Bayesian statistics or posterior probability.
• Chooses if a person who initiated a conversation is in the shopping or window shopping group
based on probability.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 349


5.1. Naïve Bayes Algorithm UNIT
05

Introduction to Bayesian statistics


‣ The probability of a customer who initiated a conversation making a purchase is now different.
‣ Bayesian updating for customer type

Final summary
The prior probability of being a shopping group0.2
Observing to initiate a conversation
3/7
The posterior probability of being a shopping group 0.428571

‣ The customer wouldn’t be in the shopping group for sure, but the probability is doubled.
‣ Bayesian estimation is ‘performing Bayesian updating to the posterior probability based
on the behavioral observation of the prior probability.’
‣ Such an estimation method is called Bayesian statistics.
‣ Bayes’ theorem
• Obtaining the posterior probability based on the prior probability

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 350


5.1. Naïve Bayes Algorithm UNIT
05

Coding Exercise #0309

Follow practice steps on 'ex_0309.ipynb’ file.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 351


Unit 6.

KNN Algorithm
6.1. KNN Algorithm

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 352


6.1. KNN Algorithm UNIT
06

KNN Algorithm
About KNN (K-Nearest Neighbors)
‣ One of the simplest algorithms
‣ Prediction is based on the 𝑘 nearest neighboring points.
‣ There are classification and regression variants.
• Classification: prediction decided by the majority class of the nearest neighbors.
• Regression: prediction given by the average of the nearest neighbors.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 353


6.1. KNN Algorithm UNIT
06

About KNN (K-Nearest Neighbors)


‣ KNN is a classification and prediction technique for a data set with the unknown category of the
objective variable. It designates the most similar category of the surrounding data set.
‣ For KNN, specific criteria are required for measuring the ‘analogy’ between the certain data set and
the surrounding one and how many data sets will be used for the final classification of the objective
variable category.
1) Analogy measurement
• There are many ways to measure the analogy between data. In general, analogy calculation is done
by inversing the squared Euclidean distance between two points or using the Pearson correlation
coefficient. If the points are discrete variables, use the Jaccard coefficient.
2) Objective variable classification criteria
• K in the KNN refers to the number of surrounding data points being used for classifying objective
variables of the specific data point after measuring the analogy between the specific data point and
other surrounding data sets.
Ex Take movie recommendations as an example. It recommends a movie that is similar to the
customer’s favorite movie and that the customer hasn’t watched yet. When considering only
one movie that is the most similar to the customer’s favorite, it is ‘1-nearest neighbor.’ When
considering three similar movies, it would be ‘3-nearest neighbor.’

• Likewise, K-nearest neighbor is a technique to determine a new category according to the principle
of majority rule. It finds k surrounding data points similar to the specific data point and considers
the classification category of the specific data point.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 354


6.1. KNN Algorithm UNIT
06

About KNN (K-Nearest Neighbors)

X2

K=5 K=3

K=1

X1

‣ The figure above conceptualizes the change in objective variables depending on the K value setting
in the K-nearest neighbor method. The ‘☆’ point is the data value that needs to be classified; the
points shaped in a square and circle are other data points present around.
‣ When setting K=1, since the closest data point to the star is a circle, the objective variable of ‘☆’ will
be classified as ‘○.’ On the other hand, if K=3, three points closest to ‘☆’ will be considered,
including two squares and one circle. In this case, ‘☆’ will be classified as ‘□.’ Also, if K=5, since
there are more circles around ‘☆,’ it will be again classified as ‘○.’

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 355


6.1. KNN Algorithm UNIT
06

About KNN (K-Nearest Neighbors)

X2

K=5 K=3

K=1

X1

‣ Thus, different K values significantly change the K-nearest neighbor’s predicted result of the objective
variable category. This is why setting an appropriate K value is important in the K-nearest neighbor
method.
‣ However, there are no clear theoretical or statistical standards for an appropriate K value. Different K
values are usually set for repeated tests, and the final K value is set once it shows the optimal
classification performance. Generally, a random initial K value is designated between K=3 and K=9,
which is then used to test the classification performance with training and evaluation data to select the
optimal K value.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 356


6.1. KNN Algorithm UNIT
06

Pros
‣ Simple and intuitive
‣ No model parameters to calculate. So, there is no training step.

Cons
‣ Since there is no “model,” little insight can be extracted.
‣ No model parameters that store the learned pattern. The training dataset is required for prediction.
‣ Prediction is not efficient ⇨ “Lazy algorithm.”

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 357


6.1. KNN Algorithm UNIT
06

KNN classification algorithm

When ,

is classified as

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 358


6.1. KNN Algorithm UNIT
06

KNN classification algorithm


‣ When 𝑘 is too small, overfitting may happen.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 359


6.1. KNN Algorithm UNIT
06

Coding Exercise #0310

Follow practice steps on 'ex_0310.ipynb’ file.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 360


Unit 7.

SVM Algorithm
7.1. SVM Algorithm

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 361


7.1. SVM Algorithm UNIT
07

SVM Algorithm
About SVM (Support Vector Machine)
‣ Enhanced classification accuracy by maximizing the margin
‣ Effective non-linear classification boundary by the “kernel” transformation

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 362


7.1. SVM Algorithm UNIT
07

About SVM (Support Vector Machine)


‣ Support vector is a model that classifies data by finding the line (or hyperplane) where the distance
(margin) between data in different categories becomes maximum.
‣ As shown in the figure below, the support vector machine model finds the hyperplane that splits the
data into two different categories with maximum margin to classify data.
‣ There would be many lines or planes that split data into two different categories. However, points
that go over the classification boundary can occur in evaluation data or unknown future data as the
points near the boundary slightly change.

Supporting vector
X2

Supporting vector

Max. margin

X1

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 363


7.1. SVM Algorithm UNIT
07

About SVM (Support Vector Machine)


‣ To minimize such a possibility, a line (or hyperplane) making the maximum margin between different
data categories should be found. In other words, the aim is to find a hyperplane that can induce
maximum differentiation for the best possible generalization to classify and predict the future data,
not the current training data.
‣ The point in the category closest to the boundary line is called a support vector, and each
classification should have at least one support vector.

Supporting vector
X2

Supporting vector

Max. margin

X1

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 364


7.1. SVM Algorithm UNIT
07

About SVM (Support Vector Machine)


‣ However, it is not always possible to classify all data linearly. Sometimes, data classification should
be done in a curve or a more complex non-linear classification plane.
‣ If so, use the Kernel trick method for mapping the given data into a higher dimension and find a
hyperplane that can classify the data in the converted dimension.
‣ Instead of converting the data to a higher dimension, the Kernel trick method uses the Kernel
function that returns a similar value to when performing internal calculations between vectors in the
higher dimension. This can result in a non-linear classification without moving data to a higher
dimension.
‣ The major Kernel functions include polynomial, Gaussian, and sigmoid kernels.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 365


7.1. SVM Algorithm UNIT
07

Pros
‣ Not very sensitive to the outliers.
‣ Performance is good.

Cons
‣ Training is relatively slow. Performs poorly for large data.
‣ The kernel and the hyperparameter set should be carefully optimized.
‣ Not much insight can be gained.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 366


7.1. SVM Algorithm UNIT
07

SVM classification algorithm

Boundary

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 367


7.1. SVM Algorithm UNIT
07

SVM classification algorithm

Boundary

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 368


7.1. SVM Algorithm UNIT
07

SVM classification algorithm

Support
vectors

Boundary

Margin

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 369


7.1. SVM Algorithm UNIT
07

Hyperplane
‣ For a 𝑘 dimensional configurational space, the hyperplane has the dimension 𝑘−1.
Ex For a two dimensional space, a hyperplane is a bisecting line that can be parametrized as:

𝛽 0 + 𝛽1 𝑋 1 + 𝛽2 𝑋 2=0
The two dimensional space is subdivided into two:

An observation would belong to either one of them ⇨ binary classification!

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 370


7.1. SVM Algorithm UNIT
07

Margin optimization (for the binary 𝑦)


‣ The boundary margin is maximized.
‣ If a clear margin is not possible to establish, an error is allowed within a limit.
‣ Target is to maximize by optimizing the parameters subject to the constraints:
• Constraint 1:
• Constraint 2: ,
is either -1 or +1.
• Constraint 3: and
⇦ is a hyperparameter related to the miss-classification errors.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 371


7.1. SVM Algorithm UNIT
07

Kernel
‣ Mapping to a higher dimension using the “kernel” functions.
‣ Kernel functions introduce an effective non-linear classification boundary.

Ex Polynomial kernel

2
𝑋

1D
𝑋 2D
𝑋

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 372


7.1. SVM Algorithm UNIT
07

Kernel
‣ Mapping to a higher dimension using the “kernel” functions.
‣ Kernel functions introduce an effective non-linear classification boundary.

Ex Polynomial kernel

𝑋2 𝑋2
𝑋 21 + 𝑋 22

𝑋1 𝑋1
2D 3D

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 373


7.1. SVM Algorithm UNIT
07

Kernel
‣ Effective mapping to a higher dimension by giving the inner product of two vectors and .
• Linear:
• Polynomial: , where
• Sigmoid: , where
• Radial function basis (rbf): , where

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 374


7.1. SVM Algorithm UNIT
07

Coding Exercise #0311

Follow practice steps on 'ex_0311.ipynb’ file.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 375


Unit 8.

Ensemble Algorithm
8.1. The concept of Ensemble Algorithm and
Voting
8.2. Bagging & Random Forest
8.3. Boosting

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 376


8.1. The concept of Ensemble Algorithm and Voting UNIT
08

Ensemble algorithms
About ensemble algorithms

‣ Strong predictive model based on the weaker learners.


‣ Voting type:
• A collection of basic learners that “vote”
Ex
• An ensemble of different kinds of learners Combine Tree, KNN, SVM, etc.
‣ Bagging type:
• A collection of independent weak learners that “vote”
Ex
• An ensemble of the same kind of weak learners Random Forest
‣ Boosting type:
Ex
• A series of weak learners that adaptatively learn and predict AdaBoost, GBM, XGBoost, etc.
• The series is grown by adding new learners multiplied by the “boosting weights.”

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 377


8.1. The concept of Ensemble Algorithm and Voting UNIT
08

About ensemble algorithms


‣ Making a more suitable decision-making by appropriately combining multiple opinions obtained by
different experts
‣ Majority Voting – Analysis using many different classification models with one identical training set
‣ Bagging (different training set)
• bootstrap + aggregating
• Bootstrap + aggregating
• Bagging (random sampling with replacement (random bootstrap))
Ex Random Forest
‣ Boosting
• Retain 50% of incorrectly classified data or use weight for sample selection.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 378


8.1. The concept of Ensemble Algorithm and Voting UNIT
08

bootstrap
‣ Estimation of unknown statistics
‣ Easy and effective estimation method with the unknown distribution of the model parameter
sample
‣ Process of recalculating the statistics and model for each sample through additional sampling with
replacement from the current sample
• No assumption is required such that the parameters or sample statistics should be in a normal
distribution.
‣ bootstrap sample – Aggregation of observed data (sampling with replacement obtained from the
record value and dependent variable)
Contents to be learned in the future
• Resampling – Involves permutation under-sampling without replacement
• Bootstrap aggregation – Making a result from aggregating predicted values obtained from
different bootstrap samples

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 379


8.1. The concept of Ensemble Algorithm and Voting UNIT
08

What is Ensemble
Ensemble LearningLearning?
이란 ?
‣ 앙상블이란
The term
단어를‘ensemble’ can
wiki 에서 검색해 보면 be
다음과defined
같이 나온다as
. follows:
• 통계역학에서
In statistical mechanics,
, 어떤 계의 an ensemble
앙상블 (ensemble) of동등한
이란 그 계와 a system
계의 모음을refers
말한다 . to the collection of equivalent systems.
‣ 쉽게
In other
말하면 , words, it is집합이다
비슷한 무리들의 an assembly
. of similar groups.
‣ 즉
Instead of expecting
, 우리는 단일 performance
모델에서 나오는 성능의 results
결과를 기대하는 것이 아니라 ,from a single
여러 개의 단일 모델들의model,
평균치를 ensemble learning
내거나 , 투표를 해서 draws
다수결에 의한 결정을 a하는better
등 여러
result집단
모델들의 using
지성을 the collective
활용하여 더 나은 결과를intelligence
도출해 내는 것에 of different models, such as averaging out many different
single
주 models
목적이 있다 . or making a decision based on the majority vote.
‣ 집단
There
지성을are many
활용하는 방법 , different ensemble
즉 앙상블 기법에는 다양한 방법이methods
있다 . using collective intelligence.
• Voting (– 투표
Drawing
) – 투표를results
통해 결과 through
도출 voting
• Bagging – Bootstrap Aggregating
aggregating (duplicated creation
( 샘플을 다양하게 중복 생성 ) of various samples)
• Boosting – Weighting by가중치
이전 오차를 보완하며 supplementing
부여 previous errors
• Stacking – A여러meta-model based모델on different models
모델을 기반으로 meta
‣ 사실
There
앙상블can
기법은be other
말 그대로 more
하나의 기법 /different
방법론 적인 부분이methods since
있어서 다양한 방식들이ensemble
추가로 더 있을 learning applies
수 있다 . 하지만 위에 언급한a certain
4 개지가 대표적인 앙상블 기법이며
technique/methodology.
이미 sklearn 라이브러리에도 구현된 기법들이다 .
Yet, the four methods listed above are the most representative ensemble techniques in the sklearn
library.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 380


8.1. The concept of Ensemble Algorithm and Voting UNIT
08

Voting Ensemble
Voting ensemble

‣ Can be applied to classification and regression.


‣ The learners that form an ensemble should be of different kinds for a good performance.
‣ Two voting methods for the classifier:
• Hard: predicted class label is given by the majority rule voting.
• Soft: predicted class label is given by the argmax of the sum of the predicted probabilities.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 381


8.1. The concept of Ensemble Algorithm and Voting UNIT
08

Voting
‣ As the word itself, voting makes a decision through votes. Voting is similar to bagging as it uses a
voting method, but they are highly differentiated from each other as follows:
• Voting: Combines different algorithm models.
• Bagging: Uses different sample combinations within the same algorithm.
‣ Voting selects final results by having final voting on results deduced by different algorithms.
‣ Voting is classified into hard voting and a soft voting.
• Hard voting: Decides the final value of the result through voting.
• Soft voting: Draws the final value by adding all the probability values of getting the final result
and then calculating every probability of the final result.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 382


8.1. The concept of Ensemble Algorithm and Voting UNIT
08

Voting

Training set ‣ Use a single training set.

Logistic ‣ Use different classification models.


regression Decision tree Training set
analysis

‣ Predictive value
P1 P2 Pn

‣ If the predictive values of each analysis


voting model are different from voting, choose the
result with the most values.
• hard voting
• soft voting
Final PI
 Predict the highest class by
averaging out the prediction of
individual classifier.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 383


8.1. The concept of Ensemble Algorithm and Voting UNIT
08

Voting
‣ Hard Voting
• Taking classification as an example. Suppose the predictive values for classification are 1,0,0,1,1
since 1 has three votes and 0 has 2 votes. In that case, 1 becomes the final predictive value in the
hard voting method.
‣ Soft Voting
• Soft voting method calculates the average value of each probability and then determines the one
with the highest probability.
• Suppose the probability of getting class 0 is (0.4, 0.9, 0.9, 0.4, 0.4), and the probability of getting
class 1 is (0.6, 0.1, 0.1, 0.6, 0.6). The final probability of getting class 0 is (0.4+0.9+0.9+0.4+0.4)
/ 5 = 0.44; the final probability of getting class 1 is (0.6+0.1+0.1+0.6+0.6) / 5 = 0.4. Therefore,
the selected final value is different from the result of the hard voting above.
‣ In general, using the soft voting method is considered more reasonable than the hard voting method
in competitions because the soft voting method provides a much better actual performance result.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 384


8.1. The concept of Ensemble Algorithm and Voting UNIT
08

Voting ensemble in Scikit-Learn


‣ To import the voting ensemble as class:
from sklearn.ensemble import VotingClassifier # For
classification
from sklearn.ensemble import VotingRegressor # For
regression
‣ To instantiate an object that implements voting ensemble:
myKNN = KNeighborsClassifier(n_neighbors = 3)
Ex
myLL = LogisticRegression()
myVotingEnsemble=VotingClassifier(estimators=[('lr',myLL),
('knn',myKNN)],voting='hard')

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 385


8.1. The concept of Ensemble Algorithm and Voting UNIT
08

Voting ensemble in Scikit-Learn


‣ We can train and predict just like any other estimator:
Ex myVotingEnsemble.fit(X_train, Y_train)
myVotingEnsemble.predict(X_test)

Scikit-Learn VotingClassifier/Regressor Hyperparameters:

Hyperparameter Explanation
estimators The list of basic learner objects
voting Either ‘soft’ or ‘hard’ (for classifier only)

‣ More information can be found at:


Classifier: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingClassifier.html
Regressor: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.VotingRegressor.html

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 386


Unit 8.

Ensemble Algorithm
8.1. The concept of Ensemble Algorithm and
Voting
8.2. Bagging & Random Forest
8.3. Boosting

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 387


8.2. Bagging & Random Forest UNIT
08

Bagging Ensemble: Random Forest


About Random Forest

‣ An ensemble algorithm based on trees


‣ Can be applied to classification and regression

Pros

‣ Powerful
‣ Few assumptions
‣ Little or no concern about the overfitting problem

Cons

‣ Training is time consuming.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 388


8.2. Bagging & Random Forest UNIT
08

Random Forest algorithm

‣ Randomly chosen trees form a forest


‣ Prediction is decided by the majority vote.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 389


8.2. Bagging & Random Forest UNIT
08

Random Forest algorithm


1) Make trees with randomly selected variables and observations.
2) Keep only those with the lowest Gini impurity (or entropy).
3) Repeat from step 1) a given number of times.
4) Using the trees gathered during the training step, we can make predictions by majority vote.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 390


8.2. Bagging & Random Forest UNIT
08

Scikit-Learn RandomForestClassifier/Regressor Hyperparameters:

Hyperparameter Explanation
n_estimators The number of trees in the forest

max_depth The maximum depth of a tree

min_samples_leaf The minimum number of sample points required to be at a leaf node

min_samples_split The minimum number of sample points required to split an internal node

max_features The number of features to consider when looking for the best split

‣ Except for “n_estimators,” the rest are analogous to those of DecisionTreeClassifier/Regressor.

‣ More information can be found at:


Classifier:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
Regressor:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 391


8.2. Bagging & Random Forest UNIT
08

Bagging
‣ Bagging-based ensemble method
Ex Random Forest algorithm
• Easy to use since it is well constructed in the Sklearn library
• Relatively quick performance speed
• High performance

‣ The ensemble method has been widely used as it raises the performance level and is easy to use.
The bagging-based ensemble method is commonly found in the high-ranked solutions in Kaggle.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 392


8.2. Bagging & Random Forest UNIT
08

Bagging

Training set Allows


Data + labels
duplication by
using bootstrap

Sam Sam … Sam


ple 1 ple 2 ple 3
Bootstrap Bootstra Bootstra … Bootstra
sample p1 p2 p3
Deci Deci Deci
sion sion … sion
Tree Tree Tree
Model 1 Model 2 Model 3

P1 P2 Pn Predictive value

voting Bagged
Voting for classification, Ensemble
and equalization for Vote
prediction
Final PI

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 393


8.2. Bagging & Random Forest UNIT
08

Bagging
‣ Bagging is an abbreviation of Bootstrap Aggregating.
‣ Bootstrap = Sample
‣ Aggregating = Adding up
‣ Bootstrap refers to a method that allows overlapping of different data sets for sampling and
splitting.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 394


8.2. Bagging & Random Forest UNIT
08

Bagging
Ex Random Forest is a typical bagging method algorithm.

‣ It creates multiple decision trees and performs sampling of different data sets while allowing
overlapped data sets.
‣ If the data set consists of [1, 2, 3, 4, 5]:
• Group 1 = [1, 2, 3]
• Group 2 = [1, 3, 4]
• Group 3 = [2, 3, 5]
‣ This is the bootstrap method. In the classification problem, voting is done on each tree trained with
different sampling for the final prediction result.
‣ In regression problems, the average of each obtained value is calculated.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 395


8.2. Bagging & Random Forest UNIT
08

Bagging
‣ “Bagging”: Bootstrap AGGregatING

Data + labels

Bootstrap Bootstrap … Bootstrap


1 2 3

Model 1 Model 2 Model 3

Bagged Ensemble

Vote

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 396


8.2. Bagging & Random Forest UNIT
08

Differences Between Bagging and Voting


‣ The greatest difference between the bagging and voting methods is whether to use multiple single
algorithms or apply various algorithms to the same sample data set.
‣ In general, the bragging method is to train a single algorithm with different sampling data sets and
perform voting. It has relatively better usability than the voting method because it uses a single
algorithm. Thus, what is important is the hyperparameter of the single algorithm.
‣ Taking the Random Forest algorithm as an example again, it is suitable to obtain the baseline score
as it requires simple hyperparameter setting such as how many trees to use (n_estimators),
maximum depth (max_depth), and the minimum number of samples for splitting
(min_samples_leaf).

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 397


8.2. Bagging & Random Forest UNIT
08

Advantages of the Bagging Ensemble


‣ When using the bagging method, it can reduce variance compared to making a prediction with a
single model. There are three major training errors in a model: variance, noise, and bias. (Of course,
more significant factors include overfitting/underfitting and many different preprocessing issues, but
assume they are already being set.)
‣ The ensemble method reduces variance, thus enhancing the performance of the final result.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 398


8.2. Bagging & Random Forest UNIT
08

sklearn.ensemble.BaggingClassifer/BaggingRegressor
‣ The sklearn library package provides the wrapper class called BaggingClassifier/BaggingRegressor.
‣ When designating the base algorithm to the base_estimator parameter, the
BaggingClassifier/BaggingRegressor performs bagging ensemble.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 399


Unit 8.

Ensemble Algorithm
8.1. The concept of Ensemble Algorithm and
Voting
8.2. Bagging & Random Forest
8.3. Boosting

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 400


8.3. Boosting UNIT
08

Boosting Ensemble: AdaBoost


About AdaBoost

‣ A sequence of weak learners such as trees


‣ A weighted ensemble of weak learners
• One set of weights that increase the importance of the better-performing learners
• One set of weights that increase the importance of the wrongly classified observations
‣ Similar pros and cons as the Random Forest

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 401


8.3. Boosting UNIT
08

AdaBoost classification algorithm


‣ Let’s suppose 𝑛 observations for the training step: 𝒙𝑖 and 𝑦𝑖.
We also suppose that 𝑦𝑖∈{−1,+1}. (binary 𝑦)
‣ We will make a series of weak learners 𝐺𝑚 (𝒙) with 𝑚= 1,…,𝑀.
‣ The ensemble classifier is made up of a linear combination of these weak learners:

(∑ )
𝑀
𝐺 𝑒𝑛𝑠𝑒𝑚𝑏𝑙𝑒 ( 𝒙 )=𝑠𝑖𝑔𝑛 𝛼 𝑚 𝐺𝑚 (𝒙 )
𝑚=1

where are the “boosting weights” that need to be calculated.

‣ Heavier weight is given to a better performing learner.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 402


8.3. Boosting UNIT
08

𝑤𝑖(1)=
AdaBoost classification algorithm
1) For the first step (𝑚=1), equal weight is assigned to the observations:
2) For the boost sequence 𝑚=1,…, 𝑀:
a) Train the learner 𝐺𝑚 (𝒙) using observations weighted by 𝑤𝑖(𝑚).

b) Calculate the error ratio: :

⇨ gives 1 for an incorrect prediction, else 0.


Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 403


8.3. Boosting UNIT
08

AdaBoost classification algorithm


3) For the boost sequence 𝑚=1,…, 𝑀:
c) Calculate the boosting weight: 1
𝛼 𝑚= 𝑙𝑜𝑔
2 𝜀𝑚(
1 − 𝜀𝑚
)
⇨ As , is a large positive number. The learner is given more importance!
⇨ As .
⇨ As , is a large negative number.

d) For the next step, the weights of the wrongly predicted observation are rescaled by a factor 𝑒am.
This can be compactly expressed, as:
where

⇨ In the next sequence step, the wrongly predicted observations receive heavier weight.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 404


8.3. Boosting UNIT
08

AdaBoost classification algorithm


4) The ensemble classifier is made up of a linear combination of weak learners:

(∑ )
𝑀
𝐺 𝑒𝑛𝑠𝑒𝑚𝑏𝑙𝑒 ( 𝒙 )=𝑠𝑖𝑔𝑛 𝛼 𝑚 𝐺𝑚 (𝒙 )
𝑚=1

5) For a new testing condition 𝒙′, we can predict 𝑦′ by 𝐺_𝑒𝑛𝑠𝑒𝑚𝑏𝑙𝑒 (𝒙′).

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 405


8.3. Boosting UNIT
08

AdaBoost classification algorithm

Step 1 Step 1 Step 2


1 1 1
1 1 1 1 1 1
1 1 1
1 1 1 1 1 1 1 1
1 1 1 1
1 1 1 1
2 2 2 2 2 2 2 2
1 1 1 1
1 2 2 1 2 2 1 2 2 1 2 2

1 2 2 2 1 2 2 2 1 2 2 2 1 2 2 2

1 1 1
1 1
1
1
2 2 Ensemble
1
1 2 2

1 2 2 2

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 406


8.3. Boosting UNIT
08

Scikit-Learn AdaBoostClassifier/Regressor Hyperparameters

Hyperparameter Explanation
base_estimator The base estimator with which the boosted ensemble is built

n_estimators The maximum number of estimators at which boosting is terminated

learning_rate The rate by which the contribution of each learner is shrunken

algorithm Either ‘SAMME’ or ‘SAMME.R’

‣ “base_estimator” is by default None, which means DecisionTreeClassifier(max_depth=1).

‣ More information can be found at:


Classifier:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html
Regressor:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 407


8.3. Boosting UNIT
08

Boosting Ensemble: GBM & XGBoost


Gradient Boosting Machine (GBM)

‣ GBM is also made up of a sequence of weak learners similar to AdaBoost.


‣ The disagreement between the predicted 𝑦 ̂ and the true 𝑦 can be represented by a “loss” function.
‣ In the sequence of weak learners, the learner at step 𝑚, can be obtained by moving the previous
step learner towards the direction that decreases the loss as defined above.
‣ These updating movements are restricted to a set of base learners.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 408


8.3. Boosting UNIT
08

Scikit-Learn GradientBoostingClassifier/Regressor Hyperparameters

Hyperparameter Explanation
loss The loss function

n_estimators The number of boosting steps (weak learners)

learning_rate The contribution of each weak learner

subsample The fraction of data that will be used by individual weak learner

‣ Need to be tuned for optimized performance.

‣ More information can be found at:


Classifier:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
Regressor:
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 409


8.3. Boosting UNIT
08

Extreme Gradient Boosting (XGBoost)


‣ Improves upon GBM in the execution speed.
‣ More resistant to the overfitting than GBM.
‣ Not included in the Scikit-Learn library. Requires installation of the “xgboost” library.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 410


8.3. Boosting UNIT
08

XGBClassifier/Regressor Hyperparameters

Hyperparameter Explanation
booster gbtree or gblinear

n_estimators The number of boosting steps (weak learners)

learning_rate The contribution of each weak learner

subsample The fraction of data that will be used by individual weak learner

‣ Need to be tuned for optimized performance.


‣ More information can be found at: https://xgboost.readthedocs.io/en/latest/python/index.htm

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 411


8.3. Boosting UNIT
08

XGBoost
‣ XGBoost is a library that implements the gradient boosting algorithm to be used in the distributed
system.
‣ It supports both regression and classification problems. This popularly used algorithm features good
performance and resource efficiency.
‣ Gradient boost is a representative algorithm using the boosting method. The XGBoost is the library
using this algorithm to support parallel training.
‣ It has been recently used a lot due to its high performance and computer resource application rate.
It has become more popular as it was frequently used by top rankers of Kaggle.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 412


8.3. Boosting UNIT
08

Light GBM
‣ While the tree is vertically expanded with Light GBM, other algorithms expand the tree horizontally.
In other words, while the Light GBM is leaf-wise, other algorithms are level-wise.
‣ For expansion, the leaf with max delta loss is selected. When expanding the same leaf, the leaf-wise
algorithm can reduce more loss than the level-wise algorithm.
‣ The following diagram shows how a boosting algorithm is implemented, which differs from the Light
GBM.

Leaf-wise tree
growth
Light GBM operation
method

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 413


8.3. Boosting UNIT
08

Boosting
‣ The boosting algorithm is also ensemble learning. After sequentially learning weak learning
machines, it supplements errors by adding weight to inaccurately predicted data from the previous
learning.
‣ The difference from other ensemble methods is that it performs sequential learning and
supplements errors by adding weight. However, one of the disadvantages is that it is difficult to
process parallel due to its sequential property, leading to a longer learning time compared to other
ensembles.
Single estimator, Bagging, Boosting

single bagging boosting


1 iteration parallel sequential

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 414


8.3. Boosting UNIT
08

Learning a series of predictors by supplementing previous models


‣ AdaBoost
• Train the first distributor (e.g., decision tree) in the training set and make a prediction.
• Slightly increase the weight of the training sample with an inaccurately classified algorithm.
• The second distributor uses the updated weight to be trained at the training set and makes a
prediction again.
• The weight is updated again, and the same process is repeated.
‣ Gradient boosting
• Similar to AdaBoost, gradient boosting sequentially adds predictors to the ensemble to correct the
previous errors.
• However, instead of modifying the sample weight repeatedly as AdaBoost, it trains a new
predictor to residual error created by the previous predictor.
• For regression problems – gradient boosted regression tree (GBRT)

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 415


8.3. Boosting UNIT
08

AdaBoost (Adaptive Boosting)


‣ AdaBoost (Adaptive Boosting) is an adaptive boosting method, one of the most representative
boosting algorithms.
‣ As explained, the AdaBoost sequentially learns weak learning machines and supplements errors by
applying weights to inaccurately predicted data.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 416


8.3. Boosting UNIT
08

Principle of the AdaBoost

Box 2 ‣ Box 1 results from a weak learning machine


D1 D2 classified as the D1 sector. However, the error
rate is relatively high because some data sets
D3 are expressed in + distributed in the red sector.
‣ The D2 line in Box 2 moves to the right to
supplement the error rate from Box 1. Here, the
data sets expressed in – are distributed in the
blue sector for better performance, but it is not
satisfactory yet.
‣ The D3 line in Box 3 is horizontally drawn on the
Box 1 top. However, the data set expressed in – is
Box 3
incorrectly classified.
‣ By training the previous Box 1, 2, and 3, Box 4
Box finds the most ideal combination. It shows much
4 better performance compared to the previous
three individual learning machines.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 417


8.3. Boosting UNIT
08

Principle of the AdaBoost


‣ How is the weight applied for combination?

Ex Suppose that the following weight will be applied to the performance of Box 1~3.
• Performance of Box 1: weight = 0.2
• Performance of Box 2: weight = 0.5
• Performance of Box 3: weight = 0.6
It can be expressed as the following formula:

0.2 * Box 1 + 0.5 + Box 2 + 0.6 * Box 3


= Box 4

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 418


8.3. Boosting UNIT
08

Gradient Descent
‣ The key to the boosting method is to supplement errors from the previous learning.
‣ The AdaBoosting and gradient descent methods are slightly different in how to supplement errors.
‣ Gradient descent uses differentiation to minimize the difference between the predicted value and
actual data.
• Weight
• Input_data = feature data (input data)
• Bias
• Y_actual = actual data value
• Y_predict = predicted value
• Loss = error rate
‣ Y_predict = weight*input_data+bias
• The predictive value can be obtained from the above formula.
Calculating the difference with actual data will result in the total error rate.
‣ Loss = Y_predict – Y_actual
• (There are many different functional formulas to define error rate, including root means square
error and means absolute error, but the above definition is provided for convenience.)
• The purpose of gradient descent is to find the weight that makes the loss closest to 0.
Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 419
8.3. Boosting UNIT
08

Gradient Boosting Machine (GBM)


‣ The boosting algorithm in gradient descent is called gradient boosting machine, abbreviated as
GBM. It is provided in the sklearn package and applicable to both classification and regression
problems.
• GradientBoostingClassifier
• GradientBoostingRegressor
‣ It is extremely easy to apply in actual problems.
‣ XGBoost and LightGBM are two major ML packages mostly used in Kaggle.
‣ XGBoost and LightGBM are not provided in the existing sklearn package. However, the sklearn
wrapper class allows an easy application for fit & predict similar to the ML class of the existing
sklearn package.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 420


8.3. Boosting UNIT
08

Gradient Boosting Machine (GBM)


‣ Together with the XGBoost, LightGBM is the most highlighted boosting algorithm. Although XGBoost
has an excellent performance, its learning time is too long.
‣ Advantages of the LightGBM
• Short learning time
• Relatively small memory use
• Automatic conversion and optimal splitting of categorical features

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 421


8.3. Boosting UNIT
08

Characteristics of the LightGBM


‣ The existing tree-based algorithm uses a level-wise method. It splits but maintains a balanced tree
as much as possible, so the tree depth becomes minimum. A disadvantage of the level-wise method
is that it takes time to make a balanced tree.
‣ On the other hand, LightGBM uses a leaf-wise method. It does not balance the tree but continuously
splits the leaf node with the max data loss. The tree depth becomes greater, and an asymmetrical
tree is created. The repetition of the leaf node with the max data loss minimizes predicted error loss
than splitting a balanced tree.

Level wise

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 422


8.3. Boosting UNIT
08

Hyperparameter tuning methods


‣ The increased Num_leaves value raises accuracy. However, it also increases the tree depth and
makes the model complex, thus increasing the probability of overfitting.
‣ Min_data_in_leaf is a significant parameter for overfitting. It depends on the Num_leaves and the
size of training data, but in general, it prevents a deeper tree when setting it as a higher value.
‣ Max_depth constraints the depth size and improves overfitting when combined with the two
parameters above.

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 423


8.3. Boosting UNIT
08

XGBoost

Default
Parameter Description
value
Designates the number of trees for repetitive work.
num_iterations 100
Overfitting occurs if the value is too high.
Updated when boosting steps are repeatedly conducted.
learning_rate 0.1
The value is designated between 0~1.
Identical to the max_depth of tree-based algorithms.
max_depth 1 No restriction is applied to tree depth when entering a value smaller
than 0.
min_data_in_lea Identical to the min_samples_leaf of a decision tree. Used as a
20
f parameter to control overfitting.
num_leaves 3
Signifies the max. The number of leaves for one tree.
boosting GBDT
bagging_fractio
1.0 Designates the ratio for data sampling. Used to control overfitting.
n
feature_fraction 1.0 The ratio of the random feature for individual tree learning
Lambda_l1 0.0 The value for L1 regulation
Lambda_l2
Samsung Innovation Campus 0.0 The value for L2 regulation Chapter 5. Machine Learning 1 – Supervised Learning 424
8.3. Boosting UNIT
08

Coding Exercise #0312

Follow practice steps on 'ex_0312.ipynb' file

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 425


8.3. Boosting UNIT
08

Coding Exercise #0313

Follow practice steps on 'ex_0313.ipynb' file

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 426


8.3. Boosting UNIT
08

Coding Exercise #0314

Follow practice steps on 'ex_0314.ipynb' file

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 427


End of
Document

Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 428/44


ⓒ2022 SAMSUNG. All rights reserved.
Samsung Electronics Corporate Citizenship Office holds the copyright of book.
This book is a literary property protected by copyright law so reprint and reproduction without permission are prohibited.
To use this book other than the curriculum of Samsung Innovation Campus or to use the entire or part of this book, you must receive written
consent from copyright holder.

You might also like