SIC - AI - Chapter 5. Machine Learning 1 - v2.1
SIC - AI - Chapter 5. Machine Learning 1 - v2.1
Innovation
Campus
Artificial Intelligence Course
Chapter 5.
Machine Learning 1
- Supervised Learning
Artificial Intelligence
Course
Be able to introduce machine learning-based data analysis according to the business objective,
strategy, and policy and manage the overall process.
Be able to select and apply a machine learning algorithm that is the most suitable to the given
problem and perform hyperparameter tuning.
Be able to design, maintain, and optimize a machine learning workflow for AI modeling using
structured and unstructured data.
Chapter contents
Modern definition
‣ “A computer program is said to learn from experience E with respect to some class of tasks T and
performance measure P, if its performance at tasks in T, as measured by P, improves with
experience E.” – Mitchell, 1997 (p.2)
‣ “Programming computers to optimize a performance criterion using example data or past
experience.”
–Alpaydin, 2010
‣ “Computational methods using experience to improve performance or to make accurate
predictions.” – Mohri, 2012
Mathematical definition
‣ Suppose that the x-axis is invested advertising expenses while the y-axis is sales.
(Target) ‣ Question about prediction – What are the sales when random
advertising expenses are given?
𝑓2
‣ Linear regression
𝑓3
• w and b as parameters
𝑓1
𝑦=𝑤 𝑥+𝑏
(Feature) • ‘w’ is commonly used as an abbreviation of ‘weigh.’
2 4 6 8 10
‣ Since the optimal value is unknown in the beginning, start with an arbitrary value and then reach
the optimal value by gradually enhancing the performance.
• From the graph, it starts from f1 to continue as f1 → f2 → f3.
• The optimal value is f3 where w=0.5 and b=2.0.
Pattern
Recognition
Artificial Intelligence
Statistics
Machine
Learning
Deep
Learning
Data Mining
Data
Science
Database Computational
Neuroscience
Pattern
Recognition
Artificial Intelligence
Statistics
Machine
Learning
Deep
Learning
Data Mining
Data
Science
Database Computational
Neuroscience
Machine Learning
Target pattern is given. Target pattern must be found out. Policy optimization
Machine
Learning Feature
Understanding the business and Pre-processing engineering
problem definition and searching of data
Train
Problem Data Modeling and
Definition Preparation optimization
Validate Model training for data
Performance
Test metrics
Raw
Data
Model performance evaluation
Data collection
Enhanced model performance and
application to real life
Type Algorithm/Method
Clustering
MDS, t-SNE
Unsupervised learning
PCA, NMF
Association analysis
Linear regression
Logistic regression
Tree, Random Forest, Ada Boost, XGBoost
Supervised learning Naïve Bayes
KNN
Support vector machine (SVM)
Neural Network
Hyperparameters
‣ Can be set manually by the practitioner.
‣ Can be tuned to optimize the machine learning performance.
Ex in KNN algorithm
Ex Learning rate in neural network
Ex Maximum depth in Tree algorithm
Mechanism of scikit-learn
Scikit-learn is characterized by its intuitive and easy interface, complete with high-level API.
Predict /
Instance Fit transfor
m
Estimator
Training: .fit
Prediction: .predict
Classifier Regressor
DecisionTreeClassifier LinearRegression
KNeighborsClassifier KNeighborsRegressor
GradientBoostingRegress GradientBoostingRegress
or or
Gaussian NB … Ridge …
Scikit-Learn Library
About the Scikit-Learn library
‣ It is a representative Python machine learning library.
‣ To import a machine learning algorithm as a class:
from sklearn.<family> import <machine learning algorithm>
Ex from sklearn.linear_model import LinearRegression
‣ Hyperparameters are specified when the machine learning object is instantiated:
Practicing scikit-learn
‣ The sklearn.datasets module includes utilities to load datasets, including methods to load and fetch
popular reference datasets. It also features some artificial data generators.
Practicing scikit-learn
Line 5
• This data becomes x (independent variable, data).
Practicing scikit-learn
Line 6
• This data becomes y (dependent variable, actual value).
Practicing scikit-learn
Line 1
• Provides details about the data.
• The help shows that the default value of test_size is 0.25.
Practicing scikit-learn
Line 7
• From the total of 569 observed values, divide the data for training and evaluation into
7:3 or 8:2.
7.5:2.5 is the default value.
Practicing scikit-learn
‣ Use train_test_split() to split the data for making and evaluating the model.
Practicing scikit-learn
Line 11
• 426 observed values (75%) out of total 569 observations are found.
Line 13
• 143 observed values (25%) out of total 569 observations are found.
Practicing scikit-learn
‣ For instancing, use the model’s hyperparameter as an argument. Hyperparameter is an option that
requires a human setting and greatly affects the model performance.
Line 1-
5• Loading the test data set
Line 1-
8• Instancing the estimator and hyperparameter setting
fit
‣ Use the fit method with an instance estimator for training. Send the training data and label data
together as an argument to the supervised learning algorithm.
predict
‣ The instance estimator that has completed training with fitting can be applied with the predict
method. ‘Predict’ converts the estimated results of the model regarding the entered data.
Line 2
• It is an estimated value, so the actual value for X_test may vary.
Measure the accuracy by comparing the two values.
Practicing scikit-learn
Practicing scikit-learn
Practicing scikit-learn
Line 57
• Data frame shows a result where predicted value and actual value differ.
Practicing scikit-learn
Practicing scikit-learn
Practicing scikit-learn
Line 66
• 133/143
Practicing scikit-learn
‣ It showed 93% accuracy, which is quite a rare result. In fact, a process of increasing data accuracy is
required during data pre-processing, and standardization is one of the options. The following is a brief
summary of standardization.
• Standardization can be done by calculating standard normal distribution.
Another term for standardization is z-transformation, and the standardized value is also referred to
as z-score. 94% accuracy would be obtained from KNN wine classification through standardization.
(, standard deviation)
• Standardization is widely used in data pre-processing in general other than KNN, and the following
is the equation.
Practicing scikit-learn
Practicing scikit-learn
Line 35
• Data frame before standardization
Practicing scikit-learn
Line 39
• The differences among column values are huge before standardization.
Practicing scikit-learn
Line 40
• After standardization, the column values do not significantly deviate from 0.
• Better performance would be possible compared to the performance before
standardization.
transform
‣ Feature processing is done with ‘transform’ to return the result.
Line 3-
3• Output before pre-processing
transform
‣ Feature processing is done with ‘transform’ to return the result.
Line 3-
7• Pre-processing – Apply scaling
transform
‣ Feature processing is done with ‘transform’ to return the result.
Line 3-
9• Result check after pre-processing
transform
‣ Feature processing is done with ‘transform’ to return the result.
fit_transform
‣ Fit and Transform is combined as fit_transform.
Line 4-3
• Combination of fit and transform
Training data
Division of
set
data set
Mod
(Perform K-fold el
cross validation Final
Overall data if necessary) model
set
Performance
Test data set
evaluation
𝑦 𝑦 𝑦 𝑦 𝑦
𝑥 𝑥 𝑥 𝑥 𝑥
Overfitting and underfitting
𝑦 𝑦 𝑦 𝑦 𝑦
𝑥 𝑥 𝑥 𝑥 𝑥
‣ Even if machine learning finds the optimal solution in data distribution, a wide margin of error
occurs since the model has a small capacity. Such a phenomenon is referred to as underfitting; the
linear equation model on the leftmost figure above is an example.
‣ An easy alternative is to use higher-degree polynomials, which are non-linear equations.
‣ The rightmost figure provided above is applied with the 12th-order polynomial.
‣ The model capacity got larger, and there are 13 parameters for estimation.
12 11 10 1
𝑦 =𝑤12 𝑥 +𝑤11 𝑥 +𝑤10 𝑥 ⋯ 𝑤1 𝑥 +𝑤 0
Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 51
1.3. Preparation and division of data set UNIT
01
Overfitting
‣ When choosing a 12th-order polynomial curve, it approximates almost perfectly to the training set.
‣ However, an issue occurs when predicting new data.
• The region around the red bar at should be predicted, but the red dot is predicted instead.
‣ The reason is because of the large capacity of the model.
• Accepting the noise during the learning process → Overfitting
‣ Model selection is required to select an adequate size model.
12t
h
𝑥 𝑥0
Inaccurate prediction in
overfitting
Cross-Validation:
1) Split the data into a training set and a testing set.
2) Further subdivide the training set into a smaller training and a validation set.
3) Train the model with the smaller training set.
4) Evaluate the errors with the validation set.
5) Repeat from step 2) a few times.
Validation
Training
‣ Subdivide the training dataset into 𝑘 equal parts. Then, apply sequentially.
Validation
Training
‣ Leave only one observation for validation. Apply sequentially. More time consuming.
k-cross
folding
n=k n=10 most of the
Repeated time
epoch
measurement
round 1 round 2 round 3 round 4 round 5 round 6 round 7 round 8 round 9 round 10
validation
set validation
set validation
set validation training training training training
set validation set set set set
training training training training set validation
set set set set training set validation
set training set validation
set set validation
set validation
set
93
Accuracy 90% 91% 95%
%
Ex In the case of the iris data, ignore the fourth row as shown on the table below.
x1 x2 x3 x4 y
Sepa-
Sepal.Width Petal.Length Petal.Width Species
l.Length
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
x1 x2 x3 x4 y
Sepa-
Sepal.Width Petal.Length Petal.Width Species
l.Length
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 unknown
5 5 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
‣ The omitted value is changed to NaN. In this case, it is not problematic due to the low number of
data, but it is extremely inconvenient to manually find missing values from a huge data frame.
Line 17
• The number of missing values can be counted.
Line 18
• axis=0 is default, so the row with the NaN value is deleted.
Imputation
‣ It is sometimes hard to delete a training sample or a certain column because it loses too much
useful data. If so, estimate missing values from other training samples in the data set by kriging.
The most commonly used method is to impute with average value, which is to change the missing
value into the overall average of a certain column. In scikit-learn, use SimpleImputer class.
Imputation
Imputation
Line 45
• Check it is the average of the column.
‣ The data on the table has features that have orders and do not have orders.
The size is ordered, but the color is not ordered. Thus, the size is classified as the ordinal scale,
while the color is the nominal scale.
Line 62
• Change the class label from strings to integers.
The encoding is done with integers, so insert the iris ‘species’ value.
The encoding is done with integers, so insert the iris ‘species’ value.
Use the get_dummies() function of pandas to convert every eigenvalue of categorical variables
into new dummy variable.
Use sklearn library to conveniently process one-hot encoding. The result is given as a sparse
matrix in linear algebra. In the sparse matrix, the value of most matrices is 0. An opposite
concept to the sparse matrix is a dense matrix.
OneHotEncoder
OneHotEncoder
OneHotEncoder
Line 82
• (0, 0) is 1, thus setosa (setosa up to 50 matrices).
Row index 0 2
Column index Species setosa versicolor virginica Sparse matrix
expression
0 setosa 1 0 0 (0,0)
1 setosa 1 0 0 (1,0)
setosa 1 0 0
49 setosa 1 0 0 (49,0
50 versicolor 0 1 0 (50,1)
51 versicolor 0 1 0 (51,1)
versicolor 0 1 0
100 versicolor 0 1 0
101 virginica 0 0 1 (101,2)
virginica 0 0 1 (102,2)
virginica 0 0 1
150 virginica 0 0 1 (150,2)
Using hold-out in real life that splits the data set into training and test data sets
‣ df_wine is the data that measure wines produced in Vinho Verde, which is adjacent to the Atlantic
Ocean in the northwest of Portugal. It measured and analyzed the grade, taste, and acidity of 1,599
red wine and 4,898 white wine samples to create data. If the data is not found in the following route,
it is possible to import the data from local by directly downloading it from the UCI repository.
Using hold-out in real life that splits the data set into training and test data sets
Line 85
• When it is not accessible to the wine data set of the UCI machine learning repository,
remove the remark of the following code and read the data set from the local route:
• df_wine = pd.read_csv(‘wine.data’, header=None)
Using hold-out in real life that splits the data set into training and test data sets
Using hold-out in real life that splits the data set into training and test data sets
‣ Data splitting is possible by using the train_test_split function provided in the model_selection
module of scikit-learn. First, convert the features from index 1 to 13 to NumPy array and assign to
variable X. With the train_test_split function, data conversion is done in four tuples, so assign by
designating appropriate variables.
‣ Randomly split X and y into training and test data sets. test_size=0.3, so 30% of the sample is
assigned to X_test and y_test.
‣ Regarding the stratify parameter, if the class label array y is sent, the class ratio found in the
training and test data sets is identically maintained with the original data set.
‣ The most widely used ratios in real life are 6:4, 7:3, and 8:2, depending on the size of the data set. It
is common and suitable for large data sets to split the training data set and test data set into the
ratio of 9:1 or 9.9:0.1.
Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 107
1.4. Data pre-processing for making a good training data set UNIT
01
#MaxAbsScaler divides the data into the maximum absolute value based on each
feature. Thus, the maximum value of each feature becomes 1.
#The overall feature changes to [-1, 1] range.
×× ×
Low ××
× ×
Bias ××
× ×
×
×
×
trade-off
‣ Bias and variance have a trade-off relationship in which when one increases, the other falls, and vice
versa. The model becomes complex at the beginning of learning; the overall error cost falls due to
decreased bias. However, at some point, the model keeps learning and becomes much more
complicated, which causes higher variance and increased overall error cost. In other words, the
model gets overfitted to the training data. One way to prevent overfitting is to stop learning at the
appropriate time. Regularization is a method to prevent overfitting by lowering variance. Still, it can
increase bias instead due to the trade-off relationship.
Variance
Bias2
Model Complexity
Ridge Regression
‣ The ridge regression model is a technique to limit the L2norm of w, which is the regression
coefficient vector. A constraint is added to minimize the sum of squares of weight in the cost
function of linear regression. If the linear regression model is as follows:
(,w= )
‣ Then, the cost function of the ridge regression model is as follows. N is the number of data, and M is
the number of elements of the regression coefficient vector. A constraint is added to the existing
SSE (Sum of Squared Errors).
{ }
𝑁 𝑀
^ 𝑟 ⅈ 𝑑𝑔 ⅇ =𝑎𝑟𝑔𝑚𝑖𝑛𝑤
𝑤 ∑ 𝑖
( 𝑦 −𝑤𝑋) 2
+ 𝜆 ∑ 𝑗
𝑤 2
𝑖=1 𝑗=1
‣ λ is a hyperparameter to adjust the weight of existing SSE and added constraint.
When the λ is large, regularization is greatly applied, and the regression coefficients become lower.
When the λ becomes smaller, regularization gets weaker. When the λ equals 0, the constraint clause
also becomes 0, the same as the general linear regression model.
Ridge Regression
‣ The following is an example of simple linear regression model equation:
Ridge Regression
‣ When drawing the cost function SSE (w1, w2) on the coordinate with the x-axis and y-axis, an ellipse
is created as provided in the following figure:
𝑤2
Minimize cost
2
𝜆∥𝑊 ∥
2
𝑤1
‣ In the figure above, the ellipse drawn in a solid line is the cost function, which is the combination of
w1 and w2 with the same cost (SSE). The central point of the ellipse is when the cost becomes 0.
Outward of the ellipse is the combination of w1 and w2 with higher cost, which is the model with
higher error (consists of w1 and w2 weights). The colored circle refers to the constraint. The circle
becomes smaller when the λ gets larger, and vice versa. The point where the cost function (ellipse)
and constraint (colored circle) meet is the optimal solution where the cost of the ridge regression
model is minimum.
Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 115
1.4. Data pre-processing for making a good training data set UNIT
01
Lasso Regression
‣ The Lasso (Least Absolute Shrinkage and Selection Operator) regression model is a technique to
limit the L1norn of the regression coefficient vector w. A constraint is added to minimize the sum of
the absolute values of the weights in the cost function of linear regression. The following figure
shows the cost function of the Lasso regression model:
{∑ }
𝑁 𝑀
^ 𝑙𝑎𝑠𝑠𝑜
𝑤 =𝑎𝑟𝑔𝑚𝑖𝑛𝑤 (𝑦 𝑖 − 𝑤𝑋) +𝜆 ∑ |𝑤 𝑗|
2
𝑖=1 𝑗=1
Lasso Regression
‣ When drawing the cost function of the Lasso regression model (w1, w2) on the coordinate with the
x-axis and y-axis, a rhombus is created as provided in the following figure:
𝑤2
𝜆 ∥𝑊 ∥1
𝑤1
‣ Since the constraint of the Lasso regression model is a rhombus, it is highly possible that the point
meeting the cost function is the vertex of the rhombus. The vertexes of the rhombus are always the
points where w1 or w2 are 0. Thus, the Lasso regression model results in 0 weight.
Elastic-net regression
‣ The Elastic-net regression model applies both the L2norm and L1norm to the regression coefficient
vector. The constraint is both the sum of weight squares and the sum of the absolute weight values.
The following figure shows the cost function of Elastic-net. There are two hyperparameters of
Elastic-net, which are λ1 and λ2.
{∑ }
𝑁 𝑀 𝑀
^
𝑤 𝑒𝑙𝑎𝑠𝑡𝑖𝑐
=𝑎𝑟𝑔𝑚𝑖𝑛 𝑤 ( 𝑦 𝑖 −𝑤𝑋) + 𝜆1 ∑ 𝑤 + 𝜆2 ∑ |𝑤 𝑗|
2 2
𝑗
𝑖=1 𝑗=1 𝑗=1
Elastic-net regression
‣ Elastic-net applies both the L2norn and L1norn at the same time, so the constraint is somewhere in
the middle. It reduces a larger weight while making an unimportant weight 0.
1.5
1.0
0.5
0.0
-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5
el
as L -0.5
t1ic
L -ne
2 t
-1.0
-1.5
Practicing
Sample
(Instance, observation)
Petal
Sepal Petal Petal
Sepal Class
lengt lengt widt
width label
h h h
1 5.1 3.5 1.4 0.2 Setosa
2 4.9 3.0 1.4 0.2 Setosa
…
50 6.4 3.5 4.5 1.2 Versicolor
…
150 5.9 3.0 5.0 1.8 Virginica
Sepal
Line 3-1 ~
4• Import the library required for practicing.
Line 4-1
• Convert the current variable data to ndarray and DataFrame of NumPy.
Line 1
• Merge feature and target.
Line 3 ~ 5
• Change the column name.
Line 10
• Change the target value.
Line 11
• Check the missing value.
Line 13
• petal_length has the greatest standard deviation. Compared to other features,
petal_width seems to have a narrower range of values. It would be better to perform
regularization after checking the model performance due to the scale differences
between features.
Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 137
1.5. Practicing to find an optimal method to solve problems with scikit-learn UNIT
01
Line 14
• The correlation coefficient of petal_length and petal_width is 0.962865, which is
extremely high. Since highly correlated features may induce multicollinearity problems,
it is recommended to select one of the two variables to use.
Line 15
• Number of data in each target was counted using the aggregation function ‘size,’ and it
was confirmed that 50 data were equally found in each feature. Select between ‘size’
and ‘count’ depending on the purpose of the analysis. The ‘size’ counts the number of
data, including missing values. On the other hand, the ‘count’ counts the number of
data without missing values. In this case, there’s no difference using ‘size’ and ‘count’
because iris data does not have any missing values.
petal_length petal_width
Visualizing correlation
1.00
0.75
0.5
0.25
0.00
-0.25
-0.50
-0.75
-1.00
Visualizing the correlation between features and data distribution using pairplot
Visualizing the correlation between features and data distribution using pairplot
setal_length
setal_width (cm)
species
• setosa
• versicolor
petal_length
• virginica
petal_widt
h
Visualizing the correlation between features and data distribution using pairplot
‣ setosa is aggregated by clearly being deviated from other classes. Classification is possible by
drawing an imaginary line, and setosa will be classified as a linear model. For versicolor and virginica,
it seems difficult to classify them by drawing a line because they are mixed in the graph, complete
with the sepal_width and sepal_length features. However, even if it seems a little vague, they can be
classified in other graphs.
versicolor
33.3%
33.3%
33.3%
se-
tosa
vir-
ginica
Line 15
• The data is evenly arranged in each target class.
‣ Before starting machine learning, split the data set into training and performance test data. The final
objective of machine learning is to create a generalized model so that it can accurately predict new
data. If evaluating the performance with data used in learning, the possibility of getting it right is high
since the model is already familiar with the given data feature. For reliable evaluation, separate the
performance test data set from the training data set. Because it is the separation of data, it is referred
to as the hold out method.
‣ Split training and performance test data sets with the train_test_split function of sklearn. Classify the
training data as ‘train’ and performance test data as ‘test.’ X is the feature of the data set, and y is
the target. For structured data analysis, indicate DataFrame with capital letters and Series with lower
cases. The test_size=0.33 option separates 33% of the total data as a test set. random_state=42 is an
option used to induce reproducible results for the practicing problem. If not designating random_state,
the data set for conversion will differ every time.
Algorithm selection
START
NO
MeanShift kernel
approximati
VBGMM on
Algorithm selection
‣ setosa is aggregated by clearly being deviated from other classes. Classification is possible by
drawing an imaginary line, and setosa will be classified as a linear model. For versicolor and virginica,
it seems difficult to classify them by drawing a line because they are mixed in the graph, complete
with the sepal_width and sepal_length features. However, even if it seems a little vague, they can be
classified in other graphs.
Algorithm selection
Algorithm selection
‣ Gini impurity or entropy
‣ The difference between Gini impurity and entropy is vague in real life. Both of them create a similar
tree.
‣ Calculation of Gini impurity is quicker, so it is recommended as a default.
‣ However, when creating a different tree, Gini impurity tends to isolate the most frequent class to one
side, while entropy results in a more balanced tree.
Model learning
‣ Perform model learning with the training data to check the model performance. The current model is
set with default hyperparameter except for random_state.
Score
‣ Evaluate the performance using the performance test data set. In the Scikit-learn, score refers to
accuracy. Since the iris data set is well-structured for practice, it generally shows high performance in
any model.
Cross validation
‣ This strategy is to make many validation sets so that every data can be included in learning once.
Divide the data set into a random number n=k (k-fold). Use the first fold as a validation set and other
k-1 folds as a training set and measure the performance.
‣ Use the second fold as a test set and other folds as a training set for learning, and then measure the
performance. Repeat the same process for all the other folds so that all data can be included in the
training. Obtain k performance evaluation results and then average out to predict the model
performance. The following figure is an example of when k=5.
cross_validation
Split 1
CV Iterations
Split 2
Train
Split 3 Test
Split 4
Split 5
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5
Cross_val_score
‣ Cross validation can be easily performed using the cross_val_score function of scikit-learn.
stratified
‣ The random splitting of the train set and validation set would result in an inconstant target class when
hold-out. If so, the data distribution differs in training and validation sets, affecting learning. For
machine learning, a premise is present in which training data distribution and real-life data
distribution are the same. If the premise is not followed, the learning model performance falls. So, to
prevent such an issue, the stratified method is used to evenly distribute the target class ratio.
The following figure provides an intuitive understanding of how the stratified method classifies data.
Stratified Cross-validation
Split 1
CV Iterations
Split 2 Training
Split 3 data
Test data
Class label Class 0 Class 1 Class 2
| | | | | | | |
0 20 40 60 80 100 120 140
Data points
stratified
Learning Curve
‣ !pip install scikit-plot
Learning Curve
‣ !pip install scikit-plot
• The green line is the result of cross-validation. Overfitting occurs when the green line goes upward
to the right but then starts to fall. The red line is data used for training which is validated. The red
line may momentarily fall if there are too much data. This phenomenon is only temporary, and the
graph will converge in the long term.
• The cv option is not designated, so 3fold is applied as default. There are 100 data in the training
set, and 33% of it is used for cross-validation, so the maximum value of the x-axis is 66. The
training curve is disconnected as the green line increases, so it is unknown if there is more data. At
this point, it is impossible to know whether there is enough data. The training curve is drawn
differently depending on the algorithm, even when using the same data. Thus, what is known from
this training curve is that the performance of the current decision tree model would be better if
there were more data.
Learning Curve
‣ !pip install scikit-plot
Learning Curve
Score
Training examples
Learning Curve
‣ If there is enough data, identical data distribution is maintained even when training and validation
sets are randomly split. The cross-validation method is required when there is insufficient data. Draw
a learning curve to determine whether there is enough data. The learning curve shows how
performance changes when slightly increasing the amount of training data by setting the x-axis as the
number of training data and the y-axis as the performance score. The test score is calculated by
internal cross-validation.
‣ The learning curve can be drawn using the scikitplot library, which supports the scikit-learn. Install the
library separately from scikit-learn to use.
‣ The scikitplot is not provided in anaconda as default, so it needs to be installed using the package
management tools. Run the following code with the Jupyter Notebook to install the library. Be aware of
different names when installing and importing the library.
The total number of cases that can possibly be made with the parameters from the practice
problem is 1600. Since k=10 in the K-fold cross-validation, 10 cross-validations were performed
for each case; a total of 16,000 training were done. The following table shows hyperparameter
combinations
‣ in the practice
The optimal parameters problem. performance found with GridSearCV are recorded in
and optimized
best_params_and best_score_ attributes.
‣ If the refit option is set to True, train the model with the optimal hyperparameters and record to the
best_estimator_ attribute.
• However, let’s assume that there are 48 setosas, 1 versicolor, and 1 virginica in the test set. When
making an evaluation using this test set, a problem is that it would have 96% accuracy.
Nevertheless, it’s not because the model’a performance is great. It would be necessary to check
other evaluation criteria as well to accurately evaluate the model performance.
Confusion Matrix
‣ The following confusion matrix can be expressed with binary classification.
‣ Evaluation scores, including precision, recall, f1-score, and others, can be made based on the
abovementioned concepts (TP, FP, TN, FN).
‣ Use the confusion matrix to analyze both right and wrong predicted results. The confusion matrix
can validate the performance differently to see how well the predicted and actual targets got right.
Actual setosa and Actual setosa but predicted Actual setosa but predicted
Actual setosa
predicted setosa versicolor virginica
Actual Actual versicolor but Actual versicolor and Actual versicolor but
versicolor predicted setosa predicted versicolor predicted virginica
Actual Actual virginica but Actual virginica but Actual virginica and
virginica predicted setosa predicted versicolor predicted virginica
‣ Since the iris data is a multi-label classification problem, it cannot be expressed in four different
concepts only as provided earlier. So, create three indices for each setosa, versicolor, virginica by
considering each as a binary classification problem. Take setosa, for example.
‣ With scikit-learn, it is possible to easily calculate the confusion matrix using confusion_matrix
function. Send the arguments to the actual class and then to the predicted class.
Confusion Matrix
setosa
True label
versicolor
virginica
precision
‣ Precision is the ratio of the correct predicted class.
𝑇𝑃
𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛=
𝑇𝑃 +𝐹𝑃
(f"{target} precision:
{score}")
setosa precision: 1.0
versicolor precision: 0.9375
virginica precision: 1.0
Line 45
• In multi-label classification, average cannot be “binary.”
• “binary” is the average default.
recall
‣ Also called sensitivity, recall is the correct prediction ratio among the actual target class.
𝑇𝑃
𝑟𝑒𝑐𝑎𝑙𝑙=
𝑇𝑃 +𝐹𝑁
(f"{target}sensitivity:
{score}")
setosa sensitivity: 1.0
versicolor sensitivity: 1.0
virginica sensitivity : 0.9375
fall-out
‣ Fall-out is the incorrect ratio among the actual class, not target. Also expressed as 1-specificity.
𝐹𝑃
𝑓𝑎𝑙𝑙− 𝑜𝑢𝑡=
𝐹𝑃 +𝑇𝑁
‣ The scikit-learn does not provide how to calculate fall-out.
f-score
‣ Precision and recall have a trade-off relationship. The f-score is the weighted harmonic mean of
precision and recall. If the f-score is less than 1, more weight is provided to precision; if it is greater
than 1, more weight is provided to recall. The f-score is used to accurately understand the model
performance when the data class is imbalanced.
( 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛× 𝑟𝑒𝑐𝑎𝑙𝑙 )
𝐹 𝛽 =( 1+ 𝛽2 )
𝛽 2 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
‣ For even weight of precision and recall, 𝜷 is set 1 most of the time, which is specifically referred to as
f1-score.
¿ 𝑖𝑠𝑖𝑜𝑛×𝑟𝑒𝑐𝑎𝑙𝑙
𝐹 1=2 ∙
¿ 𝑖𝑠𝑖𝑜𝑛+𝑟𝑒𝑐𝑎𝑙𝑙
f-score
‣ #F1 measure – both precision and recall are equally weighted. F1 score is used to calculate the
average from the harmonic mean of precision and recall (sensitivity) and weighting precision and
recall.
‣ #a (precision), b (recall) 2ab/a+b
‣ #F0.5 measure – Precision is more weighted than recall. 0.5 times greater weight is applied to the
recall compared to precision.
‣ #F2 measure – Recall is more weighted. Recall is 2 times more weighted than precision.
f-score
(f"{target}fbetas score:
{score}")
(f"{target}f1 score:{score}")
accuracy
𝑇𝑃+𝑇𝑁
𝑎𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃+𝑇𝑁 +𝐹𝑃 +𝐹𝑁
classification_report
‣ Use the classification_report function of scikit-learn to batch calculate precision, recall, and f1-score.
ROC curve
‣ ROC curve has TPR (True Positive Rate) on the y-axis and FPR(False Positive Rate) on the x-axis.
TPR is recall, and FPR refers to fall-out.
𝑇𝑃 𝐹𝑃
𝑇𝑃𝑅= 𝐹𝑃𝑅=
𝑇𝑃 +𝐹𝑁 𝐹𝑃 +𝑇𝑁
ROC curve
ROC curve
ROC Curves
True Positive Rate
Line 55
• Import model.
Line 56
• Final prediction
Line 57
• Save csv.
Machine Learning
Target pattern is given. Target pattern must be found out. Policy optimization
Error Types
Bias error (Underfitting error)
‣ Associated with simple/rigid/biased models.
‣ Prediction cannot account for the detailed data pattern.
‣ To lower this error type, increase the model complexity.
Data
Predicte
d
Target
Data
Predicte
d
Target
Total error
Model Complexity
‣ The goal is to minimize the Total error = Bias error + Variance error.
‣ Just enough complexity is required to “optimize” the model.
Minimizing Errors
Optimized machine learning model
‣ Prediction performance should be good in both training and testing.
‣ Given a machine learning algorithm, there is a model with just enough complexity (*).
Data
Predicted
Target
Error Metric
Numeric 𝑌 Categorical 𝑌
Numeric 𝑌 Categorical 𝑌
‣ The variables 𝑋_𝑖 and 𝑌 are connected by a linear relation: 𝑌=𝛽0+𝛽1 𝑋1+𝛽2 𝑋2+⋯+𝛽𝑘 𝑋𝑘+
Ex If real estate price is the response variable 𝑌, which are the most statistically meaningful
explanatory variables? Area, location, age, distance to business center, etc.
b) Predict the response given the conditions for the explanatory variables.
Ex What is the price of a 10-year-old apartment with an area of 100 and located 3 km away from
the business center? ← “predict” the value that is not open to the public yet.
Historical background
‣ Term “regression” was coined by Francis Galton, 19th-century biologist.
‣ The heights of the descendants tend to regress towards the mean.
Child Height
Pros
‣ Solid statistical and mathematical background
‣ Source of insights
‣ Fast training
Cons
‣ Many assumptions: linearity, normality, independence of the explanatory variables, etc.
‣ Sensitive to outliers
‣ Prone to multi-collinearity
Assumptions
‣ The response variable can be explained by a linear combination of the explanatory variables.
‣ There should be no multi-collinearity.
‣ Residuals should be normally distributed centered around 0.
‣ Residuals should be distributed with a constant variance. Residual analysis
‣ Residuals should be randomly distributed without a pattern.
Linear model
𝑌 = 𝛽0 + 𝛽1 𝑋 1+ 𝛽 2 𝑋 2 +⋯+ 𝛽𝑘 𝑋 𝑘+
Response
Explanatory variables
variable
Linear model
𝑌 = 𝛽0 + 𝛽1 𝑋 1+ 𝛽 2 𝑋 2 +⋯+ 𝛽𝑘 𝑋 𝑘+
Regression coefficients
Linear model
𝑌 = 𝛽0 + 𝛽1 𝑋 1+ 𝛽 2 𝑋 2 +⋯+ 𝛽𝑘 𝑋 𝑘+
‣ The error term should have zero mean and constant variance.
Linear model
Ex
𝑌 = 𝛽0 + 𝛽1 𝑋 1+ 𝛽 2 𝑋 2 +𝛽 3 𝑋 3 + 𝛽 4 𝑋 4 +
𝑌 = 𝛽1 𝑋 1 +⋯+ 𝛽𝑖 𝑋 𝑖 +⋯+ 𝛽 𝑘 𝑋 𝑘
𝑌 = 𝛽1 𝑋 1 +⋯+ 𝛽𝑖 𝑋 𝑖 +⋯+ 𝛽 𝑘 𝑋 𝑘
𝑌 = 𝛽0 + 𝛽1 𝑋 1 + 𝛽2 𝑋 2+⋯+𝛽 𝑘 𝑋 𝑘 +
Intercept 0
‣ The intercept 𝛽0 is the value of 𝑌 when all the 𝑋𝑖 = 0. It’s like a “base line.”
𝑌 = 𝛽0 + 𝛽1 𝑋 1+ 𝛽 2 𝑋 2 +
Wage Experience
Qualification
𝑦 𝑗 =𝛽 0 +𝛽 1 𝑥 𝑗, 1+ 𝛽 2 𝑥 𝑗, 2+⋯+𝛽 𝐾 𝑥 𝑗, 𝒌 +𝜀 𝑗
j
Over-determined!
‣ Now, we can write the linear relation in term of the actual data values.
𝒀 = 𝑿 𝜷+𝜺
‣ A compact notation using matrices
𝒀 = 𝑿 𝜷+𝜺
[]
𝑦1
𝒀 = 𝑦2
⋮
𝑦𝑛
𝒀 = 𝑿 𝜷+𝜺
[ ]
1 𝑥1 , 1 ⋯ 𝑥1 , 𝑘
1 𝑥2 , 1 ⋯ 𝑥2, 𝑘
𝑿=
⋮ ⋮ ⋮ ⋮
1 𝑥 𝑛,1 ⋯ 𝑥𝑛 ,𝑘
𝒀 = 𝑿 𝜷+𝜺
[]
0
𝜷= 1
⋮
𝑘
𝒀 = 𝑿 𝜷+𝜺
[]
𝜀1
𝜺= 𝜀2
⋮
𝜀𝑛
2
𝑑|𝜺|
=0
𝑑𝛽
𝜷=[ ( 𝑿 𝒕 𝑿 ) 𝑿𝒕]𝒀
−1
Pseudo-inverse
𝑥𝑖
Training dataset
𝑥𝑖 ′
‣ The predicted value of 𝑦′ is denoted as 𝑦 ̂, which is a conditional expectation 𝑦 ̂=𝐸[𝑦|𝑑𝑎𝑡𝑎].
‣ Given the values 𝑥1′, 𝑥2′, …, 𝑥𝑘′, calculate 𝑦 ̂=𝛽0+𝛽1 𝑥1′+𝛽2 𝑥2′+⋯+𝛽𝐾 𝑥𝑘′.
1
𝛽0
𝑋𝛽
1 1
Input 𝑋𝛽2 𝜺
⋮
2 Output
𝑌
⋮
𝛽𝑘 Summed
𝑋𝑘 Parameter
over
Error metrics
with
¿ 𝑚𝑎𝑙𝑒
¿ 𝑓𝑒𝑚𝑎𝑙𝑒
weight
height
¿ 𝑚𝑎𝑙𝑒
¿ 𝑓𝑒𝑚𝑎𝑙𝑒
weight
height
¿ 𝑚𝑎𝑙𝑒
¿ 𝑓𝑒𝑚𝑎𝑙𝑒
weight
height
‣ Both the intercept and slope are dependent on the categorical variable.
‣ Further improves the error metrics.
Error metrics
Coefficient of determination or 𝑅2
with and
2 2
𝑅 =𝐶𝑜𝑟 ( 𝑋 ,𝑌 )
⇦ If the p-value is below a reference (say 0.05), then 𝑯𝟎 is rejected in favor of 𝑯𝟏.
In this case, the linear model has an overall significance.
^
𝛽𝑖
⇦ t-test statistic = , where ^
𝛽𝑖 = estimated coefficient .
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 𝑜𝑓 𝛽 𝑖
⇦ If the p-value is below a reference (say 0.05), then 𝑯𝟎 is rejected in favor of 𝑯𝟏.
𝑌 𝑌
<
^
𝑌 ^
𝑌
Weak positive correlation Strong positive correlation
𝐿𝑜𝑔 𝑙𝑖𝑘𝑒𝑙𝑖h𝑜𝑜𝑑 𝑝
𝐴𝐼𝐶=− 2 +2
n 𝑛
‣ Bayes information criteria (BIC) with 𝑝=number of parameters:
𝐿𝑜𝑔 𝑙𝑖𝑘𝑒𝑙𝑖h𝑜𝑜𝑑 𝐿𝑛 (𝑛 )
𝐵𝐼𝐶=−2 +𝑝
𝑛 𝑛
𝐿𝑜𝑔 𝐿𝑖𝑘𝑒𝑙𝑖h𝑜𝑜𝑑=−
𝑛
2 (
1+𝐿𝑛 ( 2𝜋 ) +𝐿𝑛
𝑆𝑆𝐸
𝑛 ( ))
AIC AIC
BIC BIC
AIC
⇨ There is a minimum.
BIC
Complexity (~)
Residual analysis
X
‣ Residual is the difference between the predicted 𝑦 ̂ and the real 𝑦.
‣ We can easily detect outliers in 𝑌 that deviate substantially from the main trend.
Residual analysis
‣ Reasons for residual analysis:
1) To detect outliers in 𝑌.
2) To verify the assumptions of linear regression.
Leverage analysis
‣ Leverage of the i-th observation
, where
Mean leverage
𝑛
‣ Leverage tells how distant 𝑋 is from the center => Detection of outliers in 𝑋.
Regularized Regression
Bias-Variance trade off
Bias Error Variance Error Total Error
Model Complexity
‣ Tradeoff relation between the Bias error and the Variance error.
‣ The goal should be to minimize the Total error = Bias error + Variance error.
Ridge regression
‣ Useful when the usual linear regression overfits (bias error << variance error).
‣ We remember that the OLS solution consists in minimizing |𝜺|2.
‣ In the Ridge regression, we minimize the following “loss function”:
𝑘
𝑳=|𝜺| +𝝀 ∑ 𝛽2𝑖
2
𝑖=0
Ridge regression
‣ Positive and larger further constrains the coefficients 𝛽_𝑖, decreasing the variance (overfitting)
error.
𝑘
𝑳=|𝜺| +𝝀 ∑ 𝛽2𝑖
2
𝑖=0
𝑌 = 𝛽0 + 𝛽1 𝑋 1+ 𝛽 2 𝑋 2 +⋯+ 𝛽𝑘 𝑋 𝑘+ 𝜀
Lasso regression
‣ Useful when the usual linear regression overfits (bias error << variance error).
‣ In the Lasso regression, we minimize the following “loss function”:
𝑘
𝑳=|𝜺| +𝝀 ∑ ¿ 𝛽𝑖2∨¿¿
2
𝑖=0
Polynomial Regression
Polynomial regression
‣ Useful when the usual linear regression underfits (bias error >> variance error).
‣ We can model the relationship between 𝑋 and 𝑌 using the polynomials:
Poisson Regression
Poisson regression
‣ Useful when we would like to model the response 𝑌 that represents counts or frequencies.
𝐿𝑜𝑔(𝜆)=𝛽 0 +𝛽 1 𝑋 1 +𝛽 2 𝑋 2 +⋯+ 𝛽 𝐾 𝑋 𝐾 +𝜀
‣ We are assuming that Y follows the Poisson distribution.
𝑦 −𝜆
𝜆 𝑒
𝑃 ( 𝑦 )=
𝑦!
a) Mean
b) Variance
c) Standard deviation
NO
Dimensionality
clustering
reduction
Number of just
Randomiz
<10k Categoties ed
KMeans samples lookong
YES YES known YES PCA
NOT
NOT WORKING NO NO NO
WORKING
Spectral
MiniBatch <10k tough predicting <10k Isomap
Clusterin Spectral LLE
KMeans samples luck structure samples
g NO YES
Embedding
GMM NO NO
NOT WORKING
MeanShif kernel
t approximat
VBGMM ion
‣ As shown in the figure above, neural networks can be applied if there are correct answers and it is
for classification purposes or has a lot of data. If not, it is possible to use a decision tree or SVC
algorithm.
‣ Also, the Naïve Bayes algorithm can be used if the data is in text.
‣ The KNN method is used if it’s not text data. It is also possible to use the ensemble method with a
better performance.
Numeric 𝑌 Categorical 𝑌
Pros
‣ Simple and relatively easy to implement
‣ Source of intuitive insights
‣ Fast training
Cons
‣ Not among the most accurate classification algorithms
‣ Assumes that the explanatory variables are independent without multi-collinearity
𝑆= 𝛽0 + 𝛽1 𝑋 1+ 𝛽 2 𝑋 2 +⋯+ 𝛽𝑘 𝑋 𝑘
‣ The conditional probability of 𝑌 being equal to 1 is denoted as .
‣ “Sigmoid” or “Logistic” function connects the probability with the logit.
𝑆
𝑒
𝑓 (𝑆)= 𝑆
1+ 𝑒 0.5
‣ The logistic function is the inverse of the logit (and vice versa):
𝑝
( )
𝑆
𝑒
𝑆=𝐿𝑜𝑔 𝑝= 𝑆
1 −𝑝 1+𝑒
𝑃 (𝑌 )
log ( )=𝛽 0 +𝛽 1 𝑋
1− 𝑃 ( 𝑌 )
‣ When looking at the equation in detail, the 𝛽_0+𝛽_1 on the right side is a linear model with a range
(−∞, ∞); the left side also has a range (−∞, ∞). The log(P(Y))/(1-P(Y)) on the left side of the equation
is called the logit function.
‣ Or add ‘exp’ to both sides of the equation and arrange it regarding P(Y) to get the following
equation.
Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 278
3.2. Logistic Regression Basics UNIT
03
( 𝛽𝑜 +𝛽 1 𝑋 )
𝑒 1
𝑃 ( 𝑌 =1 )= =
1+𝑒 (𝛽 + 𝛽 𝑜 1 𝑋)
1+𝑒 −(𝛽 + 𝛽
𝑜 1 𝑋)
‣ To sum up, logistic regression analysis is an algorithm that establishes a model by using the logistic
function from the above equation and predicting parameters from the probability P(Y=1) = P(Y)
where objective variable Y having categorical value 1 from the training data. Maximum likelihood
estimation is generally used for predicting parameters and . This is an analytical method that
changes the formula. Since the direct calculation is difficult, estimation is done by giving a certain
initial value and performing repeated calculations to adjust the value as a numerical calculation.
P(Y=1)
Predictor
1
𝛽0
𝑋𝛽
1
1
0 ,0 ,1,1,0,…
𝛽2
Input 𝑋2
⋮
Output
𝑌
⋮
𝛽𝑘
𝑋𝑘 S 𝑃(𝑌 =1∨𝑑𝑎𝑡𝑎)
Parameters
0.3,0.72,0.12,…
𝑖=1
‣ In the above relation, and represent values given by the training dataset.
‣ Here, we assumed the conversion from to .
𝑖=1
[]
𝜕𝐿
𝜕 𝛽0
𝜕𝐿
𝜵 𝐿 ( 𝜷 )= 𝜕 𝛽1
⋮
𝜕𝐿
𝜕 𝛽𝐾
𝐿(𝜷)
Optimized
Confusion matrix
Y N
O X
Y (TP: True Posi- (FN: False Nega-
Actual tive) tive)
categorical value X O
N (FP: False Posi- (TN: True Nega-
tive) tive)
Confusion matrix
‣ Confusion matrix can be made with a 2X2 crosstable, along with 3X3 and higher crosstables. For
convenience, we will only use a 2X2 confusion matrix in this chapter.
‣ In the confusion matrix from the previous slide, the diagonally placed ‘O’ cases means the predicted
and actual categorical values are the same. In other words, the classification machine learning
predicted the results properly.
‣ On the other hand, if the predicted and actual categorical values differ, the machine learning model
has made incorrect predictions.
‣ The categories that the analysis is mostly interested in are positive categories; the others are called
negative categories. Depending on the accuracy of prediction (true or false) regarding positive and
negative categories, the accurate classification of interested categories is called TP (True
Positive). The accurate classification of uninterested categories is called TN (True Negative).
‣ The inaccurate classification of uninterested categories into interested categories is called FP
(False Positive). The inaccurate classification of interested categories into uninterested categories
is called FN (False Negative).
‣ There are various metrics based on different combinations of TP, TN, FP, and FN of the confusion
matrix for evaluating the analysis result of classification machine learning methods.
Metric
‣ Major metrics calculated from the confusion matrix include accuracy, error rate = 1-accuracy,
sensitivity (also referred to as recall, hit ratio, TP rate, etc.), specificity, FP rate, precision, and
others. Among those, accuracy, sensitivity, and precision are the most frequently used metrics.
‣ Also, there are F-Measure (or F1-Score) that combines sensitivity, precision, and Kappa Statistics,
where the predicted and actual values of the analysis model are exactly the same. The calculation
formulas and definitions of various metrics are in the following table.
Metric
Metric
‣ Among the metrics from the previous slide, sensitivity signifies how well the actual ‘positive’
category is predicted ‘positive.’ The precision is the index showing the ratio of actual ‘positive’ from
the predicted ‘positive’ categories. Thus, they are metrics that directly explain how well the
classification machine learning analysis model classifies interested categorical values of objective
variables.
‣ Sensitivity and precision are the most significant and frequently used metrics for classification
machine learning results in real life.
Confusion matrix
Ex Actual 0 Actual 1
Predicted 0 120 5
Predicted 1 15 20
‣ Confusion matrix is a contingency table that counts the frequencies of the actual vs. predicted.
Accuracy =
‣ Accuracy is the ratio between the diagonal sum and the total sum.
Sensitivity =
Specificity =
Precision =
Note
‣ Accuracy alone is not sufficient for testing.
Ex If frauds constitute only 1% of all transactions, the accuracy of a fraud detection system
(FDS) that predicts non-fraud in all transactions would be quite high at 99%.
However, such FDS would be useless because it misses the 1% that really matters.
Terminology
Accuracy =
Sensitivity =
Specificity =
Precision =
Terminology
ROC curve
ROC curve
Precision
ROC curve
Precision
ROC curve
Decision Tree
4.1. Tree Algorithm
Purpose
‣ Classification of unseen data and prediction of categorical values
‣ Extraction of generalized knowledge in a tree structure from the data
Composition
‣ Node, Branch, Depth
X₁ Root node
<.47
yes no
X₂
<.39
yes no
0
X₂ Parent node
<.84
yes no
0
X₁
Child node
<.87
yes no
0
Terminal node
or Leaf node
1 0
Branching Removing branches that have a high risk of error rate or inappropriate rules
Validity Evaluation of the decision tree through cross-validation using the gain chart,
evaluation risk chart, or test data
Interpretation
Interpretation of the decision tree and setting a prediction model
and prediction
40 60 40 60
1
28 42 18 40 60
2
23
21
Owner
19
Non-owner
17
Split criterion
Discrete objective variables
‣ Chi squared statistic – p-value: Creates child nodes with a predictor variable with the lowest p-value
and the optimal partitioning
‣ Gini index: Selects child nodes with a predictor variable that reduces the Gini index and the optimal
partitioning
‣ Entropy measure: Creates child nodes with a predictor variable with the lowest entropy measure and
the optimal partitioning
Split criterion
Continuous objective variables
‣ F statistic in ANOVA: Creates child nodes with a predictor variable with the lowest p-value and the
optimal partitioning
‣ Variance reduction: Creates child nodes with the optimal partitioning that maximizes variance
reduction
CHAID
Chi squared statistic ANOVA F statistic
(multi space partitioning)
CART
(binary space partition- Gini index Variance reduction
ing)
Impurity measure
Gini index
‣ Selects child nodes with a predictor variable that reduces the Gini index and the optimal partitioning
‣ If the T data set is split into k categories and the category performance ratios are p1, …, pk, it is
expressed as the following equation.
𝑘
𝐺𝑖𝑛𝑖 ( 𝑇 ) =1− ∑ 𝑝 2𝑙
𝑙=1
GI = 1 –(3/8)²−(3/8)²−(1/8)²−(3/8)²= .69
low impurity(diversity), high purity
GI = 1 –(6/7)²−(1/7)²= .24
Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 323
4.1. Tree Algorithm UNIT
04
Entropy measure
‣ In thermodynamics, entropy measures the degree of disorder.
‣ Creates child nodes with a predictor variable with the lowest entropy measure and the optimal
partitioning.
‣ If the T data set is split into k categories and the category performance ratios are p1, …, pk, it is
expressed as the following equation.
𝑘
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝑇 ) =− ∑ 𝑝 𝑙 log2 𝑝 𝑙
𝑙=1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝑇 0 ) =− ¿
Ex If 4 categories consists of ratios of 0.5, 0.25, 0.25, 0 (T1):
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 ( 𝑇 1 )=− ¿
Stopping criteria
A rule to designate the current node as a terminal node without further splitting
‣ Designates the depth of the decision tree
‣ Designates the minimum number of records in the terminal node
Branching criteria
Application of test data
‣ Application of the test data to the constructed model
‣ Reviewing the predictive value of the constructed model through test data
‣ Removing the branches that have a high risk of error rate or inappropriate rule of
inference
By an expert
‣ An expert reviewing the validity of rules suggested in the constructed model
‣ Removing rules without validity
Overfitting problem
Overfitting problem graph
Evaluation data
Error rate
Training data
Pros
‣ Creation of understandable rules (can be expressed with SQL)
‣ Useful in classification prediction
‣ Able to work with both continuous and discrete variables
‣ Shows a more relatively significant variable
Cons
‣ Not suitable to predict continuous variable values
‣ Unable to perform time series analysis
‣ Not stable
Tree Algorithm
Pros
‣ Intuitive and easy to understand
‣ No assumptions about the variables
‣ No need to scale or normalize data
‣ Not that sensitive to the outliers
Cons
‣ Not that powerful in the most basic form
‣ Prone to overfitting. Thus, “pruning” is often required.
Classification Tree
Ex
Do you eat
no meat? yes
Predicted
Vegan Vegetarian Flexitarian Meat lover
response
Classification Tree
‣ The tree structure is trained by minimizing the Gini impurity (or entropy).
𝐾 𝐾
^ 𝑚𝑘 ( 1− 𝑝
𝑝 ^ 𝑚𝑘 )
𝐺𝑚 =∑ 𝑝^ 𝑚𝑘 ( 1 − 𝑝^ 𝑚𝑘 ) =1− ∑ 𝑝^ 𝑚𝑘 2
𝑘=1 𝑘=1
or
𝐾
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑚 =− ∑ 𝑝^ 𝑚𝑘 𝐿𝑜𝑔 ( 𝑝^ 𝑚𝑘 )
𝑘=1 ^ 𝑚𝑘
𝑝
• is the Gini impurity in the leaf node .
The smaller, the better.
• is the entropy in the leaf node .
• Here, is the proportion of the class in the leaf node .
• is the total number of possible classes.
• The class with the largest proportion is the prediction at that leaf node.
Regression Tree
Ex
Experience < 5?
no yes
Performance <
80%?
no yes
Predicted
Wage = Wage = Wage =
response
40,000 35,000 30,000
Regression Tree
Ex
100
%
𝑹𝟐
80%
Performan
𝑹𝟑
ce
𝑹𝟏
0%
0 year 5 years 20
years
Experienc
e
‣ Each leaf node corresponds to a region in the configurational space.
Regression Tree
‣ The configurational space is split into regions: {𝑅_1,𝑅_2,⋯,𝑅_𝐽 }.
‣ The tree structure is trained by minimizing the RSS (residual sum of squares):
𝐽
𝑅𝑆𝑆=∑ ∑ ^
( 𝑦𝑖 − 𝑦 𝑅 )
𝑗
2
𝑗=1 𝑖∈𝑅 𝑗
Hyperparameter Explanation
max_depth The maximum depth of a tree
min_samples_split The minimum number of sample points required to split an internal node
max_features The number of features to consider when looking for the best split
Decision Tree
‣ Decision tree refers to a modeling technique as the shape of a tree branching out from root to leaf
nodes. It depends on reference values of independent variables (explanatory or input variables) that
affect the classification or prediction of objective variables.
‣ In the decision tree, each node is split in the form of if-then depending on explanatory variables’
characteristics or reference values. When following the tree structure, it is possible to easily
understand how the attribute value of data is classified into the category.
‣ The figure provided below is a typical form of the decision tree. From the example, ‘age’ is the root.
It can be inferred that ‘age’ is the most significant variable when deciding the loan approval.
‣ The squared shape node at the end of each branch is the leaf node.
Age
< = 35 >35
Monthly Occup
income ation
< = 2,000,000 KRW
< 2,000,000 KRW
Unemployed / Others
Worker
Loan Family Loan
Loan denied
approved income approved
< = 3,000,000 KRW
< 3,000,000 KRW
Loan
Loan denied
approved
‣ The impurity index is 0 when or , and it gets the largest when , thus making a parabola. In other
words, the impurity index is the lowest when there’s a certain classification in the node or the node
is completely free from any classification. In contrast, the impurity becomes the largest when many
classifications are found in the same node.
Pros
‣ Intuitive and simple
‣ Not that sensitive to the noise and outliers
‣ Fast
Cons
‣ Assumes that the features are independent, which may not be strictly true
‣ Not among the best-performing algorithms
𝑃 ( 𝐵| 𝐴 ) 𝑃 ( 𝐴)
𝑃 ( 𝐴| 𝐵 )=
𝑃(𝐵)
‣ Now we take 𝐴=𝐶𝑙𝑎𝑠𝑠 and 𝐵=𝐷𝑎𝑡𝑎, then:
𝑃 𝑝𝑜𝑠𝑡 (𝐶𝑙𝑎𝑠𝑠)∝
1
√2 𝜋 𝜎 𝑗
𝑒𝑥𝑝 −
(1
2𝜎 𝑗
2 ( 𝑥 −𝜇 𝑗 )
2
)𝑃 𝑝𝑟𝑖𝑜𝑟 ( 𝐶𝑙𝑎𝑠𝑠 )
where the parameters 𝜇_𝑗 and 𝜎_𝑗 are “learned” from the training data.
Random variable
‣ The variable whose values are unknown until the outcome
‣ Independent events
• If the probability of the simultaneous occurrence of two cases is identical to the multiplication of
the probabilities of each event to occur, then the two events are independent of each other.
‣ Dice
4 6 Conditional probability
A • Events of B (1,2,3) when the A events
1 B (1,3,5) occur
5 2
3 • 0.6666667
Event Probability
Total Event B Probability A
A B
Odd num-
Event A 1 1 1 0.5 0.5
ber
Less than
Event B 2 2
3
3 3 3
4
5 5
Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 343
6
5.1. Naïve Bayes Algorithm UNIT
05
Random variable
‣ The variable whose values are unknown until the outcome
‣ Independent events
• Not affected
‣ Dice
2 4
A
B
1 3 6
5
Random variable
‣ The variable whose values are unknown until the outcome
‣ Exclusive events
• Intersection is the null set.
‣ Dice
6
A
1 B
3 2 4
5
Result Cause
0.2
Initiating a conversation A certain group
Final summary
The prior probability of being a shopping group0.2
Observing to initiate a conversation
3/7
The posterior probability of being a shopping group 0.428571
‣ The customer wouldn’t be in the shopping group for sure, but the probability is doubled.
‣ Bayesian estimation is ‘performing Bayesian updating to the posterior probability based
on the behavioral observation of the prior probability.’
‣ Such an estimation method is called Bayesian statistics.
‣ Bayes’ theorem
• Obtaining the posterior probability based on the prior probability
KNN Algorithm
6.1. KNN Algorithm
KNN Algorithm
About KNN (K-Nearest Neighbors)
‣ One of the simplest algorithms
‣ Prediction is based on the 𝑘 nearest neighboring points.
‣ There are classification and regression variants.
• Classification: prediction decided by the majority class of the nearest neighbors.
• Regression: prediction given by the average of the nearest neighbors.
• Likewise, K-nearest neighbor is a technique to determine a new category according to the principle
of majority rule. It finds k surrounding data points similar to the specific data point and considers
the classification category of the specific data point.
X2
K=5 K=3
K=1
X1
‣ The figure above conceptualizes the change in objective variables depending on the K value setting
in the K-nearest neighbor method. The ‘☆’ point is the data value that needs to be classified; the
points shaped in a square and circle are other data points present around.
‣ When setting K=1, since the closest data point to the star is a circle, the objective variable of ‘☆’ will
be classified as ‘○.’ On the other hand, if K=3, three points closest to ‘☆’ will be considered,
including two squares and one circle. In this case, ‘☆’ will be classified as ‘□.’ Also, if K=5, since
there are more circles around ‘☆,’ it will be again classified as ‘○.’
X2
K=5 K=3
K=1
X1
‣ Thus, different K values significantly change the K-nearest neighbor’s predicted result of the objective
variable category. This is why setting an appropriate K value is important in the K-nearest neighbor
method.
‣ However, there are no clear theoretical or statistical standards for an appropriate K value. Different K
values are usually set for repeated tests, and the final K value is set once it shows the optimal
classification performance. Generally, a random initial K value is designated between K=3 and K=9,
which is then used to test the classification performance with training and evaluation data to select the
optimal K value.
Pros
‣ Simple and intuitive
‣ No model parameters to calculate. So, there is no training step.
Cons
‣ Since there is no “model,” little insight can be extracted.
‣ No model parameters that store the learned pattern. The training dataset is required for prediction.
‣ Prediction is not efficient ⇨ “Lazy algorithm.”
When ,
is classified as
SVM Algorithm
7.1. SVM Algorithm
SVM Algorithm
About SVM (Support Vector Machine)
‣ Enhanced classification accuracy by maximizing the margin
‣ Effective non-linear classification boundary by the “kernel” transformation
Supporting vector
X2
Supporting vector
Max. margin
X1
Supporting vector
X2
Supporting vector
Max. margin
X1
Pros
‣ Not very sensitive to the outliers.
‣ Performance is good.
Cons
‣ Training is relatively slow. Performs poorly for large data.
‣ The kernel and the hyperparameter set should be carefully optimized.
‣ Not much insight can be gained.
Boundary
Boundary
Support
vectors
Boundary
Margin
Hyperplane
‣ For a 𝑘 dimensional configurational space, the hyperplane has the dimension 𝑘−1.
Ex For a two dimensional space, a hyperplane is a bisecting line that can be parametrized as:
𝛽 0 + 𝛽1 𝑋 1 + 𝛽2 𝑋 2=0
The two dimensional space is subdivided into two:
Kernel
‣ Mapping to a higher dimension using the “kernel” functions.
‣ Kernel functions introduce an effective non-linear classification boundary.
Ex Polynomial kernel
2
𝑋
1D
𝑋 2D
𝑋
Kernel
‣ Mapping to a higher dimension using the “kernel” functions.
‣ Kernel functions introduce an effective non-linear classification boundary.
Ex Polynomial kernel
𝑋2 𝑋2
𝑋 21 + 𝑋 22
𝑋1 𝑋1
2D 3D
Kernel
‣ Effective mapping to a higher dimension by giving the inner product of two vectors and .
• Linear:
• Polynomial: , where
• Sigmoid: , where
• Radial function basis (rbf): , where
Ensemble Algorithm
8.1. The concept of Ensemble Algorithm and
Voting
8.2. Bagging & Random Forest
8.3. Boosting
Ensemble algorithms
About ensemble algorithms
bootstrap
‣ Estimation of unknown statistics
‣ Easy and effective estimation method with the unknown distribution of the model parameter
sample
‣ Process of recalculating the statistics and model for each sample through additional sampling with
replacement from the current sample
• No assumption is required such that the parameters or sample statistics should be in a normal
distribution.
‣ bootstrap sample – Aggregation of observed data (sampling with replacement obtained from the
record value and dependent variable)
Contents to be learned in the future
• Resampling – Involves permutation under-sampling without replacement
• Bootstrap aggregation – Making a result from aggregating predicted values obtained from
different bootstrap samples
What is Ensemble
Ensemble LearningLearning?
이란 ?
‣ 앙상블이란
The term
단어를‘ensemble’ can
wiki 에서 검색해 보면 be
다음과defined
같이 나온다as
. follows:
• 통계역학에서
In statistical mechanics,
, 어떤 계의 an ensemble
앙상블 (ensemble) of동등한
이란 그 계와 a system
계의 모음을refers
말한다 . to the collection of equivalent systems.
‣ 쉽게
In other
말하면 , words, it is집합이다
비슷한 무리들의 an assembly
. of similar groups.
‣ 즉
Instead of expecting
, 우리는 단일 performance
모델에서 나오는 성능의 results
결과를 기대하는 것이 아니라 ,from a single
여러 개의 단일 모델들의model,
평균치를 ensemble learning
내거나 , 투표를 해서 draws
다수결에 의한 결정을 a하는better
등 여러
result집단
모델들의 using
지성을 the collective
활용하여 더 나은 결과를intelligence
도출해 내는 것에 of different models, such as averaging out many different
single
주 models
목적이 있다 . or making a decision based on the majority vote.
‣ 집단
There
지성을are many
활용하는 방법 , different ensemble
즉 앙상블 기법에는 다양한 방법이methods
있다 . using collective intelligence.
• Voting (– 투표
Drawing
) – 투표를results
통해 결과 through
도출 voting
• Bagging – Bootstrap Aggregating
aggregating (duplicated creation
( 샘플을 다양하게 중복 생성 ) of various samples)
• Boosting – Weighting by가중치
이전 오차를 보완하며 supplementing
부여 previous errors
• Stacking – A여러meta-model based모델on different models
모델을 기반으로 meta
‣ 사실
There
앙상블can
기법은be other
말 그대로 more
하나의 기법 /different
방법론 적인 부분이methods since
있어서 다양한 방식들이ensemble
추가로 더 있을 learning applies
수 있다 . 하지만 위에 언급한a certain
4 개지가 대표적인 앙상블 기법이며
technique/methodology.
이미 sklearn 라이브러리에도 구현된 기법들이다 .
Yet, the four methods listed above are the most representative ensemble techniques in the sklearn
library.
Voting Ensemble
Voting ensemble
Voting
‣ As the word itself, voting makes a decision through votes. Voting is similar to bagging as it uses a
voting method, but they are highly differentiated from each other as follows:
• Voting: Combines different algorithm models.
• Bagging: Uses different sample combinations within the same algorithm.
‣ Voting selects final results by having final voting on results deduced by different algorithms.
‣ Voting is classified into hard voting and a soft voting.
• Hard voting: Decides the final value of the result through voting.
• Soft voting: Draws the final value by adding all the probability values of getting the final result
and then calculating every probability of the final result.
Voting
‣ Predictive value
P1 P2 Pn
Voting
‣ Hard Voting
• Taking classification as an example. Suppose the predictive values for classification are 1,0,0,1,1
since 1 has three votes and 0 has 2 votes. In that case, 1 becomes the final predictive value in the
hard voting method.
‣ Soft Voting
• Soft voting method calculates the average value of each probability and then determines the one
with the highest probability.
• Suppose the probability of getting class 0 is (0.4, 0.9, 0.9, 0.4, 0.4), and the probability of getting
class 1 is (0.6, 0.1, 0.1, 0.6, 0.6). The final probability of getting class 0 is (0.4+0.9+0.9+0.4+0.4)
/ 5 = 0.44; the final probability of getting class 1 is (0.6+0.1+0.1+0.6+0.6) / 5 = 0.4. Therefore,
the selected final value is different from the result of the hard voting above.
‣ In general, using the soft voting method is considered more reasonable than the hard voting method
in competitions because the soft voting method provides a much better actual performance result.
Hyperparameter Explanation
estimators The list of basic learner objects
voting Either ‘soft’ or ‘hard’ (for classifier only)
Ensemble Algorithm
8.1. The concept of Ensemble Algorithm and
Voting
8.2. Bagging & Random Forest
8.3. Boosting
Pros
‣ Powerful
‣ Few assumptions
‣ Little or no concern about the overfitting problem
Cons
Hyperparameter Explanation
n_estimators The number of trees in the forest
min_samples_split The minimum number of sample points required to split an internal node
max_features The number of features to consider when looking for the best split
Bagging
‣ Bagging-based ensemble method
Ex Random Forest algorithm
• Easy to use since it is well constructed in the Sklearn library
• Relatively quick performance speed
• High performance
‣ The ensemble method has been widely used as it raises the performance level and is easy to use.
The bagging-based ensemble method is commonly found in the high-ranked solutions in Kaggle.
Bagging
P1 P2 Pn Predictive value
voting Bagged
Voting for classification, Ensemble
and equalization for Vote
prediction
Final PI
Bagging
‣ Bagging is an abbreviation of Bootstrap Aggregating.
‣ Bootstrap = Sample
‣ Aggregating = Adding up
‣ Bootstrap refers to a method that allows overlapping of different data sets for sampling and
splitting.
Bagging
Ex Random Forest is a typical bagging method algorithm.
‣ It creates multiple decision trees and performs sampling of different data sets while allowing
overlapped data sets.
‣ If the data set consists of [1, 2, 3, 4, 5]:
• Group 1 = [1, 2, 3]
• Group 2 = [1, 3, 4]
• Group 3 = [2, 3, 5]
‣ This is the bootstrap method. In the classification problem, voting is done on each tree trained with
different sampling for the final prediction result.
‣ In regression problems, the average of each obtained value is calculated.
Bagging
‣ “Bagging”: Bootstrap AGGregatING
Data + labels
Bagged Ensemble
Vote
sklearn.ensemble.BaggingClassifer/BaggingRegressor
‣ The sklearn library package provides the wrapper class called BaggingClassifier/BaggingRegressor.
‣ When designating the base algorithm to the base_estimator parameter, the
BaggingClassifier/BaggingRegressor performs bagging ensemble.
Ensemble Algorithm
8.1. The concept of Ensemble Algorithm and
Voting
8.2. Bagging & Random Forest
8.3. Boosting
(∑ )
𝑀
𝐺 𝑒𝑛𝑠𝑒𝑚𝑏𝑙𝑒 ( 𝒙 )=𝑠𝑖𝑔𝑛 𝛼 𝑚 𝐺𝑚 (𝒙 )
𝑚=1
𝑤𝑖(1)=
AdaBoost classification algorithm
1) For the first step (𝑚=1), equal weight is assigned to the observations:
2) For the boost sequence 𝑚=1,…, 𝑀:
a) Train the learner 𝐺𝑚 (𝒙) using observations weighted by 𝑤𝑖(𝑚).
d) For the next step, the weights of the wrongly predicted observation are rescaled by a factor 𝑒am.
This can be compactly expressed, as:
where
⇨ In the next sequence step, the wrongly predicted observations receive heavier weight.
(∑ )
𝑀
𝐺 𝑒𝑛𝑠𝑒𝑚𝑏𝑙𝑒 ( 𝒙 )=𝑠𝑖𝑔𝑛 𝛼 𝑚 𝐺𝑚 (𝒙 )
𝑚=1
1 2 2 2 1 2 2 2 1 2 2 2 1 2 2 2
1 1 1
1 1
1
1
2 2 Ensemble
1
1 2 2
1 2 2 2
Hyperparameter Explanation
base_estimator The base estimator with which the boosted ensemble is built
Hyperparameter Explanation
loss The loss function
subsample The fraction of data that will be used by individual weak learner
XGBClassifier/Regressor Hyperparameters
Hyperparameter Explanation
booster gbtree or gblinear
subsample The fraction of data that will be used by individual weak learner
XGBoost
‣ XGBoost is a library that implements the gradient boosting algorithm to be used in the distributed
system.
‣ It supports both regression and classification problems. This popularly used algorithm features good
performance and resource efficiency.
‣ Gradient boost is a representative algorithm using the boosting method. The XGBoost is the library
using this algorithm to support parallel training.
‣ It has been recently used a lot due to its high performance and computer resource application rate.
It has become more popular as it was frequently used by top rankers of Kaggle.
Light GBM
‣ While the tree is vertically expanded with Light GBM, other algorithms expand the tree horizontally.
In other words, while the Light GBM is leaf-wise, other algorithms are level-wise.
‣ For expansion, the leaf with max delta loss is selected. When expanding the same leaf, the leaf-wise
algorithm can reduce more loss than the level-wise algorithm.
‣ The following diagram shows how a boosting algorithm is implemented, which differs from the Light
GBM.
Leaf-wise tree
growth
Light GBM operation
method
Boosting
‣ The boosting algorithm is also ensemble learning. After sequentially learning weak learning
machines, it supplements errors by adding weight to inaccurately predicted data from the previous
learning.
‣ The difference from other ensemble methods is that it performs sequential learning and
supplements errors by adding weight. However, one of the disadvantages is that it is difficult to
process parallel due to its sequential property, leading to a longer learning time compared to other
ensembles.
Single estimator, Bagging, Boosting
…
…
Ex Suppose that the following weight will be applied to the performance of Box 1~3.
• Performance of Box 1: weight = 0.2
• Performance of Box 2: weight = 0.5
• Performance of Box 3: weight = 0.6
It can be expressed as the following formula:
Gradient Descent
‣ The key to the boosting method is to supplement errors from the previous learning.
‣ The AdaBoosting and gradient descent methods are slightly different in how to supplement errors.
‣ Gradient descent uses differentiation to minimize the difference between the predicted value and
actual data.
• Weight
• Input_data = feature data (input data)
• Bias
• Y_actual = actual data value
• Y_predict = predicted value
• Loss = error rate
‣ Y_predict = weight*input_data+bias
• The predictive value can be obtained from the above formula.
Calculating the difference with actual data will result in the total error rate.
‣ Loss = Y_predict – Y_actual
• (There are many different functional formulas to define error rate, including root means square
error and means absolute error, but the above definition is provided for convenience.)
• The purpose of gradient descent is to find the weight that makes the loss closest to 0.
Samsung Innovation Campus Chapter 5. Machine Learning 1 – Supervised Learning 419
8.3. Boosting UNIT
08
Level wise
XGBoost
Default
Parameter Description
value
Designates the number of trees for repetitive work.
num_iterations 100
Overfitting occurs if the value is too high.
Updated when boosting steps are repeatedly conducted.
learning_rate 0.1
The value is designated between 0~1.
Identical to the max_depth of tree-based algorithms.
max_depth 1 No restriction is applied to tree depth when entering a value smaller
than 0.
min_data_in_lea Identical to the min_samples_leaf of a decision tree. Used as a
20
f parameter to control overfitting.
num_leaves 3
Signifies the max. The number of leaves for one tree.
boosting GBDT
bagging_fractio
1.0 Designates the ratio for data sampling. Used to control overfitting.
n
feature_fraction 1.0 The ratio of the random feature for individual tree learning
Lambda_l1 0.0 The value for L1 regulation
Lambda_l2
Samsung Innovation Campus 0.0 The value for L2 regulation Chapter 5. Machine Learning 1 – Supervised Learning 424
8.3. Boosting UNIT
08