ML-unit-3
ML-unit-3
UNIT-II
Learning with Trees: Decision Trees, Constructing Decision Trees, Classification and
Regression Trees.
Ensemble Learning: Boosting, Bagging, Different ways to combine classifiers, Basic Statistics,
Gaussian Mixture Models, Nearest Neighbour Methods.
Unsupervised Learning: K Means Algorithm.
Decision Trees:
Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems.
It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple branches, whereas
Leafnodes are the output of those decisions and do not contain any further branches.
The decisions or the tests are performed on the basis of features of the given dataset.
It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
[ —————— 3
: sub-Tree - pecision Node
|
| Decision Node
|
|
g’ vy v
|
|
Example:
Party?
Go to party
Go to pub
Yes, o
Watch TV Study
FIGURE 12.1 A simple decision tree to decide how you will spend the evening
® One of the reasons that decision trees are popular is that we can turn them into a set of
logical disjunctions (if ... then rules) that then go into program code very simply.
Ex: if there is a party then go to it
if there is not a party and you have an urgent deadline then study
Constructing Decision Trees:
Types of Decision Tree Algorithms:
e ID3: This algorithm measures how mixed up the data is at a node using something
called entropy. It then chooses the feature that helps to clarify the data the most.
e (C4.5: This is an improved version of ID3 that can handle missing data and continuous
attributes.
e CART: This algorithm uses a different measure called Gini impurity to decide how to
split the data. It can be used for both classification (sorting data into categories) and
regression (predicting continuous values) tasks.
ID3 Algorithm:
Entropy(p) = — Z pilogy i,
e where the logarithm is base 2 because we are imagining that we encode everything
using binary digits (bits), and we define 0 log 0=0.
R22 Machine Learning Lecture Notes
0 o2 04 06 o8 1
Proparton of Posiive Examplos
FIGURE 12.2 A graph of entropy, detailing how much information is available from finding
out another piece of information given what you already know.
e If all of the examples are positive, then we don’t get any extra information from
knowing the value of the feature for any particular example, since whatever the value
of the feature, the example will be positive. Thus, the entropy of that feature is 0.
e However, if the feature separates the examples into 50% positive and 50% negative,
then the amount of entropy is at a maximum, and knowing about that feature is very
useful to us.
e For our decision tree, the best feature to pick as the one to classify on now is the one
that gives you the most information, i.e., the one with the highest entropy.
Information Gain:
e Itis defined as the entropy of the whole set minus the entropy when a particular feature
is chosen.
S .
Gain(S, F) = Entropy(S) - Z
fevalues(F)
g
f,f Entropy(S;). (12.2)
e The ID3 algorithm computes this information gain for each feature and chooses the
one that produces the highest value.
R22 Machine Learning Lecture Notes
The D3 Algorithm
« If all examples have the same label:
— return a leaf with that label
o Else if there are no features left to test:
— return a leaf with the most common label
o Else:
— choose the feature F' that maximises the information gain of S to be the next
node using Equation (12.2)
— add a branch from the node for each possible value f in F'
— for each branch:
& calenlate Sy by removing ' from the set of features
¥ recurs: ely call the algorithm with Sy, to compute the gain relative to the
current set of examples
C4.5 Algorithm:
Classification Example: construct the decision tree to decide what to do in the evening
= Ppublogy
ppub = p1v logy prv
5 8, 3 1 1 —log, 1 1
-—log,
— - —logy — = —log, — - —
105270710 20 10 ™0 10570
=05+ 052014 033224 0.3322 = 16855 (12.11)
EER 1)
11
01 1% %y
3f1 12 2 9
0 ( FRCERELS ;)
1.6855 - 0.2755 - 0.6 — 0.2755
05345 (12.12)
R22 Machine Learning Lecture Notes
= 10 (12.13)
Therefore, the root node will be the party feature, which has two feature values (‘yes’
and ‘no’), so it will have two branches coming out of it.
Party?
Ye No
‘When we look at the ‘yes’ branch, we see that in all five cases where there was a party
we went to it, so we just put a leaf node there, saying ‘party’.
For the ‘no’ branch, out of the five cases there are three different outcomes, so now we
need to choose another feature.
The five cases we are looking at are:
Deadline? | Is there
a party? | Lazy? | Activity
Urgent No Yes | Study
None No Yes | Pub
Near No No | Study
Near No Yes | TV
Urgent No Yes | Study
‘We’ve used the party feature, so we just need to calculate the information gain of the
other two over these five examples:
R22 Machine Learning Lecture Notes
Gain(S,
ain(S, Deadline)
Deadline) = 1371
1. ~5: | gleg
2l :
2 1l 1 1] L 1L ll 1
AT LT
= 13711-0-04-0
4 9
Gain($, Lazy) = 137172(7 logzi—iloug— ,10};2,>
(12.16)
Here, Deadline feature has maximum information gain. Hence, we selected Deadline
feature for splitting data.
Go to party
Urge:
Party?
Go to party
Go to pub
Watch TV Study
R22 Machine Learning Lecture Notes
It is another well-known tree-based algorithm, CART, whose name indicates that it can
be used for both classification and regression.
Gini Impurity:
k
Gini(D) =1 - Z p?
i=1
The node with uniform class distribution has the highest impurity.
The minimum impurity is obtained when all records belong to the same class.
Node A o 10 o 1 1-02-12=0
Node C 5 5 05 05 1-05%2-05%2=05
An attribute with the smallest Gini Impurity is selected for splitting the node.
Gini Impurity
os
oa
g
03
01
oo
0o 01 02 03 04 05 06 07 08 09 10
Fraction of Class k (ox)
R22 Machine Learning Lecture Notes
Regression in Trees:
A Regression tree is an algorithm where the target variable is continuous and the tree
is used to predict its value.
Econditionw (Condition)
(Valuej
R (Valuej
N
(Value] LValuej
Regression Tree works by splitting the training data recursively into smaller subsets
based on specific criteria.
The objective is to split the data in a way that minimizes the residual reduction (Sum
of Squared Error) in each subset.
Residual Reduction- Residual reduction is a measure of how much the average
squared difference between the predicted values and the actual values for the target
variable is reduced by splitting the subset. The lower the residual reduction, the better
the model fits the data.
Splitting Criteria- CART evaluates every possible split at each node and selects the
one that results in the greatest reduction of residual error in the resulting subsets. This
process is repeated until a stopping criterion is met, such as reaching the maximum tree
depth or having too few instances in a leaf node.
R22 Machine Learning Lecture Notes
Ensemble Learning:
Ensemble Methods
Boosting:
= B =
Boosting is an ensemble technique that combines multiple weak learners to create a
strong learner.
Boosting
o
PR e
e
0O T WO e
-
BS1 Bs2
Classifier 1 Classifier 2
10
R22 Machine Learning Lecture Notes
e The ensemble of weak models are trained in series such that each model that comes
next, tries to correct errors of the previous model until the entire training dataset is
predicted correctly.
e One of the most well-known boosting algorithms is AdaBoost (Adaptive Boosting).
AdaBoost:
Steps in AdaBoost:
1. Weight Initialization
At the start, every instance is assigned an identical weight. These weights determine the
importance of every example.
2. Model Training
A weak learner is skilled at the dataset, with the aim of minimizing classification errors.
The weighted mistakes are then calculated by means of summing up the weights of the
misclassified times. This step emphasizes the importance of the samples which are tough to
classify.
The weight of the susceptible learner is calculated primarily based on their Performance in
classifying the training data. Models that perform properly are assigned higher weights,
indicating that they're more reliable.
The example weights are updated to offer more weight to the misclassified samples from the
previous step.
6. Repeat
Steps 2 through 5 are repeated for a predefined variety of iterations or till a distinctive overall
performance threshold is met.
11
R22 Machine Learning Lecture Notes
The very last sturdy model (also referred to as the ensemble) is created by means of
combining the weighted outputs of all weak learners.
8. Classification
To make predictions on new records, AdaBoost uses the very last ensemble model.
Algorithm:
AdaBoost Algorithm
12
R22 Machine Learning Lecture Notes
Bagging:
e Bagging is a supervised learning technique that can be used for both regression and
classification tasks.
Bagging
« S
@ > flw““aouia‘“fj P
- T
BS1 BS2
| Classifier 1 ’ Classifier 2 |
13
R22 Machine Learning Lecture Notes
Random Forest:
The idea is largely that if one tree is good, then many trees (a forest) should be better,
provided that there is enough variety between them.
It works by creating a number of Decision Trees during the training phase.
Each tree is constructed using a random subset of the data set to measure a random
subset of features in each partition.
This randomness introduces variability among individual trees, reducing the risk of
overfitting and improving overall prediction performance.
In prediction, the algorithm aggregates the results of all trees, either by voting (for
classification tasks) or by averaging (for regression tasks)
o -
: ‘ VI [ ]
st ‘ [ voting
eaas ing majorit)
1 =
Class A
14
R22 Machine Learning Lecture Notes
Stacking:
Stacking
W.;l
If the number of classifiers is odd and the classifiers are each independent of each
other, then majority voting will return the correct label if more than half of the
classifiers agree.
Assuming that each individual classifier has a success rate of p, the probability of the
ensemble getting the correct answer is a binomial distribution of the form:
T
> (f )M(\fl')" .,
e=T7341
For regression problems, rather than taking the majority vote, it is common to take the
mean of the outputs.
However, the mean is heavily affected by outliers, with the result that the median is a
more common average to use.
It is the use of the median that produces the bagging algorithm, which is meant to imply
‘robust bagging’.
There is another algorithm to combine classifier, known as mixture of experts.
Mixture of Experts
Mixture of experts (MoE) is a machine learning technique that uses multiple specialized
models to solve a problem. MoE is a type of ensemble learning that combines predictions
from multiple models to improve accuracy
Working:
1. An input is evaluated by a gating network that determines which experts to activate
2. The selected experts are assigned weights and their outputs are combined to produce a
final result
15
R22 Machine Learning Lecture Notes
Gate,
Algorithm:
The Mixture of Experts Algorithm
~ caleulate the probability of the input belonging to each possible class by com-
puting (where the w; are the weights for that classifier):
1
oi(x, W) = . (13.6)
1+ exp(—wi - x)
« For each gating network up the tree:
compute:
(13.7)
« Pass as input to the next level gates (where the sum is over the relevant inputs to
that gate):
> 0505 (13.8)
k
Basic Statistics:
Mean:
Median:
16
R22 Machine Learning Lecture Notes
e Itis calculated by arranging the values in the dataset in order and finding the value
that lies in the middle.
e Ifthere are an even number of values in the dataset, the median is the average of the
two middle values.
e The median is a useful measure of central tendency because it is not affected by
outliers, meaning that extreme values do not significantly affect the value of the
median.
Variance:
e Variance is a measure of how much the data for a variable varies from it's mean.
Where,
o ;s the i observation,
* Zis the mean, and
« Nis the number of observations
Covariance:
Standard Deviation:
17
R22 Machine Learning Lecture Notes
1 (=)’
S(xX)=—7=e 2
o2
4 = mean of x
o = standard deviation of x
77~ 3.14159 ...
e~ 271828 ..
Bias is the difference between the average prediction of our model and the correct value
which we are trying to predict.
Model with high bias pays very little attention to the training data and oversimplifies
the model. It always leads to high error on training and test data.
Variance is the variability of model prediction for a given data point or a value which
tells us spread of our data.
Model with high variance pays a lot of attention to training data and does not generalize
on the data which it hasn’t seen before.
As a result, such models perform very well on training data but has high error rates on
test data.
If our model is too simple and has very few parameters then it may have high bias and
low variance.
On the other hand if our model has large number of parameters then it’s going to have
high variance and low bias.
So we need to find the right/good balance without overfitting and underfitting the data.
18
R22 Machine Learning Lecture Notes
underfitting overfitting
zone zone
generalization (test),
bias squared’,
. variance
error
irreducible error
model complexity
Cluster 1
Cluster 3
Mathematical Function:
The overall distribution is a weighted sum of multiple Gaussian distributions
The probability density function for a GMM
M
G = ). b Z)
m=1
Here ¢(x; i, Z,,)is Gaussian function with mean y,,, and covariance matrix X, and a,, are the
wweights with constraint that ¥ _; a,, = 1
The overall distribution is a weighted sum of multiple Gaussian distributions
The probability that a given data point x;belongs to Gaussian component m is:
P(xi€ Cm)= ,f"‘d)(x 2 Ek)A
Zhem1 Am (xiiuTr)
The challenge is determining the weights o , which is done by maximizing the likelihood function.
The likelihood function is the probability of the observed data given the model parameters. To
simplify computations, we take the log-likelihood because:
e Probabilities are small, leading to numerical stability issues.
19
R22 Machine Learning Lecture Notes
e The logarithm turns the product of probabilities into a sum, making optimization easier.
Since we do not know which Gaussian a given data point belongs to, we introduce a latent
variable f'that indicates the Gaussian component:
P(y)=ré(y:pi,00)+(1-m)d(y;12,02)
where:
The expectation step computes the expected value of the latent variable f'given the observed
data.
o This gives the probability that a given data point comes from Gaussian 1.
. Given the responsibilities computed in the E-step, the parameters (ltm, 6m, 7) are updated by
maximizing the expected log-likelihood:
— (M-step 3) &, —
(M-step 1) iy = =
— (M-step 2) jp = *
Algorithm:
20
R22 Machine Learning Lecture Notes
« Initialisation
— set ji; and fi2 to be randomly chosen values from the dataset
N
—set 6y =62 =) (yi — §)?/N (where y is the mean of the entire dataset)
— (M-step 1) i,
(M-step 2) fip = =}
N
S =F)yi—in)?
— (M-step 3) 6; = =+
(M-step 4) 62 = =
(M-step 5) &
21
R22 Machine Learning Lecture Notes
22