0% found this document useful (0 votes)

10 views22 pages

ML-unit-3

The document covers key concepts in machine learning, focusing on decision trees, ensemble learning, and algorithms such as ID3, C4.5, and CART. It explains how decision trees are constructed and used for classification and regression, detailing the processes of entropy, information gain, and Gini impurity. Additionally, it discusses ensemble methods like boosting, particularly AdaBoost, which combines multiple weak learners to enhance prediction accuracy.

Uploaded by

Sameer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views22 pages

ML-unit-3

Uploaded by

Sameer

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

R22 Machine Learning Lecture Notes

UNIT-II

Learning with Trees: Decision Trees, Constructing Decision Trees, Classification and
Regression Trees.
Ensemble Learning: Boosting, Bagging, Different ways to combine classifiers, Basic Statistics,
Gaussian Mixture Models, Nearest Neighbour Methods.
Unsupervised Learning: K Means Algorithm.

Decision Trees:

Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems.
It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.

Decision nodes are used to make any decision and have multiple branches, whereas
Leafnodes are the output of those decisions and do not contain any further branches.
The decisions or the tests are performed on the basis of features of the given dataset.
It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.

[ —————— 3
: sub-Tree - pecision Node
|
| Decision Node
|
|

g’ vy v
|
|

: Leaf Node LeafNode | Leaf Node Decision Node

N e e o )

Leaf Node Leaf Node

R22 Machine Learning Lecture Notes

Example:

Party?

Go to party

Go to pub

Yes, o
Watch TV Study

FIGURE 12.1 A simple decision tree to decide how you will spend the evening

® One of the reasons that decision trees are popular is that we can turn them into a set of
logical disjunctions (if ... then rules) that then go into program code very simply.
Ex: if there is a party then go to it
if there is not a party and you have an urgent deadline then study
Constructing Decision Trees:
Types of Decision Tree Algorithms:
e ID3: This algorithm measures how mixed up the data is at a node using something
called entropy. It then chooses the feature that helps to clarify the data the most.
e (C4.5: This is an improved version of ID3 that can handle missing data and continuous
attributes.
e CART: This algorithm uses a different measure called Gini impurity to decide how to
split the data. It can be used for both classification (sorting data into categories) and
regression (predicting continuous values) tasks.

ID3 Algorithm:

Entropy in Information Theory:

e Entropy measures the amount of impurity in a set of features.

e The entropy H of a set of probabilities p; is:

Entropy(p) = — Z pilogy i,

e where the logarithm is base 2 because we are imagining that we encode everything
using binary digits (bits), and we define 0 log 0=0.
R22 Machine Learning Lecture Notes

0 o2 04 06 o8 1
Proparton of Posiive Examplos

FIGURE 12.2 A graph of entropy, detailing how much information is available from finding
out another piece of information given what you already know.
e If all of the examples are positive, then we don’t get any extra information from
knowing the value of the feature for any particular example, since whatever the value
of the feature, the example will be positive. Thus, the entropy of that feature is 0.
e However, if the feature separates the examples into 50% positive and 50% negative,
then the amount of entropy is at a maximum, and knowing about that feature is very
useful to us.
e For our decision tree, the best feature to pick as the one to classify on now is the one
that gives you the most information, i.e., the one with the highest entropy.
Information Gain:
e Itis defined as the entropy of the whole set minus the entropy when a particular feature
is chosen.

S .
Gain(S, F) = Entropy(S) - Z
fevalues(F)
g
f,f Entropy(S;). (12.2)

e The ID3 algorithm computes this information gain for each feature and chooses the
one that produces the highest value.
R22 Machine Learning Lecture Notes

The D3 Algorithm
« If all examples have the same label:
— return a leaf with that label
o Else if there are no features left to test:
— return a leaf with the most common label
o Else:
— choose the feature F' that maximises the information gain of S to be the next
node using Equation (12.2)
— add a branch from the node for each possible value f in F'
— for each branch:
& calenlate Sy by removing ' from the set of features
¥ recurs: ely call the algorithm with Sy, to compute the gain relative to the
current set of examples

C4.5 Algorithm:

e Itisan improved version of ID3.

e Pruning is another method that can help us avoid overfitting.
e It helps in improving the performance of the Decision tree by cutting the nodes or sub-
nodes which are not significant.
e Additionally, it removes the branches which have very low importance.
e There are mainly 2 ways for pruning:
e Pre-pruning — we can stop growing the tree earlier, which means we can
prune/remove/cut a node if it has low importance while growing the tree.
e Post-pruning — once our tree is built to its depth, we can start pruning the nodes based
on their significance.
e (4.5 uses a different method called rule post-pruning.
e This consists of taking the tree generated by ID3, converting it to a set of if-then rules,
and then pruning each rule by removing preconditions if the accuracy of the rule
increases without it.
e The rules are then sorted according to their accuracy on the training set and applied in
order.
o The advantages of dealing with rules are that they are easier to read and their order in
the tree does not matter, just their accuracy in the classification.
e For Continuous Variables, the simplest solution is to discretise the continuous variable.
e Computation complexity of Decision Tree is O(dnlogn) where n is number of data
points, d is number of dimensions.
R22 Machine Learning Lecture Notes

Classification Example: construct the decision tree to decide what to do in the evening

Deadline? | Is there a party? | Lazy? | Activity

Urgent Yes Yes Party
Urgent No Yes Study
Near Yes Yes Party
None Yes No Party
None No Yes Pub
None Yes No Party
Near No No Study
Near No Yes TV
Near Yes Yes Party
Urgent No No Study

‘We start with which feature has to selected as a root node?

Compute Entropy of S:

Entropy(S) “Pparty lOgQ Pparty = Pstudy lOgZ Dstudy

= Ppublogy
ppub = p1v logy prv
5 8, 3 1 1 —log, 1 1
-—log,
— - —logy — = —log, — - —
105270710 20 10 ™0 10570
=05+ 052014 033224 0.3322 = 16855 (12.11)

find which feature has the maximal information gain:

Gain(S, Deadline) = 1.6855 - 0‘

1 Emmpus,“g )
\", -Entropy(Suear) - 1 f(Snone)

EER 1)
11
01 1% %y
3f1 12 2 9
0 ( FRCERELS ;)
1.6855 - 0.2755 - 0.6 — 0.2755
05345 (12.12)
R22 Machine Learning Lecture Notes

Gain(S, Party) = 1.685371<7;10g3;>

§ %25
1
3

= 10 (12.13)

Gain(S.Lazy) = 16855P - 56 (’E 3 log, 37 oz, & —

Therefore, the root node will be the party feature, which has two feature values (‘yes’
and ‘no’), so it will have two branches coming out of it.

Party?

Ye No

FIGURE 12.6 The decision tree after one

step of the algorithm.

‘When we look at the ‘yes’ branch, we see that in all five cases where there was a party
we went to it, so we just put a leaf node there, saying ‘party’.
For the ‘no’ branch, out of the five cases there are three different outcomes, so now we
need to choose another feature.
The five cases we are looking at are:

Gain(S,
ain(S, Deadline)
Deadline) = 1371
1. ~5: | gleg
2l :

2 1l 1 1] L 1L ll 1
AT LT
= 13711-0-04-0

4 9
Gain($, Lazy) = 137172(7 logzi—iloug— ,10};2,>

(12.16)

Here, Deadline feature has maximum information gain. Hence, we selected Deadline
feature for splitting data.

Go to party
Urge:

FIGURE 12.7 The tree after another

step.

Finally, we will get the following decision tree.

Party?

Go to party

Go to pub

Watch TV Study
R22 Machine Learning Lecture Notes

Classification and Regression Trees(CART):

It is another well-known tree-based algorithm, CART, whose name indicates that it can
be used for both classification and regression.

Gini Impurity:

It is the probability of misclassifying a randomly chosen element in a set.

The ‘impurity’ in the name suggests that the aim of the decision tree is to have each
leaf node represent a set of data points that are in the same class, so that there are no
mismatches. This is known as purity.
If a leaf is pure then all of the training data within it have just one class.
Consider a dataset D that contains samples from k classes.
The probability of samples belonging to class i at a given node can be denoted as p;.
Then the Gini Impurity of is defined as:

k
Gini(D) =1 - Z p?
i=1

The node with uniform class distribution has the highest impurity.
The minimum impurity is obtained when all records belong to the same class.

Count Probability Gini Impurity

ni na P P2 1-pf—p

Node A o 10 o 1 1-02-12=0

Node B 3 7 03 0.7 1-0.3> 0.7 = 0.42

Node C 5 5 05 05 1-05%2-05%2=05

An attribute with the smallest Gini Impurity is selected for splitting the node.
Gini Impurity
os

oa
g
03

oo
0o 01 02 03 04 05 06 07 08 09 10
Fraction of Class k (ox)
R22 Machine Learning Lecture Notes

Regression in Trees:

A Regression tree is an algorithm where the target variable is continuous and the tree
is used to predict its value.

Econditionw (Condition)

(Valuej
R (Valuej
N
(Value] LValuej

Regression Tree works by splitting the training data recursively into smaller subsets
based on specific criteria.
The objective is to split the data in a way that minimizes the residual reduction (Sum
of Squared Error) in each subset.
Residual Reduction- Residual reduction is a measure of how much the average
squared difference between the predicted values and the actual values for the target
variable is reduced by splitting the subset. The lower the residual reduction, the better
the model fits the data.
Splitting Criteria- CART evaluates every possible split at each node and selects the
one that results in the greatest reduction of residual error in the resulting subsets. This
process is repeated until a stopping criterion is met, such as reaching the maximum tree
depth or having too few instances in a leaf node.
R22 Machine Learning Lecture Notes

Ensemble Learning:

o Ensemble learning refers to the approach of combining multiple ML models to produce

a more accurate and robust prediction compared to any individual model.
The conventional ensemble methods include bagging, boosting, and stacking-based
methods

Ensemble Methods

Boosting:
= B =
Boosting is an ensemble technique that combines multiple weak learners to create a
strong learner.

Boosting

o
PR e
e
0O T WO e
-

BS1 Bs2

Classifier 1 Classifier 2

10
R22 Machine Learning Lecture Notes

e The ensemble of weak models are trained in series such that each model that comes
next, tries to correct errors of the previous model until the entire training dataset is
predicted correctly.
e One of the most well-known boosting algorithms is AdaBoost (Adaptive Boosting).

AdaBoost:

e AdaBoost short for Adaptive Boosting is an ensemble learning used in machine

learning for classification and regression problems.
e The main idea behind AdaBoost is to iteratively train the weak classifier on the training
dataset with each successive classifier giving more weightage to the data points that are
misclassified.
o The final AdaBoost model is decided by combining all the weak classifier that has been
used for training with the weightage given to the models according to their accuracies.
e The model which has the highest accuracy is given the highest weightage while the
model which has the lowest accuracy is given a lower weightage.

Steps in AdaBoost:

1. Weight Initialization

At the start, every instance is assigned an identical weight. These weights determine the
importance of every example.

2. Model Training

A weak learner is skilled at the dataset, with the aim of minimizing classification errors.

3. Weighted Error Calculation

The weighted mistakes are then calculated by means of summing up the weights of the
misclassified times. This step emphasizes the importance of the samples which are tough to
classify.

4. Model Weight Calculation

The weight of the susceptible learner is calculated primarily based on their Performance in
classifying the training data. Models that perform properly are assigned higher weights,
indicating that they're more reliable.

5. Update Instance Weights

The example weights are updated to offer more weight to the misclassified samples from the
previous step.

6. Repeat

Steps 2 through 5 are repeated for a predefined variety of iterations or till a distinctive overall
performance threshold is met.

11
R22 Machine Learning Lecture Notes

7. Final Model Creation

The very last sturdy model (also referred to as the ensemble) is created by means of
combining the weighted outputs of all weak learners.

8. Classification

To make predictions on new records, AdaBoost uses the very last ensemble model.

Algorithm:

AdaBoost Algorithm

Initialise all weights to 1/N, where N is the number of datapoints

« While 0 < ¢ < % (and t < 7, some maximum number of iterations):

train classifier on {, w(*)}, getting hypotheses h () for datapoints =,,
~
— compute training error ¢, = > wi I(y, # h(x,))

set ap = log (’—k)

— update weights using:
Wit = w® exp(aed(yn # he(xn))/Zes (13.1)
where Z; is a normalisation constant
-
Output f(x) = sign (Z u‘h,(x))
1

12
R22 Machine Learning Lecture Notes

Bagging:

e Bagging is a supervised learning technique that can be used for both regression and
classification tasks.

Bagging

« S

@ > ﬂw““aouia‘“fj P
- T

BS1 BS2

| Classifier 1 ’ Classifier 2 |

o Here is an overview of the steps including Bagging classifier algorithm:

e Bootstrap Sampling: Divides the original training data into ‘N’ subsets and randomly
selects a subset with replacement in some rows from other subsets. This step ensures
that the base models are trained on diverse subsets of the data and there is no class
imbalance.
e Base Model Training: For each bootstrapped sample, train a base model independently
on that subset of data. These weak models are trained in parallel to increase
computational efficiency and reduce time consumption.

13
R22 Machine Learning Lecture Notes

Prediction Aggregation: To make a prediction on testing data combine the predictions

of all base models. For classification tasks, it can include majority voting or weighted
majority while for regression, it involves averaging the predictions.
Out-of-Bag (OOB) Evaluation: Some samples are excluded from the training subset
of particular base models during the bootstrapping method. These “out-of-bag” samples
can be used to estimate the model’s performance without the need for cross-validation.
Final Prediction: After aggregating the predictions from all the base models, Bagging
produces a final prediction for each instance.

Random Forest:

The idea is largely that if one tree is good, then many trees (a forest) should be better,
provided that there is enough variety between them.
It works by creating a number of Decision Trees during the training phase.
Each tree is constructed using a random subset of the data set to measure a random
subset of features in each partition.
This randomness introduces variability among individual trees, reducing the risk of
overfitting and improving overall prediction performance.
In prediction, the algorithm aggregates the results of all trees, either by voting (for
classification tasks) or by averaging (for regression tasks)

Random Forest Algorithm e

in Machine Learning
T

o -
: ‘ VI [ ]
st ‘ [ voting
eaas ing majorit)

1 =
Class A

The Basic Random Forest Training Algorithm

o For each of N trees:

~ create a new bootstrap sample of the training set

— use this bootstrap sample to train a decision tree
— at each node of the decision tree, randomly select m features, and compute the
information gain (or Gini impurity) only on that set of features, selecting the
optimal one
— repeat until the tree is complete

14
R22 Machine Learning Lecture Notes

Stacking:

Stacking combines many ensemble methods in order to build a meta-learner.

Stacking has two levels of learning: 1) base learning and 2) meta-learning.
In the first one, the base learners are trained with training data set.
Once trained, the base learners create a new data set for a meta-learner.
The meta-learner is then trained with that new training data set.
Finally, the trained meta-learner is used to classify new instances.

Stacking

W.;l

Different ways to combine classifiers:

If the number of classifiers is odd and the classifiers are each independent of each
other, then majority voting will return the correct label if more than half of the
classifiers agree.
Assuming that each individual classifier has a success rate of p, the probability of the
ensemble getting the correct answer is a binomial distribution of the form:
T

> (f )M(\fl')" .,
e=T7341
For regression problems, rather than taking the majority vote, it is common to take the
mean of the outputs.
However, the mean is heavily affected by outliers, with the result that the median is a
more common average to use.
It is the use of the median that produces the bagging algorithm, which is meant to imply
‘robust bagging’.
There is another algorithm to combine classifier, known as mixture of experts.

Mixture of Experts
Mixture of experts (MoE) is a machine learning technique that uses multiple specialized
models to solve a problem. MoE is a type of ensemble learning that combines predictions
from multiple models to improve accuracy
Working:
1. An input is evaluated by a gating network that determines which experts to activate
2. The selected experts are assigned weights and their outputs are combined to produce a
final result

15
R22 Machine Learning Lecture Notes

Gate,

| Expert, | | Expert2| | Expert, | I Expert,,|

inputs inputs inputs inputs

Algorithm:
The Mixture of Experts Algorithm

« For each expert:

~ caleulate the probability of the input belonging to each possible class by com-
puting (where the w; are the weights for that classifier):
1
oi(x, W) = . (13.6)
1+ exp(—wi - x)
« For each gating network up the tree:

compute:
(13.7)
« Pass as input to the next level gates (where the sum is over the relevant inputs to
that gate):
> 0505 (13.8)
k

Basic Statistics:

Mean:

e The "mean" is the average value of a dataset.

e It is calculated by adding up all the values in the dataset and dividing by the number
of observations.
e The mean is a useful measure of central tendency because it is sensitive to outliers,
meaning that extreme values can significantly affect the value of the mean.

Median:

e The "median" is the middle value in a dataset.

16
R22 Machine Learning Lecture Notes

e Itis calculated by arranging the values in the dataset in order and finding the value
that lies in the middle.
e Ifthere are an even number of values in the dataset, the median is the average of the
two middle values.
e The median is a useful measure of central tendency because it is not affected by
outliers, meaning that extreme values do not significantly affect the value of the
median.

e The "mode" is the most common value in a dataset.

e Tt is calculated by finding the value that occurs most frequently in the dataset.
e Ifthere are multiple values that occur with the same frequency, the dataset is said to be
bimodal, trimodal, or multimodal.
e The mode is a useful measure of central tendency because it can identify the most
common value in a dataset.
e However, it is not a good measure of central tendency for datasets with a wide range of
values or datasets with no repeating values.

Variance:

e Variance is a measure of how much the data for a variable varies from it's mean.

Variance (s%) = Z (&:-2)°

N-1

Where,
o ;s the i observation,
* Zis the mean, and
« Nis the number of observations

Covariance:

e Covariance is a measure of relationship between two variables that is scale dependent,

i.e. how much will a variable change when another variable changes.
Covariance (z,) = ) h#
w
Where,

« z; isthe i observation in variable x.

Z isthe mean for variable x,
« y; isthe i observation in variable y,
+ gis he mean for variable . and
« Nis the number of observations

Standard Deviation:

e The square root of the variance is known as the standard deviation

The Gaussian / Normal Distribution:

e Normal distribution, also known as the Gaussian distribution, is a continuous

probability distribution that is symmetric about the mean, depicting that data near the
mean are more frequent in occurrence than data far from the mean.

17
R22 Machine Learning Lecture Notes

1 (=)’
S(xX)=—7=e 2
o2
4 = mean of x
o = standard deviation of x
77~ 3.14159 ...
e~ 271828 ..

FIGURE 2.14 Plot of the one-dimensional Gaussian curve.

The bias and Variance Trade-off:

Bias is the difference between the average prediction of our model and the correct value
which we are trying to predict.
Model with high bias pays very little attention to the training data and oversimplifies
the model. It always leads to high error on training and test data.
Variance is the variability of model prediction for a given data point or a value which
tells us spread of our data.
Model with high variance pays a lot of attention to training data and does not generalize
on the data which it hasn’t seen before.
As a result, such models perform very well on training data but has high error rates on
test data.
If our model is too simple and has very few parameters then it may have high bias and
low variance.
On the other hand if our model has large number of parameters then it’s going to have
high variance and low bias.
So we need to find the right/good balance without overfitting and underfitting the data.

18
R22 Machine Learning Lecture Notes

underfitting overfitting
zone zone
generalization (test),

bias squared’,
. variance
error

irreducible error

model complexity

Gaussian Mixture Models (GMMs)

Clustering is a foundational technique in machine learning, used to group data into distinct categories
based on patterns or similarities. Among the many clustering methods, Gaussian Mixture Models
(GMMs) stand out for their probabilistic approach to clustering.
A Gaussian mixture model is a soft clustering technique used in unsupervised learning to determine
the probability that a given data point belongs to a cluster. It’s composed of several Gaussians, each
identified by m € {1,..., m}, where m is the number of clusters in a data set.
Each Gaussian m in the mixture is comprised of the following parameters:
e A mean p that defines its center.
e A covariance ¥ that defines its width. This would be equivalent to the dimensions of an
ellipsoid in a multivariate scenario.
e A mixing probability m that defines how big or small the Gaussian function will be.

Let’s illustrate these parameters graphically:

Cluster 2

Cluster 1
Cluster 3

Mathematical Function:
The overall distribution is a weighted sum of multiple Gaussian distributions
The probability density function for a GMM
M
G = ). b Z)
m=1
Here ¢(x; i, Z,,)is Gaussian function with mean y,,, and covariance matrix X, and a,, are the
wweights with constraint that ¥ _; a,, = 1
The overall distribution is a weighted sum of multiple Gaussian distributions
The probability that a given data point x;belongs to Gaussian component m is:
P(xi€ Cm)= ,f"‘d)(x 2 Ek)A
Zhem1 Am (xiiuTr)
The challenge is determining the weights o , which is done by maximizing the likelihood function.
The likelihood function is the probability of the observed data given the model parameters. To
simplify computations, we take the log-likelihood because:
e Probabilities are small, leading to numerical stability issues.

19
R22 Machine Learning Lecture Notes

e The logarithm turns the product of probabilities into a sum, making optimization easier.

The Expectation-Maximization (EM) algorithm is used to maximize this log-likelihood iteratively.

Step 1: Introduce Latent Variables

Since we do not know which Gaussian a given data point belongs to, we introduce a latent
variable f'that indicates the Gaussian component:

o f=1means the data point comes from Gaussian 1.

o f=0means the data point comes from Gaussian 2.

e The probability of selecting a component is given by:

P(y)=ré(y:pi,00)+(1-m)d(y;12,02)
where:

o mis the mixing probability for Gaussian 1.

o (1-m) is the mixing probability for Gaussian 2.

Step 2: Expectation Step (E-Step)

The expectation step computes the expected value of the latent variable f'given the observed
data.

The responsibility of Gaussian m for a data point x; is:

o This gives the probability that a given data point comes from Gaussian 1.

o The expectation of /'is computed as the weighted sum of probabilities.

Step 3: Maximization Step (M-Step)

. Given the responsibilities computed in the E-step, the parameters (ltm, 6m, 7) are updated by
maximizing the expected log-likelihood:

— (M-step 3) &, —

(M-step 1) iy = =

— (M-step 2) jp = *

Algorithm:

20
R22 Machine Learning Lecture Notes

The Gaussian Mixture Model EM Algorithm

« Initialisation

— set ji; and fi2 to be randomly chosen values from the dataset
N
—set 6y =62 =) (yi — §)?/N (where y is the mean of the entire dataset)

+ Repeat until convergence:

Fo(yifin,a1) Ean & == g
(E-step) Folyipn.o1)+(1—7)elyifz.62) fori=1...N

— (M-step 1) i,

(M-step 2) fip = =}

N
S =F)yi—in)?
— (M-step 3) 6; = =+

(M-step 4) 62 = =

(M-step 5) &

Expectation Maximization (EM)

Expectation Maximization (EM) estimates mixture models for variety of applications, enhancing
clustering and anomaly detection.
EM iteratively computes and maximizes likelihood to improve model accuracy, ensuring reliable
clustering results.
Expectation maximization is an iterative method. It starts with an initial parameter guess. The
parameter values are used to compute the likelihood of the current model. This is the Expectation step.
The parameter values are then recomputed to maximize the likelihood. This is the Maximization step.
The new parameter estimates are used to compute a new expectation and then they are optimized again
to maximize the likelihood. This iterative process continues until model convergence.

21
R22 Machine Learning Lecture Notes

The General Expectation-Maximisation (EM) Algorithm

« Initialisation
- (0)
— guess parameters 6
* Repeat until convergence:

— (E-step) compute the expectation Q(G’.ém) = E(f(0": U'HDAQ(”)

. AG+1) ()
— (M-step) estimate the new parameters 6 T as maxg Q(6'. 0 ! )

Corey Wade - Hands-On Gradient Boosting With XGBoost and Scikit-Learn - Perform Accessible Python Machine Learning and Extreme Gradient Boosting With Python-PACKT Publishing LTD (2020)
No ratings yet
Corey Wade - Hands-On Gradient Boosting With XGBoost and Scikit-Learn - Perform Accessible Python Machine Learning and Extreme Gradient Boosting With Python-PACKT Publishing LTD (2020)
141 pages
Unit-3 Alt
No ratings yet
Unit-3 Alt
24 pages
ML Unit 3 New
100% (1)
ML Unit 3 New
24 pages
ML_UNIT3
No ratings yet
ML_UNIT3
24 pages
ML UNIT-3
No ratings yet
ML UNIT-3
23 pages
DECISION TREES-jb
No ratings yet
DECISION TREES-jb
8 pages
Lecture 17 18
No ratings yet
Lecture 17 18
52 pages
ML_Module-3-chapter-6 RNSIT
No ratings yet
ML_Module-3-chapter-6 RNSIT
10 pages
STAT 451: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
No ratings yet
STAT 451: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
18 pages
Decision_tree
No ratings yet
Decision_tree
15 pages
unit3-ml
No ratings yet
unit3-ml
23 pages
AI&Ml-module 4 (Part 1)
No ratings yet
AI&Ml-module 4 (Part 1)
85 pages
AI&Ml-module 4 (Complete)
No ratings yet
AI&Ml-module 4 (Complete)
124 pages
Ml Unit 2 Final_iii Yr
No ratings yet
Ml Unit 2 Final_iii Yr
72 pages
Decision Tree
No ratings yet
Decision Tree
31 pages
L04 Decision Trees
No ratings yet
L04 Decision Trees
34 pages
CS446: Machine Learning: Lecture 21 (ML Models - Decision Trees - ID3)
No ratings yet
CS446: Machine Learning: Lecture 21 (ML Models - Decision Trees - ID3)
54 pages
Wk. 5.2. Decision Trees (27.10.2020)
No ratings yet
Wk. 5.2. Decision Trees (27.10.2020)
57 pages
Aiml M4 C1
No ratings yet
Aiml M4 C1
101 pages
unit-4[1].docx ML
No ratings yet
unit-4[1].docx ML
42 pages
Experiment No-2
No ratings yet
Experiment No-2
4 pages
u34
No ratings yet
u34
4 pages
Lesson 7 Supervised Method (Decision Trees) Algorithms
No ratings yet
Lesson 7 Supervised Method (Decision Trees) Algorithms
12 pages
Unit6 -2 Classification-Decision-Trees_25625586-1bf9-4821-a721-70db2d7805ef
No ratings yet
Unit6 -2 Classification-Decision-Trees_25625586-1bf9-4821-a721-70db2d7805ef
36 pages
Cse 445 Lecture 8 Mma
No ratings yet
Cse 445 Lecture 8 Mma
107 pages
Decision Trees
No ratings yet
Decision Trees
15 pages
FALLSEM2024-25 BCSE209L TH VL2024250101598 2024-08-05 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE209L TH VL2024250101598 2024-08-05 Reference-Material-I
31 pages
unit 3
No ratings yet
unit 3
28 pages
2 - Decision Tree
No ratings yet
2 - Decision Tree
23 pages
7_DecisionTree
No ratings yet
7_DecisionTree
58 pages
Decision Trees - 2022
No ratings yet
Decision Trees - 2022
49 pages
Decision Tree Algorithm, Explained-1-22
No ratings yet
Decision Tree Algorithm, Explained-1-22
22 pages
Chapter 03
No ratings yet
Chapter 03
30 pages
Decision Tree.pptx
No ratings yet
Decision Tree.pptx
41 pages
U4 ML Updated
No ratings yet
U4 ML Updated
32 pages
MLT Unit 3
100% (1)
MLT Unit 3
38 pages
Unit IV Da Online - PPTX 2 82
No ratings yet
Unit IV Da Online - PPTX 2 82
81 pages
Unit IV Decision Trees
No ratings yet
Unit IV Decision Trees
37 pages
Decision Trees_ a Complete Introduction With Examples _ by Shubham Koli _ Medium
No ratings yet
Decision Trees_ a Complete Introduction With Examples _ by Shubham Koli _ Medium
22 pages
فاينل تعلم
No ratings yet
فاينل تعلم
144 pages
Classification and Prediction
No ratings yet
Classification and Prediction
81 pages
UNIT-3[MLT]
No ratings yet
UNIT-3[MLT]
42 pages
Chapter 7 Supervised Learning
No ratings yet
Chapter 7 Supervised Learning
71 pages
Module 3 Chap 3 Decision Tree Learning
No ratings yet
Module 3 Chap 3 Decision Tree Learning
79 pages
DT-0 (3 Files Merged)
No ratings yet
DT-0 (3 Files Merged)
143 pages
AI Chapter 3 Part 2
No ratings yet
AI Chapter 3 Part 2
51 pages
3. Tree Models
No ratings yet
3. Tree Models
42 pages
ML Unit II
No ratings yet
ML Unit II
183 pages
Session 5b Classification by Decision Tree Induction (1)
No ratings yet
Session 5b Classification by Decision Tree Induction (1)
42 pages
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
No ratings yet
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
83 pages
ML Unit-2 Material WORD
No ratings yet
ML Unit-2 Material WORD
25 pages
Classification - Decision Trees
No ratings yet
Classification - Decision Trees
43 pages
DMDW-CO3-SESSION-14
No ratings yet
DMDW-CO3-SESSION-14
55 pages
Decision Tree
No ratings yet
Decision Tree
14 pages
1.decision Trees Concepts
No ratings yet
1.decision Trees Concepts
70 pages
Decision Tree Classifier-Introduction, ID3
No ratings yet
Decision Tree Classifier-Introduction, ID3
34 pages
ML Unit 3
No ratings yet
ML Unit 3
28 pages
Decision Tree in Machine Learning
No ratings yet
Decision Tree in Machine Learning
11 pages
Unit 3 Classification
No ratings yet
Unit 3 Classification
71 pages
Decision - Tree
No ratings yet
Decision - Tree
75 pages
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
ML-unit-1
No ratings yet
ML-unit-1
42 pages
ML-unit-4
No ratings yet
ML-unit-4
20 pages
FLAT-M2
No ratings yet
FLAT-M2
40 pages
Unit-II_Greedy Method (2)
No ratings yet
Unit-II_Greedy Method (2)
42 pages
1
No ratings yet
1
113 pages
Unit-II_Multiple_Access
No ratings yet
Unit-II_Multiple_Access
58 pages
MID-3 ML Question Bank
No ratings yet
MID-3 ML Question Bank
2 pages
Navo Minority Over-Sampling Technique (Nmote) : A Consistent Performance Booster On Imbalanced Datasets
No ratings yet
Navo Minority Over-Sampling Technique (Nmote) : A Consistent Performance Booster On Imbalanced Datasets
42 pages
IITD_CPADSAI_f5904ed02d
No ratings yet
IITD_CPADSAI_f5904ed02d
20 pages
Analysis of ensemble machine learning classification comparison on the skin cancer MNIST dataset
No ratings yet
Analysis of ensemble machine learning classification comparison on the skin cancer MNIST dataset
8 pages
1.20 Deep Learning in Business Analytics
No ratings yet
1.20 Deep Learning in Business Analytics
21 pages
PCI and RSI Collisions-Rodrigo Verissimo Extended Abstract
No ratings yet
PCI and RSI Collisions-Rodrigo Verissimo Extended Abstract
8 pages
Kailash BusinessReport ML
No ratings yet
Kailash BusinessReport ML
51 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
Exposys Data Labs Diabetes Disease Prediction: Shilpa J Shetty Nishma Nayana
No ratings yet
Exposys Data Labs Diabetes Disease Prediction: Shilpa J Shetty Nishma Nayana
13 pages
Statistical Arbitrage With ML 1721555596
No ratings yet
Statistical Arbitrage With ML 1721555596
9 pages
107 Boostong Models
No ratings yet
107 Boostong Models
27 pages
1-s2.0-S2352012424021192-main
No ratings yet
1-s2.0-S2352012424021192-main
22 pages
12622-Article Text-22383-1-10-20220510
No ratings yet
12622-Article Text-22383-1-10-20220510
5 pages
Baduwal Survey - On - Machine - Learning - Paradigms - For - Phishing - Website - Detection
No ratings yet
Baduwal Survey - On - Machine - Learning - Paradigms - For - Phishing - Website - Detection
15 pages
Diabetes Prediction
No ratings yet
Diabetes Prediction
15 pages
Module 5,1 Ensemble_Bagging, RF,Boosting
No ratings yet
Module 5,1 Ensemble_Bagging, RF,Boosting
66 pages
Ensemble Methods.pptx
No ratings yet
Ensemble Methods.pptx
32 pages
A Machine Learning Analysis of Stock Market Tick Data For Stock Price Trend Prediction
100% (1)
A Machine Learning Analysis of Stock Market Tick Data For Stock Price Trend Prediction
24 pages
A Gradient Boosting model to predict the milk production
No ratings yet
A Gradient Boosting model to predict the milk production
8 pages
Ensemble Learning: Comprehensive Explanation: Base Models
No ratings yet
Ensemble Learning: Comprehensive Explanation: Base Models
20 pages
Contemporary Machine Learning Applications in Agriculture
No ratings yet
Contemporary Machine Learning Applications in Agriculture
36 pages
CSE PROJECTS 2025
No ratings yet
CSE PROJECTS 2025
13 pages
Thong Kam 2008
No ratings yet
Thong Kam 2008
8 pages
A Review On Cyber Security and Anomaly Detection Perspectives of Smart Grid
No ratings yet
A Review On Cyber Security and Anomaly Detection Perspectives of Smart Grid
6 pages
Proposal Defense v6
No ratings yet
Proposal Defense v6
55 pages
Parkinsons Disease Prediction Using Machine Learning
No ratings yet
Parkinsons Disease Prediction Using Machine Learning
21 pages
07au Midterm
No ratings yet
07au Midterm
17 pages
Assignment 8 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
No ratings yet
Assignment 8 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
3 pages
Calorie Burnt Prediction
No ratings yet
Calorie Burnt Prediction
16 pages

ML-unit-3

Uploaded by

ML-unit-3

Uploaded by

R22 Machine Learning Lecture Notes

: Leaf Node LeafNode | Leaf Node Decision Node

Leaf Node Leaf Node

Entropy in Information Theory:

e Entropy measures the amount of impurity in a set of features.

e Itisan improved version of ID3.

Deadline? | Is there a party? | Lazy? | Activity

‘We start with which feature has to selected as a root node?

Entropy(S) “Pparty lOgQ Pparty = Pstudy lOgZ Dstudy

find which feature has the maximal information gain:

Gain(S, Deadline) = 1.6855 - 0‘

Gain(S, Party) = 1.685371<7;10g3;>

Gain(S.Lazy) = 16855P - 56 (’E 3 log, 37 oz, & —

FIGURE 12.6 The decision tree after one

FIGURE 12.7 The tree after another

Finally, we will get the following decision tree.

Classification and Regression Trees(CART):

It is the probability of misclassifying a randomly chosen element in a set.

Count Probability Gini Impurity

Node B 3 7 03 0.7 1-0.3> 0.7 = 0.42

o Ensemble learning refers to the approach of combining multiple ML models to produce

e AdaBoost short for Adaptive Boosting is an ensemble learning used in machine

3. Weighted Error Calculation

4. Model Weight Calculation

5. Update Instance Weights

7. Final Model Creation

Initialise all weights to 1/N, where N is the number of datapoints

« While 0 < ¢ < % (and t < 7, some maximum number of iterations):

set ap = log (’—k)

o Here is an overview of the steps including Bagging classifier algorithm:

Prediction Aggregation: To make a prediction on testing data combine the predictions

Random Forest Algorithm e

The Basic Random Forest Training Algorithm

o For each of N trees:

~ create a new bootstrap sample of the training set

Stacking combines many ensemble methods in order to build a meta-learner.

Different ways to combine classifiers:

| Expert, | | Expert2| | Expert, | I Expert,,|

inputs inputs inputs inputs

« For each expert:

e The "mean" is the average value of a dataset.

e The "median" is the middle value in a dataset.

e The "mode" is the most common value in a dataset.

Variance (s%) = Z (&:-2)°

e Covariance is a measure of relationship between two variables that is scale dependent,

« z; isthe i observation in variable x.

e The square root of the variance is known as the standard deviation

The Gaussian / Normal Distribution:

e Normal distribution, also known as the Gaussian distribution, is a continuous

FIGURE 2.14 Plot of the one-dimensional Gaussian curve.

The bias and Variance Trade-off:

Gaussian Mixture Models (GMMs)

Let’s illustrate these parameters graphically:

The Expectation-Maximization (EM) algorithm is used to maximize this log-likelihood iteratively.

o f=1means the data point comes from Gaussian 1.

o f=0means the data point comes from Gaussian 2.

o mis the mixing probability for Gaussian 1.

o (1-m) is the mixing probability for Gaussian 2.

Step 2: Expectation Step (E-Step)

The responsibility of Gaussian m for a data point x; is:

o The expectation of /'is computed as the weighted sum of probabilities.

Step 3: Maximization Step (M-Step)

The Gaussian Mixture Model EM Algorithm

+ Repeat until convergence:

Expectation Maximization (EM)

The General Expectation-Maximisation (EM) Algorithm

— (E-step) compute the expectation Q(G’.ém) = E(f(0": U'HDAQ(”)

You might also like