0% found this document useful (0 votes)
10 views22 pages

ML-unit-3

The document covers key concepts in machine learning, focusing on decision trees, ensemble learning, and algorithms such as ID3, C4.5, and CART. It explains how decision trees are constructed and used for classification and regression, detailing the processes of entropy, information gain, and Gini impurity. Additionally, it discusses ensemble methods like boosting, particularly AdaBoost, which combines multiple weak learners to enhance prediction accuracy.

Uploaded by

Sameer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views22 pages

ML-unit-3

The document covers key concepts in machine learning, focusing on decision trees, ensemble learning, and algorithms such as ID3, C4.5, and CART. It explains how decision trees are constructed and used for classification and regression, detailing the processes of entropy, information gain, and Gini impurity. Additionally, it discusses ensemble methods like boosting, particularly AdaBoost, which combines multiple weak learners to enhance prediction accuracy.

Uploaded by

Sameer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

R22 Machine Learning Lecture Notes

UNIT-II

Learning with Trees: Decision Trees, Constructing Decision Trees, Classification and
Regression Trees.
Ensemble Learning: Boosting, Bagging, Different ways to combine classifiers, Basic Statistics,
Gaussian Mixture Models, Nearest Neighbour Methods.
Unsupervised Learning: K Means Algorithm.

Decision Trees:

Decision Tree is a Supervised learning technique that can be used for both classification
and Regression problems, but mostly it is preferred for solving Classification problems.
It is a tree-structured classifier, where internal nodes represent the features of a dataset,
branches represent the decision rules and each leaf node represents the outcome.
In a Decision tree, there are two nodes, which are the Decision Node and Leaf Node.

Decision nodes are used to make any decision and have multiple branches, whereas
Leafnodes are the output of those decisions and do not contain any further branches.
The decisions or the tests are performed on the basis of features of the given dataset.
It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.

[ —————— 3
: sub-Tree - pecision Node
|
| Decision Node
|
|

g’ vy v
|
|

: Leaf Node LeafNode | Leaf Node Decision Node


N e e o )

Leaf Node Leaf Node


R22 Machine Learning Lecture Notes

Example:

Party?

Go to party

Go to pub

Yes, o
Watch TV Study

FIGURE 12.1 A simple decision tree to decide how you will spend the evening

® One of the reasons that decision trees are popular is that we can turn them into a set of
logical disjunctions (if ... then rules) that then go into program code very simply.
Ex: if there is a party then go to it
if there is not a party and you have an urgent deadline then study
Constructing Decision Trees:
Types of Decision Tree Algorithms:
e ID3: This algorithm measures how mixed up the data is at a node using something
called entropy. It then chooses the feature that helps to clarify the data the most.
e (C4.5: This is an improved version of ID3 that can handle missing data and continuous
attributes.
e CART: This algorithm uses a different measure called Gini impurity to decide how to
split the data. It can be used for both classification (sorting data into categories) and
regression (predicting continuous values) tasks.

ID3 Algorithm:

Entropy in Information Theory:

e Entropy measures the amount of impurity in a set of features.


e The entropy H of a set of probabilities p; is:

Entropy(p) = — Z pilogy i,

e where the logarithm is base 2 because we are imagining that we encode everything
using binary digits (bits), and we define 0 log 0=0.
R22 Machine Learning Lecture Notes

0 o2 04 06 o8 1
Proparton of Posiive Examplos

FIGURE 12.2 A graph of entropy, detailing how much information is available from finding
out another piece of information given what you already know.
e If all of the examples are positive, then we don’t get any extra information from
knowing the value of the feature for any particular example, since whatever the value
of the feature, the example will be positive. Thus, the entropy of that feature is 0.
e However, if the feature separates the examples into 50% positive and 50% negative,
then the amount of entropy is at a maximum, and knowing about that feature is very
useful to us.
e For our decision tree, the best feature to pick as the one to classify on now is the one
that gives you the most information, i.e., the one with the highest entropy.
Information Gain:
e Itis defined as the entropy of the whole set minus the entropy when a particular feature
is chosen.

S .
Gain(S, F) = Entropy(S) - Z
fevalues(F)
g
f,f Entropy(S;). (12.2)

e The ID3 algorithm computes this information gain for each feature and chooses the
one that produces the highest value.
R22 Machine Learning Lecture Notes

The D3 Algorithm
« If all examples have the same label:
— return a leaf with that label
o Else if there are no features left to test:
— return a leaf with the most common label
o Else:
— choose the feature F' that maximises the information gain of S to be the next
node using Equation (12.2)
— add a branch from the node for each possible value f in F'
— for each branch:
& calenlate Sy by removing ' from the set of features
¥ recurs: ely call the algorithm with Sy, to compute the gain relative to the
current set of examples

C4.5 Algorithm:

e Itisan improved version of ID3.


e Pruning is another method that can help us avoid overfitting.
e It helps in improving the performance of the Decision tree by cutting the nodes or sub-
nodes which are not significant.
e Additionally, it removes the branches which have very low importance.
e There are mainly 2 ways for pruning:
e Pre-pruning — we can stop growing the tree earlier, which means we can
prune/remove/cut a node if it has low importance while growing the tree.
e Post-pruning — once our tree is built to its depth, we can start pruning the nodes based
on their significance.
e (4.5 uses a different method called rule post-pruning.
e This consists of taking the tree generated by ID3, converting it to a set of if-then rules,
and then pruning each rule by removing preconditions if the accuracy of the rule
increases without it.
e The rules are then sorted according to their accuracy on the training set and applied in
order.
o The advantages of dealing with rules are that they are easier to read and their order in
the tree does not matter, just their accuracy in the classification.
e For Continuous Variables, the simplest solution is to discretise the continuous variable.
e Computation complexity of Decision Tree is O(dnlogn) where n is number of data
points, d is number of dimensions.
R22 Machine Learning Lecture Notes

Classification Example: construct the decision tree to decide what to do in the evening

Deadline? | Is there a party? | Lazy? | Activity


Urgent Yes Yes Party
Urgent No Yes Study
Near Yes Yes Party
None Yes No Party
None No Yes Pub
None Yes No Party
Near No No Study
Near No Yes TV
Near Yes Yes Party
Urgent No No Study

‘We start with which feature has to selected as a root node?


Compute Entropy of S:

Entropy(S) “Pparty lOgQ Pparty = Pstudy lOgZ Dstudy

= Ppublogy
ppub = p1v logy prv
5 8, 3 1 1 —log, 1 1
-—log,
— - —logy — = —log, — - —
105270710 20 10 ™0 10570
=05+ 052014 033224 0.3322 = 16855 (12.11)

find which feature has the maximal information gain:

Gain(S, Deadline) = 1.6855 - 0‘


1 Emmpus,“g )
\", -Entropy(Suear) - 1 f(Snone)

EER 1)
11
01 1% %y
3f1 12 2 9
0 ( FRCERELS ;)
1.6855 - 0.2755 - 0.6 — 0.2755
05345 (12.12)
R22 Machine Learning Lecture Notes

Gain(S, Party) = 1.685371<7;10g3;>


§ %25
1
3

= 10 (12.13)

Gain(S.Lazy) = 16855P - 56 (’E 3 log, 37 oz, & —

Therefore, the root node will be the party feature, which has two feature values (‘yes’
and ‘no’), so it will have two branches coming out of it.

Party?

Ye No

FIGURE 12.6 The decision tree after one


step of the algorithm.

‘When we look at the ‘yes’ branch, we see that in all five cases where there was a party
we went to it, so we just put a leaf node there, saying ‘party’.
For the ‘no’ branch, out of the five cases there are three different outcomes, so now we
need to choose another feature.
The five cases we are looking at are:

Deadline? | Is there
a party? | Lazy? | Activity
Urgent No Yes | Study
None No Yes | Pub
Near No No | Study
Near No Yes | TV
Urgent No Yes | Study
‘We’ve used the party feature, so we just need to calculate the information gain of the
other two over these five examples:
R22 Machine Learning Lecture Notes

Gain(S,
ain(S, Deadline)
Deadline) = 1371
1. ~5: | gleg
2l :

2 1l 1 1] L 1L ll 1
AT LT
= 13711-0-04-0

4 9
Gain($, Lazy) = 137172(7 logzi—iloug— ,10};2,>

(12.16)

Here, Deadline feature has maximum information gain. Hence, we selected Deadline
feature for splitting data.

Go to party
Urge:

FIGURE 12.7 The tree after another


step.

Finally, we will get the following decision tree.

Party?

Go to party

Go to pub

Watch TV Study
R22 Machine Learning Lecture Notes

Classification and Regression Trees(CART):

It is another well-known tree-based algorithm, CART, whose name indicates that it can
be used for both classification and regression.

Gini Impurity:

It is the probability of misclassifying a randomly chosen element in a set.


The ‘impurity’ in the name suggests that the aim of the decision tree is to have each
leaf node represent a set of data points that are in the same class, so that there are no
mismatches. This is known as purity.
If a leaf is pure then all of the training data within it have just one class.
Consider a dataset D that contains samples from k classes.
The probability of samples belonging to class i at a given node can be denoted as p;.
Then the Gini Impurity of is defined as:

k
Gini(D) =1 - Z p?
i=1

The node with uniform class distribution has the highest impurity.
The minimum impurity is obtained when all records belong to the same class.

Count Probability Gini Impurity


ni na P P2 1-pf—p

Node A o 10 o 1 1-02-12=0

Node B 3 7 03 0.7 1-0.3> 0.7 = 0.42

Node C 5 5 05 05 1-05%2-05%2=05

An attribute with the smallest Gini Impurity is selected for splitting the node.
Gini Impurity
os

oa
g
03

01

oo
0o 01 02 03 04 05 06 07 08 09 10
Fraction of Class k (ox)
R22 Machine Learning Lecture Notes

Regression in Trees:

A Regression tree is an algorithm where the target variable is continuous and the tree
is used to predict its value.

Econditionw (Condition)

(Valuej
R (Valuej
N
(Value] LValuej

Regression Tree works by splitting the training data recursively into smaller subsets
based on specific criteria.
The objective is to split the data in a way that minimizes the residual reduction (Sum
of Squared Error) in each subset.
Residual Reduction- Residual reduction is a measure of how much the average
squared difference between the predicted values and the actual values for the target
variable is reduced by splitting the subset. The lower the residual reduction, the better
the model fits the data.
Splitting Criteria- CART evaluates every possible split at each node and selects the
one that results in the greatest reduction of residual error in the resulting subsets. This
process is repeated until a stopping criterion is met, such as reaching the maximum tree
depth or having too few instances in a leaf node.
R22 Machine Learning Lecture Notes

Ensemble Learning:

o Ensemble learning refers to the approach of combining multiple ML models to produce


a more accurate and robust prediction compared to any individual model.
The conventional ensemble methods include bagging, boosting, and stacking-based
methods

Ensemble Methods

Boosting:
= B =
Boosting is an ensemble technique that combines multiple weak learners to create a
strong learner.

Boosting

o
PR e
e
0O T WO e
-

BS1 Bs2

Classifier 1 Classifier 2

10
R22 Machine Learning Lecture Notes

e The ensemble of weak models are trained in series such that each model that comes
next, tries to correct errors of the previous model until the entire training dataset is
predicted correctly.
e One of the most well-known boosting algorithms is AdaBoost (Adaptive Boosting).

AdaBoost:

e AdaBoost short for Adaptive Boosting is an ensemble learning used in machine


learning for classification and regression problems.
e The main idea behind AdaBoost is to iteratively train the weak classifier on the training
dataset with each successive classifier giving more weightage to the data points that are
misclassified.
o The final AdaBoost model is decided by combining all the weak classifier that has been
used for training with the weightage given to the models according to their accuracies.
e The model which has the highest accuracy is given the highest weightage while the
model which has the lowest accuracy is given a lower weightage.

Steps in AdaBoost:

1. Weight Initialization

At the start, every instance is assigned an identical weight. These weights determine the
importance of every example.

2. Model Training

A weak learner is skilled at the dataset, with the aim of minimizing classification errors.

3. Weighted Error Calculation

The weighted mistakes are then calculated by means of summing up the weights of the
misclassified times. This step emphasizes the importance of the samples which are tough to
classify.

4. Model Weight Calculation

The weight of the susceptible learner is calculated primarily based on their Performance in
classifying the training data. Models that perform properly are assigned higher weights,
indicating that they're more reliable.

5. Update Instance Weights

The example weights are updated to offer more weight to the misclassified samples from the
previous step.

6. Repeat

Steps 2 through 5 are repeated for a predefined variety of iterations or till a distinctive overall
performance threshold is met.

11
R22 Machine Learning Lecture Notes

7. Final Model Creation

The very last sturdy model (also referred to as the ensemble) is created by means of
combining the weighted outputs of all weak learners.

8. Classification

To make predictions on new records, AdaBoost uses the very last ensemble model.

Algorithm:

AdaBoost Algorithm

Initialise all weights to 1/N, where N is the number of datapoints

« While 0 < ¢ < % (and t < 7, some maximum number of iterations):


train classifier on {, w(*)}, getting hypotheses h () for datapoints =,,
~
— compute training error ¢, = > wi I(y, # h(x,))

set ap = log (’—k)


— update weights using:
Wit = w® exp(aed(yn # he(xn))/Zes (13.1)
where Z; is a normalisation constant
-
Output f(x) = sign (Z u‘h,(x))
1

12
R22 Machine Learning Lecture Notes

Bagging:

e Bagging is a supervised learning technique that can be used for both regression and
classification tasks.

Bagging

« S

@ > flw““aouia‘“fj P
- T

BS1 BS2

| Classifier 1 ’ Classifier 2 |

o Here is an overview of the steps including Bagging classifier algorithm:


e Bootstrap Sampling: Divides the original training data into ‘N’ subsets and randomly
selects a subset with replacement in some rows from other subsets. This step ensures
that the base models are trained on diverse subsets of the data and there is no class
imbalance.
e Base Model Training: For each bootstrapped sample, train a base model independently
on that subset of data. These weak models are trained in parallel to increase
computational efficiency and reduce time consumption.

13
R22 Machine Learning Lecture Notes

Prediction Aggregation: To make a prediction on testing data combine the predictions


of all base models. For classification tasks, it can include majority voting or weighted
majority while for regression, it involves averaging the predictions.
Out-of-Bag (OOB) Evaluation: Some samples are excluded from the training subset
of particular base models during the bootstrapping method. These “out-of-bag” samples
can be used to estimate the model’s performance without the need for cross-validation.
Final Prediction: After aggregating the predictions from all the base models, Bagging
produces a final prediction for each instance.

Random Forest:

The idea is largely that if one tree is good, then many trees (a forest) should be better,
provided that there is enough variety between them.
It works by creating a number of Decision Trees during the training phase.
Each tree is constructed using a random subset of the data set to measure a random
subset of features in each partition.
This randomness introduces variability among individual trees, reducing the risk of
overfitting and improving overall prediction performance.
In prediction, the algorithm aggregates the results of all trees, either by voting (for
classification tasks) or by averaging (for regression tasks)

Random Forest Algorithm e


in Machine Learning
T

o -
: ‘ VI [ ]
st ‘ [ voting
eaas ing majorit)

1 =
Class A

The Basic Random Forest Training Algorithm

o For each of N trees:

~ create a new bootstrap sample of the training set


— use this bootstrap sample to train a decision tree
— at each node of the decision tree, randomly select m features, and compute the
information gain (or Gini impurity) only on that set of features, selecting the
optimal one
— repeat until the tree is complete

14
R22 Machine Learning Lecture Notes

Stacking:

Stacking combines many ensemble methods in order to build a meta-learner.


Stacking has two levels of learning: 1) base learning and 2) meta-learning.
In the first one, the base learners are trained with training data set.
Once trained, the base learners create a new data set for a meta-learner.
The meta-learner is then trained with that new training data set.
Finally, the trained meta-learner is used to classify new instances.

Stacking

W.;l

Different ways to combine classifiers:

If the number of classifiers is odd and the classifiers are each independent of each
other, then majority voting will return the correct label if more than half of the
classifiers agree.
Assuming that each individual classifier has a success rate of p, the probability of the
ensemble getting the correct answer is a binomial distribution of the form:
T

> (f )M(\fl')" .,
e=T7341
For regression problems, rather than taking the majority vote, it is common to take the
mean of the outputs.
However, the mean is heavily affected by outliers, with the result that the median is a
more common average to use.
It is the use of the median that produces the bagging algorithm, which is meant to imply
‘robust bagging’.
There is another algorithm to combine classifier, known as mixture of experts.

Mixture of Experts
Mixture of experts (MoE) is a machine learning technique that uses multiple specialized
models to solve a problem. MoE is a type of ensemble learning that combines predictions
from multiple models to improve accuracy
Working:
1. An input is evaluated by a gating network that determines which experts to activate
2. The selected experts are assigned weights and their outputs are combined to produce a
final result

15
R22 Machine Learning Lecture Notes

Gate,

| Expert, | | Expert2| | Expert, | I Expert,,|

inputs inputs inputs inputs

Algorithm:
The Mixture of Experts Algorithm

« For each expert:

~ caleulate the probability of the input belonging to each possible class by com-
puting (where the w; are the weights for that classifier):
1
oi(x, W) = . (13.6)
1+ exp(—wi - x)
« For each gating network up the tree:

compute:
(13.7)
« Pass as input to the next level gates (where the sum is over the relevant inputs to
that gate):
> 0505 (13.8)
k

Basic Statistics:

Mean:

e The "mean" is the average value of a dataset.


e It is calculated by adding up all the values in the dataset and dividing by the number
of observations.
e The mean is a useful measure of central tendency because it is sensitive to outliers,
meaning that extreme values can significantly affect the value of the mean.

Median:

e The "median" is the middle value in a dataset.

16
R22 Machine Learning Lecture Notes

e Itis calculated by arranging the values in the dataset in order and finding the value
that lies in the middle.
e Ifthere are an even number of values in the dataset, the median is the average of the
two middle values.
e The median is a useful measure of central tendency because it is not affected by
outliers, meaning that extreme values do not significantly affect the value of the
median.

e The "mode" is the most common value in a dataset.


e Tt is calculated by finding the value that occurs most frequently in the dataset.
e Ifthere are multiple values that occur with the same frequency, the dataset is said to be
bimodal, trimodal, or multimodal.
e The mode is a useful measure of central tendency because it can identify the most
common value in a dataset.
e However, it is not a good measure of central tendency for datasets with a wide range of
values or datasets with no repeating values.

Variance:

e Variance is a measure of how much the data for a variable varies from it's mean.

Variance (s%) = Z (&:-2)°


N-1

Where,
o ;s the i observation,
* Zis the mean, and
« Nis the number of observations

Covariance:

e Covariance is a measure of relationship between two variables that is scale dependent,


i.e. how much will a variable change when another variable changes.
Covariance (z,) = ) h#
w
Where,

« z; isthe i observation in variable x.


Z isthe mean for variable x,
« y; isthe i observation in variable y,
+ gis he mean for variable . and
« Nis the number of observations

Standard Deviation:

e The square root of the variance is known as the standard deviation

The Gaussian / Normal Distribution:

e Normal distribution, also known as the Gaussian distribution, is a continuous


probability distribution that is symmetric about the mean, depicting that data near the
mean are more frequent in occurrence than data far from the mean.

17
R22 Machine Learning Lecture Notes

1 (=)’
S(xX)=—7=e 2
o2
4 = mean of x
o = standard deviation of x
77~ 3.14159 ...
e~ 271828 ..

FIGURE 2.14 Plot of the one-dimensional Gaussian curve.

The bias and Variance Trade-off:

Bias is the difference between the average prediction of our model and the correct value
which we are trying to predict.
Model with high bias pays very little attention to the training data and oversimplifies
the model. It always leads to high error on training and test data.
Variance is the variability of model prediction for a given data point or a value which
tells us spread of our data.
Model with high variance pays a lot of attention to training data and does not generalize
on the data which it hasn’t seen before.
As a result, such models perform very well on training data but has high error rates on
test data.
If our model is too simple and has very few parameters then it may have high bias and
low variance.
On the other hand if our model has large number of parameters then it’s going to have
high variance and low bias.
So we need to find the right/good balance without overfitting and underfitting the data.

18
R22 Machine Learning Lecture Notes

underfitting overfitting
zone zone
generalization (test),

bias squared’,
. variance
error

irreducible error

model complexity

Gaussian Mixture Models (GMMs)


Clustering is a foundational technique in machine learning, used to group data into distinct categories
based on patterns or similarities. Among the many clustering methods, Gaussian Mixture Models
(GMMs) stand out for their probabilistic approach to clustering.
A Gaussian mixture model is a soft clustering technique used in unsupervised learning to determine
the probability that a given data point belongs to a cluster. It’s composed of several Gaussians, each
identified by m € {1,..., m}, where m is the number of clusters in a data set.
Each Gaussian m in the mixture is comprised of the following parameters:
e A mean p that defines its center.
e A covariance ¥ that defines its width. This would be equivalent to the dimensions of an
ellipsoid in a multivariate scenario.
e A mixing probability m that defines how big or small the Gaussian function will be.

Let’s illustrate these parameters graphically:


Cluster 2

Cluster 1
Cluster 3

Mathematical Function:
The overall distribution is a weighted sum of multiple Gaussian distributions
The probability density function for a GMM
M
G = ). b Z)
m=1
Here ¢(x; i, Z,,)is Gaussian function with mean y,,, and covariance matrix X, and a,, are the
wweights with constraint that ¥ _; a,, = 1
The overall distribution is a weighted sum of multiple Gaussian distributions
The probability that a given data point x;belongs to Gaussian component m is:
P(xi€ Cm)= ,f"‘d)(x 2 Ek)A
Zhem1 Am (xiiuTr)
The challenge is determining the weights o , which is done by maximizing the likelihood function.
The likelihood function is the probability of the observed data given the model parameters. To
simplify computations, we take the log-likelihood because:
e Probabilities are small, leading to numerical stability issues.

19
R22 Machine Learning Lecture Notes

e The logarithm turns the product of probabilities into a sum, making optimization easier.

The Expectation-Maximization (EM) algorithm is used to maximize this log-likelihood iteratively.


Step 1: Introduce Latent Variables

Since we do not know which Gaussian a given data point belongs to, we introduce a latent
variable f'that indicates the Gaussian component:

o f=1means the data point comes from Gaussian 1.

o f=0means the data point comes from Gaussian 2.


e The probability of selecting a component is given by:

P(y)=ré(y:pi,00)+(1-m)d(y;12,02)
where:

o mis the mixing probability for Gaussian 1.

o (1-m) is the mixing probability for Gaussian 2.

Step 2: Expectation Step (E-Step)

The expectation step computes the expected value of the latent variable f'given the observed
data.

The responsibility of Gaussian m for a data point x; is:

o This gives the probability that a given data point comes from Gaussian 1.

o The expectation of /'is computed as the weighted sum of probabilities.

Step 3: Maximization Step (M-Step)

. Given the responsibilities computed in the E-step, the parameters (ltm, 6m, 7) are updated by
maximizing the expected log-likelihood:

— (M-step 3) &, —

(M-step 1) iy = =

— (M-step 2) jp = *

Algorithm:

20
R22 Machine Learning Lecture Notes

The Gaussian Mixture Model EM Algorithm

« Initialisation

— set ji; and fi2 to be randomly chosen values from the dataset
N
—set 6y =62 =) (yi — §)?/N (where y is the mean of the entire dataset)

+ Repeat until convergence:


Fo(yifin,a1) Ean & == g
(E-step) Folyipn.o1)+(1—7)elyifz.62) fori=1...N

— (M-step 1) i,

(M-step 2) fip = =}

N
S =F)yi—in)?
— (M-step 3) 6; = =+

(M-step 4) 62 = =

(M-step 5) &

Expectation Maximization (EM)


Expectation Maximization (EM) estimates mixture models for variety of applications, enhancing
clustering and anomaly detection.
EM iteratively computes and maximizes likelihood to improve model accuracy, ensuring reliable
clustering results.
Expectation maximization is an iterative method. It starts with an initial parameter guess. The
parameter values are used to compute the likelihood of the current model. This is the Expectation step.
The parameter values are then recomputed to maximize the likelihood. This is the Maximization step.
The new parameter estimates are used to compute a new expectation and then they are optimized again
to maximize the likelihood. This iterative process continues until model convergence.

21
R22 Machine Learning Lecture Notes

The General Expectation-Maximisation (EM) Algorithm


« Initialisation
- (0)
— guess parameters 6
* Repeat until convergence:

— (E-step) compute the expectation Q(G’.ém) = E(f(0": U'HDAQ(”)


. AG+1) ()
— (M-step) estimate the new parameters 6 T as maxg Q(6'. 0 ! )

22

You might also like