0% found this document useful (0 votes)

98 views5 pages

The Kullback-Liebler Distance and Entropy

The EM algorithm is an iterative method used to find maximum likelihood estimates of parameters in statistical models where the model depends on unobserved latent variables. It involves alternating between performing an expectation (E) step, where the expected value of the log-likelihood is computed with respect to the latent variables, and a maximization (M) step, where the parameters are estimated by maximizing the expected log-likelihood found in the E step. Each round of E and M steps is guaranteed to increase the log-likelihood and the algorithm is guaranteed to converge to a local maximum. The EM algorithm is commonly used to estimate parameters in mixture models where the latent variables indicate component membership.

Uploaded by

harislye

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

98 views5 pages

The Kullback-Liebler Distance and Entropy

Uploaded by

harislye

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

The EM algorithm

The goal is to estimate the parameter θ for a family p(x, y; θ) of distributions on a product
space S1 ×S2 . Assume that if we observe the full data (X (n) , Y (n) ), n = 1, . . . , N , finding the
maximum likelihood estimate is a simple computation. However we only get to observe
an i.i.d. random samples X (1) , . . . , X (N ) (the Y (n) ’s remain hidden). The best we can do is
maximize the marginal log-likelihood:

n
X
max `(X (1) , . . . , X (N ) ; θ) = max log pX (X (n) ; θ). (1)
θ θ
i=1

We develop an iterative algorithm which can be used to find local maxima of this marginal
log-likelihood.

The Kullback-Liebler distance and entropy

The Kullback-Liebler distance between two distributions is defined as

X q(y)
K(q, p) = q(y) log .
p(y)
y∈S

Since the log function is convex we have

X p(y) X p(y)
−K(q, p) = q(y) log ≤ log q(y) = 0.
y
q(y) y
q(y)

Since p, q are distributions that add up to 1, equality can only be achieved if p(y) = q(y) for
all y. The entropy of a distribution denoted by H(q) is given by
X
H(q) = − q(y) log q(y).
y

We return to the properties of this quantity later. Note that

K(q, p) = −H(q) − Eq log p.

The Gibbs Variational formula

Let ψ(y) be a function on a discrete state space S and define the probability p on S as

p(y) = eψ(y) /Z = eψ(y)−B , (2)

P
where Z = y∈S eψ(y) is the normalizing constant and B = log(Z). Then

B = max[H(q) + Eq ψ]. (3)

1
where the unique maximizer is p. This follows directly from the properties of the KL
distance. For any q
0 ≤ K(q, p) = −H(q) − Eq ψ + B,
so that
B ≥ H(q) + Eq ψ,
and equality is achieved only when q = p.

A variation formula for the log-marginal

Consider a joint distribution p(x, y) on a product space of two sets S1 × S2 with x ∈ S1 and
y ∈ S2 . Write
pX,Y (x, y) elog p(x,y)
pY |X (y|x) = =
pX (x) Z
where the normalizing constant Z is given by the the marginal pX (x). For fixed x set
ψ(y) = log p(x, y), then using the Gibbs variation formula we have

log pX (x) = max [H(q) + Eq log p(x, y)]

where q runs over probability disitribution on S2 . The maximum is achieved at q(y) =

pY |X (y|x).

Derivation of the EM iteration

Given observations X (1) , . . . , X (N ) , define

N
X
J (q1 , . . . , qN , θ) = H(qn ) + Eqn log p(X (n) , ·; θ), (4)
i=1

Since
log p(X (n) ; θ) = max[H(q) + Eq log p(X (n) , ·; θ)],
q

max `(X (1) , . . . , X (N ) ; θ) = max max J (q1 , . . . , qN , θ). (5)

θ θ q1 ,...,qN

This suggests an iterative scheme to maximize (1). Maximize in the qn ’s for fixed θ and
then maximize in θ for fixed qn ’s.

Initialize Set t = 0. And pick θ(0) .

· ¸
(t) (n) (t)
E. qn = argmaxq H(q) + Eq log p(X , ·; θ ) .
· ¸
PN (t)
M. θ(t+1) = argmaxθ n=1 H(qn ) + Eq(t) log p(X (n) , ·; θ) .
n

2
For the E step we know already that

p(X (n) , y; θ(t) ) p(X (n) , y; θ(t) )

qn(t) (y) = p(y|X (n) ; θ(t) ) = = P (6)
p(X (n) ; θ(t) ) y 0 ∈S2 p(X
(n) , y 0 ; θ (t) )

In the M -step we estimate:

N h
X i
(t+1)
θ = argmax H(qn(t) ) + Eq(t) log p(X (n) , ·; θ) (7)
n
θ n=1

The first term is a constant with respect to varying θ and can be ignored in the maximiza-
tion. Writing out the expectation we get
N X
X
θ(t+1) = argmax p(y|X (n) ; θ(t) ) log p(X (n) , y; θ).
θ n=1 y∈S2

To avoid confusion, we reiterate that θ(t) in the above expression is a constant value differ-
ent from the variable θ being maximized.
The above maximization has the following interpretation: The value of θ(t+1) is what
would be obtained by maximum likelihood estimation if in addition to the observed data
{X (n) }N
n=1 , we had access to a large number of Y observations for each X
(n) , drawn from
(n) (t)
the conditional p(y|X , θ ). Now, suppose we could easily perform maximum likeli-
hood estimation given full data by computing a sufficient statistic such as an empirical
average of some function T over the full data i.e
N
1 X
θM L = T (X (n) , Y (n) ).
N
n=1

Then, in the present context of unobserved data the θ(t+1) reduces to the average of T
over our hypothetical pool of full data: Y observations for each X (n) , drawn from the
conditional. Thus,
N
1 XX
θ(t+1) = p(y|X (n) , θ(t) )T (X (n) , y). (8)
N
n=1 y∈S2

More generally if the joint distribution is of an exponential family and θM L is the solution
to an equation
N
1 X
Eθ T (X, Y ) = T (X (n) , Y (n) ),
N
i=1

then in the M-step θ(t+1) is obtained as a solution to

N
1 XX
Eθ T (X, Y ) = p(y|X (n) , θ(t) )T (X (n) , y). (9)
N y
n=1

3
Some properties of the iterated maximization

Suppose the iterations reach a fixed point (q∗ , θ∗ ) = (q1∗ . . . , qn∗ , θ∗ ). At a fixed point, by
definition, the E and M steps look like
h i
qn∗ = argmax H(q) + Eq log p(X (n) , y; θ∗ ) (10)
q
N h
X i
θ∗ = argmax H(qn∗ ) + Eqn∗ log p(X (n) , y; θ) . (11)
θ n=1

Assume also that (q∗ , θ∗ ) is a local maximum of J in some neighborhood U of q∗ and V of

θ∗ .
Let θ be in V and let qθ = (q1,θ , . . . , qn,θ ), be the maximizing distributions in equation 10.
Assume θ is close enough to θ∗ so that qθ is in U . This is possible if p(x, y; θ) is smooth in
θ. Then using equation 5 we have

`(Y (1) , . . . , Y (n) ; θ∗ ) = J (q1,θ∗ , . . . , qn,θ∗ , θ∗ )

≥ J (q1,θ , . . . , qn,θ , θ)
= `(Y (1) , . . . , Y (n) ; θ).

EM for Mixture models

In this problem, the domain of the hidden Y variable lies in S2 = {1 . . . K}. Let us setup
some notation.

πk = P (Y = k)
pk (x, θk ) = P (x|Y = k; θk )

So the full parameter θ = (π1 , . . . , πK , θ1 , . . . , θK ). The marginal distribution over S1 is

given by the mixture:
K
X K
X
p(x; θ) = P (x, Y = k; θ) = πk pk (x; θk ).
k=1 k=1

If we have a fully observed sample (X (1) , Y (1) ), . . . , (X (N ) , Y (N ) ), then it is easy to see that
maximum likelihood leads to separating the sample into the K groups according to the
value of Y (n) and computing
N
X
θ̂k = argmax 1[Y (n) =k] log pk (X (n) ; θk )
θk n=1
N
1 X
π̂k = 1[Y (n) =k] . (12)
N
n=1

4
Using Bayes rule The E step which involves computing p(k|X (n) ; θ(t) ) for each k and n
becomes:
(t) (t)
pk (X (n) ; θk )πk
qnt (k) = P (t)
.
K
π p 0 (X (n) ; θ (t) )
k =1 k k
0 0 k

Equation 12 reduces to

N
X
(t+1)
θk = argmax qn(t) (k) log pk (X (n) ; θk ) (13)
θk n=1
N
(t+1) 1 X (t)
πk = qn (k). (14)
N
n=1

An Introduction To Signal Detection and Estimation - Second Edition Chapter IV: Selected Solutions
100% (1)
An Introduction To Signal Detection and Estimation - Second Edition Chapter IV: Selected Solutions
7 pages
Gene Keys - Magical Contemplations
100% (8)
Gene Keys - Magical Contemplations
5 pages
F - Boring
100% (1)
F - Boring
44 pages
EM-algorithm: California Institute of Technology 136-93 Pasadena, CA 91125 Welling@vision - Caltech.edu
No ratings yet
EM-algorithm: California Institute of Technology 136-93 Pasadena, CA 91125 Welling@vision - Caltech.edu
7 pages
Tutorial On Generalized Expectation Maximization: Javier R. Movellan
No ratings yet
Tutorial On Generalized Expectation Maximization: Javier R. Movellan
6 pages
Tutorial On Generalized Expectation
No ratings yet
Tutorial On Generalized Expectation
6 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
14 pages
Lecture3 EM
No ratings yet
Lecture3 EM
36 pages
The Expectation Maximization Algorithm
No ratings yet
The Expectation Maximization Algorithm
7 pages
5
No ratings yet
5
29 pages
HW2
No ratings yet
HW2
4 pages
EM Presentation 2013
No ratings yet
EM Presentation 2013
18 pages
GMMEMNotes
No ratings yet
GMMEMNotes
10 pages
Mixture Models and Expectation-Maximization: Justus H. Piater
No ratings yet
Mixture Models and Expectation-Maximization: Justus H. Piater
11 pages
(Slides) The em Algorithm
No ratings yet
(Slides) The em Algorithm
14 pages
gmm
No ratings yet
gmm
8 pages
Dis10 Sol PDF
No ratings yet
Dis10 Sol PDF
6 pages
EM Algo
No ratings yet
EM Algo
8 pages
Statistical Inference III: Mohammad Samsul Alam
No ratings yet
Statistical Inference III: Mohammad Samsul Alam
32 pages
11 Hidden Markov Models (HMMS) Model and Problem Description
No ratings yet
11 Hidden Markov Models (HMMS) Model and Problem Description
15 pages
Expectation-Maximization Algorithm
No ratings yet
Expectation-Maximization Algorithm
13 pages
msqe_metrics_1_ps2
No ratings yet
msqe_metrics_1_ps2
11 pages
The Expectation-Maximisation Algorithm: 14.1 The EM Algorithm - A Method For Maximising The Likeli-Hood
No ratings yet
The Expectation-Maximisation Algorithm: 14.1 The EM Algorithm - A Method For Maximising The Likeli-Hood
21 pages
The EM Algorithm: Ajit Singh November 20, 2005
No ratings yet
The EM Algorithm: Ajit Singh November 20, 2005
4 pages
Inf 2
No ratings yet
Inf 2
37 pages
ML and MAP - HTML
No ratings yet
ML and MAP - HTML
9 pages
TR 97 021
No ratings yet
TR 97 021
15 pages
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
No ratings yet
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
15 pages
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
No ratings yet
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
15 pages
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
No ratings yet
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
15 pages
Lecture 03 Maximum Likelihood Estimation
No ratings yet
Lecture 03 Maximum Likelihood Estimation
22 pages
Isye 6416: Computational Statistics Spring 2023: Prof. Yao Xie
No ratings yet
Isye 6416: Computational Statistics Spring 2023: Prof. Yao Xie
24 pages
Maximum Likelihood An Introduction: L. Le Cam
No ratings yet
Maximum Likelihood An Introduction: L. Le Cam
31 pages
Solution 4 Problem 1: A A ( 1, +1) : Iid Data
No ratings yet
Solution 4 Problem 1: A A ( 1, +1) : Iid Data
18 pages
TS-Theme3
No ratings yet
TS-Theme3
18 pages
Notes
No ratings yet
Notes
10 pages
EM at RIT
No ratings yet
EM at RIT
17 pages
Expectation Maximization Notes
No ratings yet
Expectation Maximization Notes
5 pages
Expectation Maximization
No ratings yet
Expectation Maximization
19 pages
Exp Family
No ratings yet
Exp Family
7 pages
کتاب ششم بارگزاری شده
No ratings yet
کتاب ششم بارگزاری شده
49 pages
Experiment 1
No ratings yet
Experiment 1
5 pages
A Modified Expectation Maximization Algorithm For Penalized Likelihood Estimation in Emission Tomorzradhv
No ratings yet
A Modified Expectation Maximization Algorithm For Penalized Likelihood Estimation in Emission Tomorzradhv
6 pages
Expectation Maximization
No ratings yet
Expectation Maximization
21 pages
21 Mle
No ratings yet
21 Mle
24 pages
Em Algo For Multivariate GMM
No ratings yet
Em Algo For Multivariate GMM
9 pages
ML Notes
No ratings yet
ML Notes
4 pages
3.exponential Family & Point Estimation - 552
0% (1)
3.exponential Family & Point Estimation - 552
33 pages
7.Estimation Clustering
No ratings yet
7.Estimation Clustering
56 pages
Minka Gamma
No ratings yet
Minka Gamma
3 pages
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
No ratings yet
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
6 pages
Learning Models From Data: 1 Parametric Estimation
No ratings yet
Learning Models From Data: 1 Parametric Estimation
14 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
Lecture-04_GMM_EMalg
No ratings yet
Lecture-04_GMM_EMalg
34 pages
L08-MaximumLikelihoodEstimation
No ratings yet
L08-MaximumLikelihoodEstimation
5 pages
MLE_Assingnment (1)
No ratings yet
MLE_Assingnment (1)
7 pages
stat100b_maximum_likelihood
No ratings yet
stat100b_maximum_likelihood
9 pages
Latent 2
No ratings yet
Latent 2
4 pages
CpE646 6v3 PDF
No ratings yet
CpE646 6v3 PDF
44 pages
7 Mle
No ratings yet
7 Mle
31 pages
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
Theory of Approximation
From Everand
Theory of Approximation
N. I. Achieser
No ratings yet
Final Exam
No ratings yet
Final Exam
10 pages
XVZ1300TF: Owner'S Manual
No ratings yet
XVZ1300TF: Owner'S Manual
132 pages
The Mathematics of Decisions, Elections, and Games
No ratings yet
The Mathematics of Decisions, Elections, and Games
242 pages
me261_24-25_hw6c
No ratings yet
me261_24-25_hw6c
10 pages
DCS800 Winder Description
100% (2)
DCS800 Winder Description
53 pages
1Z0-184 (Final_Last_One) 2 2
No ratings yet
1Z0-184 (Final_Last_One) 2 2
10 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
20 pages
High-Speed Switching Transistor (-60V, - 5A) : Features Dimensions (Unit: MM)
No ratings yet
High-Speed Switching Transistor (-60V, - 5A) : Features Dimensions (Unit: MM)
3 pages
Pre - DT Report ZBGR - 4331 - TDD
No ratings yet
Pre - DT Report ZBGR - 4331 - TDD
4 pages
Department of Education: Sergia Soriano Esteban Integrated School Ii
No ratings yet
Department of Education: Sergia Soriano Esteban Integrated School Ii
8 pages
GGH2603 L5
No ratings yet
GGH2603 L5
30 pages
Conservation Biology - Andrew S.pullin
No ratings yet
Conservation Biology - Andrew S.pullin
9,764 pages
Combinepdf (4) Removed Removed
No ratings yet
Combinepdf (4) Removed Removed
97 pages
Blueprint of Class Ix
No ratings yet
Blueprint of Class Ix
3 pages
Senior Engineer Quality-Ramesh
No ratings yet
Senior Engineer Quality-Ramesh
4 pages
5G Antenna Design
No ratings yet
5G Antenna Design
78 pages
Role Name Terms of Reference (Duties and Responsibilities)
No ratings yet
Role Name Terms of Reference (Duties and Responsibilities)
5 pages
"A Girl. A Machine. A Freak": A Consideration of Contemporary Queer Composites
No ratings yet
"A Girl. A Machine. A Freak": A Consideration of Contemporary Queer Composites
13 pages
Jet-Lube Deacon 327-RTV Silicone Sealant
No ratings yet
Jet-Lube Deacon 327-RTV Silicone Sealant
3 pages
Paper - II Linguistics
No ratings yet
Paper - II Linguistics
16 pages
Bongolan, Stephanie N - FS 2 Activity 2 1
No ratings yet
Bongolan, Stephanie N - FS 2 Activity 2 1
7 pages
Annexure New
No ratings yet
Annexure New
6 pages
Antenna Height Vs Take Off Angles
No ratings yet
Antenna Height Vs Take Off Angles
7 pages
Improvements in The Lugeon or Packer Permeability Test R. Pearson & M. S. Money
100% (1)
Improvements in The Lugeon or Packer Permeability Test R. Pearson & M. S. Money
19 pages
LR10G FL 21 V4
No ratings yet
LR10G FL 21 V4
3 pages
Electra Complex
No ratings yet
Electra Complex
2 pages
Lecture 2. Measuring Tools-Rules and Calipers
No ratings yet
Lecture 2. Measuring Tools-Rules and Calipers
45 pages
km70 THN 1998 Tabel TTG Pengawakan KPL Niaga PDF
No ratings yet
km70 THN 1998 Tabel TTG Pengawakan KPL Niaga PDF
5 pages