0% found this document useful (0 votes)
98 views5 pages

The Kullback-Liebler Distance and Entropy

The EM algorithm is an iterative method used to find maximum likelihood estimates of parameters in statistical models where the model depends on unobserved latent variables. It involves alternating between performing an expectation (E) step, where the expected value of the log-likelihood is computed with respect to the latent variables, and a maximization (M) step, where the parameters are estimated by maximizing the expected log-likelihood found in the E step. Each round of E and M steps is guaranteed to increase the log-likelihood and the algorithm is guaranteed to converge to a local maximum. The EM algorithm is commonly used to estimate parameters in mixture models where the latent variables indicate component membership.

Uploaded by

harislye
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views5 pages

The Kullback-Liebler Distance and Entropy

The EM algorithm is an iterative method used to find maximum likelihood estimates of parameters in statistical models where the model depends on unobserved latent variables. It involves alternating between performing an expectation (E) step, where the expected value of the log-likelihood is computed with respect to the latent variables, and a maximization (M) step, where the parameters are estimated by maximizing the expected log-likelihood found in the E step. Each round of E and M steps is guaranteed to increase the log-likelihood and the algorithm is guaranteed to converge to a local maximum. The EM algorithm is commonly used to estimate parameters in mixture models where the latent variables indicate component membership.

Uploaded by

harislye
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

The EM algorithm

The goal is to estimate the parameter θ for a family p(x, y; θ) of distributions on a product
space S1 ×S2 . Assume that if we observe the full data (X (n) , Y (n) ), n = 1, . . . , N , finding the
maximum likelihood estimate is a simple computation. However we only get to observe
an i.i.d. random samples X (1) , . . . , X (N ) (the Y (n) ’s remain hidden). The best we can do is
maximize the marginal log-likelihood:

n
X
max `(X (1) , . . . , X (N ) ; θ) = max log pX (X (n) ; θ). (1)
θ θ
i=1

We develop an iterative algorithm which can be used to find local maxima of this marginal
log-likelihood.

The Kullback-Liebler distance and entropy

The Kullback-Liebler distance between two distributions is defined as


X q(y)
K(q, p) = q(y) log .
p(y)
y∈S

Since the log function is convex we have


X p(y) X p(y)
−K(q, p) = q(y) log ≤ log q(y) = 0.
y
q(y) y
q(y)

Since p, q are distributions that add up to 1, equality can only be achieved if p(y) = q(y) for
all y. The entropy of a distribution denoted by H(q) is given by
X
H(q) = − q(y) log q(y).
y

We return to the properties of this quantity later. Note that

K(q, p) = −H(q) − Eq log p.

The Gibbs Variational formula

Let ψ(y) be a function on a discrete state space S and define the probability p on S as

p(y) = eψ(y) /Z = eψ(y)−B , (2)


P
where Z = y∈S eψ(y) is the normalizing constant and B = log(Z). Then

B = max[H(q) + Eq ψ]. (3)


q

1
where the unique maximizer is p. This follows directly from the properties of the KL
distance. For any q
0 ≤ K(q, p) = −H(q) − Eq ψ + B,
so that
B ≥ H(q) + Eq ψ,
and equality is achieved only when q = p.

A variation formula for the log-marginal

Consider a joint distribution p(x, y) on a product space of two sets S1 × S2 with x ∈ S1 and
y ∈ S2 . Write
pX,Y (x, y) elog p(x,y)
pY |X (y|x) = =
pX (x) Z
where the normalizing constant Z is given by the the marginal pX (x). For fixed x set
ψ(y) = log p(x, y), then using the Gibbs variation formula we have

log pX (x) = max [H(q) + Eq log p(x, y)]


q

where q runs over probability disitribution on S2 . The maximum is achieved at q(y) =


pY |X (y|x).

Derivation of the EM iteration

Given observations X (1) , . . . , X (N ) , define


N
X
J (q1 , . . . , qN , θ) = H(qn ) + Eqn log p(X (n) , ·; θ), (4)
i=1

Since
log p(X (n) ; θ) = max[H(q) + Eq log p(X (n) , ·; θ)],
q

max `(X (1) , . . . , X (N ) ; θ) = max max J (q1 , . . . , qN , θ). (5)


θ θ q1 ,...,qN

This suggests an iterative scheme to maximize (1). Maximize in the qn ’s for fixed θ and
then maximize in θ for fixed qn ’s.

Initialize Set t = 0. And pick θ(0) .


· ¸
(t) (n) (t)
E. qn = argmaxq H(q) + Eq log p(X , ·; θ ) .
· ¸
PN (t)
M. θ(t+1) = argmaxθ n=1 H(qn ) + Eq(t) log p(X (n) , ·; θ) .
n

2
For the E step we know already that

p(X (n) , y; θ(t) ) p(X (n) , y; θ(t) )


qn(t) (y) = p(y|X (n) ; θ(t) ) = = P (6)
p(X (n) ; θ(t) ) y 0 ∈S2 p(X
(n) , y 0 ; θ (t) )

In the M -step we estimate:


N h
X i
(t+1)
θ = argmax H(qn(t) ) + Eq(t) log p(X (n) , ·; θ) (7)
n
θ n=1

The first term is a constant with respect to varying θ and can be ignored in the maximiza-
tion. Writing out the expectation we get
N X
X
θ(t+1) = argmax p(y|X (n) ; θ(t) ) log p(X (n) , y; θ).
θ n=1 y∈S2

To avoid confusion, we reiterate that θ(t) in the above expression is a constant value differ-
ent from the variable θ being maximized.
The above maximization has the following interpretation: The value of θ(t+1) is what
would be obtained by maximum likelihood estimation if in addition to the observed data
{X (n) }N
n=1 , we had access to a large number of Y observations for each X
(n) , drawn from
(n) (t)
the conditional p(y|X , θ ). Now, suppose we could easily perform maximum likeli-
hood estimation given full data by computing a sufficient statistic such as an empirical
average of some function T over the full data i.e
N
1 X
θM L = T (X (n) , Y (n) ).
N
n=1

Then, in the present context of unobserved data the θ(t+1) reduces to the average of T
over our hypothetical pool of full data: Y observations for each X (n) , drawn from the
conditional. Thus,
N
1 XX
θ(t+1) = p(y|X (n) , θ(t) )T (X (n) , y). (8)
N
n=1 y∈S2

More generally if the joint distribution is of an exponential family and θM L is the solution
to an equation
N
1 X
Eθ T (X, Y ) = T (X (n) , Y (n) ),
N
i=1

then in the M-step θ(t+1) is obtained as a solution to


N
1 XX
Eθ T (X, Y ) = p(y|X (n) , θ(t) )T (X (n) , y). (9)
N y
n=1

3
Some properties of the iterated maximization

Suppose the iterations reach a fixed point (q∗ , θ∗ ) = (q1∗ . . . , qn∗ , θ∗ ). At a fixed point, by
definition, the E and M steps look like
h i
qn∗ = argmax H(q) + Eq log p(X (n) , y; θ∗ ) (10)
q
N h
X i
θ∗ = argmax H(qn∗ ) + Eqn∗ log p(X (n) , y; θ) . (11)
θ n=1

Assume also that (q∗ , θ∗ ) is a local maximum of J in some neighborhood U of q∗ and V of


θ∗ .
Let θ be in V and let qθ = (q1,θ , . . . , qn,θ ), be the maximizing distributions in equation 10.
Assume θ is close enough to θ∗ so that qθ is in U . This is possible if p(x, y; θ) is smooth in
θ. Then using equation 5 we have

`(Y (1) , . . . , Y (n) ; θ∗ ) = J (q1,θ∗ , . . . , qn,θ∗ , θ∗ )


≥ J (q1,θ , . . . , qn,θ , θ)
= `(Y (1) , . . . , Y (n) ; θ).

EM for Mixture models

In this problem, the domain of the hidden Y variable lies in S2 = {1 . . . K}. Let us setup
some notation.

πk = P (Y = k)
pk (x, θk ) = P (x|Y = k; θk )

So the full parameter θ = (π1 , . . . , πK , θ1 , . . . , θK ). The marginal distribution over S1 is


given by the mixture:
K
X K
X
p(x; θ) = P (x, Y = k; θ) = πk pk (x; θk ).
k=1 k=1

If we have a fully observed sample (X (1) , Y (1) ), . . . , (X (N ) , Y (N ) ), then it is easy to see that
maximum likelihood leads to separating the sample into the K groups according to the
value of Y (n) and computing
N
X
θ̂k = argmax 1[Y (n) =k] log pk (X (n) ; θk )
θk n=1
N
1 X
π̂k = 1[Y (n) =k] . (12)
N
n=1

4
Using Bayes rule The E step which involves computing p(k|X (n) ; θ(t) ) for each k and n
becomes:
(t) (t)
pk (X (n) ; θk )πk
qnt (k) = P (t)
.
K
π p 0 (X (n) ; θ (t) )
k =1 k k
0 0 k

Equation 12 reduces to

N
X
(t+1)
θk = argmax qn(t) (k) log pk (X (n) ; θk ) (13)
θk n=1
N
(t+1) 1 X (t)
πk = qn (k). (14)
N
n=1

You might also like