Problem Sheet 1 (1)
Problem Sheet 1 (1)
1. (25 points) In this problem, we illustrate the Expectation-Maximization (EM) algorithm using con-
crete examples. Suppose that the complete dataset consists of Z = (X , Y), where X is observed but Y
is unobserved. The log likelihood for Z is then denoted by l(θ; X , Y), where θ determines the unknown
parameter vector. We repeat the E-Step and M-Step below until the sequence of θnew ’s converges
(which can be guaranteed for some cases for local maximum).
E-Step (Expectation Step): We compute the expected value of l(θ; X , Y), using (a) the information
gained from the observed data X , and (b) the current parameter estimate, θold . More precisely, let
Z
Q (θ; θold ) := E [l(θ; X , Y) | X , θold ] = l(θ; X , y)p (y | X , θold ) dy. (1)
M-Step (Maximization Step): We maximize θ over the conditional expectation (1). We simply set
θnew := maxθ Q (θ; θold ), and afterwards, let θold = θnew .
(a) We now derive the algorithm above. Let p(· | ·) denote an arbitrary conditional probability density
function. Show that
Z
l(θ; X ) = ln p(X | θ) = ln p(X , y | θ) dy ≥ Q (θ; θold ) − E [ln p (Y | X , θold ) | X , θold ] . (2)
(b) Denote the rightmost side of (2) by g (θ | θold ). It is clear that l(θ; X ) ≥ g (θ | θold ). Prove that
we have equality when θ = θold . Why does this imply that the EM algorithm is reasonable for
maximizing likelihood?
(c) Now, consider the multinomial distribution with four classes Mult (n, πθ ) where
1 1 1 1 1
πθ = + θ, (1 − θ), (1 − θ), θ .
2 4 4 4 4
let x := (x1 , x2 , x3 , x4 ) be a sample from this distribution. Write down the likelihood L(θ; x), and
log-likelihood l(θ; x), for sample x.
(d) We will maximize l(θ; x) over θ using the EM algorithm (other algorithms will receive no marks), as
a toy example. To use EM, we assume that the complete data Z is given by y := (y1 , y2 , y3 , y4 , y5 )
and that y has a 5-class Mult (n, πθ∗ ) distribution where
∗ 1 1 1 1 1
πθ = , θ, (1 − θ), (1 − θ), θ .
2 4 4 4 4
2. (15 points) As we saw in class, k-means clustering minimizes the average square distance distortion
k X
X
Javg2 = d(x, mj )2 , (3)
j=1 x∈Cj
1
where d(x, x′ ) = ∥x − x′ ∥, and Cj is the set of points belonging to cluster j. Another distortion function
that we mentioned is the intra-cluster sum of squared distances,
k
X 1 X X
JIC = d(x, x′ )2 .
|C j | ′
j=1 x∈Cj x ∈Cj
1
P
(a) Given that in k-means, mj = |Cj | x∈Cj x, show that JIC = 2 Javg2 .
(b) Let γi ∈ {1, . . . , k} be the cluster assignment of the i ’th data point xi , and let n be the total
number of data points. Then
n
X 2
Javg2 (γ1 , . . . , γn , m1 , . . . , mk ) = d xi , mγi .
i=1
Show that step 1 minimizes Javg2 w.r.t. the assignments (holding {mj } fixed), and step 2 minimizes
Javg2 w.r.t. the centroids (holding the assignments fixed).
3. (10 points) Implement the k-means algorithm in a language of your choice, initializing the cluster
centers randomly. The algorithm terminates when no further change in cluster assignments or centroids
occurs.
(a) Use the toy dataset toydata.txt (500 points in R2 , from 3 well-separated clusters). Plot the final
clustering assignments (by color or symbol) and also, on a separate figure, plot the distortion value
vs. iteration for 20 separate runs. Comment on whether you get the “correct” clusters each time,
and on the variability of results across runs.
(b) Implement k-means++ initialization and repeat part (a). Compare convergence (speed and final
distortion) to the original random initialization.
(c) Run both the original and k-means++ algorithms on the MNIST dataset (images are 28 × 28
pixels, i.e. 784-dimensional vectors). Compare how they converge and how results differ for k = 10
vs. k = 16. You can download MNIST via
from torchvision import datasets
mnist_trainset = datasets.MNIST(root=’./data’, train=True,
download=True, transform=None)
mnist_testset = datasets.MNIST(root=’./data’, train=False,
download=True, transform=None)
Explain any differences you observe in speed, distortion, or cluster quality.
2
(a) Given an i.i.d. sample {(xi , zi )}ni=1 from the model, write down the complete-data log-likelihood
ℓ(θ), ignoring additive constants that do not affect optimization.
(b) Let pi,j = P (zi = j | xi ). Give an expression for pi,j in terms of the mixture parameters.
(c) Derive the expected complete-data log-likelihood ℓθold (θ) with respect to these posterior probabil-
ities pi,j .
P
(d) Show that maximizing ℓθold (θ) under the constraint j πj = 1 gives
n
1X
πj ← pi,j .
n i=1