0% found this document useful (0 votes)
8 views3 pages

Problem Sheet 1 (1)

This document outlines Homework 1 for a statistics course at the University of Chicago, focusing on the Expectation-Maximization (EM) algorithm, k-means clustering, and Gaussian mixture models. It includes specific problems related to deriving algorithms, maximizing likelihood, implementing k-means, and comparing clustering methods. Additionally, there is an extra credit opportunity for creating a dataset that demonstrates the effectiveness of k-means++ initialization over random initialization.

Uploaded by

tuyue3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views3 pages

Problem Sheet 1 (1)

This document outlines Homework 1 for a statistics course at the University of Chicago, focusing on the Expectation-Maximization (EM) algorithm, k-means clustering, and Gaussian mixture models. It includes specific problems related to deriving algorithms, maximizing likelihood, implementing k-means, and comparing clustering methods. Additionally, there is an extra credit opportunity for creating a dataset that demonstrates the effectiveness of k-means++ initialization over random initialization.

Uploaded by

tuyue3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

STAT 37710/CAAM 37710/CMSC 35400 Homework 1

University of Chicago, Spring 2025 due Thursday, April 10 at 11:59pm

1. (25 points) In this problem, we illustrate the Expectation-Maximization (EM) algorithm using con-
crete examples. Suppose that the complete dataset consists of Z = (X , Y), where X is observed but Y
is unobserved. The log likelihood for Z is then denoted by l(θ; X , Y), where θ determines the unknown
parameter vector. We repeat the E-Step and M-Step below until the sequence of θnew ’s converges
(which can be guaranteed for some cases for local maximum).
E-Step (Expectation Step): We compute the expected value of l(θ; X , Y), using (a) the information
gained from the observed data X , and (b) the current parameter estimate, θold . More precisely, let
Z
Q (θ; θold ) := E [l(θ; X , Y) | X , θold ] = l(θ; X , y)p (y | X , θold ) dy. (1)

where p (· | X , θold ) is the conditional density of Y given the observed data X .

M-Step (Maximization Step): We maximize θ over the conditional expectation (1). We simply set
θnew := maxθ Q (θ; θold ), and afterwards, let θold = θnew .

(a) We now derive the algorithm above. Let p(· | ·) denote an arbitrary conditional probability density
function. Show that
Z
l(θ; X ) = ln p(X | θ) = ln p(X , y | θ) dy ≥ Q (θ; θold ) − E [ln p (Y | X , θold ) | X , θold ] . (2)

(b) Denote the rightmost side of (2) by g (θ | θold ). It is clear that l(θ; X ) ≥ g (θ | θold ). Prove that
we have equality when θ = θold . Why does this imply that the EM algorithm is reasonable for
maximizing likelihood?
(c) Now, consider the multinomial distribution with four classes Mult (n, πθ ) where
 
1 1 1 1 1
πθ = + θ, (1 − θ), (1 − θ), θ .
2 4 4 4 4

let x := (x1 , x2 , x3 , x4 ) be a sample from this distribution. Write down the likelihood L(θ; x), and
log-likelihood l(θ; x), for sample x.
(d) We will maximize l(θ; x) over θ using the EM algorithm (other algorithms will receive no marks), as
a toy example. To use EM, we assume that the complete data Z is given by y := (y1 , y2 , y3 , y4 , y5 )
and that y has a 5-class Mult (n, πθ∗ ) distribution where
 
∗ 1 1 1 1 1
πθ = , θ, (1 − θ), (1 − θ), θ .
2 4 4 4 4

However, instead of observing y directly, we are only able to observe x = (y1 + y2 , y3 , y4 , y5 ).


Therefore, we let X = (y1 + y2 , y3 , y4 , y5 ) and Y = y2 , where Y remains unobserved. Write down
the E-Step and M-Step update equations, with derivations.

2. (15 points) As we saw in class, k-means clustering minimizes the average square distance distortion
k X
X
Javg2 = d(x, mj )2 , (3)
j=1 x∈Cj

1
where d(x, x′ ) = ∥x − x′ ∥, and Cj is the set of points belonging to cluster j. Another distortion function
that we mentioned is the intra-cluster sum of squared distances,
k
X 1 X X
JIC = d(x, x′ )2 .
|C j | ′
j=1 x∈Cj x ∈Cj

1
P
(a) Given that in k-means, mj = |Cj | x∈Cj x, show that JIC = 2 Javg2 .
(b) Let γi ∈ {1, . . . , k} be the cluster assignment of the i ’th data point xi , and let n be the total
number of data points. Then
n
X 2
Javg2 (γ1 , . . . , γn , m1 , . . . , mk ) = d xi , mγi .
i=1

Recall that k-means clustering alternates the following two steps:


1. Update the cluster assignments:

γi ← arg min d(xi , mj ) ∀ i = 1, . . . , n.


j∈{1,...,k}

2. Update the centroids:


1 X
mj ← xi j = 1, . . . , k.
|Cj | i: γ =j
i

Show that step 1 minimizes Javg2 w.r.t. the assignments (holding {mj } fixed), and step 2 minimizes
Javg2 w.r.t. the centroids (holding the assignments fixed).

3. (10 points) Implement the k-means algorithm in a language of your choice, initializing the cluster
centers randomly. The algorithm terminates when no further change in cluster assignments or centroids
occurs.

(a) Use the toy dataset toydata.txt (500 points in R2 , from 3 well-separated clusters). Plot the final
clustering assignments (by color or symbol) and also, on a separate figure, plot the distortion value
vs. iteration for 20 separate runs. Comment on whether you get the “correct” clusters each time,
and on the variability of results across runs.
(b) Implement k-means++ initialization and repeat part (a). Compare convergence (speed and final
distortion) to the original random initialization.
(c) Run both the original and k-means++ algorithms on the MNIST dataset (images are 28 × 28
pixels, i.e. 784-dimensional vectors). Compare how they converge and how results differ for k = 10
vs. k = 16. You can download MNIST via
from torchvision import datasets
mnist_trainset = datasets.MNIST(root=’./data’, train=True,
download=True, transform=None)
mnist_testset = datasets.MNIST(root=’./data’, train=False,
download=True, transform=None)
Explain any differences you observe in speed, distortion, or cluster quality.

4. (50 points) Recall the Gaussian mixture model for clustering



p(x, z) = πz N x; µz , Σz ,
 k
with parameters θ = {πj }, {µj }, {Σj } j=1
.

2
(a) Given an i.i.d. sample {(xi , zi )}ni=1 from the model, write down the complete-data log-likelihood
ℓ(θ), ignoring additive constants that do not affect optimization.
(b) Let pi,j = P (zi = j | xi ). Give an expression for pi,j in terms of the mixture parameters.
(c) Derive the expected complete-data log-likelihood ℓθold (θ) with respect to these posterior probabil-
ities pi,j .
P
(d) Show that maximizing ℓθold (θ) under the constraint j πj = 1 gives
n
1X
πj ← pi,j .
n i=1

(e) Similarly, derive the updates for µj and Σj .


(f) Compare these updates to the k-means updates from Question 2.
(g) Apply the mixture-of-Gaussians EM algorithm to the toy data and comment on how it clusters
the points vs. k-means (both accuracy and convergence speed).

5. (Extra Credit up to 20 points)


Create a dataset for which k-means++ leads to solutions whose final distortion is at least 10 times
better (on average) than random initialization. Provide the code used to generate the data.

You might also like