0% found this document useful (0 votes)
12 views26 pages

GMM

Uploaded by

katariyam071
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views26 pages

GMM

Uploaded by

katariyam071
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Density Estimation

with Gaussian Mixture Models1


CS 2XX: Mathematics for AI and ML

Chandresh Kumar Maurya

IIT Indore
https://chandreshiit.github.io

November 17, 2024

1
Slides credit goes to Yi, Yung
November 17, 2024 1 / 26
Warm-Up

Please watch this tutorial video by Luis Serrano on Gaussian


Mixture Model.

https://www.youtube.com/watch?v=q71Niz856KE

November 17, 2024 2 / 26


Roadmap

(1) Gaussian Mixture Model


(2) Parameter Learning: MLE
(3) Latent-Variable Perspective for Probabilistic Modeling
(4) EM Algorithm

November 17, 2024 3 / 26


Roadmap

(1) Gaussian Mixture Model


(2) Parameter Learning: MLE
(3) Latent-Variable Perspective for Probabilistic Modeling
(4) EM Algorithm

L11(1) November 17, 2024 4 / 26


Density Estimation
• Represent data compactly using a density from a parametric family, e.g., Gaussian
or Beta distribution
• Parameters of those families can be found by MLE and MAPE
• However, there are many cases when simple distributions (e.g., just Gaussian) fail
to approximate data.

L11(1) November 17, 2024 5 / 26


Mixture Models

• More expressive family of distribution


• Idea: Let’s mix! A convex combination of K “base” distributions

K
X K
X
p(x) = πk pk (x), 0 ≤ πk ≤ 1, πk = 1
k=1 k=1

• Multi-modal distributions: Can be used to describe datasets with multiple clusters


• Our focus: Gaussian mixture models
• Want to finding the parameters using MLE, but cannot have the closed form
solution (even with the mixture of Gaussians) → some iterative methods needed

L11(1) November 17, 2024 6 / 26


Gaussian Mixture Model

K
X K
X
p(x|θ) = N (x|µk , Σk ), 0 ≤ πk ≤ 1, πk = 1,
k=1 k=1
where the parameters θ := {µk , Σk , πk : k = 1, . . . , K }
• Example. p(x|θ) = 0.5N (x| − 2, 1/2) + 0.2N (x|1, 2) + 0.3N (x|4, 1)

L11(1) November 17, 2024 7 / 26


Roadmap

(1) Gaussian Mixture Model


(2) Parameter Learning: MLE
(3) Latent-Variable Perspective for Probabilistic Modeling
(4) EM Algorithm

L11(2) November 17, 2024 8 / 26


Parameter Learning: Maximum Likelihood

• Given a iid dataset X = {x1 , . . . , xn }, the log-likelihood is:


N
X N
X K
X
L(θ) = log p(X |θ) = log p(xn |θ) = log πk N (xn |µk , Σk )
n=1 n=1 k=1

• θML = arg minθ (−L(θ))


dL
• Necessary condition for θML : =0
dθ θML
• However, the closed-form solution of θML does not exist, so we rely on an iterative
algorithm (also called EM algorithm).
• We show the algorithm first, and then discuss how we get the algorithm.

L11(2) November 17, 2024 9 / 26


Responsibilities

• Definition. Responsibilities. Given n-th data point xn and the parameters


(µk , Σk , πk : k = 1, . . . , K ),
πk N (xn |µk , Σk )
rnk = P
j πj N (xn |µj , Σj )

• How much is each component k responsible, if the data xn is sampled from the
current mixture model?
PK
• rn = (rnk : k = 1, . . . , K ) is a probability distribution, so k=1 rnk = 1
• Soft assignment of xn to the K mixture components

L11(2) November 17, 2024 10 / 26


EM Algorithm: MLE in Gaussian Mixture Models
EM for MLE in Gaussian Mixture Models
S1. Initialize µk , Σk , πk
S2. E-step: Evaluate responsibilities rnk for every data point xn using the current µk , Σk , πk :
N
πk N (xn |µk , Σk ) X
rnk =P , Nk = rnk
π
j j N (x |µ
n j , Σ j ) n=1

S3. M-step: Reestimate parameters µk , Σk , πk using the current responsibilities rnk :


N N
1 X 1 X T Nk
µk = rnk xn , Σk = rnk (xn − µk )(xn − µk ) , πk = ,
Nk n=1 Nk n=1 N

and go to S2.

- The update equation in M-step is still mysterious, which will be covered later.

L11(2) November 17, 2024 11 / 26


Example: EM Algorithm

L11(2) November 17, 2024 12 / 26


M-Step: Towards the Zero Gradient
• Given X and rnk from E-step, the new updates of µk , Σk , πk should be made, such
that the followings are satisfied:
N
X ∂ log p(xn |θ)
∂L T
= 0 ⇐⇒ = 0T
∂µk ∂µk
n=1
N
∂L X ∂ log p(xn |θ)
= 0 ⇐⇒ =0
∂Σk ∂Σk
n=1
N
∂L X ∂ log p(xn |θ)
= 0 ⇐⇒ =0
∂πk ∂πk
n=1
• Nice thing: the new updates of µk , Σk , πk are all expressed by the responsibilities
[rnk ]
• Let’s take a look at them one by one!

L11(2) November 17, 2024 13 / 26


M-Step: Update of µk

PN
rnk xn
µnew
k = Pn=1
N
, k = 1, . . . , K
n=1 rnk

L11(2) November 17, 2024 14 / 26


M-Step: Update of Σk

N
1 X
Σnew
k = rnk (xn − µk )(xn − µk )T , k = 1, . . . , K
Nk
n=1

L11(2) November 17, 2024 15 / 26


M-Step: Update of πk

PN
n=1 rnk
πknew = , k = 1, . . . , K
N

L11(2) November 17, 2024 16 / 26


Roadmap

(1) Gaussian Mixture Model


(2) Parameter Learning: MLE
(3) Latent-Variable Perspective for Probabilistic Modeling
(4) EM Algorithm

L11(3) November 17, 2024 17 / 26


Latent-Variable Perspective
• Justify some ad hoc decisions made earlier
• Allow for a concrete interpretation of the responsibilities as posterior distributions
• Iterative algorithm for updating the model parameters can be derived in a principled
manner

L11(3) November 17, 2024 18 / 26


Generative Process
• Latent variable z: One-hot encoding random vector z = [z1 , . . . , zK ]T consisting of
K − 1 many 0s and exactly one 1.
• An indicator rv zk = 1 represents whether k-th component is used to generate the
data sample x or not.
• p(x|zk = 1) = N (x|µk , Σk )
• Prior for z with πk = p(zk = 1)
K
X
p(z) = π = [π1 , . . . , πK ]T , πk = 1
k=1
• Sampling procedure
1. Sample which component to use z (i) ∼ p(z)
2. Sample data according to i-th Gaussian x (i) ∼ p(x|z (i) )

L11(3) November 17, 2024 19 / 26


Joint Distribution, Likelihood, and Posterior (1)
• Joint distribution
     
p(x, z1 = 1) p(x|z1 = 1)p(z1 = 1) π1 N (x|µ1 , Σ1 )
p(x, z) = 
 .. =
  .. =
  .. 
.   .   . 
p(x, zK = 1) p(x|zK = 1)p(zK = 1) πK N (x|µK , ΣK )
• Likelihood for an arbitrary single data x: By summing out all latent variables2 ,
X K
X K
X
p(x|θ) = p(x|θ, z)p(z|θ) = p(x|θ, zk = 1)p(zk = 1|θ) = πk N (x|µk , Σk )
z k=1 k=1
• For all the data samples X , the log-likelihood is:
N
X N
X K
X
log p(X |θ) = log p(xn |θ) = log πk N (xn |µk , Σk ) Compare: Page 7
n=1 n=1 k=1

2
In probabilistic PCA, z was continuous, so we integrated them out.
L11(3) November 17, 2024 20 / 26
Joint Distribution, Likelihood, and Posterior (2)

• Posterior for the k-th zk , given an arbitrary single data x:


p(zk = 1)p(x|zk = 1) πk N (x|µk , Σk )
p(zk = 1|x) = PK = PK
j=1 p(zj = 1)p(x|zj = 1) j=1 πj N (x|µj , Σj )

• Now, for all data samples X , each data xn has zn = [zn1 , . . . , znK ]T , but with the
same prior π.
p(znk = 1)p(xn |znk = 1) πk N (xn |µk , Σk )
p(znk = 1|xn ) = PK = PK = rnk
j=1 p(znj = 1)p(xn |znj = 1) j=1 πj N (xn |µj , Σj )

• Responsibilities are mathematically interpreted as posterior distributions.

L11(3) November 17, 2024 21 / 26


Roadmap

(1) Gaussian Mixture Model


(2) Parameter Learning: MLE
(3) Latent-Variable Perspective for Probabilistic Modeling
(4) EM Algorithm

L11(4) November 17, 2024 22 / 26


Revisiting EM Algorithm for MLE

S1. Initialize µk , Σk , πk • E-step. Expectation over z|x, θ (t) : Given


the current θ (t) = (µk , Σk , πk ), calculates
S2. E-step:
the expected log-likelihood
πk N (xn |µk , Σk )
rnk = P Q(θ|θ (t) ) = Ez|x,θ(t) [log p(x, z|θ)]
j πj N (xn |µj , Σj ) Z
= log p(x, z|θ)p(z|x, θ (t) )dz
S3. M-step: Update µk , Σk , πk using rnk and
go to S2.
• M-step. Maximization of the computation
results in E-step for the new model
parameters.

• Only guarantee of just local-optimum because the original optimization is not


necessarily a convex optimization. L7(4)

L11(4) November 17, 2024 23 / 26


Other Issues
• Model selection for finding a good K , e.g., using nested cross-validation
• Application: Clustering
◦ K-means: Treat the means in GMM as cluster centers and ignore the covariances.
◦ K-means: hard assignment, GMM: soft assignment
• EM algorithm: Highly generic in the sense that it can be used for parameter
learning in general latent-variable models
• Standard criticism for MLE exists such as overfitting. Also, fully-Bayesian approach
assuming some priors on the parameters is possible, but not covered in this notes.
• Other density estimation methods
◦ Histogram-based method: non-parametric method
◦ Kernel-density estimation: non-parametric method

L11(4) November 17, 2024 24 / 26


Questions?

L11(4) November 17, 2024 25 / 26


Review Questions

1)

L11(4) November 17, 2024 26 / 26

You might also like