Probabilistic Models
with Latent Variables
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 1
Density Estimation Problem
• Learning from unlabeled data
• Unsupervised learning, density estimation
• Empirical distribution typically has multiple modes
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 2
Density Estimation Problem
From http://yulearning.blogspot.co.uk
From http://courses.ee.sun.ac.za/Pattern_Recognition_813
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 3
Density Estimation Problem
• Conv. composition of unimodal pdf’s: multimodal pdf
where
• Physical interpretation
• Sub populations
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 4
Latent Variables
• Introduce
new variable for each
• Latent / hidden: not observed in the data
• Probabilistic interpretation
• Mixing weights:
• Mixture densities:
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 5
Generative Mixture Model
•
For
𝑍𝑖
𝑋𝑖
• recovers mixture distribution
𝑁
Plate Notation
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 6
Tasks in a Mixture Model
• Inference
• Parameter Estimation
• Find parameters that e.g. maximize likelihood
• Does not decouple according to classes
• Non convex, many local minima
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 7
Example: Gaussian Mixture Model
• Model
For
• Inference
• Soft-max function
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 8
Example: Gaussian Mixture Model
• Loglikelihood
• Which training instance comes from which component?
• No closed form solution for maximizing
• Possibility 1: Gradient descent etc
• Possibility 2: Expectation Maximization
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 9
Expectation Maximization Algorithm
• Observation: Know values of easy to maximize
• Key idea: iterative updates
• Given parameter estimates, “infer” all variables
• Given inferred variables, maximize wrt parameters
• Questions
• Does this converge?
• What does this maximize?
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 10
Expectation Maximization Algorithm
• Complete loglikelihood
• Problem: not known
• Possible solution: Replace w/ conditional expectation
• Expected complete loglikelihood
Wrt where are the current parameters
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 11
Expectation Maximization Algorithm
•
Where
• Compare with likelihood for generative classifier
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 12
Expectation Maximization Algorithm
• Expectation Step
• Update based on current parameters
• Maximization Step
• Maximize wrt parameters
• Overall algorithm
• Initialize all latent variables
• Iterate until convergence
• M Step
• E Step
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 13
Example: EM for GMM
• E Step remains the step for all mixture models
• M Step
• Compare with generative classifier
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 14
Analysis of EM Algorithm
• Expected complete LL is a lower bound on LL
• EM iteratively maximizes this lower bound
• Converges to a local maximum of the loglikelihood
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 15
Bayesian / MAP Estimation
• EM overfits
• Possible to perform MAP instead of MLE in M-step
• EM is partially Bayesian
• Posterior distribution over latent variables
• Point estimate over parameters
• Fully Bayesian approach is called Variational Bayes
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 16
(Lloyd’s) K Means Algorithm
• Hard EM for Gaussian Mixture Model
• Point estimate of parameters (as usual)
• Point estimate of latent variables
• Spherical Gaussian mixture components
Where
• Most popular “hard” clustering algorithm
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 17
K Means Problem
• Given
, find k “means” and data assignments such
that
• Note: is k-dimensional binary vector
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 18
Model selection: Choosing K for GMM
• Cross validation
• Plot likelihood on training set and validation set for
increasing values of k
• Likelihood on training set keeps improving
• Likelihood on validation set drops after “optimal” k
• Does not work for k-means! Why?
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 19
Principal Component Analysis: Motivation
• Dimensionality reduction
• Reduces #parameters to estimate
• Data often resides in much lower dimension, e.g., on a line
in a 3D space
• Provides “understanding”
• Mixture models very restricted
• Latent variables restricted to small discrete set
• Can we “relax” the latent variable?
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 20
Classical PCA: Motivation
• Revisit K-means
• W: matrix containing means
• Z: matrix containing cluster membership vectors
• How can we relax Z and W?
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 21
Classical PCA: Problem
•
• X:
• Arbitrary Z of size ,
• Orthonormal W of size
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 22
Classical PCA: Optimal Solution
• Empirical covariance matrix
• Scaled and centered data
• where contains L Eigen vectors for the L largest
Eigen values of
• Alternative solution via Singular Value
Decomposition (SVD)
• W contains the “principal components” that capture
the largest variance in the data
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 23
Probabilistic PCA
• Generative model
forced to be diagonal
• Latent linear models
• Factor Analysis
• Special Case: PCA with
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 24
Visualization of Generative Process
From Bishop, PRML
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 25
Relationship with Gaussian Density
•
• Why does need to be restricted?
• Intermediate low rank parameterization of Gaussian
covariance matrix between full rank and diagonal
• Compare #parameters
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 26
EM for PCA: Rod and Springs
From Bishop, PRML
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 27
Advantages of EM
• Simpler than gradient methods w/ constraints
• Handles missing data
• Easy path for handling more complex models
• Not always the fastest method
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 28
Summary of Latent Variable Models
• Learning from unlabeled data
• Latent variables
• Discrete: Clustering / Mixture models ; GMM
• Continuous: Dimensionality reduction ; PCA
• Summary / “Understanding” of data
• Expectation Maximization Algorithm
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 29