Finite mixture models
The finite mixture distribution is given by
K
X
H(y |x, ) =
k Fk (y |x, k )
k=1
with
Finite Mixture Modelling
K
X
Model Specification, Estimation & Application
k=1
Bettina Grun
k = 1
k > 0 k.
In the following it is assumed that the component specific density functions
fk exist and determine the mixture density h.
Department of Statistics and Mathematics
Research Seminar, November 23 2007
Finite mixture models
Finite mixture models
Types of applications:
semi-parametric tool to estimate general distribution functions
modeling unobserved heterogeneity
Special cases:
model-based clustering
5
mixtures of regression models
10
Finite mixture models
50
20
0
2
10
10
yn
40
30
Finite mixture models
10
Finite mixture models
Estimation
Maximum-Likelihood: Expectation-Maximization (EM) Algorithm (Dempster, Laird and Rubin, 1977)
50
40
30
yn
20
10
Bayesian: Gibbs sampling (Diebolt and Robert, 1994)
General method for ML estimation in models with unobserved
latent variables: The complete likelihood containing the observed
and unobserved data is easier to estimate.
Iterates between
E-step, which computes the expectation of the complete
likelihood, and
M-step, where the expected complete likelihood is maximized.
6
x
10
Markov Chain Monte Carlo algorithm
Applicable when the joint posterior distribution is not known
explicitly, but the conditional posterior distributions of each
variable/subsets of variables are known.
Missing data
EM algorithm: E-step
The component-label vectors zn = (znk )k=1,...,K are treated as missing
data. It holds that
Given the current parameter estimates (i) replace the missing data znk
by the estimated a-posteriori probabilities
(i)
(i)
znk = P(k|yn, xn, (i)) =
znk {0, 1} and
(i)
k fk (yn|xn, k )
K
X
(i)
(i)
u fk (yn|xn, u )
u=1
P
K
k=1 znk = 1 for all k = 1, . . . , K.
The conditional expectation of log Lc() at the ith step is given by
The complete log-likelihood is given by
log Lc() =
N
K X
X
znk [log k + log fk (yn|xn, k )] .
Q(; (i)) = E(i) [log Lc()|y , x]
=
(i)
znk [log k + log fk (yn|xn, k )] .
k=1 n=1
k=1 n=1
EM algorithm: M-step
K X
N
X
M-step: Mixtures of Gaussian distributions
The next parameter estimate is given by:
(i+1) = arg max Q(; (i)).
The solutions for the M-step are given in closed form:
PN
The estimates for the prior class probabilities are given by:
(i+1)
N
1 X
(i)
z .
N n=1 nk
The component specific parameter estimates are determined by:
(i+1)
= arg max
k
k
N
X
(i)
znk log(fk (yn|xn, k )).
n=1
weighted ML estimation of the component specific model.
n=1
(i+1)
= P
k
N
(i)
znk yn
(i)
nk
n=1 z
(i+1)
k
=
PN
(i)
(i+1)
(i+1) 0
nk (yn k
)(yn k
)
n=1 z
PN
(i)
nk
n=1 z
Estimation: EM algorithm
Advantages:
EM algorithm: Number of components
Information criteria: e.g. AIC, BIC, ICL
The likelihood is increased in each step EM algorithm converges
for bounded likelihoods.
Relatively easy to implement:
Different mixture models require only different M-steps.
Weighted ML estimation of the component specific model is
sometimes already available.
Disadvantages:
Standard errors have to be determined separately as the information
matrix is not required during the algorithm.
Convergence only to a local optimum
Slow convergence
Likelihood ratio test statistic: Comparison of nested models where the
smaller model is derived by fixing one parameter at the border of the
parameter space.
Regularity conditions are not fulfilled.
The asymptotic null distribution is not the usual 2-distribution with
degrees of freedom equal to the difference beween the number of
parameters under the null and alternative hypotheses.
distributional results for special cases
bootstrapping
variants such as Stochastic EM (SEM) or Classification EM (CEM)
Bayesian estimation
Estimation: Gibbs sampling
Determine the posterior density using Bayes theorem
Starting with Z 0 = (zn0)n=1,...,N repeat the following steps for i =
1, . . . , I0, . . . , I + I0.
p(|Y , X ) h(Y |X , )p(),
where p() is the prior and Y = (yn)n and X = (xn)n.
Standard prior distributions:
Proper priors: Improper priors give improper posteriors.
Independent priors for the component weights and the component
specific parameters.
Conjugate priors for the complete likelihood
Dirichlet distribution D(e0,1, . . . , e0,K ) for the component weights
which is the conjugate prior for the multinomial distribution.
Priors on the component specific parameters depend on the
underlying distribution family.
Invariant priors, e.g. the parameter for the Dirchlet prior is constant
over all components: e0,k e0.
1. Parameter simulation conditional on the classification Z (i1):
P
(i1)
(a) Sample 1, . . . , K from D(( N
+ e0,k )k=1,...,K ).
n=1 znk
(b) Sample component specific parameters from the complete-data
posterior p(1, . . . , K |Z (i1), Y )
(i) (i)
Store the actual values of all parameters (i) = (k , k )k=1,...,K .
2. Classification of each observation (yn, xn) conditional on knowing
(i):
(i)
Sample zn from the multinomial distribution with parameter equal to
the posterior probabilities.
After discarding the burn-in draws the draws I0 + 1, . . . , I + I0 can be
used to approximate all quantities of interest.
Example: Gaussian distribution
Estimation: Gibbs sampling
Assume an independence prior
1
p(k , 1
k ) fN (k ; b0 , B0 )fW (k ; c0 , C0 ).
1. Parameter simulation conditional on the classification Z (i1):
(i)
(i)
(i1)
(a) Sample 1 , . . . , K from D(( N
+ e0,k )k=1,...,K ).
n=1 znk
(i) in each group k from a Wishart
(b) Sample (1
)
k
W(ck (Z (i1)), Ck (Z (i1))) distribution.
(i)
(c) Sample k in each group k from a N (bk (Z (i1)), Bk (Z (i1)))
distribution.
P
Advantages:
Relatively easy to implement
Different mixture models differ only in the parameter simulation
step.
Parameter simulation conditional on the classification is sometimes
already available.
Disadvantages:
2. Classification of each observation yn conditional on knowing (i):
Might fail to escape the attraction area of one mode not all posterior
modes are visited.
(i)
P(znk
= 1|yn, (i)) k fN (yn; k , k )
Gibbs sampling: Number of components
Bayes factors
Label switching
The posterior distribution is invariant under a permutation of the
components with the same component-specific model.
Determine a unique labelling for component-specific inference:
Sampling schemes with a varying number of components
reversible-jump MCMC
inclusion of birth-and-death processes
Impose a suitable ordering constraint, e.g. s < t s, t {1, . . . , S}
with s < t.
Minimize the distance to the Maximum-A-Posteriori (MAP) estimate.
Fix the component membership for some observations.
Relabelling algorithms.
Initialization
Extensions and special cases
Model-based clustering:
Latent class analysis: multivariate discrete observations where the
marginal distributions in the components are independent.
mixtures of factor analyzers
mixtures of t-distributions
Construct a suitable parameter vector (0).
random
other estimation methods: e.g. moment estimators
Classify observations/assign a-posteriori probabilities to each observation.
Mixtures of regressions:
mixtures of generalized linear models
mixtures of generalized linear mixed models
random
cluster analysis results: e.g. hierarchical clustering, k-means
Covariates for the component sizes: concomitant variable models
Impose equality constraints between component-specific parameters
Software in R
Software: FlexMix
Model-based clustering:
mclust (Fraley and Raftery, 2002) for Gaussian mixtures:
specify different models depending on the structure of the
variance-covariance matrices (volume, shape, orientation)
k = k Dk diag(ak )Dk0
initialize EM algorithm with the solution from an agglomerative
hierarchical clustering algorithm
Clusterwise regression:
flexmix (Leisch, 2004)
See also CRAN Task View Cluster Analysis & Finite Mixture Models.
The function flexmix() provides the E-step and all data handling.
The M-step is supplied by the user similar to glm() families.
Multiple independent responses from different families
Currently bindings to several GLM families exist (Gaussian, Poisson,
Gamma, Binomial)
Weighted, hard (CEM) and random (SEM) classification
Components with prior probability below a user-specified threshold are
automatically removed during iteration
FlexMix Design
Example: Clustering
Primary goal is extensibility: ideal for trying out new mixture models.
No replacement of specialized mixtures like mclust(), but complement.
Usage of S4 classes and methods
Formula-based interface
Multivariate responses:
combination of univariate families: assumption of independence
(given x), each response may have its own model formula, i.e.,
a different set of regressors
multivariate families: if family handles multivariate response directly,
then arbitrary multivariate response distributions are possible
Example: Clustering
Example: Clustering
>
+
+
1
2
3
4
5
600
1000
700
500
400
100
0
500
1000
insulin
1500
Call:
stepFlexmix(diabetes_data ~ 1, model = FLXMCmvnorm(diag = FALSE),
k = 1:5, nrep = 10)
300
200
(mix <- stepFlexmix(diabetes_data ~ 1, k = 1:5,
model = FLXMCmvnorm(diag = FALSE),
nrep = 10))
: * * * * * * * * * *
: * * * * * * * * * *
: * * * * * * * * * *
: * * * * * * * * * *
: * * * * * * * * * *
sspg
500
500
sspg
> library("flexmix")
> data("diabetes", package = "mclust")
> diabetes_data <- as.matrix(diabetes[, 2:4])
100
200
glucose
300
1
2
3
4
5
400
iter converged k k0
logLik
AIC
BIC
ICL
2
TRUE 1 1 -2545.833 5109.666 5136.456 5136.456
12
TRUE 2 2 -2354.674 4747.347 4803.905 4811.644
24
TRUE 3 3 -2303.557 4665.113 4751.439 4770.353
36
TRUE 4 4 -2287.605 4653.210 4769.302 4793.502
60
TRUE 5 5 -2274.655 4647.309 4793.169 4822.905
500
> plot(mix)
Example: Clustering
Example: Clustering
AIC
BIC
ICL
5000
5100
4800
4900
prior size post>0 ratio
Comp.1 0.540
82
101 0.812
Comp.2 0.199
28
96 0.292
Comp.3 0.261
35
123 0.285
4700
log Lik. -2303.557 (df=29)
AIC: 4665.113
BIC: 4751.439
number of components
Example: Clustering
Example: Clustering
700
> table(cluster(getModel(mix)), diabetes$class)
chemical normal overt
1
10
72
0
2
1
0
27
3
25
4
6
> parameters(mix_best, component = 1, simplify = FALSE)
$center
glucose
insulin
sspg
91.00937 358.19098 164.14443
sspg
300
200
400
500
600
1000
500
$cov
glucose
insulin
sspg
glucose 58.21456
80.1404
16.8295
insulin 80.14039 2154.9810 347.6972
sspg
16.82950 347.6972 2484.1538
> plot(mix_best, mark = 2)
100
500
sspg
500
1000
insulin
1500
Cluster sizes:
1 2 3
82 28 35
convergence after 24 iterations
> summary(mix_best)
Call:
stepFlexmix(diabetes_data ~ 1, model = FLXMCmvnorm(diag = FALSE),
k = 3, nrep = 10)
> (mix_best <- getModel(mix))
Call:
stepFlexmix(diabetes_data ~ 1, model = FLXMCmvnorm(diag = FALSE),
k = 3, nrep = 10)
100
200
glucose
300
400
500
Example: Clustering
Example: Regression
Rootogram of posterior probabilities > 1e04
0.0
0.2
Comp. 1
0.4
0.6
0.8
1.0
Comp. 2
Comp. 3
25
20
15
10
Number of infected plants
0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
50
100
150
200
250
Number of aphids
Example: Regression
> data("aphids", package = "mixreg")
> (mix <- stepFlexmix(n.inf ~ n.aphids, k = 2, data = aphids,
+
nrep = 10))
2 : * * * * * * * * * *
Call:
stepFlexmix(n.inf ~ n.aphids, data = aphids, k = 2, nrep = 10)
Cluster sizes:
1 2
23 28
convergence after 17 iterations
Example: Regression
> posterior(mix)[1:4,]
[,1]
[,2]
[1,] 0.9949732 0.005026814
[2,] 0.9949769 0.005023128
[3,] 0.2098020 0.790198026
[4,] 0.2050383 0.794961704
> predict(mix, newdata = data.frame(n.aphids = c(0, 300)))
$Comp.1
[,1]
1 3.458813
2 20.047842
$Comp.2
[,1]
1 0.8679776
2 1.5740946
300
Example: Regression
Example: Regression
25
20
> refit(mix)
Call:
refit(mix)
Number of components: 2
15
$Comp.1
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.4585759 1.3730364 2.5189
0.01177
n.aphids
0.0552974 0.0090624 6.1019 1.048e-09
10
Number of infected plants
$Comp.2
50
100
150
200
250
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.8679003 0.5017007 1.7299 0.08365
n.aphids
0.0023539 0.0035375 0.6654 0.50578
> plot(refit(mix))
300
Number of aphids
Example: Regression
Comp. 1
Applications
Comp. 2
Market segmentation: find groups of customers who share
(Intercept)
characteristics: e.g. groups of tourists with similar behaviours at
their destination
reactions: e.g. customers with similar price and other marketing
mix elasticities in choice models
account for heterogeneity between customers
n.aphids
develop segment-specific marketing strategies
Monographs
References
D. Bohning.Computer
Assisted Analysis of Mixtures and Applications:
Meta-Analysis, Disease Mapping, and Others. Chapman & Hall/CRC,
London, 1999.
A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from
incomplete data via the EM-algorithm. Journal of the Royal Statistical
Society B, 39:138, 1977.
S. Fruhwirth-Schnatter.
Finite Mixture and Markov Switching Models.
Springer Series in Statistics. Springer, New York, 2006.
J. Diebolt and C. P. Robert. Estimation of finite mixture distributions through
Bayesian sampling. Journal of the Royal Statistical Society B, 56:363375,
1994.
B. G. Lindsay. Mixture Models: Theory, Geometry, and Applications. The
Institute for Mathematical Statistics, Hayward, California, 1995.
G. J. McLachlan and K. E. Basford. Mixture Models: Inference and
Applications to Clustering. Marcel Dekker, New York, 1988.
G. J. McLachlan and D. Peel. Finite Mixture Models. Wiley, 2000.
C. Fraley and A. E. Raftery. Model-based clustering, discriminant analysis
and density estimation. Journal of the American Statistical Association, 97
(458):611631, 2002.
F. Leisch. FlexMix: A general framework for finite mixture models and
latent class regression in R. Journal of Statistical Software, 11(8), 2004.
URL http://www.jstatsoft.org/v11/i08/.