0% found this document useful (0 votes)
2 views

Unsupervised Learning of Distribution

The document discusses unsupervised learning of distributions, focusing on learning probability weight functions for discrete variables and probability density functions for continuous variables. It covers various models including parametric and non-parametric models, with detailed explanations of learning discrete distributions, Gaussian distributions, and Gaussian mixture models (GMM) using maximum likelihood and maximum a posteriori estimation. Additionally, it explains the re-estimation of GMM parameters using the Expectation-Maximization (EM) algorithm.

Uploaded by

江冠緯
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Unsupervised Learning of Distribution

The document discusses unsupervised learning of distributions, focusing on learning probability weight functions for discrete variables and probability density functions for continuous variables. It covers various models including parametric and non-parametric models, with detailed explanations of learning discrete distributions, Gaussian distributions, and Gaussian mixture models (GMM) using maximum likelihood and maximum a posteriori estimation. Additionally, it explains the re-estimation of GMM parameters using the Expectation-Maximization (EM) algorithm.

Uploaded by

江冠緯
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

UNSUPERVISED LEARNING OF

DISTRIBUTION
Bor-shen Lin
[email protected]
http://www.cs.ntust.edu.tw/~bslin
LEARNING OF DISTRIBUTION

 Unsupervised Learning (without output label)


 Given { Xi } → learn distribution 𝑝 𝑥
 Discrete variable
 Learn probability weight function
 𝑝 𝑥 should satisfy σ∞
−∞ 𝑝 𝑥 = 1

 Continuous variable
 Learn probability density function

 𝑝 𝑥 should satisfy ‫׬‬−∞ 𝑝 𝑥 𝑑𝑥=1
LEARNING MODELS

 Parametric Model
 Discrete distribution
 Gaussian distribution
 Gaussian mixture model

 Non-parametric Model (Instance-based Learning)


 Nearest Neighbor Model
 Kernel Model
LEARNING OF PARAMETRIC MODEL
1. Learning of Discrete Distribution
2. Learning of Gaussian Distribution
3. Learning of Gaussian Mixture Model (GMM)
ESTIMATION OF PARAMETERS

 q represents a set of parameters for the probability


density/mass function P(X|q)
 X is a set of i. i. d. observations X1, X2, …, Xn

θˆ ML = arg max P( X | θ) • Maximum Likelihood Estimation


θ

θˆ MAP = arg max P(θ | X) • Maximum A Posteriori Estimation


θ

P(θ, X)
= arg max
θ P ( X)
= arg max P(θ) P( X | θ)
θ
1. LEARNING OF DISCRETE DISTRIBUTION

X consists of X 1 , X 2 ,..., X N , i.i.d . with p.w. f . as .


P( X =  k | θ) = wk , k = 1,2,..., n
where w1 + w2 + ... + wn = 1 and θ consists of w1 , w2 ,..., wn .
N
then P( X | θ) =  P( X i | θ),
i =1

= P ( X | w1 , w2 ,..., wn ) = w1 1  w2 2    wn n ,
c c c

where c k is the number of occurences for  k in X


1. ESTIMATION OF DISCRETE DISTRIBUTION
θˆ ML = arg max P ( X | θ) is an optimization problem with constra int w1 + w2 + ... + wn = 1,
θ
n
which can be solved with Lagrange multipler L( X , q )  P ( X | θ) +  ( wk − 1)
k =1

 c1 / w1   1
L    
q L( X , q ) = 0, ( = 0 wk )  w1  w2    wn   = −  ,
c1 c2 cn

wk c / w   1
 n n  
N n n
c1 c2 cn
 = ==     ck =   wk =  ( given  wk = 1)
w1 w2 wn k =1 k =1 k =1

ck ck
 wˆ k = =
 N

c
k =1
k
• Use occurrence count to estimate the
probability weights for a p. w. f.
CONSTRAINT OPTIMIZATION WITH
LAGRANGE MULTIPLIER

 Maximize P(w) with the constraint: Smwm = 1

P ( w )
wk
wk
ˆk = M
w
P ( w )

m =1
wm
wm
2. LEARNING OF GAUSSIAN DISTRIBUTION
( x− )2
1 −
X 1 , X 2 ,..., X N are i.i.d . with p.d . f . as .P ( X | θ) = e 2 2
, θ = ( , 2 )
2 
N
then P ( X | θ) =  P ( X i | θ), X consists of X 1 , X 2 ,..., X N
i =1
N
2 log( P ( X | θ) = 2 log( P ( X | θ) =  2 log( P ( X i | θ))
i =1
( X i − )2
N −
1
− N
 ( X i −  )2 
=  2 log(( 2 ) e 2 2 2 2
) = −  log( 2 ) +
2

i =1 i =1   2

N
 ( X i −  )2  N
 ( X i −  )2 
= −  log( 2 ) + log( ) +
2
 = − N log( 2 ) −   log( ) +
2

i =1   2
 i =1   2

l (θ)  2 log( P ( X | θ) + N log( 2 ) is monotonic with P ( X | θ)
GAUSSIAN DISTRIBUTION

1  ( x −  ) 2 


1 2   2 
N ( x;  ,  ) 
2
e 

2 2

E ( X ) =  ,Var ( X ) =  2

N(x; μ,σ2)

x
μ
ESTIMATION OF GAUSSIAN DISTRIBUTION

θˆ ML = arg max P( X | θ) = arg max l (θ)


θ θ

l (  , 2 )   N ( X i −  )2  N
= 0,   = 2 ( X i −  ) = 0
   i =1  2
 i =1

1 N
ˆ =  X i • Every observation Xi contributes to the estimation
N i =1 of (weight as 1/N)
l (  , 2 )  N  ( X −  ) 2

2 
= 0, log( 2
) + i
 =0
 2
 i =1   2

N
 1 ( X i −  )2  N

  2 −  4  = 0, N =  ( X i −  )
2
2

i =1   i =1

1 N
ˆ =  ( X i −  ) • Every observation Xi contributes to the
2
2

N i =1 estimation of 2(weight as 1/N)


GAUSSIAN MIXTURE MODEL (GMM)

M
p( x) =  c k N ( x;  k ,  k )
2

k =1

 M  M

 p ( x)dx =  c k  N ( x;  k ,  k )dx =  c k = 1.0


2
− −
k =1 k =1

p(x)

μ1 μ2 μ3 μ4 x
PARTITION GAUSSIAN MODEL(PGM)

p ( x)  max N ( x;  k ,  k )
2
k

−
p ( x)dx  1.0

p(x)

μ1 μ2 μ3 μ4 x
MULTI-DIMENSIONAL GAUSSIAN DISTRIBUTION
1
1 − (x −μ)t Σ −1 (x −μ )
N (x; μ, Σ) = e 2

(2 ) | Σ |
n

μ  E ( X ) =  x  P ( x )  dx n  1
Σ  E (( X − μ)( X − μ) t ) n  n

• x : n dimensional vector
• Each Gaussian: μ as mean vector, 𝜮 as covariance matrix
• Stochastically independent when 𝜮 is diagonal
• Multi-dimensional GMM: θ = {(𝑐𝑚 , 𝝁𝑚 , 𝜮𝑚 )}
• # of parameters : M(1+n+n2)
3. LEARNING OF GMM
WITH EXPECTATION MAXIMIZATION (EM)

M
X 1 , X 2 ,..., X N are i.i.d . with p.d . f . P( X | θ) =  cm  P( X ;  m ,  m ),
2

m =1
( x−m )2

2 m 2 M
e
where n( X ;  m ,  m ) = , θ = {cm ,  m ,  m } and  cm = 1.
2 2

2  m m =1
N
Then P( X | θ) =  P( X i | θ), X consists of X 1 , X 2 ,..., X N .
i =1
M
The Lagrange function is L(θ) = log P ( X | θ) +  ( cm − 1)
m =1
N M
=  log P( X i | θ) +  ( cm − 1).
i =1 m =1
RE-ESTIMATION OF GMM PARAMETERS
P ( X i ;  m ,  m )
2
( X i − m )
c  c  P ( X ;  ,  2
) 
L(θ) N  m  m2
m N m i m m

= = =0
 m i =1 P ( X i | θ) i =1 P ( X i | θ)
P ( X i , Cm )  cm  P ( X i ;  m ,  m ), θ m  (cm ,  m ,  m ),
2

P ( X i , Cm )
l (m, i )  = P (Cm | X i )
P( X i )
N N N
  l (m, i )( X i −  m ) = 0   m  l (m, i ) =  l (m, i ) X i
i =1 i =1 i =1
N

 l (m, i) X i • l(m,i): estimated probability that Xi is


 ˆ m = i =1
N produced by m-th mixture (weight)
 l (m, i)
i =1 • Denominator for normalization
RE-ESTIMATION OF GMM PARAMETERS

P ( X i ;  m ,  m )
2
cm 
L (θ) N
 2 m
 2 m
=  i =1 P ( X i | θ)
=0

( X i − m )2
P ( X i | θ m )  (1 − )
N
m 2


i =1 P ( X i | θ)
=0

N N
( X i − m )2
  l (m, i ) =  l (m, i )
i =1 i =1  m2
N

 l (m, i )( X i − m )2
 ˆ m = i =1
2
N

 l (m, i )
i =1
CONCEPT
 𝜇𝑚 is a independent variable that may be
adjusted freely no matter what others
parameters are.
 𝜇𝑚 has a unique global maximum.
𝜕𝐿(𝜃)
 The global maximum is located at = 0.
𝜕𝜇𝑚
 A better value of 𝜇𝑚 is guaranteed independent
through the iteration formula of 𝜇Ƹ 𝑚 .
 Both 𝜇𝑚 and 𝜎𝑚 are independent variables.

 𝑐𝑚 ’s are dependent variables, since σ𝑚 𝑐𝑚 = 1.


RE-ESTIMATION OF GMM PARAMETERS

L(θ) P( X i ;  m ,  m )
N 2
= + =0 m
cm i =1 P ( X i | θ)
P( X i | θ m )
P ( X i | θ m )  cm  P ( X i ;  m ,  m ), θ m  (cm ,  m ,  m ), l ( m, i ) 
2

P ( X i | θ)
N

P( X i ;  m ,  m )
N N 2
P( X i | θ m )  l (m, i)
 = = i =1
=  m
i =1 P ( X i | θ) i =1 cm P ( X i | θ) cm
N

M M  l (m, i) M N
 1 =  cm =  i =1
  =  l ( m, i )
m =1 m =1  m =1 i =1
N N

 l (m, i)  l (m, i)
 cˆm = i =1
= i =1

 M N

 l (m, i)
m =1 i =1
EM REESTIMATES OF GMM PARAMETERS
(MULTI-DIMENSIONAL)
σ𝑖 𝑙 𝑚,𝑖 𝒙𝑖
 ෝ𝑚 =
𝝁 σ𝑖 𝑙(𝑚,𝑖)
σ𝑖 𝑙 𝑚,𝑖 (𝒙𝑖 −𝝁𝑚 )(𝒙𝑖 −𝝁𝑚 )𝑡
 ෡𝑚 =
𝚺 σ𝑖 𝑙(𝑚,𝑖)
σ𝑖 𝑙(𝑚,𝑖)
 𝑐𝑚 = σ
𝑚 σ𝑖 𝑙(𝑚,𝑖)
LEARNING FOR GMM

Initial model
• random
Initial model
q = { (cm, m, Sm)} • splitting

Compute YES
l(m,i) Expectation
Training data
Estimate converged ?
{ Xi }
q={qm}
NO
Final Model
መ qm} ={(cm, m, Sm)}
𝜃={ 𝜃 ∗ ={qm}
STEPS OF GMM TRAINING

1. Set the initial mixture as the mean and


covariance of all training data {Xi} (M = 1).
2. Split the largest cluster into two clusters
 from the mean, equal weights
 Other algorithm: LBG(1→2→4→8→…)
3. Re-estimate the model parameters iteratively
until converged.
4. Repeat steps 2 & 3 till M mixtures.
DISTANCE BETWEEN GAUSSIANS - 1

 Bhattacharyya Divergence DB(f, g)


 DB(f, g) estimation of classification error

DB ( f , g )  − log  f ( x) g ( x) dx
1 −1 1 Σ f + Σg 1
= (μ f − μ g ) ( Σ f + Σ g ) (μ f − μ g ) + log
t
− log Σ f Σ g
4 2 2 4
1 1 − DB ( f , g )
Then Bayes error Be ( f , g )   min( f ( x), g ( x))dx  e
2 2
DISTANCE BETWEEN GAUSSIANS - 2

 Kullback-Leibler Divergence (KLD)

f ( x)
DKL ( f , g )   f ( x) log dx
g ( x)
1 Σg −1 −1

= log + Tr Σ g Σ f − d + (μ f − μ g ) ( Σ f − Σ g ) (μ f − μ g )
t

2
 Σf 
Then DKL ( f , g )  2 DB ( f , g )
SIMILARITY BETWEEN TWO GMMS
GMM 1

 Pairwise distances
 Bhattacharyya distance
 Weighted average

GMM 2
APPLICATIONS OF GMM
 Parametric model of continuous variables
 Unsupervised learning of p(X)
 Clustering (unsupervised) C X
 Regarding every mixture as a cluster
 Classification (supervised)
 Train p(X|Cm) for all m’s (classes)
 Examples
 Language identification
 Gender identification
 Speaker recognition
 Image classification/tagging
GMM-BASED CLUSTERING

1. Every mixture of a GMM regarded as a cluster


 Similar to k-Means clustering, however the variances
are used for distance normalization when computing
the probability (exponential term)
(simple k-Means uses Euclidean distance)
 A GMM is trained, and each point 𝑋 is assigned to a
cluster according to:
𝑐𝑘 𝑝𝑘 (𝑋)
𝑘 ∗ = 𝑎𝑟𝑔𝑚𝑎𝑥𝑘 𝑙(𝑋, 𝑘 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑘 σ𝑖 𝑐𝑖 𝑝𝑖 (𝑋)

2. A GMM may also be regarded as a point


 Clustering of GMMs based on distances
 Example: speaker clustering (groups)
training GMMs for all speakers
GMM-BASED CLASSIFIER
 Train GMMs of p(x|Ci) for i=0,1 respectively
 ML Detector

C* = argmaxi p(x|Ci)
 MAP Detector (Given the prior distribution)

C* = argmaxi p(Ci|x)
= argmaxi p(Ci)p(x|Ci)

ML GMM
{ xi } of Ck Estimation p(x|Ck)
DISCRIMINATIVE TRAINING FOR GMM

 ML training
 The objective functions to be maximized is the
likelihood function for every class
 Every GMM are trained with the data of its class
 A sample of class k will influence the distribution of
that class, i.e. p(x|Ck), only
 Minimum classification error (MCE) training
 The objective function to be minimized is the overall
classification errors
 The GMMs for different classes are trained jointly
instead of individually
 Every sample will influence the distributions of all
classes, i.e. p(x|Cj) for all j.
MCE TRAINING
 𝑝𝑘 (𝑥) is a GMM with parameters { (𝑐𝑘𝑚 , 𝜇𝑘𝑚 , Σ𝑘𝑚 )}
𝑝𝑘 𝑥 = σ𝑀 𝑚=1 𝑐𝑘𝑚 𝑁𝑘𝑚 𝑥
 𝑔𝑘 𝑥 = log 𝑝𝑘 (𝑥)

 𝑑𝑘 𝑥 = −𝑔𝑘 𝑥 + 𝑔𝑘 ത 𝑥
1 1/𝜂
σ𝑗≠𝑘 𝑒 𝜂𝑔𝑗 𝑥
= −𝑔𝑘 𝑥 + 𝑙𝑜𝑔
𝑀−1
1
 𝑙𝑘 𝑥 = (sigmoid)
1+𝑒 −𝛾𝑑𝑘 +𝜃
 𝐿(𝑋) = σ𝐾𝑘=1 σ𝑥𝑖 ∈𝐶𝑘 𝑙𝑘 𝑥𝑖
 Minimizing l leads to the minimization of
classification errors
 The parameters can be obtained by gradient
probabilistic descent (GPD) 𝑑Λ = −𝜖𝛻𝐿
MCE FORMULA – DIAGONAL COVARIANCE
𝑐𝑗𝑚 𝑝𝑗𝑚
 For 𝑥𝑖 ∈ 𝐶𝑘 , 𝜃𝑗𝑚 ≡ , 𝑟𝑘 ≡ 𝛾𝑙𝑘 1 − 𝑙𝑘
𝑝𝑗
𝑥𝑙 −𝜇𝑘𝑚𝑙
 𝑑𝜇𝑘𝑚𝑙 = 𝜀𝑟𝑘 𝜃𝑘𝑚
𝜎𝑘𝑚𝑙 2
𝑝𝑗 𝑥𝑙 −𝜇𝑗𝑚𝑙
𝑑𝜇𝑗𝑚𝑙 = −𝜀𝑟𝑘 σ 𝜃 2 𝑓𝑜𝑟 𝑗 ≠ 𝑘
𝑝 𝑗𝑚 𝜎
𝑛≠𝑘 𝑛 𝑗𝑚𝑙

(𝑥𝑙 −𝜇𝑘𝑚𝑙 )2
1
 𝑑𝜎𝑘𝑚𝑙 = 𝜀𝑟𝑘 𝜃𝑘𝑚 −1 𝑑𝜎𝑗𝑚𝑙 =
𝜎𝑘𝑚𝑙 𝜎𝑘𝑚𝑙 2
𝑝𝑗 1 (𝑥𝑙 −𝜇𝑗𝑚𝑙 )2
− 𝜀𝑟𝑘 σ 𝜃 −1 𝑓𝑜𝑟 𝑗 ≠ 𝑘
𝑝 𝑗𝑚 𝜎
𝑛≠𝑘 𝑛 𝑗𝑚𝑙𝜎 2
𝑗𝑚𝑙

 Minimum classification error rate for speech


recognition, IEEE Trans. on Speech and Audio
Processing, 1997.
INSTANCE-BASED LEARNING

 Weakness of parametric model


 restricted family of function might over-simplify the
real world
 Non-parametric learning
 All samples are stored and used for model
 Instance-based learning or memory-based learning
 Complexity is increased as the data set grows.
 Estimation of p(x)
 A type of unsupervised learning
1. Nearest Neighbor Model
2. Kernel Model
NEAREST-NEIGHBOR MODELS
 Estimation of density
K=3
 Use the largest distance for the k
nearest-neighbors x x
x x
 The larger the distance, the lower x
the density of the point 𝒙 D x
x D
 k is low → 𝑝(𝒙) highly variable
k is large → 𝑝(𝒙) smooth
𝑘
 Distance measure 𝑝(𝒙) ∝
𝐷𝑛
 Euclidean distance might not be
appropriate (e.g. D = 𝒅𝑡 Σ −1 𝒅)
 Should consider the physical
meanings of different dimensions
KERNEL MODELS
 𝑝 𝑥 is estimated with the normalized sum of the
kernel functions for all training instances {𝒙𝑖 }
1
 𝑝 𝑥 = σ 𝐾(𝒙, 𝒙𝑖 )
𝑁 𝑖
 𝐾(𝒙, 𝒙𝑖 ) is the measure of similarity that depends on
𝐷(𝒙, 𝒙𝒊 )
𝐷(𝒙,𝒙𝒊 )2
1 −
 A popular kernel: 𝐾 𝒙, 𝒙𝑖 = 𝑑 𝑒 2𝑤2
2𝜋𝑤 2

You might also like