Unsupervised Learning of Distribution
Unsupervised Learning of Distribution
DISTRIBUTION
Bor-shen Lin
[email protected]
http://www.cs.ntust.edu.tw/~bslin
LEARNING OF DISTRIBUTION
Continuous variable
Learn probability density function
∞
𝑝 𝑥 should satisfy −∞ 𝑝 𝑥 𝑑𝑥=1
LEARNING MODELS
Parametric Model
Discrete distribution
Gaussian distribution
Gaussian mixture model
P(θ, X)
= arg max
θ P ( X)
= arg max P(θ) P( X | θ)
θ
1. LEARNING OF DISCRETE DISTRIBUTION
= P ( X | w1 , w2 ,..., wn ) = w1 1 w2 2 wn n ,
c c c
c1 / w1 1
L
q L( X , q ) = 0, ( = 0 wk ) w1 w2 wn = − ,
c1 c2 cn
wk c / w 1
n n
N n n
c1 c2 cn
= == ck = wk = ( given wk = 1)
w1 w2 wn k =1 k =1 k =1
ck ck
wˆ k = =
N
c
k =1
k
• Use occurrence count to estimate the
probability weights for a p. w. f.
CONSTRAINT OPTIMIZATION WITH
LAGRANGE MULTIPLIER
P ( w )
wk
wk
ˆk = M
w
P ( w )
m =1
wm
wm
2. LEARNING OF GAUSSIAN DISTRIBUTION
( x− )2
1 −
X 1 , X 2 ,..., X N are i.i.d . with p.d . f . as .P ( X | θ) = e 2 2
, θ = ( , 2 )
2
N
then P ( X | θ) = P ( X i | θ), X consists of X 1 , X 2 ,..., X N
i =1
N
2 log( P ( X | θ) = 2 log( P ( X | θ) = 2 log( P ( X i | θ))
i =1
( X i − )2
N −
1
− N
( X i − )2
= 2 log(( 2 ) e 2 2 2 2
) = − log( 2 ) +
2
i =1 i =1 2
N
( X i − )2 N
( X i − )2
= − log( 2 ) + log( ) +
2
= − N log( 2 ) − log( ) +
2
i =1 2
i =1 2
l (θ) 2 log( P ( X | θ) + N log( 2 ) is monotonic with P ( X | θ)
GAUSSIAN DISTRIBUTION
1 ( x − ) 2
−
1 2 2
N ( x; , )
2
e
2 2
E ( X ) = ,Var ( X ) = 2
N(x; μ,σ2)
x
μ
ESTIMATION OF GAUSSIAN DISTRIBUTION
l ( , 2 ) N ( X i − )2 N
= 0, = 2 ( X i − ) = 0
i =1 2
i =1
1 N
ˆ = X i • Every observation Xi contributes to the estimation
N i =1 of (weight as 1/N)
l ( , 2 ) N ( X − ) 2
2
= 0, log( 2
) + i
=0
2
i =1 2
N
1 ( X i − )2 N
2 − 4 = 0, N = ( X i − )
2
2
i =1 i =1
1 N
ˆ = ( X i − ) • Every observation Xi contributes to the
2
2
M
p( x) = c k N ( x; k , k )
2
k =1
M M
p(x)
μ1 μ2 μ3 μ4 x
PARTITION GAUSSIAN MODEL(PGM)
p ( x) max N ( x; k , k )
2
k
−
p ( x)dx 1.0
p(x)
μ1 μ2 μ3 μ4 x
MULTI-DIMENSIONAL GAUSSIAN DISTRIBUTION
1
1 − (x −μ)t Σ −1 (x −μ )
N (x; μ, Σ) = e 2
(2 ) | Σ |
n
μ E ( X ) = x P ( x ) dx n 1
Σ E (( X − μ)( X − μ) t ) n n
• x : n dimensional vector
• Each Gaussian: μ as mean vector, 𝜮 as covariance matrix
• Stochastically independent when 𝜮 is diagonal
• Multi-dimensional GMM: θ = {(𝑐𝑚 , 𝝁𝑚 , 𝜮𝑚 )}
• # of parameters : M(1+n+n2)
3. LEARNING OF GMM
WITH EXPECTATION MAXIMIZATION (EM)
M
X 1 , X 2 ,..., X N are i.i.d . with p.d . f . P( X | θ) = cm P( X ; m , m ),
2
m =1
( x−m )2
−
2 m 2 M
e
where n( X ; m , m ) = , θ = {cm , m , m } and cm = 1.
2 2
2 m m =1
N
Then P( X | θ) = P( X i | θ), X consists of X 1 , X 2 ,..., X N .
i =1
M
The Lagrange function is L(θ) = log P ( X | θ) + ( cm − 1)
m =1
N M
= log P( X i | θ) + ( cm − 1).
i =1 m =1
RE-ESTIMATION OF GMM PARAMETERS
P ( X i ; m , m )
2
( X i − m )
c c P ( X ; , 2
)
L(θ) N m m2
m N m i m m
= = =0
m i =1 P ( X i | θ) i =1 P ( X i | θ)
P ( X i , Cm ) cm P ( X i ; m , m ), θ m (cm , m , m ),
2
P ( X i , Cm )
l (m, i ) = P (Cm | X i )
P( X i )
N N N
l (m, i )( X i − m ) = 0 m l (m, i ) = l (m, i ) X i
i =1 i =1 i =1
N
P ( X i ; m , m )
2
cm
L (θ) N
2 m
2 m
= i =1 P ( X i | θ)
=0
( X i − m )2
P ( X i | θ m ) (1 − )
N
m 2
i =1 P ( X i | θ)
=0
N N
( X i − m )2
l (m, i ) = l (m, i )
i =1 i =1 m2
N
l (m, i )( X i − m )2
ˆ m = i =1
2
N
l (m, i )
i =1
CONCEPT
𝜇𝑚 is a independent variable that may be
adjusted freely no matter what others
parameters are.
𝜇𝑚 has a unique global maximum.
𝜕𝐿(𝜃)
The global maximum is located at = 0.
𝜕𝜇𝑚
A better value of 𝜇𝑚 is guaranteed independent
through the iteration formula of 𝜇Ƹ 𝑚 .
Both 𝜇𝑚 and 𝜎𝑚 are independent variables.
L(θ) P( X i ; m , m )
N 2
= + =0 m
cm i =1 P ( X i | θ)
P( X i | θ m )
P ( X i | θ m ) cm P ( X i ; m , m ), θ m (cm , m , m ), l ( m, i )
2
P ( X i | θ)
N
P( X i ; m , m )
N N 2
P( X i | θ m ) l (m, i)
= = i =1
= m
i =1 P ( X i | θ) i =1 cm P ( X i | θ) cm
N
M M l (m, i) M N
1 = cm = i =1
= l ( m, i )
m =1 m =1 m =1 i =1
N N
l (m, i) l (m, i)
cˆm = i =1
= i =1
M N
l (m, i)
m =1 i =1
EM REESTIMATES OF GMM PARAMETERS
(MULTI-DIMENSIONAL)
σ𝑖 𝑙 𝑚,𝑖 𝒙𝑖
ෝ𝑚 =
𝝁 σ𝑖 𝑙(𝑚,𝑖)
σ𝑖 𝑙 𝑚,𝑖 (𝒙𝑖 −𝝁𝑚 )(𝒙𝑖 −𝝁𝑚 )𝑡
𝑚 =
𝚺 σ𝑖 𝑙(𝑚,𝑖)
σ𝑖 𝑙(𝑚,𝑖)
𝑐𝑚 = σ
𝑚 σ𝑖 𝑙(𝑚,𝑖)
LEARNING FOR GMM
Initial model
• random
Initial model
q = { (cm, m, Sm)} • splitting
Compute YES
l(m,i) Expectation
Training data
Estimate converged ?
{ Xi }
q={qm}
NO
Final Model
መ qm} ={(cm, m, Sm)}
𝜃={ 𝜃 ∗ ={qm}
STEPS OF GMM TRAINING
DB ( f , g ) − log f ( x) g ( x) dx
1 −1 1 Σ f + Σg 1
= (μ f − μ g ) ( Σ f + Σ g ) (μ f − μ g ) + log
t
− log Σ f Σ g
4 2 2 4
1 1 − DB ( f , g )
Then Bayes error Be ( f , g ) min( f ( x), g ( x))dx e
2 2
DISTANCE BETWEEN GAUSSIANS - 2
f ( x)
DKL ( f , g ) f ( x) log dx
g ( x)
1 Σg −1 −1
= log + Tr Σ g Σ f − d + (μ f − μ g ) ( Σ f − Σ g ) (μ f − μ g )
t
2
Σf
Then DKL ( f , g ) 2 DB ( f , g )
SIMILARITY BETWEEN TWO GMMS
GMM 1
Pairwise distances
Bhattacharyya distance
Weighted average
GMM 2
APPLICATIONS OF GMM
Parametric model of continuous variables
Unsupervised learning of p(X)
Clustering (unsupervised) C X
Regarding every mixture as a cluster
Classification (supervised)
Train p(X|Cm) for all m’s (classes)
Examples
Language identification
Gender identification
Speaker recognition
Image classification/tagging
GMM-BASED CLUSTERING
C* = argmaxi p(x|Ci)
MAP Detector (Given the prior distribution)
C* = argmaxi p(Ci|x)
= argmaxi p(Ci)p(x|Ci)
ML GMM
{ xi } of Ck Estimation p(x|Ck)
DISCRIMINATIVE TRAINING FOR GMM
ML training
The objective functions to be maximized is the
likelihood function for every class
Every GMM are trained with the data of its class
A sample of class k will influence the distribution of
that class, i.e. p(x|Ck), only
Minimum classification error (MCE) training
The objective function to be minimized is the overall
classification errors
The GMMs for different classes are trained jointly
instead of individually
Every sample will influence the distributions of all
classes, i.e. p(x|Cj) for all j.
MCE TRAINING
𝑝𝑘 (𝑥) is a GMM with parameters { (𝑐𝑘𝑚 , 𝜇𝑘𝑚 , Σ𝑘𝑚 )}
𝑝𝑘 𝑥 = σ𝑀 𝑚=1 𝑐𝑘𝑚 𝑁𝑘𝑚 𝑥
𝑔𝑘 𝑥 = log 𝑝𝑘 (𝑥)
𝑑𝑘 𝑥 = −𝑔𝑘 𝑥 + 𝑔𝑘 ത 𝑥
1 1/𝜂
σ𝑗≠𝑘 𝑒 𝜂𝑔𝑗 𝑥
= −𝑔𝑘 𝑥 + 𝑙𝑜𝑔
𝑀−1
1
𝑙𝑘 𝑥 = (sigmoid)
1+𝑒 −𝛾𝑑𝑘 +𝜃
𝐿(𝑋) = σ𝐾𝑘=1 σ𝑥𝑖 ∈𝐶𝑘 𝑙𝑘 𝑥𝑖
Minimizing l leads to the minimization of
classification errors
The parameters can be obtained by gradient
probabilistic descent (GPD) 𝑑Λ = −𝜖𝛻𝐿
MCE FORMULA – DIAGONAL COVARIANCE
𝑐𝑗𝑚 𝑝𝑗𝑚
For 𝑥𝑖 ∈ 𝐶𝑘 , 𝜃𝑗𝑚 ≡ , 𝑟𝑘 ≡ 𝛾𝑙𝑘 1 − 𝑙𝑘
𝑝𝑗
𝑥𝑙 −𝜇𝑘𝑚𝑙
𝑑𝜇𝑘𝑚𝑙 = 𝜀𝑟𝑘 𝜃𝑘𝑚
𝜎𝑘𝑚𝑙 2
𝑝𝑗 𝑥𝑙 −𝜇𝑗𝑚𝑙
𝑑𝜇𝑗𝑚𝑙 = −𝜀𝑟𝑘 σ 𝜃 2 𝑓𝑜𝑟 𝑗 ≠ 𝑘
𝑝 𝑗𝑚 𝜎
𝑛≠𝑘 𝑛 𝑗𝑚𝑙
(𝑥𝑙 −𝜇𝑘𝑚𝑙 )2
1
𝑑𝜎𝑘𝑚𝑙 = 𝜀𝑟𝑘 𝜃𝑘𝑚 −1 𝑑𝜎𝑗𝑚𝑙 =
𝜎𝑘𝑚𝑙 𝜎𝑘𝑚𝑙 2
𝑝𝑗 1 (𝑥𝑙 −𝜇𝑗𝑚𝑙 )2
− 𝜀𝑟𝑘 σ 𝜃 −1 𝑓𝑜𝑟 𝑗 ≠ 𝑘
𝑝 𝑗𝑚 𝜎
𝑛≠𝑘 𝑛 𝑗𝑚𝑙𝜎 2
𝑗𝑚𝑙