Hidden Markov Models
Machine Learning
CSx824/ECEx242
Bert Huang
Virginia Tech
Outline
• Hidden Markov models (HMMs)
• Forward-backward for HMMs
• Baum-Welch learning (expectation maximization)
Hidden State Transitions
? ?
submarine ?
Hidden State Transitions
? ?
submarine ?
Hidden State Transitions
? ?
submarine
Hidden Markov Models
p(yt |xt ) observation probability SONAR noisiness
p(xt |xt 1) transition probability submarine locomotion
TY1 T
Y
p(X , Y ) = p(x1 ) p(xt+1 |xt ) p(yt 0 |xt 0 )
t=1 t 0 =1
x1 x2 x3 x4 x5
y1 y2 y3 y4 y5
Hidden State Inference
p(X |Y ) p(xt |Y )
↵t (xt ) = p(xt , y1 , ... , yt ) t (xt ) = p(yt+1 , ... , yT |xt )
↵t (xt ) t (xt ) = p(xt , y1 , ... , yt )p(yt+1 , ... , yT |xt ) = p(xt , Y ) / p(xt |Y )
normalize to get conditional probability
note: not the same as p(x1 , ... , xT , Y )
Forward Inference
↵t (xt ) = p(xt , y1 , ... , yt )
p(x1 , y1 ) = p(x1 )p(y1 |x1 ) = ↵1 (x1 )
X X
p(x2 , y1 , y2 ) = p(x1 , y1 )p(x2 |x1 )p(y2 |x2 ) = ↵2 (x2 ) = ↵1 (x1 )p(x2 |x1 )p(y2 |x2 )
x1 x1
X
p(xt+1 , y1 , ... , yt+1 ) = ↵t+1 (xt+1 ) = ↵t (xt )p(xt+1 |xt )p(yt+1 |xt+1 )
xt
Backward Inference
t (xt ) = p(yt+1 , ... , yT |xt )
p({}|xT ) = 1 = T (xT )
X
t 1 (xt 1 ) = p(yt , ... , yT |xt 1) = p(xt |xt 1 )p(yt , yt+1 , ... , yT |xt )
xt
X
= p(xt |xt 1 )p(yt |xt )p(yt+1 , ... , yT |xt )
xt
X
= p(xt |xt 1 )p(yt |xt ) t (xt )
xt
Backward Inference
t (xt ) = p(yt+1 , ... , yT |xt )
p({}|xT ) = 1 = T (xT )
X
t 1 (xt 1 ) = p(yt , ... , yT |xt 1) = p(xt |xt 1 )p(yt |xt ) t (xt )
xt
Fusing the Messages
↵t (xt ) = p(xt , y1 , ... , yt ) t (xt ) = p(yt+1 , ... , yT |xt )
↵t (xt ) t (xt ) = p(xt , y1 , ... , yt )p(yt+1 , ... , yT |xt ) = p(xt , Y ) / p(xt |Y )
p(xt , xt+1 , y1 , ... , yt , yt+1 , yt+2 , ... , yT )
p(xt , xt+1 |Y ) =
p(Y )
p(xt , y1 , ... , yt )p(xt+1 |xt )p(yt+2 , ... , yT |xt+1 )p(yt+1 |xt+1 )
= P
xT p(x t , Y )
↵t (xt )p(xt+1 |xt ) t+1 (xt+1 )p(yt+1 |xt+1 )
= P
xT ↵T (xT )
Forward-Backward Inference
X
↵1 (x1 ) = p(x1 )p(y1 |x1 ) ↵t+1 (xt+1 ) = ↵t (xt )p(xt+1 |xt )p(yt+1 |xt+1 )
xt
X
T (xT ) = 1 t 1 (xt 1 ) = p(xt |xt 1 )p(yt |xt ) t (xt )
xt
↵t (xt ) t (xt )
p(xt , Y ) = ↵t (xt ) t (xt ) p(xt |Y ) = P 0 0
0
xt ↵ t (x t ) t (x t )
↵t (xt )p(xt+1 |xt ) t+1 (xt+1 )p(yt+1 |xt+1 )
p(xt , xt+1 |Y ) = P
xT ↵T (xT )
Normalization
To avoid underflow, re-normalize at each time step
↵t (xt ) ˜t (xt ) = P t (xt )
˜ t (xt ) = P
↵ 0 0
xt0 ↵t (xt ) xt0 t (xt )
Exercise: why is this okay?
Learning
• Parameterize and learn
p(xt+1 |xt ) p(yt |xt )
conditional probability table observation model
transition matrix emission model
• If fully observed, super easy!
• If x is hidden (most cases) treat as latent variable
• E.g., expectation maximization
Baum-Welch Algorithm
EM using forward-backward inference as E-step
Baum-Welch Details
Compute p(xt |Y ) and p(xt , xt+1 |Y ) using forward-backward
Maximize weighted (expected) log-likelihood
T e.g., Gaussian
1 X PT
p(x1 ) p(xt |Y ) or p(x1 |Y ) t=1 p(x t = x|Y )y t
T µx PT
t=1 p(x = x|Y )
0
t =1 t
PT 1 PT
t=1p(xt+1 = i, xt = j|Y ) p(x = x|Y )I (y = y )
p(xt 0 +1 = i|xt 0 = j) PT 1 t=1 t t
p(y |x) PT
t=1 p(xt = j|Y )
0
t =1 p(x t = x|Y )
e.g., multinomial
Summary
• HMMs represent hidden states
• Transitions between adjacent states
• Observation based on states
• Forward-backward inference to incorporate all evidence
• Expectation maximization to train parameters (Baum-Welch)
• Treat states as latent variables