0% found this document useful (0 votes)
13 views8 pages

Dynamic Bayesian Multinets

Jeff A. Bilmes Department of Electrical Engineering University of Washington Seattle, WA 98 195-2500

Uploaded by

Tamás Mákos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views8 pages

Dynamic Bayesian Multinets

Jeff A. Bilmes Department of Electrical Engineering University of Washington Seattle, WA 98 195-2500

Uploaded by

Tamás Mákos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

38 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000

Dynamic Bayesian Multinets

Jeff A. Bilmes
Department of Electrical Engineering
University of Washington
Seattle, WA 98 195-2500

Abstract for example, is one such instance of pre-specifying an ar­


bitrary model structure for a domain. Such an approach
In this work, dynamic Bayesian multinets are has obvious computational and infrastructural advantages:
introduced where a Markov chain state at time if the model is kept simple, inference is guaranteed to stay
t determines conditional independence patterns tractable and software tools can be developed and reused
between random variables lying within a local many times. The conditional independence properties of
time window surrounding t. It is shown how a particular model, however, could be sub-optimal for a
information-theoretic criterion functions can be given task. With a more appropriate model, substantial
used to induce sparse, discriminative, and class­ improvements in classification accuracy, memory require­
conditional network structures that yield an op­ ments, and computational demands could potentially be
timal approximation to the class posterior prob­ achieved. While it might be sufficient to hand-specify the
ability, and therefore are useful for the classi­ model for a given application, a promising approach allows
fication task. Using a new structure learning the data itself to determine or at least influence the model.
heuristic, the resulting models are tested on a In the most general case, there are four distinct compo­
medium-vocabulary isolated-word speech recog­ nents of a graphical model: the semantics, the structure,
nition task. It is demonstrated that these discrim­ the implementation, and the parameters. There are a va­
inatively structured dynamic Bayesian multinets, riety of different semantics, including directed (Bayesian
when trained in a maximum likelihood setting us­ network) and undirected (Markov random field) models,
ing EM, can outperform both HMMs and other chain graphs, and other more experimental frameworks. In
dynamic Bayesian networks with a similar num­ general, each corresponds to a different family of proba­
ber of parameters. bility distributions and, based on training data, a seman­
tics could potentially be selected or perhaps even induced
anew. Fixing the semantics, obtaining a good model struc­
1 Introduction
ture is crucial, and is therefore a current active research
focus [7, 16, 5, 20, 10]. Fixing the structure, there are a
While Markov chains are sometimes a useful model for se­
variety of ways to implement1 the dependencies between
quences, such simple independence assumptions can lead
random variables, such as conditional probability tables,
to poor representations of real processes. An alternative
neural networks, decision trees, Gaussian polynomial re­
and highly successful extension to the Markov chain al­
gression, and so on. And finally, fixing all of the above, a
lows random functions to be applied to each Markov state
good assignment of all the parameters must be found. Of
to yield the hidden Markov model (HMM). As is well
course in each case, a Bayesian approach can also be taken
known, an HMM is simply one type of dynamic Bayesian
where we use a (potentially uncountably infinite) proba­
network (DBN) [ 12], or more generally a graphical model
bilistically weighted mixture over multiple choices.
[21]. When HMMs are considered as one small instance in
this enormous family of models, it is reasonable to suppose The task of learning graphical models can be seen as learn­
that the independence assumptions underlying an HMM ing any or all of the above four components given a collec­
can further be relaxed to yield better models still. tion of data, and is akin to model selection [22] problem
known to the statistics community for years. In all cases,
The structure of a graphical model, however, is sometimes
chosen for an application without ensuring that it matches 1While this is not standard terminology, a concise way to refer
the underlying process the model is supposed to repre­ to the representation of the local conditional probability model is
sent. Using a hidden Markov model to represent speech, simply to use the term "implementation."
UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 39

the underlying goal is to identify a system for probabilis­ Perhaps the earliest well-known work on structure learning
tic inference that is computationally efficient, accurate, and in directed graphical models is [7]. More recent research on
somehow informative about the given problem domain. this topic may be found in [ 17, 5, 16, 25, 10, 20, 23, 13]. 2
In general, the task of learning Bayesian networks can be
In this paper, a class of models called dynamic Bayesian
grouped into four categories [ 10] depending on 1) if the
multinets and a method to induce their structure for the
data is fully observable or if it contains missing values, and
classification task is described. In this work, an exten­
2) if it is assumed that the structure of the model is known
sion of [3], the problem domain is speech recognition so
or not. The easiest case is when the data is fully observable
it is necessary to use dynamic models. Also, since clas­
and model structure is known, whereas the most difficult
sification is the goal, it is beneficial to learn class-specific
case is when the data is only partially observable and when
and (as we will see) discriminative structure. And to fur­
the structure is unknown or only partially known.
ther improve sparsity (and therefore reduce computational
and memory demands) and to represent class conditional Note that a general optimization procedure can be used to
information only where necessary, Bayesian multinets (de­ learn many aspects of a graphical model. Often, learn­
scribed in the next section) are used. ing needs only a maximum likelihood procedure perhaps
with an additional complexity penalty term such as MDL or
Section 2 provides a review of structure learning in
BIC. Alternatively, a Bayesian approach to learning can be
Bayesian networks, of Bayesian multinets, and presents the
used where no single structure or set of parameters are cho­
idea of structural discriminability. Section 3 introduces the
sen. For certain classes of networks, the prior and posterior
class of models considered in this work and analyzes their
are particularly simple [ 16]. Alternatively, a risk minimiza­
inferential complexity. Section 4 provides three informa­
tion approach [26] can be applied to the learning problem.
tion theoretic criterion functions that can be used to learn
structure, the last of which provably provides an optimal In principle, an optimization procedure could simultane­
approximation to the local posterior probability. Section 5 ously cover all four components of a graphical model: se­
introduces the improved pairwise algorithm, a heuristic de­ mantics, structure, implementation, and parameters. There
veloped because the above induction procedure is compu­ has, however, been little if any research on methods to Jearn
tationally infeasible. Section 6 evaluates this system on a the best implementation and semantics. The problem be­
medium-vocabulary speech corpus and shows that when comes inherently difficult because the quality of each com­
structure is determined using the discriminative induction ponent cannot be accurately evaluated without first obtain­
method and trained using EM, these networks can outper­ ing good settings for the other three components. The prob­
form both HMMs and other dynamic Bayesian networks lem becomes more arduous when one begins to consider
with a similar number of parameters. But when structure multi-implementation and multi-semantic models. In prac­
is determined arbitrarily, or without using a discriminative tice, therefore, one or more components are typically fixed
method, the performance is dramatically worse. Finally, before any optimization begins.
Section 7 concludes and discusses future work.

2 B ackground 2.2 Bayesian Multinets

2.1 Structure Learning A advantage of Bayesian networks is that they can spec­
ify dependencies only when necessary, leading to a signifi­
A fully-connected graphical model can represent any prob­ cant reduction in the cost of inference. Bayesian multinets
ability distribution representable by a sparsely structured [ 15, 14] further generalize Bayesian networks and can fur­
one, but there are many important reasons for not using ther reduce computation. A multinet can be thought of as
such a fully connected model. These include 1) sparse net­ a network where edges can appear or disappear depending
work structures have fewer computational and memory re­ on the values of certain nodes in the graph, a notion that
quirements; 2) a sparse network is less susceptible to noise has been called asymmetric independence assertions [ 14].
in training data (i.e., lower variance) and less prone to over­
fitting; and 3) the resulting structure might reveal high-level Consider a network with four nodes A, B, C and Q. In
knowledge about the underlying problem domain that was a multinet, the conditional independence properties among
previously drowned out by many extra dependencies. A A, B, and C might, for example, change for differing val­
graphical model should represent a dependence between ues of Q. If Q is binary, and CJLAI{B, Q 0} but=

two random variables only when necessary, where "neces­ C-Jl.AI{B, Q 1}, then the joint probability could be writ-
=

sary" depends on the current task. In essence, learning the


structure in data is similar to developing an efficient code
for the underlying random process, as efficient coding is
analogous to probabilistic modeling. 2 See, especially, the reviews given in [16, 5, 20].
40 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000

ten as: 3 TheModel

In this work, we consider a class of models called dy­


p(A,B,C) = LP(A,B,CIQ = q) p(Q = q) namic Bayesian multinets (DBM). They consist of hid­
q den Markov chains that determine local class-conditional
= p(CIB,Q = O) p(BIA,Q = O) p(Q = 0) + Bayesian networks over a window of observations. Equiva­
lently, they consist of extensions to hidden Markov models
p(CIB,A,Q =I) p(BIA,Q = I) p(Q =I) (HMMs) where additional cross-observation dependencies
have been added as a function of the underlying Markov
chain. This model is also called a buried Markov model
Some examples of multinets include mixtures of tree­
(BMM) [3] because the hidden Markov chain in a HMM
dependent distributions [23] and class-conditional naive
is further hidden (buried) by additional cross-observation
Bayes classifiers [11, 10].
dependencies.
In general, the statistical dependencies in a multinet could
First some notation: Qt refers to a Markov state at
be represented by a regular Bayesian network via specific
values of the parameters [14] (e.g., for switching linear time t, and Q1,r g { Q1, Q 2,... , Qr} refers to the
Gaussian models, certain parameters could be zero, or for entire chain.3 Xt will refer to the observation vector
t
at time t with Xti referring to its i h element. Us­
discrete probability tables, hyperplanes could indicate in­
dependence between random variables only for certain val­ ing this notation, a hidden Markov model is a col­
ues of other random variables). In other words, the family lection of hidden Q1,r and observation X1,r vari­
of probability distributions representable by Bayesian net­ ables that possess the following conditional indepen­
works and by Bayesian multinets is the same. In practice, dence properties: {Xt:T,Qt:T}ll{ Q l:t-2,Xl:t-diQt-1
however, a multinet could result in a substantial savings and XtlL{Q�t, X�t}IQt for all t.
in memory, computation, and necessary sample-size com­ We generalize this model such that Xti is no longer con­
plexity relative to an equivalent Bayesian network. ditionally independent of all the surrounding observations
given Qt. Relative to an HMM, a DBM has been aug­
mented with chain-conditional cross-observation depen­
2.3 The Classification Task
dencies between individual observation elements. The
probability model becomes:

Many papers on structure learning concentrate on produc­ p(xl:t) = LIT p(xtlzt(qt),Qt)p(qtiQt-d


ing networks that best represent statistical dependencies ex­ ql,t t
tant in data. When the goal is classification, however, this
where zt(q) � X<t for all t and q. For example, it
is not necessarily optimal. Indeed, the class posterior prob­
could be that Zt(q) = {Xt-1,3,Xt-1,5,Xt-2,1,Xt-3,9} and
ability will be accurately approximated if sample and class
label are considered together, and then jointly optimized in
Zt(r) = {xt-4,2,Xt-9,o} forr =fi q. A multinet occurs be­
cause zt(q) is a function of q; if the Markov chain changes,
a maximum likelihood procedure, assuming sufficient data.
so will the set of dependencies. Specifically, the condi­
Such a procedure might be wasteful, however, as likelihood tional independence assumption among observation ele­
scores are penalized from the term containing dependen­ ments becomes XtilL{X�t \zti(q)}l{q, Zt;(q)}. This class
cies only between features (which has a much larger mag­ of model is depicted in Figure 1 for two instantiations of the
nitude) than the term containing the class posterior prob­ Markov chain.
ability. It is this later term that, according to Bayes deci­
In general, adding conditional dependencies in a DBN can
sion theory, must be accurately modeled to achieve good
significantly increase computational and memory complex­
classification performance. In [10], this issue was noticed,
ity. For the junction tree algorithm, the complexity is
and both extended versions of naive Bayes classifiers and
class conditional Bayesian multinets were considered, both
O('L'{=1 s (C;)) where Tis the number of resultant cliques
in the junction tree, and s( C;) is the size of the state space
of which outperformed the naive Bayes classifier on classi­
for clique C;. For an HMM with Ttime-steps and N states,
fication tasks.
there are O(T) cliques each with at most a state space size
In a general classification task, additional reductions in of N2 resulting in O(TN2).
computation and increases in sparsity can be achieved by
To determine the complexity of a DBM, first define 4 an
learning a specific network structure for each class, where
3In general, Xu is matlab-like notation to refer to the set
each class-conditional network represents nothing other
than those dependencies, often unique to its class, that help of variables with indices between 1 and t inclusive, X<t �
tl.
approximate the class posterior probability . This property Xqt-1)• and X�t = {X1,T \ Xt}.
has been called structural discriminability [2]. 4AR-HMM stands for auto-regressive HMM. An AR-HMM
UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 41

the new edges must go through a clique containing nodes


adjacent by these new edges. Second, a triangulated-by­
moralization AR-HMM(K) has at most two hidden vari­
ables in its cliques since no node has more than one hidden
variable as a parent, so moralization does not add edges
between hidden variables. The remaining clique variables
are observations, so the state-space size is only N2. Third,
such a triangulated AR-HMM(K) has only O(T) cliques.
Therefore, the complexity of an AR-HMM(K) and there­
fore a DBM is again only O(TN2) for any fixed K. There
is, however, a constant cost associated with the number of
additional dependency edges. Incorporating this cost, the
complexity becomes O(TN2K) where K is the maximum
number of dependency edges per observation. The extra
dependency structure is sparse, however, so the computa­
tional and memory requirements of a DBM will in practice
be much less than its O(TN2K) complexity suggests.
X
4 Structure Learning in DBMs

Figure 1: Two networks corresponding to two values of the Structure learning consists of optimally choosing zt(q) in
Markov chain, q1,r and q�:T· The individual elements of p(xtlzt(q), q). In this section, three methods are consid­
the observation vectors are shown explicitly as dark shaded ered. In each case, a fixed upper bound is assumed on the
nodes surrounded by boxes. Nodes with diagonal lines cor­ possible number dependency variables. That is, it is as­
respond to instantiated hidden variables. sumed that lzt(q) l ::::; c for some fixed c > 0. The phrases
"choose dependency variable" and "choose dependencies"
will be used synonymously. It is assumed that the reader is
AR-HMM(K) as an HMM but with additional edges point­ familiar with information-theoretic constructs [8].
ing from observations Xt-l to Xt for£ =1, . . . ,k (See
The following theorem will be needed:
Figure 2). Collapsing the vector graphical notation in Fig­
ure 1 into a single node, and including a dependency from Theorem 4.1. Mutual Information and Likelihood.
observations Xr to Xt in the AR-HMM if for any value Given three random variables X, z(a) and z(b}, where
of Qt and there exists a dependency between Xri and Xti I(X; z(a)) > I(X; z(b}), the likelihood of X given z(a)
for any i,j, a DBM with a maximal dependency across K is higher than given z(b}' for n, the sample size, large
observations can be more generally represented by an AR­ enough, i.e.,
HMM(K).

Proof. Under the assumption, it immediately follows that:

Figure 2: Bayesian network for an AR-HMM(2) Negating and expanding as integrals gives

There are three things to note about an AR-HMM(K).


First, a moralized AR-HMM(K) is triangulated. This can
be seen by induction, the base case being obvious, and
the induction step following because 1) a cycle contain­ or equivalently
ing edges only contained in the previous step's graph must
have a chord by induction, and 2) a cycle containing edges
not in the previous step's graph (new edges) must also
have a chord because the portion of a cycle not containing
can of course use any, possibly non-linear, implementation be­ where (x ;, zik}) ,....., p(X, z(k}) fork E {a, b}. The weak
law of large numbers implies that VE > 0, 3 n and n
tween observations.
a b
42 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000

such that for n > max ( n


a, b),
n q more than any increase in the r model in the context of
q for all r -:/:- q. The score increases can in fact be nega­
�� t, logp(x;lz)''l + H(XIZ'''l l < <
tive, thereby decreasing the likelihood of both models, but
potentially improving the classification accuracy. Accord­
ingly, the score of a model in a different context can be
again for k E {a, b}. Choosing E < IH(XIZ(a)) - evaluated using an extended form of conditional mutual in­
H(XIZ(b))l/2 to get n implies Equation 1. D formation:5

p(x,zlq) dxdz
Of course, the actual probability distribution
not known and only an approximation p(xlz,
is
8) is avail­
p(xlz) Ir(X; Zlq) =

6.
I p(x,zir) log
p(xlq)p(zlq)
able where the parameters e are estimated, often using Therefore, I(X; Z(q)lq) should be large and
a method such as maximum likelihood, that decreases Iq(X; Z(r)ir) should not be as large for each r. This
D(p(xiz)II.P(xiz)), the KL-distance between the actual and suggests optimizing the following:6
approximate distribution. If pis close enough top, then the
theorem above still holds. It is therefore assumed that the S(X; ZIQ) � LP(q)(8qr- p(r))Iq(X; Z(r)ir)
p
parameters of have been estimated well enough so that qr
p
any differences with are negligible.
where Z UiZ(i) and 8qr is a Kronecker delta. This
=

The above theorem is, of course, also true for condi­


quantity can be further motivated by noticing that the ex­
tional mutual information [8] such as I(X; ZIQ)
or for
pected class posterior probability can be expanded as fol­
a particular value of q, I(X; ZIQ q). Therefore, if
=

lows:
I(X; z(al(q)IQ = q) I(X; z(bl(q)IQ q),
> for all= q
then:
E[logp(QIX,Z)] -H(QIX,Z) =

1
T 1 T -H(XIQ,Z)- H(QIZ) + H(XIZ)
Llogp(xtlz�a)(qt),qt) > Llogp(xtlz��)(qt),qt)
=

T t=l T t=l
so that
These quantities can be viewed as likelihoods of the data
given Viterbi paths Qt
of modified HMMs. In the left
E[Iogp(QIX,Z)] + H(XIQ) + H(Q)- H(X)
case, the Viterbi path likelihood is higher. Note that us­ = I(X; ZIQ) + I(Q; Z)- I(X; Z)
ing a similar argument as in the theorem, and because
Furthermore, the conditional entropy can be bounded by
H(X) H(XIZ),

-H(XIZ) = I log (� p(xlr,z)p(riz)) dP(x,z,q)


for some non-Viterbi path rt
and for n large enough. In
� I LPr
(riz) 1ogp(xlr,z)dP(x,z,q)
other words, relative to an HMM, the likelihood of the data
for paths other than the Viterbi path do not decrease when � LP(r)p(q) I 1ogp(xlz,r)dP(x,zlq)
adding conditioning variables. The following theorem has r,q
therefore been shown.
where the first inequality follows from Jensen's inequality,
Theorem 4.2. A DBM with edges added relative to an
HMM according to conditional mutual information
and the approximate equality is valid if I(Q; Z)
is small.

produces a higher likelihood score than before modifi­


From this, it can be shown [2] that choosing z(a)
over z(b)
cation.
when S(X; z(a)IQ) > S(X; z(b)IQ)
will increase an up­
per bound on the expected class posterior probability, and
The DBM represents statistical relationships contained in therefore could potentially reduce the Bayes error. This de­
the data that are not well represented before modification, fines a second dependency selection rule.
which is the reason for the higher likelihood. Augmenting A generalization that does not require a small I(Q; Z)
can
the dependencies according to conditional mutual informa­ I(Q; Z)
be obtained by noticing that does not depend on
tion therefore defines the first dependency selection rule. p(xiz,q). Z
Therefore, if Z(i)
= Ui is chosen to max­
When the task is classification, however, a higher likeli­ imize I(X; ZIQ) - I(X; Z), the average class posterior
hood does not necessarily correspond to a lower error. Con­ probabilityE(p(QIX,Z)] will be maximized and there­
q
sider the two states and -:/:- r q.
To achieve a lower error, fore optimal for a fixed number of edges (see [2] for a
q r
a modification the and models should increase the av­ 5Called cross-context conditional mutual information in [2].
q
erage score of the model in the context of a sample from 6Called discriminative conditional mutual information in [2]
UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 43

proof). The quantity I(X; ZIQ)- I(X; Z)


could be called INPUT: z1, q, M, Xti
the explaining away residual (or the EAR measure), and OUTPUT: Zq i
it asks for edges that are more class-conditionally depen­ Set Zqi =0
dent than marginally independent. Moreover, a multinet Sort Zi, so f(Zi)
2: j(Z2) 2: ...
can result when choosing class-conditional edges accord­ f
For j decreasing until ( z1) falls below threshold:
ing toI(X; ZIQ q)- I(X; Z)
= for each q. If z1 satisfies all the following three criteria:
Summarizing this section, there are three possible rules that
1) I(Xti; Zilq) is larger than a threshold
could be used to choose new edges between elements of Xt 2) For each Z Z i I(Zj; ZIQ) ri(Zj; XtiiQ)
E q , <

and previous observations: 3) I(Xti; Zj) is smaller than a threshold:


then add z1 to Z qi and break if I Zqi I > M.
Z*(q) = argmax I(X; Z(q)lq) (2)
Z(q)<;;X<t & IZ(q)l::;c Figure 3: The improved pairwise heuristic : this algorithm
Z* = argmax S(X; ZIQ) (3) chooses the dependency variables for the ith feature posi­
Z<;;X« & IZI::;c tion of Xtand for class q.
Z* = argmax I(X; ZIQ)- I(X; Z) (4)
Z<;;X<t & IZI::;c
and discriminative criterion ensures the candidate variable
does improve the models scores in the general non-class­
Equations 3 and 4 produce discriminatively structured net­
conditional case.
works, in that the underlying dependencies represented by
the network are unique to each class. The resulting models
achieve a high score in the presence of a sample from the
right class, but get a low score in the presence of a differ­
ent class. More importantly, this can be true even for non­
optimal parameter settings since, via the structure, the net­
works are inherently less capable of achieving high scores
for samples of the wrong class. Therefore, along appro­
priate complexity penalties, it would be sufficient to learn
parameters using likelihood based methods rather than the
more costly risk-minimization procedures [26, 1, 9, 18, 19].
Figure 4: The redundancy check: The edge from Z4 X
to is
being considered. It will only be added if I(Z4; Zi)
for i <
5 The Improved Pairwise Algorithm 4 is small, hopefully reducing the chance that a variable is
added for which a class-conditional Markov blanket exists
The optimization suggested in the previous section is separating Z4
from X
and making Z4
superfluous.
clearly impractical. In this section, a new computation­
ally efficient heuristic, entitled the improved pairwise al­ This algorithm is inexpensive, running in time
gorithm, is introduced. The algorithm approximates the O(IQIIXId2) where IQI is the number of classes,
desired quantities using only pairwise conditional mutual lXI IX1:tl
is the size of the observation vector, and d :::; is
information between scalars. the maximum number of variables that are considered in
the selection.
The algorithm is presented in Figure 3 using rule 4. All
candidate scalar random variables in Xl:t
are given and in­
dexed by z1. A total of M edges will be added separately 6 Experimental Evaluation
q,
for each value and for each Xti.
Therefore, the algorithm
might allow for some redundancy if intra-feature depen­ In this section, DBMs are evaluated in the context of
dencies are already modeled. The algorithm first sorts the classifying speech utterances. All experiments are re­
scalars decreasing by the function f(Zj) I(Xti; ZiiQ
= ported using the NYNEX Phonebook database[24]. Phone­
q)- I(Xti; Zi
z1). The output is q , set of variables from
=

book is a large-vocabulary "phonetically-rich isolated­


which a link should be added to Xti
under the class q. word telephone-speech database." It contains a rich col­
lection of vocabulary words including poly-syllabic words
The algorithm uses three criteria to eliminate candidate
such as "exhaustion," "immobilizing," "sluggishness," and
edges. The first ensures that the edge to Zj
is actually
"overambitious" as well monosyllabic words such as
informative about Xti
in the context of q.
The second is
"awe," "biff," and "his."
a redundancy check - it asks for an edge from a variable
that has little information in common with the variables al­ The quantities (X; I ZQq) are obtained using an initial
=

ready added as depicted in Figure 4. The degree of allowed baseline HMM-based system. The following general train­
redundancy is determined using 0 < r < 1. The third ing procedure is used for all of the results reported in this
44 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000

section: more parameters are added to the system. In each case,


however the DBM outperform an HMM with a comparable
1) Train a bootstrap HMM-based system using EM number of parameters and states per phone, and this is true
2) Compute (X; ZIQ = q) and I(X; Z) for all pairs across the different vocabulary sizes.
within a 200ms range surrounding the current time.
3) Run the improved pairwise heuristic
Case Type WER Params/ SP
4) Train the resulting models again with EM
1 CMI 32.0% 207k/ 1
5) Test the result
2 AR2 27.6% 207k/ 1
3 AR 1 20.9% 156k/ 1
4 RAND 8.3% 207k I 1
The reported results all use a mixture of Gaussian linear­
regression model to implement the dependencies. In this Table 2: Performance of DBMs determined using alter­
case, nate methods and evaluated on the 75-word vocabulary test
case.
p(xiz(q) , q) = (t1 p(xlm, z ( q) , q) p(mlq) ) A second set of results are given in Table 2 and can be
where each underlying component is a Gaussian linear­ compared to those given in the 75-word column in Table 1.
regressive process on z using a sparse dependency matrix Case 1 shows the performance of a DBM created using
Bqm· Equation 2. This rule adds dependencies that increase the
model scores but not the classification accuracy. In fact,
l the likelihood scores in this case were dramatically larger
p(xlm, z, q) =
e-!(x-Bqmz)'E;;-;.(x-BqmZ)
l27ri:qmll/2 than before modification. As can be seen, however, the per­
formance dramatically decreases, presumably because the
models are not structurally discriminative.
Therefore,p(xlm, z, q) is a Gaussian distribution with con­
ditional mean B mZ and covariance matrix I: m· This Cases 2 and 3 show the performance when dependencies
q q
implementation can, to some extent, simulate conditional are added from the previous (the two previous for case 3)
variance by using mixtures, but it avoids many training observations to the current observation, and is therefore
complexities since closed-form EM update equations can not a multinet. The performance is also very poor, indi­
be derived. Complete details of the experimental setup, cating again that relaxing the wrong conditional indepen­
training procedure, definitions of test and training sets, dence properties can dramatically decrease classification
topology of the Markov chains, and so on are described accuracy.
in[2].
Case 4 shows the performance when a different random set
of dependencies between observation elements are added
jVocabj 75 150 300 600 Params I SP for each state. Interestingly, case 4 is much better than the
HMM 5.0% 7.0% 9.2% 13.3% 157k I 1 previous cases suggesting that the most harmful and anti­
DBM 4.6% 6.5% 8.7% 12.3% 157k I 1
discriminative dependencies have not been added. The per­
HMM 2.4% 3.5% 5.6% 7.7% 182k /2 formance, however, is still worse than the baseline HMM.
DBM 2.3% 3.4% 5.4% 7.5% 182k 12
HMM 1.7% 2.8% 4.6% 6.2% 163k 13 Several general points can be made from the two tables.
DBM 1.5% 2.6% 4.4% 6.0% 166k 13 Cases 1-3 indicate that dependencies that are added to a
HMM 1.5% 2.7% 4.3% 5.8% 200kl4 model structure to increase a (likelihood) score can cause
DBM 1.4% 2.6% 4.2% 5.6% 207k /4 a dramatic decrease in classification accuracy, even if the
structures are augmented in a class-conditional way, as in
case 1 above. Note that the likelihood scores for these mod­
Table 1: Word-error rate results comparing HMMs with
els are dramatically higher both for the training and testing
DBMs. IVocabl is the vocabulary size (number of test
data, suggesting that overfitting is not the problem. The
classes), Params is the number of system parameters, and
goal of many model selection methods [6, 22] is to choose
SP is the number of Markov states per phoneme.
a model that provides the best description of the data, but
the above suggests that this can be inappropriate for classi­
The results, given in Table 1, show the error rate for differ­
fication. Admittedly, model selection procedures typically
ent topologies of Markov chain for each phone. The num­
include complexity penalty terms (e.g., MDL, BIC, and so
bers are competitive with those reported using dynamic
on). But these penalties do not select for discriminative
Bayesian networks on the same speech corpus and train­
structures.
ing/testing conditions [27], where best achieved result was
2.7% with a 5 15k parameter system on the 75-word vocab­ Second, dependencies in a network should not be added
ulary size case. The performance increases on average as just because they are missing. Cases 2 and 3 adds depen-
UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000 45

dencies between adjacent observation vectors, an approach Acknowledgements


sometimes justified by noting that they are not directly rep­
resented by an HMM. But as the performance for these aug­ This work has benefited from discussions with Nir Fried­
mented models shows, the results indicate that adding miss­ man, Steven Greenberg, Michael Jordan, Katrin Kirchhoff,
ing dependencies can decrease classification performance. Kevin Murphy, Stuart Russel, and Geoff Zweig. The au­
thor would like to acknowledge the International Computer
Third, adding random dependencies does not produce as Science Institute at which many of the computational ex­
poor performance as in the previous cases, but neither is periments were performed, and thank the three anonymous
there any benefit. It is unlikely that choosing random de­ reviewers for their useful comments.
pendencies, even if q-conditioned, will result in discrim­
inative structure because the selection space is so large. References
The implications for structure learning methods that search
[1] L.R. Bahl, P.F. Brown, P.V. de Souza, and R.L. Mercer. Maximum mutual information esti­
over randomly chosen sets are clear: because of the large mation of HMM parameters for speech recognition. In Proc. IEEE Inti. Conf on Acoustics,
Speech, and Signal Processing, pages 49-52, Tokyo, Japan, December 1986.
search space, it is unlikely that good sets of dependencies
J. Bilmes. Nalural Statistic Models for Automatic Speech Recognition. PhD thesis, U.C.
will be found in a reasonable amount of time. It seems cru­ [2]
Berkeley. Dept of EECS, CS Division, 1999.
cial, therefore, to constrain the random search to those that
[3] J.A. Bilmes. Buried Markov models for speech recognition. In Proc. IEEE Inti. Conf on
found to be useful in some way, as has been argued in the Acoustics, Speech, and Signal Processing, Phoenix, AZ, March 1999.

past [ 13]. [ 4] X. Boyen, N. Friedman, and D. Koller. Discovering the hidden structure of complex dynamic
systems. 15th ConJ on Uncertainty in Artificial Intelligence, 1999.

Finally, as argued in [2], an HMM can approximate a distri­ [5] W. Buntine. A guide to the literature on learning probabilistic networks from data. IEEE
bution arbitrarily well given enough capacity, enough train­ Trans. on Knowledge and Data Engineering, 8:195-210, 1994.

ing data, and enough computation. The results in the tables [6] K.P. Burnham and D.R. Anderson. Model Selection and Inference: A Practical!nfonnation­
Theoretic Approach. Springer·Verlag, 1998.
support this claim as increasing parameters leads to im­
proved accuracy. The performance improvement obtained [7] C.K. Chow and C.N. Liu. Approximating discrete probability distributions with dependence
trees. IEEE Trans. on Info. Theory, 14, 1968.

by adding more hidden states is dramatic, but the additional


[8] T.M. Cover and J.A. Thomas. Elements of Information Theory. Wiley, 1991.
discriminative DBM dependencies can provide further im­
[9] Y. Ephraim, A. Dembo, and L. Rabiner. A minimum discrimination information approach
provements. for HMM. IEEE Trans. Info. Theory, 35(5):1001-1013, September 1989.

N. Friedman. The Bayesian structural EM algorithm. 14th ConJ on Uncertainty in Artificial


The results show that the DBM achieves the same or [10]
Intelligence, 1998.

better classification performance with the same parame­


(111 N. Friedman, D. Geiger, and M. Goldszmidt. Bayesjan network classifiers. Machine Learn­
ters, thereby supporting the claim that they have achieved ing, 29:131-163, 1997.

sparser, higher performing, but lower complexity networks. [12] N. Friedman, K. Murphy, and S. Russell. Learning the structure of dynamic probabilistic
networks. 14th ConJ on Uncertainty in Artificial Intelligence, 1998.

[13] N. Friedman, I. Nachman, and D. Peer. Learning Bayesian network structure from massive
.
7 Discussion datasets: The . sparse candidate" algorithm. 15th Conf. on Uncertainty in Artificial Intelli­
gence, 1999.

[14] D. Geiger and D. Heckennan. Knowledge representation and inference in similarity net­

In this paper, a class of graphical models is considered that works and Bayesian multinets. Artificial Intelligence, 82:45-74, 1996.

generalizes HMMs. Several methods to automatically learn [ 15) D. Heckerman. Probabilistic Similarity Networks. MIT Press, 1991.

structure were presented that have optimal properties either [16] D. Heckennan. A tutorial on learning with Bayesian networks. Technical Report MSR-TR-
95·06, Microsoft, 1995.
by maximizing the likelihood score, or (the EAR measure)
by maximizing the class posterior probability. A depen­ [17] D. Heckerman, D. Geiger, and D.M. Chickering. Learning Bayesian networks: The com­
bination of knowledge and statistical data. Technical Report MSR-TR-94�09, Microsoft,

dence selection heuristic, the improved pairwise algorithm, 1994.

is introduced, and an implementation was tested using a [18] B.-H. Juang, W. Chou, and C.-H. Lee. Minimum classification error rate methods for speech
recognition. IEEE Trans. on Speech and Audio Signal Processing, 5(3):257-265, May 1997.
medium-vocabulary speech corpus showing that apprecia­
[19] S. Katagiri, B.·H. Juang. and C.·H.·Lee. Pattern recognition using a family of design algo·
ble gains can be obtained when the dependencies are cho­ rithms based upon the generalized probabilistic descent method. Proceedigns of the IEEE,

sen appropriately. 1998.

[20] P. Krause. Learning probabilistic networks. Philips Resean:h Labs Tech. Repon., 1998.
While this paper does not address the problem of learning
[21) S.L. Lauritzen. Grophical Models. Oxford Science Publications, 1996.
the hidden structure in networks and uses only a simple
[22) H. Linhart and W. Zucchini. Model Selection. Wiley, 1986.
Markov chain to represent dynamics [4, 12], for speech, it
is often sufficient to consider a Markov chain as a proba­ [23) M. Meila. Learning with Mi�tures of Trees. PhD thesis, MIT, 1999.

bilistic sequencer over strings of phonetic units. The multi­ [24) J. Pitrelli, C. Fong, S.H. Wong, J.R. Spitz, and H.C. Lueng. PhoneBook: A phonetically-rich
isolated-word telephone-speech database. In Proc. IEEE Inti. ConJ on Acoustics, Speech,
nets, which are conditioned on each of these sequences, and Signal Processing, 1995.

determine local structure. Ultimately, it is planned to use [25] M. Sahami. Learning limited dependence Bayesian classifiers. In Proc. 2nd Int. ConJ on

and learn more complex models of dynamic behavior for Knowledge Discovery and Data Mining, 1996.

those classes of signals that can benefit from it. It is also [26] V. Vapnik. Statistical Learning Theory. Wiley, 1998.

planned to use the EAR measure to determine more general [27] G. Zweig. Speech Recognition with Dynamic Bayesian Networks. PhD thesis, U.C. Berkeley,
1998.
discriminatively structured Bayesian multinets.

You might also like