Dynamic Bayesian Multinets
Dynamic Bayesian Multinets
Jeff A. Bilmes
Department of Electrical Engineering
University of Washington
Seattle, WA 98 195-2500
the underlying goal is to identify a system for probabilis Perhaps the earliest well-known work on structure learning
tic inference that is computationally efficient, accurate, and in directed graphical models is [7]. More recent research on
somehow informative about the given problem domain. this topic may be found in [ 17, 5, 16, 25, 10, 20, 23, 13]. 2
In general, the task of learning Bayesian networks can be
In this paper, a class of models called dynamic Bayesian
grouped into four categories [ 10] depending on 1) if the
multinets and a method to induce their structure for the
data is fully observable or if it contains missing values, and
classification task is described. In this work, an exten
2) if it is assumed that the structure of the model is known
sion of [3], the problem domain is speech recognition so
or not. The easiest case is when the data is fully observable
it is necessary to use dynamic models. Also, since clas
and model structure is known, whereas the most difficult
sification is the goal, it is beneficial to learn class-specific
case is when the data is only partially observable and when
and (as we will see) discriminative structure. And to fur
the structure is unknown or only partially known.
ther improve sparsity (and therefore reduce computational
and memory demands) and to represent class conditional Note that a general optimization procedure can be used to
information only where necessary, Bayesian multinets (de learn many aspects of a graphical model. Often, learn
scribed in the next section) are used. ing needs only a maximum likelihood procedure perhaps
with an additional complexity penalty term such as MDL or
Section 2 provides a review of structure learning in
BIC. Alternatively, a Bayesian approach to learning can be
Bayesian networks, of Bayesian multinets, and presents the
used where no single structure or set of parameters are cho
idea of structural discriminability. Section 3 introduces the
sen. For certain classes of networks, the prior and posterior
class of models considered in this work and analyzes their
are particularly simple [ 16]. Alternatively, a risk minimiza
inferential complexity. Section 4 provides three informa
tion approach [26] can be applied to the learning problem.
tion theoretic criterion functions that can be used to learn
structure, the last of which provably provides an optimal In principle, an optimization procedure could simultane
approximation to the local posterior probability. Section 5 ously cover all four components of a graphical model: se
introduces the improved pairwise algorithm, a heuristic de mantics, structure, implementation, and parameters. There
veloped because the above induction procedure is compu has, however, been little if any research on methods to Jearn
tationally infeasible. Section 6 evaluates this system on a the best implementation and semantics. The problem be
medium-vocabulary speech corpus and shows that when comes inherently difficult because the quality of each com
structure is determined using the discriminative induction ponent cannot be accurately evaluated without first obtain
method and trained using EM, these networks can outper ing good settings for the other three components. The prob
form both HMMs and other dynamic Bayesian networks lem becomes more arduous when one begins to consider
with a similar number of parameters. But when structure multi-implementation and multi-semantic models. In prac
is determined arbitrarily, or without using a discriminative tice, therefore, one or more components are typically fixed
method, the performance is dramatically worse. Finally, before any optimization begins.
Section 7 concludes and discusses future work.
2.1 Structure Learning A advantage of Bayesian networks is that they can spec
ify dependencies only when necessary, leading to a signifi
A fully-connected graphical model can represent any prob cant reduction in the cost of inference. Bayesian multinets
ability distribution representable by a sparsely structured [ 15, 14] further generalize Bayesian networks and can fur
one, but there are many important reasons for not using ther reduce computation. A multinet can be thought of as
such a fully connected model. These include 1) sparse net a network where edges can appear or disappear depending
work structures have fewer computational and memory re on the values of certain nodes in the graph, a notion that
quirements; 2) a sparse network is less susceptible to noise has been called asymmetric independence assertions [ 14].
in training data (i.e., lower variance) and less prone to over
fitting; and 3) the resulting structure might reveal high-level Consider a network with four nodes A, B, C and Q. In
knowledge about the underlying problem domain that was a multinet, the conditional independence properties among
previously drowned out by many extra dependencies. A A, B, and C might, for example, change for differing val
graphical model should represent a dependence between ues of Q. If Q is binary, and CJLAI{B, Q 0} but=
two random variables only when necessary, where "neces C-Jl.AI{B, Q 1}, then the joint probability could be writ-
=
Figure 1: Two networks corresponding to two values of the Structure learning consists of optimally choosing zt(q) in
Markov chain, q1,r and q�:T· The individual elements of p(xtlzt(q), q). In this section, three methods are consid
the observation vectors are shown explicitly as dark shaded ered. In each case, a fixed upper bound is assumed on the
nodes surrounded by boxes. Nodes with diagonal lines cor possible number dependency variables. That is, it is as
respond to instantiated hidden variables. sumed that lzt(q) l ::::; c for some fixed c > 0. The phrases
"choose dependency variable" and "choose dependencies"
will be used synonymously. It is assumed that the reader is
AR-HMM(K) as an HMM but with additional edges point familiar with information-theoretic constructs [8].
ing from observations Xt-l to Xt for£ =1, . . . ,k (See
The following theorem will be needed:
Figure 2). Collapsing the vector graphical notation in Fig
ure 1 into a single node, and including a dependency from Theorem 4.1. Mutual Information and Likelihood.
observations Xr to Xt in the AR-HMM if for any value Given three random variables X, z(a) and z(b}, where
of Qt and there exists a dependency between Xri and Xti I(X; z(a)) > I(X; z(b}), the likelihood of X given z(a)
for any i,j, a DBM with a maximal dependency across K is higher than given z(b}' for n, the sample size, large
observations can be more generally represented by an AR enough, i.e.,
HMM(K).
Figure 2: Bayesian network for an AR-HMM(2) Negating and expanding as integrals gives
p(x,zlq) dxdz
Of course, the actual probability distribution
not known and only an approximation p(xlz,
is
8) is avail
p(xlz) Ir(X; Zlq) =
6.
I p(x,zir) log
p(xlq)p(zlq)
able where the parameters e are estimated, often using Therefore, I(X; Z(q)lq) should be large and
a method such as maximum likelihood, that decreases Iq(X; Z(r)ir) should not be as large for each r. This
D(p(xiz)II.P(xiz)), the KL-distance between the actual and suggests optimizing the following:6
approximate distribution. If pis close enough top, then the
theorem above still holds. It is therefore assumed that the S(X; ZIQ) � LP(q)(8qr- p(r))Iq(X; Z(r)ir)
p
parameters of have been estimated well enough so that qr
p
any differences with are negligible.
where Z UiZ(i) and 8qr is a Kronecker delta. This
=
lows:
I(X; z(al(q)IQ = q) I(X; z(bl(q)IQ q),
> for all= q
then:
E[logp(QIX,Z)] -H(QIX,Z) =
1
T 1 T -H(XIQ,Z)- H(QIZ) + H(XIZ)
Llogp(xtlz�a)(qt),qt) > Llogp(xtlz��)(qt),qt)
=
T t=l T t=l
so that
These quantities can be viewed as likelihoods of the data
given Viterbi paths Qt
of modified HMMs. In the left
E[Iogp(QIX,Z)] + H(XIQ) + H(Q)- H(X)
case, the Viterbi path likelihood is higher. Note that us = I(X; ZIQ) + I(Q; Z)- I(X; Z)
ing a similar argument as in the theorem, and because
Furthermore, the conditional entropy can be bounded by
H(X) H(XIZ),
�
ready added as depicted in Figure 4. The degree of allowed baseline HMM-based system. The following general train
redundancy is determined using 0 < r < 1. The third ing procedure is used for all of the results reported in this
44 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000
past [ 13]. [ 4] X. Boyen, N. Friedman, and D. Koller. Discovering the hidden structure of complex dynamic
systems. 15th ConJ on Uncertainty in Artificial Intelligence, 1999.
Finally, as argued in [2], an HMM can approximate a distri [5] W. Buntine. A guide to the literature on learning probabilistic networks from data. IEEE
bution arbitrarily well given enough capacity, enough train Trans. on Knowledge and Data Engineering, 8:195-210, 1994.
ing data, and enough computation. The results in the tables [6] K.P. Burnham and D.R. Anderson. Model Selection and Inference: A Practical!nfonnation
Theoretic Approach. Springer·Verlag, 1998.
support this claim as increasing parameters leads to im
proved accuracy. The performance improvement obtained [7] C.K. Chow and C.N. Liu. Approximating discrete probability distributions with dependence
trees. IEEE Trans. on Info. Theory, 14, 1968.
sparser, higher performing, but lower complexity networks. [12] N. Friedman, K. Murphy, and S. Russell. Learning the structure of dynamic probabilistic
networks. 14th ConJ on Uncertainty in Artificial Intelligence, 1998.
[13] N. Friedman, I. Nachman, and D. Peer. Learning Bayesian network structure from massive
.
7 Discussion datasets: The . sparse candidate" algorithm. 15th Conf. on Uncertainty in Artificial Intelli
gence, 1999.
[14] D. Geiger and D. Heckennan. Knowledge representation and inference in similarity net
In this paper, a class of graphical models is considered that works and Bayesian multinets. Artificial Intelligence, 82:45-74, 1996.
generalizes HMMs. Several methods to automatically learn [ 15) D. Heckerman. Probabilistic Similarity Networks. MIT Press, 1991.
structure were presented that have optimal properties either [16] D. Heckennan. A tutorial on learning with Bayesian networks. Technical Report MSR-TR-
95·06, Microsoft, 1995.
by maximizing the likelihood score, or (the EAR measure)
by maximizing the class posterior probability. A depen [17] D. Heckerman, D. Geiger, and D.M. Chickering. Learning Bayesian networks: The com
bination of knowledge and statistical data. Technical Report MSR-TR-94�09, Microsoft,
is introduced, and an implementation was tested using a [18] B.-H. Juang, W. Chou, and C.-H. Lee. Minimum classification error rate methods for speech
recognition. IEEE Trans. on Speech and Audio Signal Processing, 5(3):257-265, May 1997.
medium-vocabulary speech corpus showing that apprecia
[19] S. Katagiri, B.·H. Juang. and C.·H.·Lee. Pattern recognition using a family of design algo·
ble gains can be obtained when the dependencies are cho rithms based upon the generalized probabilistic descent method. Proceedigns of the IEEE,
[20] P. Krause. Learning probabilistic networks. Philips Resean:h Labs Tech. Repon., 1998.
While this paper does not address the problem of learning
[21) S.L. Lauritzen. Grophical Models. Oxford Science Publications, 1996.
the hidden structure in networks and uses only a simple
[22) H. Linhart and W. Zucchini. Model Selection. Wiley, 1986.
Markov chain to represent dynamics [4, 12], for speech, it
is often sufficient to consider a Markov chain as a proba [23) M. Meila. Learning with Mi�tures of Trees. PhD thesis, MIT, 1999.
bilistic sequencer over strings of phonetic units. The multi [24) J. Pitrelli, C. Fong, S.H. Wong, J.R. Spitz, and H.C. Lueng. PhoneBook: A phonetically-rich
isolated-word telephone-speech database. In Proc. IEEE Inti. ConJ on Acoustics, Speech,
nets, which are conditioned on each of these sequences, and Signal Processing, 1995.
determine local structure. Ultimately, it is planned to use [25] M. Sahami. Learning limited dependence Bayesian classifiers. In Proc. 2nd Int. ConJ on
and learn more complex models of dynamic behavior for Knowledge Discovery and Data Mining, 1996.
those classes of signals that can benefit from it. It is also [26] V. Vapnik. Statistical Learning Theory. Wiley, 1998.
planned to use the EAR measure to determine more general [27] G. Zweig. Speech Recognition with Dynamic Bayesian Networks. PhD thesis, U.C. Berkeley,
1998.
discriminatively structured Bayesian multinets.