Learning Bayesian Networks With R: Susanne G. Bøttcher Claus Dethlefsen
Learning Bayesian Networks With R: Susanne G. Bøttcher Claus Dethlefsen
Abstract
deal is a software package freely available for use with R. It includes several
methods for analysing data using Bayesian networks with variables of discrete
and/or continuous types but restricted to conditionally Gaussian networks.
Construction of priors for network parameters is supported and their param-
eters can be learned from data using conjugate updating. The network score
is used as a metric to learn the structure of the network and forms the basis
of a heuristic search strategy. deal has an interface to Hugin.
1 Introduction
A Bayesian network is a graphical model that encodes the joint probability distri-
bution for a set of random variables. Bayesian networks are treated in e.g. Cowell,
Dawid, Lauritzen, and Spiegelhalter (1999) and have found application within many
fields, see Lauritzen (2003) for a recent overview.
Here we consider Bayesian networks with mixed variables, i.e. the random vari-
ables in a network can be of both discrete and continuous types. A method for
learning the parameters and structure of such Bayesian networks has recently been
described by Bøttcher (2001). We have developed a package called deal, writ-
ten in R (Ihaka and Gentleman, 1996), which provides these methods for learn-
ing Bayesian networks. In particular, the package includes procedures for defin-
ing priors, estimating parameters, calculating network scores, performing heuristic
search as well as simulating data sets with a given dependency structure. The
package can be downloaded from the Comprehensive R Archive Network (CRAN)
http://cran.R-project.org/ and may be used freely for non-commercial pur-
poses.
In Section 2 we define Bayesian networks for mixed variables. To learn a
Bayesian network, the user needs to supply a training data set and represent any
prior knowledge available as a Bayesian network. Section 3 discusses how to specify
a Bayesian network in terms of a directed acyclic graph and the local probability
distributions. deal uses the prior Bayesian network to deduce prior distributions
Proceedings of DSC 2003 2
for all parameters in the model. Then, this is combined with the training data
to yield posterior distributions of the parameters. The parameter learning proce-
dure is treated in Section 4. Section 5 describes how to learn the structure of the
network. A network score is calculated and a search strategy is employed to find
the network with the highest score. This network gives the best representation of
data and we call it the posterior network. Section 6 describes how to transfer the
posterior network to Hugin (http://www.hugin.com). The Hugin graphical user
interface (GUI) can then be used for further inference in the posterior network.
2 Bayesian networks
Let D = (V, E) be a Directed Acyclic Graph (DAG), where V is a finite set of
nodes and E is a finite set of directed edges (arrows) between the nodes. The DAG
defines the structure of the Bayesian network. To each node v ∈ V in the graph
corresponds a random variable Xv . The set of variables associated with the graph
D is then X = (Xv )v∈V . Often, we do not distinguish between a variable Xv and
the corresponding node v. To each node v with parents pa(v), a local probability
distribution, p(xv |xpa(v) ) is attached. The set of local probability distributions for
all variables in the network is P. A Bayesian network for a set of random variables
X is then the pair (D, P).
The possible lack of directed edges in D encodes conditional independencies
between the random variables X through the factorization of the joint probability
distribution, Y
p(x) = p xv |xpa(v) .
v∈V
Here, we allow Bayesian networks with both discrete and continuous variables,
as treated in Lauritzen (1992), so the set of nodes V is given by V = ∆ ∪ Γ, where
∆ and Γ are the sets of discrete and continuous nodes, respectively. The set of
variables X can then be denoted X = (Xv )v∈V = (I, Y ) = ((Iδ )δ∈∆ , (Yγ )γ∈Γ ),
where I and Y are the sets of discrete and continuous variables, respectively. For a
discrete variable, δ, we let Iδ denote the set of levels.
To ensure availability of exact local computation methods, we do not allow
discrete variables to have continuous parents. The joint probability distribution
then factorizes into a discrete part and a mixed part, so
Y Y
p(x) = p(i, y) = p iδ |ipa(δ) p yγ |ipa(γ) , ypa(γ) .
δ∈∆ γ∈Γ
The primary attribute of a network is the list of nodes, in the example: ksl.nw$nodes.
Each entry in the list is an object of class node representing a node in the graph,
which includes information associated with the node. Several methods for the net-
work class operate by applying an appropriate method for one or more nodes in the
list of nodes.
so that
2
Yγ |ipa(γ) , ypa(γ) , θγ|ipa(γ) ∼ N mγ|ipa(γ) + ypa(γ) βγ|ipa(γ) , σγ|i pa(γ)
.
Define zpa(γ) = (1, ypa(γ) ) and let ηγ|ipa(γ) = (mγ|ipa(γ) , βγ|ipa(γ) ), where mγ|ipa(γ)
is the intercept and βγ|ipa(γ) is the vector of coefficients. For a continuous variable γ,
2
the suggested local probability distribution N (zpa(γ) ηγ|ipa(γ) , σγ|i pa(γ)
) is determined
as a regression on the continuous parents for each configuration of the discrete
parents.
For continuous variables, the joint distribution N (Mi , Σi ) is determined for each
configuration of the discrete variables by applying a sequential algorithm developed
in Shachter and Kenley (1989).
In deal, we can assess these quantities by
Proceedings of DSC 2003 4
4 Parameter learning
To estimate the parameters in the network, we use the Bayesian approach. We
encode our uncertainty about parameters θ in a prior distribution p(θ), use data d
to update this distribution, and hereby obtain the posterior distribution p(θ|d) by
using Bayes’ theorem,
p(d|θ)p(θ)
p(θ|d) = , θ ∈ Θ. (1)
p(d)
Here Θ is the parameter space, d is a random sample from the probability distri-
bution p(x|θ) and p(d|θ) is the joint probability distribution of d, also called the
likelihood of θ. We refer to this as parameter learning or just learning.
In deal, we assume that the parameters associated with one variable are inde-
pendent of the parameters associated with the other variables and, in addition, that
the parameters are independent for each configuration of the discrete parents, i.e.
Y Y Y Y
p(θ) = p(θδ|ipa(δ) ) p(θγ|ipa(γ) ), (2)
δ∈∆ ipa(δ) ∈Ipa(δ) γ∈Γ ipa(γ) ∈Ipa(γ)
2. From this joint prior distribution, the marginal distribution of all parameters
in the family consisting of the node and its parents can be determined. We
call this the master prior.
3. The local parameter priors are now determined by conditioning in the master
prior distribution.
This procedure ensures parameter independence. Further, it has the property
that if a node has the same set of parents in two different networks, then the local
parameter prior for this node will be the same in the two networks. Therefore,
we only have to deduce the local parameter prior for a node given the same set of
parents once. This property is called parameter modularity.
p(Ψ) ∼ D(α),
and let αA = (αiA )iA ∈IA . Then the marginal distribution of ΨA is Dirichlet,
p(ΨA ) ∼ D(αA ). This is the master prior in the discrete case. The local parameter
priors can now be found by conditioning in these master prior distributions.
We cannot use these distributions to derive priors for other networks, so instead
we use the imaginary data base to derive local master priors.
Define the notation X
ρiA∩∆ = ρj
j:jA∩∆ =iA∩∆
and similarly for νiA∩∆ and ΦiA∩∆ . For the family A = γ ∪ pa(γ), the local master
prior is then found as
ΣA∩Γ|iA∩∆ ∼ IW ρiA∩∆ , Φ̃A∩Γ|iA∩∆
1
MA∩Γ|iA∩∆ |ΣA∩Γ|iA∩∆ ∼ N µ̄A∩Γ|iA∩∆ , ΣA∩Γ|iA∩∆ ,
νiA∩∆
where
P
j:jA∩∆ =iA∩∆ µj νj
µ̄iA∩∆ =
νiA∩∆
X
Φ̃A∩Γ|iA∩∆ = ΦiA∩∆ + νj (µj − µ̄iA∩∆ )(µj − µ̄iA∩∆ )> .
j:jA∩∆ =iA∩∆
Again, the local parameter priors can be found by conditioning in these local master
priors.
Note that the network score is a product over terms involving only one node
and its parents. This property is called decomposability. It can be shown that the
network scores for two independence equivalent DAGs are equal. This property is
called likelihood equivalence and it is a property of the master prior procedure.
In deal we use, for computational reasons, the logarithm of the network score.
The log network score contribution of a node is evaluated whenever the node is
learned and the log network score is updated and is stored in the score attribute of
the network.
where p(D)/p(D∗ ) is the prior odds and p(d|D)/p(d|D∗ ) is the Bayes factor. At the
moment, the only option in deal for specifying prior distribution over DAGs is to
let all DAGs be equally likely, so the prior odds are always equal to one. Therefore,
we use the Bayes factor for comparing two different DAGs.
In greedy search we compare models that differ only by a single arrow, either
added, removed or reversed. In these cases, the Bayes factor is especially simple,
because of decomposability of the network score.
To manually assess the network score of a network (e.g. to use as initial network
in a search), use
6 Hugin interface
A network object may be written to a file in the Hugin .net language. Hugin
(http://www.hugin.com) is commercial software for inference in Bayesian net-
works. Hugin has the ability to learn networks with only discrete networks but
cannot learn either purely continuous or mixed networks. deal may therefore be
used for this purpose and the result can then be transferred to Hugin. The pro-
cedure savenet() saves a network to a file. For each node, we use point estimates
of the parameters in the local probability distributions. The readnet() procedure
reads the network structure from a file but does not, however, read the probability
distributions. This is planned to be included in a future version of deal.
7 Example
In this section, we describe the analysis of the ksl data that has been used as
illustration throughout the paper. The data set, included in Badsberg (1995), is
from a study measuring health and social characteristics of representative samples
of Danish 70-year old people, taken in 1967 and 1984. In total, 1083 cases have
been recorded and each case contains observations on nine different variables, see
Table 1.
Table 1: Variables in the ksl data set. The variables Fev, Kol, BMI are continuous
variables and the rest are discrete variables.
The purpose of our analysis is to find dependency relations between the vari-
ables. One interest is to determine which variables influence the presence or absence
of hypertension. From a medical viewpoint, it is possible that hypertension is in-
fluenced by some of the continuous variables Fev, Kol and BMI. However, in deal
we do not allow continuous parents of discrete nodes, so we cannot describe such a
relation. A way to overcome this problem is to treat Hyp as a continuous variable,
even though this is obviously not most natural. This is done in the analysis below.
Further, the initial data analysis indicates a transformation of BMI into log(BMI).
With these adjustments, the data set is ready for analysis in deal.
We have no prior knowledge about specific dependency relations, so for simplicity
we use the empty DAG as the prior DAG and let the probability distribution of
the discrete variables be uniform. The assessment of the probability distribution
for the continuous variables is based on data, as described in Section 3.1.
We do not allow arrows into Sex and Year, as none of the other variables can
influence these variables. So we create a ban list which is attached to the network.
The ban list is a matrix with two columns. Each row contains the directed edge
that is not allowed.
Finally, the parameters in the network are learned and structural learning is
used with the prior DAG as starting point.
Kol
Year
logBMI
Hyp Sex
FEV
Smok Work
Alc
The resulting network thebest is shown in Figure 1 and it is the network with
the highest network score among those networks that have been tried through the
search.
In the result we see for the discrete variables that Alc, Smok and Work depend
directly on Sex and Year. In addition, Smok and Work also depend on Alc. These two
arrows are, however, not causal arrows, as Smok ← Alc → Work in the given DAG
represents the same probability distribution as the relations Smok ← Alc ← Work and
Smok → Alc → Work, i.e. the three DAGs are independence equivalent. Year and
Sex are independent on all variables, as specified in the ban list. For the continuous
Proceedings of DSC 2003 10
variables all the arrows are causal arrows. We see that Fev depends directly on Year,
Sex and Smok. So given these variables, Fev is conditional independent on the rest of
the variables. Kol depends directly on Year and Sex, and logBMI depends directly on
Kol and Sex. Given logBMI and Fev, the variable Hyp is conditionally independent on
the rest of the variables. So according to this study, hypertension can be determined
by the body mass index and the lung function forced ejection volume. However, as
Hyp is not continuous by nature, other analyses should be performed with Hyp as a
discrete variable, e.g. a logistic regression with Hyp as a response and the remaning
as explanatory variables. Such an analysis indicates that, in addition, Sex and Smok
may influence Hyp but otherwise identifies logBMI as the main predictor.
Acknowledgements
The work has been supported by Novo Nordisk A/S.
References
J.H. Badsberg. An Environment for Graphical Models. PhD thesis, Aalborg Uni-
versity, 1995.
S.G. Bøttcher. Learning Bayesian networks with mixed variables. In Proceedings of
the Eighth International Workshop in Artificial Intelligence and Statistics, 2001.
Proceedings of DSC 2003 11
Corresponding author
Claus Dethlefsen
Dept. of Mathematical Sciences
Aalborg University
Fr. Bajers Vej 7G
9220 Aalborg, Denmark
E-mail: [email protected]