100% found this document useful (4 votes)
28 views77 pages

Learning Bayesian Networks 1st Edition by Richard Neapolitan ISBN 0130125342 978-0130125347instant Download

The document provides information on various textbooks related to Bayesian networks and their applications, including titles by Richard Neapolitan and others. It includes links for instant access and download of the texts in multiple formats. Additionally, it outlines the contents of 'Learning Bayesian Networks' by Neapolitan, covering topics such as probability theory, Bayesian inference, and learning algorithms.

Uploaded by

tabakchirov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (4 votes)
28 views77 pages

Learning Bayesian Networks 1st Edition by Richard Neapolitan ISBN 0130125342 978-0130125347instant Download

The document provides information on various textbooks related to Bayesian networks and their applications, including titles by Richard Neapolitan and others. It includes links for instant access and download of the texts in multiple formats. Additionally, it outlines the contents of 'Learning Bayesian Networks' by Neapolitan, covering topics such as probability theory, Bayesian inference, and learning algorithms.

Uploaded by

tabakchirov
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

learning bayesian networks 1st edition by

Richard Neapolitan ISBN 0130125342


978-0130125347 download

https://ebookball.com/product/learning-bayesian-networks-1st-
edition-by-richard-neapolitan-
isbn-0130125342-978-0130125347-19568/

Instantly Access and Download Textbook at https://ebookball.com


Get Your Digital Files Instantly: PDF, ePub, MOBI and More
Quick Digital Downloads: PDF, ePub, MOBI and Other Formats

Approxamation methods for efficient learning of Bayesian networks 1st


Edition by Carsten Riggelsen ISBN 1586038214 9781586038212

https://ebookball.com/product/approxamation-methods-for-
efficient-learning-of-bayesian-networks-1st-edition-by-carsten-
riggelsen-isbn-1586038214-9781586038212-19724/

Foundations of Algorithms 4th Edition by Richard Neapolitan 0763782505


978-0763782504

https://ebookball.com/product/foundations-of-algorithms-4th-
edition-by-richard-neapolitan-0763782505-978-0763782504-16434/

Bayesian Network Structure Ensemble Learning 1st Edition by Feng Liu,


Fengzhan Tian, Qiliang Zhu 9783540738701

https://ebookball.com/product/bayesian-network-structure-
ensemble-learning-1st-edition-by-feng-liu-fengzhan-tian-qiliang-
zhu-9783540738701-10416/

Learning to Teach 10th Edition by Richard Arends 0078110300


978-0078110306

https://ebookball.com/product/learning-to-teach-10th-edition-by-
richard-arends-0078110300-978-0078110306-17444/
A Novel Greedy Bayesian Network Structure Learning Algorithm for
Limited Data 1st Edition by Feng Liu, Fengzhan Tian, Qiliang Zhu
9783540738701

https://ebookball.com/product/a-novel-greedy-bayesian-network-
structure-learning-algorithm-for-limited-data-1st-edition-by-
feng-liu-fengzhan-tian-qiliang-zhu-9783540738701-10414/

Pricing Communication Networks Economics Technology and Modelling 1st


Edition by Costas Courcoubetis, Richard Weber ISBN 9780470864241
0470864249

https://ebookball.com/product/pricing-communication-networks-
economics-technology-and-modelling-1st-edition-by-costas-
courcoubetis-richard-weber-isbn-9780470864241-0470864249-19372/

Bayesian Biostatistics and Diagnostic Medicine 1st Edition by Lyle


Broemeling ISBN 1584887680 9781584887683

https://ebookball.com/product/bayesian-biostatistics-and-
diagnostic-medicine-1st-edition-by-lyle-broemeling-
isbn-1584887680-9781584887683-1900/

Creating Learning Materials for Open and Distance Learning A Handbook


for Authors and Instructional Designers 1st Edition by Freeman,
Richard ISBN 1894975235

https://ebookball.com/product/creating-learning-materials-for-
open-and-distance-learning-a-handbook-for-authors-and-
instructional-designers-1st-edition-by-freeman-richard-
isbn-1894975235-11724/

(Ebook PDF) Learning to Teach 10th edition by Richard Arends


0078110300‎ 978-0078110306 full chapters

https://ebookball.com/product/ebook-pdf-learning-to-teach-10th-
edition-by-richard-arends-0078110300aeurz-978-0078110306-full-
chapters-23248/
Learning Bayesian Networks

Richard E. Neapolitan
Northeastern Illinois University
Chicago, Illinois

In memory of my dad, a difficult but loving father, who raised me well.


ii
Contents

Preface ix

I Basics 1
1 Introduction to Bayesian Networks 3
1.1 Basics of Probability Theory . . . . . . . . . . . . . . . . . . . . 5
1.1.1 Probability Functions and Spaces . . . . . . . . . . . . . . 6
1.1.2 Conditional Probability and Independence . . . . . . . . . 9
1.1.3 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . 12
1.1.4 Random Variables and Joint Probability Distributions . . 13
1.2 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.2.1 Random Variables and Probabilities in Bayesian Applica-
tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
1.2.2 A Definition of Random Variables and Joint Probability
Distributions for Bayesian Inference . . . . . . . . . . . . 24
1.2.3 A Classical Example of Bayesian Inference . . . . . . . . . 27
1.3 Large Instances / Bayesian Networks . . . . . . . . . . . . . . . . 29
1.3.1 The Difficulties Inherent in Large Instances . . . . . . . . 29
1.3.2 The Markov Condition . . . . . . . . . . . . . . . . . . . . 31
1.3.3 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . 40
1.3.4 A Large Bayesian Network . . . . . . . . . . . . . . . . . 43
1.4 Creating Bayesian Networks Using Causal Edges . . . . . . . . . 43
1.4.1 Ascertaining Causal Influences Using Manipulation . . . . 44
1.4.2 Causation and the Markov Condition . . . . . . . . . . . 51

2 More DAG/Probability Relationships 65


2.1 Entailed Conditional Independencies . . . . . . . . . . . . . . . . 66
2.1.1 Examples of Entailed Conditional Independencies . . . . . 66
2.1.2 d-Separation . . . . . . . . . . . . . . . . . . . . . . . . . 70
2.1.3 Finding d-Separations . . . . . . . . . . . . . . . . . . . . 76
2.2 Markov Equivalence . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.3 Entailing Dependencies with a DAG . . . . . . . . . . . . . . . . 92
2.3.1 Faithfulness . . . . . . . . . . . . . . . . . . . . . . . . . . 95

iii
iv CONTENTS

2.3.2 Embedded Faithfulness . . . . . . . . . . . . . . . . . . . 99


2.4 Minimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
2.5 Markov Blankets and Boundaries . . . . . . . . . . . . . . . . . . 108
2.6 More on Causal DAGs . . . . . . . . . . . . . . . . . . . . . . . . 110
2.6.1 The Causal Minimality Assumption . . . . . . . . . . . . 110
2.6.2 The Causal Faithfulness Assumption . . . . . . . . . . . . 111
2.6.3 The Causal Embedded Faithfulness Assumption . . . . . 112

II Inference 121
3 Inference: Discrete Variables 123
3.1 Examples of Inference . . . . . . . . . . . . . . . . . . . . . . . . 124
3.2 Pearl’s Message-Passing Algorithm . . . . . . . . . . . . . . . . . 126
3.2.1 Inference in Trees . . . . . . . . . . . . . . . . . . . . . . . 127
3.2.2 Inference in Singly-Connected Networks . . . . . . . . . . 142
3.2.3 Inference in Multiply-Connected Networks . . . . . . . . . 153
3.2.4 Complexity of the Algorithm . . . . . . . . . . . . . . . . 155
3.3 The Noisy OR-Gate Model . . . . . . . . . . . . . . . . . . . . . 156
3.3.1 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . 156
3.3.2 Doing Inference With the Model . . . . . . . . . . . . . . 160
3.3.3 Further Models . . . . . . . . . . . . . . . . . . . . . . . . 161
3.4 Other Algorithms that Employ the DAG . . . . . . . . . . . . . . 161
3.5 The SPI Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 162
3.5.1 The Optimal Factoring Problem . . . . . . . . . . . . . . 163
3.5.2 Application to Probabilistic Inference . . . . . . . . . . . 168
3.6 Complexity of Inference . . . . . . . . . . . . . . . . . . . . . . . 170
3.7 Relationship to Human Reasoning . . . . . . . . . . . . . . . . . 171
3.7.1 The Causal Network Model . . . . . . . . . . . . . . . . . 171
3.7.2 Studies Testing the Causal Network Model . . . . . . . . 173

4 More Inference Algorithms 181


4.1 Continuous Variable Inference . . . . . . . . . . . . . . . . . . . . 181
4.1.1 The Normal Distribution . . . . . . . . . . . . . . . . . . 182
4.1.2 An Example Concerning Continuous Variables . . . . . . 183
4.1.3 An Algorithm for Continuous Variables . . . . . . . . . . 185
4.2 Approximate Inference . . . . . . . . . . . . . . . . . . . . . . . . 205
4.2.1 A Brief Review of Sampling . . . . . . . . . . . . . . . . . 205
4.2.2 Logic Sampling . . . . . . . . . . . . . . . . . . . . . . . . 211
4.2.3 Likelihood Weighting . . . . . . . . . . . . . . . . . . . . . 217
4.3 Abductive Inference . . . . . . . . . . . . . . . . . . . . . . . . . 221
4.3.1 Abductive Inference in Bayesian Networks . . . . . . . . . 221
4.3.2 A Best-First Search Algorithm for Abductive Inference . . 224
CONTENTS v

5 Influence Diagrams 239


5.1 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
5.1.1 Simple Examples . . . . . . . . . . . . . . . . . . . . . . . 239
5.1.2 Probabilities, Time, and Risk Attitudes . . . . . . . . . . 242
5.1.3 Solving Decision Trees . . . . . . . . . . . . . . . . . . . . 245
5.1.4 More Examples . . . . . . . . . . . . . . . . . . . . . . . . 245
5.2 Influence Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . 259
5.2.1 Representing with Influence Diagrams . . . . . . . . . . . 259
5.2.2 Solving Influence Diagrams . . . . . . . . . . . . . . . . . 266
5.3 Dynamic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 272
5.3.1 Dynamic Bayesian Networks . . . . . . . . . . . . . . . . 272
5.3.2 Dynamic Influence Diagrams . . . . . . . . . . . . . . . . 279

III Learning 291


6 Parameter Learning: Binary Variables 293
6.1 Learning a Single Parameter . . . . . . . . . . . . . . . . . . . . . 294
6.1.1 Probability Distributions of Relative Frequencies . . . . . 294
6.1.2 Learning a Relative Frequency . . . . . . . . . . . . . . . 303
6.2 More on the Beta Density Function . . . . . . . . . . . . . . . . . 310
6.2.1 Non-integral Values of a and b . . . . . . . . . . . . . . . 311
6.2.2 Assessing the Values of a and b . . . . . . . . . . . . . . . 313
6.2.3 Why the Beta Density Function? . . . . . . . . . . . . . . 315
6.3 Computing a Probability Interval . . . . . . . . . . . . . . . . . . 319
6.4 Learning Parameters in a Bayesian Network . . . . . . . . . . . . 323
6.4.1 Urn Examples . . . . . . . . . . . . . . . . . . . . . . . . 323
6.4.2 Augmented Bayesian Networks . . . . . . . . . . . . . . . 331
6.4.3 Learning Using an Augmented Bayesian Network . . . . . 336
6.4.4 A Problem with Updating; Using an Equivalent Sample
Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
6.5 Learning with Missing Data Items . . . . . . . . . . . . . . . . . 357
6.5.1 Data Items Missing at Random . . . . . . . . . . . . . . . 358
6.5.2 Data Items Missing Not at Random . . . . . . . . . . . . 363
6.6 Variances in Computed Relative Frequencies . . . . . . . . . . . . 364
6.6.1 A Simple Variance Determination . . . . . . . . . . . . . 364
6.6.2 The Variance and Equivalent Sample Size . . . . . . . . . 366
6.6.3 Computing Variances in Larger Networks . . . . . . . . . 372
6.6.4 When Do Variances Become Large? . . . . . . . . . . . . 373

7 More Parameter Learning 381


7.1 Multinomial Variables . . . . . . . . . . . . . . . . . . . . . . . . 381
7.1.1 Learning a Single Parameter . . . . . . . . . . . . . . . . 381
7.1.2 More on the Dirichlet Density Function . . . . . . . . . . 388
7.1.3 Computing Probability Intervals and Regions . . . . . . . 389
7.1.4 Learning Parameters in a Bayesian Network . . . . . . . . 392
vi CONTENTS

7.1.5 Learning with Missing Data Items . . . . . . . . . . . . . 398


7.1.6 Variances in Computed Relative Frequencies . . . . . . . 398
7.2 Continuous Variables . . . . . . . . . . . . . . . . . . . . . . . . . 398
7.2.1 Normally Distributed Variable . . . . . . . . . . . . . . . 399
7.2.2 Multivariate Normally Distributed Variables . . . . . . . 413
7.2.3 Gaussian Bayesian Networks . . . . . . . . . . . . . . . . 425

8 Bayesian Structure Learning 441


8.1 Learning Structure: Discrete Variables . . . . . . . . . . . . . . . 441
8.1.1 Schema for Learning Structure . . . . . . . . . . . . . . . 442
8.1.2 Procedure for Learning Structure . . . . . . . . . . . . . . 445
8.1.3 Learning From a Mixture of Observational and Experi-
mental Data. . . . . . . . . . . . . . . . . . . . . . . . . . 449
8.1.4 Complexity of Structure Learning . . . . . . . . . . . . . 450
8.2 Model Averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
8.3 Learning Structure with Missing Data . . . . . . . . . . . . . . . 452
8.3.1 Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . 453
8.3.2 Large-Sample Approximations . . . . . . . . . . . . . . . 462
8.4 Probabilistic Model Selection . . . . . . . . . . . . . . . . . . . . 468
8.4.1 Probabilistic Models . . . . . . . . . . . . . . . . . . . . . 468
8.4.2 The Model Selection Problem . . . . . . . . . . . . . . . . 472
8.4.3 Using the Bayesian Scoring Criterion for Model Selection 473
8.5 Hidden Variable DAG Models . . . . . . . . . . . . . . . . . . . . 476
8.5.1 Models Containing More Conditional Independencies than
DAG Models . . . . . . . . . . . . . . . . . . . . . . . . . 477
8.5.2 Models Containing the Same Conditional Independencies
as DAG Models . . . . . . . . . . . . . . . . . . . . . . . . 479
8.5.3 Dimension of Hidden Variable DAG Models . . . . . . . . 484
8.5.4 Number of Models and Hidden Variables . . . . . . . . . . 486
8.5.5 Efficient Model Scoring . . . . . . . . . . . . . . . . . . . 487
8.6 Learning Structure: Continuous Variables . . . . . . . . . . . . . 491
8.6.1 The Density Function of D . . . . . . . . . . . . . . . . . 491
8.6.2 The Density function of D Given a DAG pattern . . . . . 495
8.7 Learning Dynamic Bayesian Networks . . . . . . . . . . . . . . . 505

9 Approximate Bayesian Structure Learning 511


9.1 Approximate Model Selection . . . . . . . . . . . . . . . . . . . . 511
9.1.1 Algorithms that Search over DAGs . . . . . . . . . . . . . 513
9.1.2 Algorithms that Search over DAG Patterns . . . . . . . . 518
9.1.3 An Algorithm Assuming Missing Data or Hidden Variables 529
9.2 Approximate Model Averaging . . . . . . . . . . . . . . . . . . . 531
9.2.1 A Model Averaging Example . . . . . . . . . . . . . . . . 532
9.2.2 Approximate Model Averaging Using MCMC . . . . . . . 533
CONTENTS vii

10 Constraint-Based Learning 541


10.1 Algorithms Assuming Faithfulness . . . . . . . . . . . . . . . . . 542
10.1.1 Simple Examples . . . . . . . . . . . . . . . . . . . . . . . 542
10.1.2 Algorithms for Determining DAG patterns . . . . . . . . 545
10.1.3 Determining if a Set Admits a Faithful DAG Representation552
10.1.4 Application to Probability . . . . . . . . . . . . . . . . . . 560
10.2 Assuming Only Embedded Faithfulness . . . . . . . . . . . . . . 561
10.2.1 Inducing Chains . . . . . . . . . . . . . . . . . . . . . . . 562
10.2.2 A Basic Algorithm . . . . . . . . . . . . . . . . . . . . . . 568
10.2.3 Application to Probability . . . . . . . . . . . . . . . . . . 590
10.2.4 Application to Learning Causal Influences1 . . . . . . . . 591
10.3 Obtaining the d-separations . . . . . . . . . . . . . . . . . . . . . 599
10.3.1 Discrete Bayesian Networks . . . . . . . . . . . . . . . . . 600
10.3.2 Gaussian Bayesian Networks . . . . . . . . . . . . . . . . 603
10.4 Relationship to Human Reasoning . . . . . . . . . . . . . . . . . 604
10.4.1 Background Theory . . . . . . . . . . . . . . . . . . . . . 604
10.4.2 A Statistical Notion of Causality . . . . . . . . . . . . . . 606

11 More Structure Learning 617


11.1 Comparing the Methods . . . . . . . . . . . . . . . . . . . . . . . 617
11.1.1 A Simple Example . . . . . . . . . . . . . . . . . . . . . . 618
11.1.2 Learning College Attendance Influences . . . . . . . . . . 620
11.1.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . 623
11.2 Data Compression Scoring Criteria . . . . . . . . . . . . . . . . . 624
11.3 Parallel Learning of Bayesian Networks . . . . . . . . . . . . . . 624
11.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 624
11.4.1 Structure Learning . . . . . . . . . . . . . . . . . . . . . . 625
11.4.2 Inferring Causal Relationships . . . . . . . . . . . . . . . 633

IV Applications 647

12 Applications 649
12.1 Applications Based on Bayesian Networks . . . . . . . . . . . . . 649
12.2 Beyond Bayesian networks . . . . . . . . . . . . . . . . . . . . . . 655

Bibliography 657

Index 686

1 The relationships in the examples in this section are largely fictitious.


viii CONTENTS
Preface

Bayesian networks are graphical structures for representing the probabilistic


relationships among a large number of variables and doing probabilistic inference
with those variables. During the 1980’s, a good deal of related research was done
on developing Bayesian networks (belief networks, causal networks, influence
diagrams), algorithms for performing inference with them, and applications that
used them. However, the work was scattered throughout research articles. My
purpose in writing the 1990 text Probabilistic Reasoning in Expert Systems was
to unify this research and establish a textbook and reference for the field which
has come to be known as ‘Bayesian networks.’ The 1990’s saw the emergence
of excellent algorithms for learning Bayesian networks from data. However,
by 2000 there still seemed to be no accessible source for ‘learning Bayesian
networks.’ Similar to my purpose a decade ago, the goal of this text is to
provide such a source.
In order to make this text a complete introduction to Bayesian networks,
I discuss methods for doing inference in Bayesian networks and influence di-
agrams. However, there is no effort to be exhaustive in this discussion. For
example, I give the details of only two algorithms for exact inference with dis-
crete variables, namely Pearl’s message passing algorithm and D’Ambrosio and
Li’s symbolic probabilistic inference algorithm. It may seem odd that I present
Pearl’s algorithm, since it is one of the oldest. I have two reasons for doing
this: 1) Pearl’s algorithm corresponds to a model of human causal reasoning,
which is discussed in this text; and 2) Pearl’s algorithm extends readily to an
algorithm for doing inference with continuous variables, which is also discussed
in this text.
The content of the text is as follows. Chapters 1 and 2 cover basics. Specifi-
cally, Chapter 1 provides an introduction to Bayesian networks; and Chapter 2
discusses further relationships between DAGs and probability distributions such
as d-separation, the faithfulness condition, and the minimality condition. Chap-
ters 3-5 concern inference. Chapter 3 covers Pearl’s message-passing algorithm,
D’Ambrosio and Li’s symbolic probabilistic inference, and the relationship of
Pearl’s algorithm to human causal reasoning. Chapter 4 shows an algorithm for
doing inference with continuous variable, an approximate inference algorithm,
and finally an algorithm for abductive inference (finding the most probable
explanation). Chapter 5 discusses influence diagrams, which are Bayesian net-
works augmented with decision nodes and a value node, and dynamic Bayesian

ix
x PREFACE

networks and influence diagrams. Chapters 6-10 address learning. Chapters


6 and 7 concern parameter learning. Since the notation for these learning al-
gorithm is somewhat arduous, I introduce the algorithms by discussing binary
variables in Chapter 6. I then generalize to multinomial variables in Chapter 7.
Furthermore, in Chapter 7 I discuss learning parameters when the variables are
continuous. Chapters 8, 9, and 10 concern structure learning. Chapter 8 shows
the Bayesian method for learning structure in the cases of both discrete and
continuous variables, while Chapter 9 discusses the constraint-based method
for learning structure. Chapter 10 compares the Bayesian and constraint-based
methods, and it presents several real-world examples of learning Bayesian net-
works. The text ends by referencing applications of Bayesian networks in Chap-
ter 11.
This is a text on learning Bayesian networks; it is not a text on artificial
intelligence, expert systems, or decision analysis. However, since these are fields
in which Bayesian networks find application, they emerge frequently throughout
the text. Indeed, I have used the manuscript for this text in my course on expert
systems at Northeastern Illinois University. In one semester, I have found that
I can cover the core of the following chapters: 1, 2, 3, 5, 6, 7, 8, and 9.
I would like to thank those researchers who have provided valuable correc-
tions, comments, and dialog concerning the material in this text. They in-
clude Bruce D’Ambrosio, David Maxwell Chickering, Gregory Cooper, Tom
Dean, Carl Entemann, John Erickson, Finn Jensen, Clark Glymour, Piotr
Gmytrasiewicz, David Heckerman, Xia Jiang, James Kenevan, Henry Kyburg,
Kathryn Blackmond Laskey, Don Labudde, David Madigan, Christopher Meek,
Paul-André Monney, Scott Morris, Peter Norvig, Judea Pearl, Richard Scheines,
Marco Valtorta, Alex Wolpert, and Sandy Zabell. I thank Sue Coyle for helping
me draw the cartoon containing the robots.
Part I

Basics

1
Chapter 1

Introduction to Bayesian
Networks

Consider the situation where one feature of an entity has a direct influence on
another feature of that entity. For example, the presence or absence of a disease
in a human being has a direct influence on whether a test for that disease turns
out positive or negative. For decades, Bayes’ theorem has been used to perform
probabilistic inference in this situation. In the current example, we would use
that theorem to compute the conditional probability of an individual having a
disease when a test for the disease came back positive. Consider next the situ-
ation where several features are related through inference chains. For example,
whether or not an individual has a history of smoking has a direct influence
both on whether or not that individual has bronchitis and on whether or not
that individual has lung cancer. In turn, the presence or absence of each of these
diseases has a direct influence on whether or not the individual experiences fa-
tigue. Also, the presence or absence of lung cancer has a direct influence on
whether or not a chest X-ray is positive. In this situation, we would want to do
probabilistic inference involving features that are not related via a direct influ-
ence. We would want to determine, for example, the conditional probabilities
both of bronchitis and of lung cancer when it is known an individual smokes, is
fatigued, and has a positive chest X-ray. Yet bronchitis has no direct influence
(indeed no influence at all) on whether a chest X-ray is positive. Therefore,
these conditional probabilities cannot be computed using a simple application
of Bayes’ theorem. There is a straightforward algorithm for computing them,
but the probability values it requires are not ordinarily accessible; furthermore,
the algorithm has exponential space and time complexity.
Bayesian networks were developed to address these difficulties. By exploiting
conditional independencies entailed by influence chains, we are able to represent
a large instance in a Bayesian network using little space, and we are often able
to perform probabilistic inference among the features in an acceptable amount
of time. In addition, the graphical nature of Bayesian networks gives us a much

3
4 CHAPTER 1. INTRODUCTION TO BAYESIAN NETWORKS

P(h1) = .2

P(b1|h1) = .25 P(l1|h1) = .003


B L
P(b1|h2) = .05 P(l1|h2) = .00005

F C

P(f1|b1,l1) = .75 P(c1|l1) = .6


P(f1|b1,l2) = .10 P(c1|l2) = .02
P(f1|b2,l1) = .5
P(f1|b2,l2) = .05

Figure 1.1: A Bayesian nework.

better intuitive grasp of the relationships among the features.


Figure 1.1 shows a Bayesian network representing the probabilistic relation-
ships among the features just discussed. The values of the features in that
network represent the following:

Feature Value When the Feature Takes this Value


H h1 There is a history of smoking
h2 There is no history of smoking
B b1 Bronchitis is present
b2 Bronchitis is absent
L l1 Lung cancer is present
l2 Lung cancer is absent
F f1 Fatigue is present
f2 Fatigue is absent
C c1 Chest X-ray is positive
c2 Chest X-ray is negative

This Bayesian network is discussed in Example 1.32 in Section 1.3.3 after we


provide the theory of Bayesian networks. Presently, we only use it to illustrate
the nature and use of Bayesian networks. First, in this Bayesian network (called
a causal network) the edges represent direct influences. For example, there is
an edge from H to L because a history of smoking has a direct influence on the
presence of lung cancer, and there is an edge from L to C because the presence
of lung cancer has a direct influence on the result of a chest X-ray. There is no
1.1. BASICS OF PROBABILITY THEORY 5

edge from H to C because a history of smoking has an influence on the result


of a chest X-ray only through its influence on the presence of lung cancer. One
way to construct Bayesian networks is by creating edges that represent direct
influences as done here; however, there are other ways. Second, the probabilities
in the network are the conditional probabilities of the values of each feature given
every combination of values of the feature’s parents in the network, except in the
case of roots they are prior probabilities. Third, probabilistic inference among
the features can be accomplished using the Bayesian network. For example, we
can compute the conditional probabilities both of bronchitis and of lung cancer
when it is known an individual smokes, is fatigued, and has a positive chest
X-ray. This Bayesian network is discussed again in Chapter 3 when we develop
algorithms that do this inference.
The focus of this text is on learning Bayesian networks from data. For
example, given we had values of the five features just discussed (smoking his-
tory, bronchitis, lung cancer, fatigue, and chest X-ray) for a large number of
individuals, the learning algorithms we develop might construct the Bayesian
network in Figure 1.1. However, to make it a complete introduction to Bayesian
networks, it does include a brief overview of methods for doing inference in
Bayesian networks and using Bayesian networks to make decisions. Chapters 1
and 2 cover properties of Bayesian networks which we need in order to discuss
both inference and learning. Chapters 3-5 concern methods for doing inference
in Bayesian networks. Methods for learning Bayesian networks from data are
discussed in Chapters 6-11. A number of successful experts systems (systems
which make the judgements of an expert) have been developed which are based
on Bayesian networks. Furthermore, Bayesian networks have been used to learn
causal influences from data. Chapter 12 references some of these real-world ap-
plications. To see the usefulness of Bayesian networks, you may wish to review
that chapter before proceeding.
This chapter introduces Bayesian networks. Section 1.1 reviews basic con-
cepts in probability. Next, Section 1.2 discusses Bayesian inference and illus-
trates the classical way of using Bayes’ theorem when there are only two fea-
tures. Section 1.3 shows the problem in representing large instances and intro-
duces Bayesian networks as a solution to this problem. Finally, we discuss how
Bayesian networks can often be constructed using causal edges.

1.1 Basics of Probability Theory

The concept of probability has a rich and diversified history that includes many
different philosophical approaches. Notable among these approaches include the
notions of probability as a ratio, as a relative frequency, and as a degree of belief.
Next we review the probability calculus and, via examples, illustrate these three
approaches and how they are related.
6 CHAPTER 1. INTRODUCTION TO BAYESIAN NETWORKS

1.1.1 Probability Functions and Spaces


In 1933 A.N. Kolmogorov developed the set-theoretic definition of probability,
which serves as a mathematical foundation for all applications of probability.
We start by providing that definition.
Probability theory has to do with experiments that have a set of distinct
outcomes. Examples of such experiments include drawing the top card from a
deck of 52 cards with the 52 outcomes being the 52 different faces of the cards;
flipping a two-sided coin with the two outcomes being ‘heads’ and ‘tails’; picking
a person from a population and determining whether the person is a smoker
with the two outcomes being ‘smoker’ and ‘non-smoker’; picking a person from
a population and determining whether the person has lung cancer with the
two outcomes being ‘having lung cancer’ and ‘not having lung cancer’; after
identifying 5 levels of serum calcium, picking a person from a population and
determining the individual’s serum calcium level with the 5 outcomes being
each of the 5 levels; picking a person from a population and determining the
individual’s serum calcium level with the infinite number of outcomes being
the continuum of possible calcium levels. The last two experiments illustrate
two points. First, the experiment is not well-defined until we identify a set of
outcomes. The same act (picking a person and measuring that person’s serum
calcium level) can be associated with many different experiments, depending on
what we consider a distinct outcome. Second, the set of outcomes can be infinite.
Once an experiment is well-defined, the collection of all outcomes is called the
sample space. Mathematically, a sample space is a set and the outcomes are
the elements of the set. To keep this review simple, we restrict ourselves to finite
sample spaces in what follows (You should consult a mathematical probability
text such as [Ash, 1970] for a discussion of infinite sample spaces.). In the case
of a finite sample space, every subset of the sample space is called an event. A
subset containing exactly one element is called an elementary event. Once a
sample space is identified, a probability function is defined as follows:

Definition 1.1 Suppose we have a sample space Ω containing n distinct ele-


ments. That is,
Ω = {e1 , e2 , . . . en }.
A function that assigns a real number P (E) to each event E ⊆ Ω is called
a probability function on the set of subsets of Ω if it satisfies the following
conditions:

1. 0 ≤ P ({ei }) ≤ 1 for 1 ≤ i ≤ n.
2. P ({e1 }) + P ({e2 }) + . . . + P ({en }) = 1.
3. For each event E = {ei1 , ei2 , . . . eik } that is not an elementary event,

P (E) = P ({ei1 }) + P ({ei2 }) + . . . + P ({eik }).

The pair (Ω, P ) is called a probability space.


1.1. BASICS OF PROBABILITY THEORY 7

We often just say P is a probability function on Ω rather than saying on the


set of subsets of Ω.
Intuition for probability functions comes from considering games of chance
as the following example illustrates.

Example 1.1 Let the experiment be drawing the top card from a deck of 52
cards. Then Ω contains the faces of the 52 cards, and using the principle of
indifference, we assign P ({e}) = 1/52 for each e ∈ Ω. Therefore, if we let kh
and ks stand for the king of hearts and king of spades respectively, P ({kh}) =
1/52, P ({ks}) = 1/52, and P ({kh, ks}) = P ({kh}) + P ({ks}) = 1/26.

The principle of indifference (a term popularized by J.M. Keynes in 1921)


says elementary events are to be considered equiprobable if we have no reason
to expect or prefer one over the other. According to this principle, when there
are n elementary events the probability of each of them is the ratio 1/n. This
is the way we often assign probabilities in games of chance, and a probability
so assigned is called a ratio.
The following example shows a probability that cannot be computed using
the principle of indifference.

Example 1.2 Suppose we toss a thumbtack and consider as outcomes the two
ways it could land. It could land on its head, which we will call ‘heads’, or
it could land with the edge of the head and the end of the point touching the
ground, which we will call ‘tails’. Due to the lack of symmetry in a thumbtack,
we would not assign a probability of 1/2 to each of these events. So how can
we compute the probability? This experiment can be repeated many times. In
1919 Richard von Mises developed the relative frequency approach to probability
which says that, if an experiment can be repeated many times, the probability of
any one of the outcomes is the limit, as the number of trials approach infinity,
of the ratio of the number of occurrences of that outcome to the total number of
trials. For example, if m is the number of trials,
#heads
P ({heads}) = lim .
m→∞ m
So, if we tossed the thumbtack 10, 000 times and it landed heads 3373 times, we
would estimate the probability of heads to be about .3373.

Probabilities obtained using the approach in the previous example are called
relative frequencies. According to this approach, the probability obtained is
not a property of any one of the trials, but rather it is a property of the entire
sequence of trials. How are these probabilities related to ratios? Intuitively,
we would expect if, for example, we repeatedly shuffled a deck of cards and
drew the top card, the ace of spades would come up about one out of every 52
times. In 1946 J. E. Kerrich conducted many such experiments using games of
chance in which the principle of indifference seemed to apply (e.g. drawing a
card from a deck). His results indicated that the relative frequency does appear
to approach a limit and that limit is the ratio.
8 CHAPTER 1. INTRODUCTION TO BAYESIAN NETWORKS

The next example illustrates a probability that cannot be obtained either


with ratios or with relative frequencies.
Example 1.3 If you were going to bet on an upcoming basketball game between
the Chicago Bulls and the Detroit Pistons, you would want to ascertain how
probable it was that the Bulls would win. This probability is certainly not a
ratio, and it is not a relative frequency because the game cannot be repeated
many times under the exact same conditions (Actually, with your knowledge
about the conditions the same.). Rather the probability only represents your
belief concerning the Bulls chances of winning. Such a probability is called a
degree of belief or subjective probability. There are a number of ways
for ascertaining such probabilities. One of the most popular methods is the
following, which was suggested by D. V. Lindley in 1985. This method says an
individual should liken the uncertain outcome to a game of chance by considering
an urn containing white and black balls. The individual should determine for
what fraction of white balls the individual would be indifferent between receiving
a small prize if the uncertain outcome happened (or turned out to be true) and
receiving the same small prize if a white ball was drawn from the urn. That
fraction is the individual’s probability of the outcome. Such a probability can be
constructed using binary cuts. If, for example, you were indifferent when the
fraction was .75, for you P ({bullswin}) = .75. If I were indifferent when the
fraction was .6, for me P ({bullswin}) = .6. Neither of us is right or wrong.
Subjective probabilities are unlike ratios and relative frequencies in that they do
not have objective values upon which we all must agree. Indeed, that is why they
are called subjective.
Neapolitan [1996] discusses the construction of subjective probabilities fur-
ther. In this text, by probability we ordinarily mean a degree of belief. When
we are able to compute ratios or relative frequencies, the probabilities obtained
agree with most individuals’ beliefs. For example, most individuals would assign
a subjective probability of 1/13 to the top card being an ace because they would
be indifferent between receiving a small prize if it were the ace and receiving
that same small prize if a white ball were drawn from an urn containing one
white ball out of 13 total balls.
The following example shows a subjective probability more relevant to ap-
plications of Bayesian networks.
Example 1.4 After examining a patient and seeing the result of the patient’s
chest X-ray, Dr. Gloviak decides the probability that the patient has lung cancer
is .9. This probability is Dr. Gloviak’s subjective probability of that outcome.
Although a physician may use estimates of relative frequencies (such as the
fraction of times individuals with lung cancer have positive chest X-rays) and
experience diagnosing many similar patients to arrive at the probability, it is
still assessed subjectively. If asked, Dr. Gloviak may state that her subjective
probability is her estimate of the relative frequency with which patients, who
have these exact same symptoms, have lung cancer. However, there is no reason
to believe her subjective judgement will converge, as she continues to diagnose
1.1. BASICS OF PROBABILITY THEORY 9

patients with these exact same symptoms, to the actual relative frequency with
which they have lung cancer.
It is straightforward to prove the following theorem concerning probability
spaces.
Theorem 1.1 Let (Ω, P ) be a probability space. Then
1. P (Ω) = 1.
2. 0 ≤ P (E) ≤ 1 for every E ⊆ Ω.
3. For E and F ⊆ Ω such that E ∩ F = ∅,
P (E ∪ F) = P (E) + P (F).

Proof. The proof is left as an exercise.


The conditions in this theorem were labeled the axioms of probability
theory by A.N. Kolmogorov in 1933. When Condition (3) is replaced by in-
finitely countable additivity, these conditions are used to define a probability
space in mathematical probability texts.

Example 1.5 Suppose we draw the top card from a deck of cards. Denote by
Queen the set containing the 4 queens and by King the set containing the 4 kings.
Then

P (Queen ∪ King) = P (Queen) + P (King) = 1/13 + 1/13 = 2/13

because Queen ∩ King = ∅. Next denote by Spade the set containing the 13
spades. The sets Queen and Spade are not disjoint; so their probabilities are not
additive. However, it is not hard to prove that, in general,
P (E ∪ F) = P (E) + P (F) − P (E ∩ F).
So
P (Queen ∪ Spade) = P (Queen) + P (Spade) − P (Queen ∩ Spade)
1 1 1 4
= + − = .
13 4 52 13

1.1.2 Conditional Probability and Independence


We have yet to discuss one of the most important concepts in probability theory,
namely conditional probability. We do that next.
Definition 1.2 Let E and F be events such that P (F) 6= 0. Then the condi-
tional probability of E given F, denoted P (E|F), is given by
P (E ∩ F)
P (E|F) = .
P (F)
10 CHAPTER 1. INTRODUCTION TO BAYESIAN NETWORKS

The initial intuition for conditional probability comes from considering prob-
abilities that are ratios. In the case of ratios, P (E|F), as defined above, is the
fraction of items in F that are also in E. We show this as follows. Let n be the
number of items in the sample space, nF be the number of items in F, and nEF
be the number of items in E ∩ F. Then
P (E ∩ F) nEF /n nEF
= = ,
P (F) nF /n nF
which is the fraction of items in F that are also in E. As far as meaning, P (E|F)
means the probability of E occurring given that we know F has occurred.
Example 1.6 Again consider drawing the top card from a deck of cards, let
Queen be the set of the 4 queens, RoyalCard be the set of the 12 royal cards, and
Spade be the set of the 13 spades. Then
1
P (Queen) =
13
P (Queen ∩ RoyalCard) 1/13 1
P (Queen|RoyalCard) = = =
P (RoyalCard) 3/13 3
P (Queen ∩ Spade) 1/52 1
P (Queen|Spade) = = = .
P (Spade) 1/4 13
Notice in the previous example that P (Queen|Spade) = P (Queen). This
means that finding out the card is a spade does not make it more or less probable
that it is a queen. That is, the knowledge of whether it is a spade is irrelevant
to whether it is a queen. We say that the two events are independent in this
case, which is formalized in the following definition.
Definition 1.3 Two events E and F are independent if one of the following
hold:
1. P (E|F) = P (E) and P (E) 6= 0, P (F) 6= 0.
2. P (E) = 0 or P (F) = 0.
Notice that the definition states that the two events are independent even
though it is based on the conditional probability of E given F. The reason is
that independence is symmetric. That is, if P (E) 6= 0 and P (F) 6= 0, then
P (E|F) = P (E) if and only if P (F|E) = P (F). It is straightforward to prove that
E and F are independent if and only if P (E ∩ F) = P (E)P (F).
The following example illustrates an extension of the notion of independence.
Example 1.7 Let E = {kh, ks, qh}, F = {kh, kc, qh}, G = {kh, ks, kc, kd},
where kh means the king of hearts, ks means the king of spades, etc. Then
3
P (E) =
52
2
P (E|F) =
3
1.1. BASICS OF PROBABILITY THEORY 11

2 1
P (E|G) = =
4 2
1
P (E|F ∩ G) = .
2
So E and F are not independent, but they are independent once we condition on
G.

In the previous example, E and F are said to be conditionally independent


given G. Conditional independence is very important in Bayesian networks and
will be discussed much more in the sections that follow. Presently, we have the
definition that follows and another example.

Definition 1.4 Two events E and F are conditionally independent given G


if P (G) 6= 0 and one of the following holds:

1. P (E|F ∩ G) = P (E|G) and P (E|G) 6= 0, P (F|G) 6= 0.

2. P (E|G) = 0 or P (F|G) = 0.

Another example of conditional independence follows.

Example 1.8 Let Ω be the set of all objects in Figure 1.2. Suppose we assign
a probability of 1/13 to each object, and let Black be the set of all black objects,
White be the set of all white objects, Square be the set of all square objects, and
One be the set of all objects containing a ‘1’. We then have
5
P (One) =
13
3
P (One|Square) =
8

3 1
P (One|Black) = =
9 3
2 1
P (One|Square ∩ Black) = =
6 3

2 1
P (One|White) = =
4 2
1
P (One|Square ∩ White) = .
2
So One and Square are not independent, but they are conditionally independent
given Black and given White.

Next we discuss a very useful rule involving conditional probabilities. Sup-


pose we have n events E1 , E2 , . . . En such that Ei ∩ Ej = ∅ for i 6= j and
12 CHAPTER 1. INTRODUCTION TO BAYESIAN NETWORKS

1 1 2 2 2 2 1 2 2

1 2 1 2

Figure 1.2: Containing a ‘1’ and being a square are not independent, but they
are conditionally independent given the object is black and given it is white.

E1 ∪ E2 ∪ . . . ∪ En = Ω. Such events are called mutually exclusive and


exhaustive. Then the law of total probability says for any other event F,
n
X
P (F) = P (F ∩ Ei ). (1.1)
i=1

If P (Ei ) 6= 0, then P (F ∩ Ei ) = P (F|Ei )P (Ei ). Therefore, if P (Ei ) 6= 0 for all i,


the law is often applied in the following form:
n
X
P (F) = P (F|Ei )P (Ei ). (1.2)
i=1

It is straightforward to derive both the axioms of probability theory and


the rule for conditional probability when probabilities are ratios. However,
they can also be derived in the relative frequency and subjectivistic frameworks
(See [Neapolitan, 1990].). These derivations make the use of probability theory
compelling for handling uncertainty.

1.1.3 Bayes’ Theorem


For decades conditional probabilities of events of interest have been computed
from known probabilities using Bayes’ theorem. We develop that theorem next.

Theorem 1.2 (Bayes) Given two events E and F such that P (E) 6= 0 and
P (F) 6= 0, we have
P (F|E)P (E)
P (E|F) = . (1.3)
P (F)
Furthermore, given n mutually exclusive and exhaustive events E1 , E2 , . . . En
such that P (Ei ) 6= 0 for all i, we have for 1 ≤ i ≤ n,

P (F|Ei )P (Ei )
P (Ei |F) = . (1.4)
P (F|E1 )P (E1 ) + P (F|E2 )P (E2 ) + · · · P (F|En )P (En )
1.1. BASICS OF PROBABILITY THEORY 13

Proof. To obtain Equality 1.3, we first use the definition of conditional proba-
bility as follows:
P (E ∩ F) P (F ∩ E)
P (E|F) = and P (F|E) = .
P (F) P (E)
Next we multiply each of these equalities by the denominator on its right side to
show that
P (E|F)P (F) = P (F|E)P (E)
because they both equal P (E ∩ F). Finally, we divide this last equality by P (F)
to obtain our result.
To obtain Equality 1.4, we place the expression for F, obtained using the rule
of total probability (Equality 1.2), in the denominator of Equality 1.3.
Both of the formulas in the preceding theorem are called Bayes’ theorem
because they were originally developed by Thomas Bayes (published in 1763).
The first enables us to compute P (E|F) if we know P (F|E), P (E), and P (F), while
the second enables us to compute P (Ei |F) if we know P (F|Ej ) and P (Ej ) for
1 ≤ j ≤ n. Computing a conditional probability using either of these formulas
is called Bayesian inference. An example of Bayesian inference follows:
Example 1.9 Let Ω be the set of all objects in Figure 1.2, and assign each
object a probability of 1/13. Let One be the set of all objects containing a 1, Two
be the set of all objects containing a 2, and Black be the set of all black objects.
Then according to Bayes’ Theorem,
P (Black|One)P (One)
P (One|Black) =
P (Black|One)P (One) + P (Black|Two)P (Two)
( 35 )( 13
5
) 1
= 3 5 6 8 = ,
( 5 )( 13 ) + ( 8 )( 13 ) 3

which is the same value we get by computing P (One|Black) directly.


The previous example is not a very exciting application of Bayes’ Theorem
as we can just as easily compute P (One|Black) directly. Section 1.2 discusses
useful applications of Bayes’ Theorem.

1.1.4 Random Variables and Joint Probability Distribu-


tions
We have one final concept to discuss in this overview, namely that of a random
variable. The definition shown here is based on the set-theoretic definition of
probability given in Section 1.1.1. In Section 1.2.2 we provide an alternative
definition which is more pertinent to the way random variables are used in
practice.
Definition 1.5 Given a probability space (Ω, P ), a random variable X is a
function on Ω.
14 CHAPTER 1. INTRODUCTION TO BAYESIAN NETWORKS

That is, a random variable assigns a unique value to each element (outcome)
in the sample space. The set of values random variable X can assume is called
the space of X. A random variable is said to be discrete if its space is finite
or countable. In general, we develop our theory assuming the random variables
are discrete. Examples follow.
Example 1.10 Let Ω contain all outcomes of a throw of a pair of six-sided
dice, and let P assign 1/36 to each outcome. Then Ω is the following set of
ordered pairs:
Ω = {(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), (2, 1), (2, 2), . . . (6, 5), (6, 6)}.
Let the random variable X assign the sum of each ordered pair to that pair, and
let the random variable Y assign ‘odd’ to each pair of odd numbers and ‘even’
to a pair if at least one number in that pair is an even number. The following
table shows some of the values of X and Y :
e X(e) Y (e)
(1, 1) 2 odd
(1, 2) 3 even
··· ··· ···
(2, 1) 3 even
··· ··· ···
(6, 6) 12 even
The space of X is {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}, and that of Y is {odd, even}.
For a random variable X, we use X = x to denote the set of all elements
e ∈ Ω that X maps to the value of x. That is,
X =x represents the event {e such that X(e) = x}.
Note the difference between X and x. Small x denotes any element in the space
of X, while X is a function.
Example 1.11 Let Ω , P , and X be as in Example 1.10. Then
X=3 represents the event {(1, 2), (2, 1)} and
1
P (X = 3) = .
18
It is not hard to see that a random variable induces a probability function
on its space. That is, if we define PX ({x}) ≡ P (X = x), then PX is such a
probability function.
Example 1.12 Let Ω contain all outcomes of a throw of a single die, let P
assign 1/6 to each outcome, and let Z assign ‘even’ to each even number and
‘odd’ to each odd number. Then
1
PZ ({even}) = P (Z = even) = P ({2, 4, 6}) =
2
1.1. BASICS OF PROBABILITY THEORY 15

1
PZ ({odd}) = P (Z = odd) = P ({1, 3, 5}) = .
2
We rarely refer to PX ({x}). Rather we only reference the original probability
function P , and we call P (X = x) the probability distribution of the random
variable X. For brevity, we often just say ‘distribution’ instead of ‘probability
distribution’. Furthermore, we often use x alone to represent the event X = x,
and so we write P (x) instead of P (X = x) . We refer to P (x) as ‘the probability
of x’.
Let Ω, P , and X be as in Example 1.10. Then if x = 3,
1
P (x) = P (X = x) = .
18
Given two random variables X and Y , defined on the same sample space Ω,
we use X = x, Y = y to denote the set of all elements e ∈ Ω that are mapped
both by X to x and by Y to y. That is,
X = x, Y = y represents the event
{e such that X(e) = x} ∩ {e such that Y (e) = y}.
Example 1.13 Let Ω, P , X, and Y be as in Example 1.10. Then
X = 4, Y = odd represents the event {(1, 3), (3, 1)}, and
P (X = 4, Y = odd) = 1/18.
Clearly, two random variables induce a probability function on the Cartesian
product of their spaces. As is the case for a single random variable, we rarely
refer to this probability function. Rather we reference the original probability
function. That is, we refer to P (X = x, Y = y), and we call this the joint
probability distribution of X and Y . If A = {X, Y }, we also call this the
joint probability distribution of A. Furthermore, we often just say ‘joint
distribution’ or ‘probability distribution’.
For brevity, we often use x, y to represent the event X = x, Y = y, and
so we write P (x, y) instead of P (X = x, Y = y). This concept extends in a
straightforward way to three or more random variables. For example, P (X =
x, Y = y, Z = z) is the joint probability distribution function of the variables
X, Y , and Z, and we often write P (x, y, z).
Example 1.14 Let Ω, P , X, and Y be as in Example 1.10. Then if x = 4 and
y = odd,
P (x, y) = P (X = x, Y = y) = 1/18.
If, for example, we let A = {X, Y } and a = {x, y}, we use
A=a to represent X = x, Y = y,
and we often write P (a) instead of P (A = a). The same notation extends to
the representation of three or more random variables. For consistency, we set
P (∅ = ∅) = 1, where ∅ is the empty set of random variables. Note that if ∅
is the empty set of events, P (∅) = 0.
16 CHAPTER 1. INTRODUCTION TO BAYESIAN NETWORKS

Example 1.15 Let Ω, P , X, and Y be as in Example 1.10. If A = {X, Y },


a = {x, y}, x = 4, and y = odd,

P (A = a) = P (X = x, Y = y) = 1/18.

This notation entails that if we have, for example, two sets of random vari-
ables A = {X, Y } and B = {Z, W }, then

A = a, B = b represents X = x, Y = y, Z = z, W = w.

Given a joint probability distribution, the law of total probability (Equality


1.1) implies the probability distribution of any one of the random variables
can be obtained by summing over all values of the other variables. It is left
as an exercise to show this. For example, suppose we have a joint probability
distribution P (X = x, Y = y). Then
X
P (X = x) = P (X = x, Y = y),
y

P
where y means the sum as y goes through all values of Y . The probability
distribution P (X = x) is called the marginal probability distribution of X
because it is obtained using a process similar to adding across a row or column in
a table of numbers. This concept also extends in a straightforward way to three
or more random variables. For example, if we have a joint distribution P (X =
x, Y = y, Z = z) of X, Y , and Z, the marginal distribution P (X = x, Y = y) of
X and Y is obtained by summing over all values of Z. If A = {X, Y }, we also
call this the marginal probability distribution of A.

Example 1.16 Let Ω, P , X, and Y be as in Example 1.10. Then


X
P (X = 4) = P (X = 4, Y = y)
y
1 1 1
= P (X = 4, Y = odd) + P (X = 4, Y = even) = + = .
18 36 12

The following example reviews the concepts covered so far concerning ran-
dom variables:

Example 1.17 Let Ω be a set of 12 individuals, and let P assign 1/12 to each
individual. Suppose the sexes, heights, and wages of the individuals are as fol-
lows:
1.1. BASICS OF PROBABILITY THEORY 17

Case Sex Height (inches) Wage ($)


1 female 64 30, 000
2 female 64 30, 000
3 female 64 40, 000
4 female 64 40, 000
5 female 68 30, 000
6 female 68 40, 000
7 male 64 40, 000
8 male 64 50, 000
9 male 68 40, 000
10 male 68 50, 000
11 male 70 40, 000
12 male 70 50, 000
Let the random variables S, H and W respectively assign the sex, height and
wage of an individual to that individual. Then the distributions of the three
variables are as follows (Recall that, for example, P (s) represents P (S = s).):

s P (s) h P (h) w P (w)


female 1/2 64 1/2 30, 000 1/4
male 1/2 68 1/3 40, 000 1/2
70 1/6 50, 000 1/4

The joint distribution of S and H is as follows:

s h P (s, h)
female 64 1/3
female 68 1/6
female 70 0
male 64 1/6
male 68 1/6
male 70 1/6

The following table also shows the joint distribution of S and H and illustrates
that the individual distributions can be obtained by summing the joint distribu-
tion over all values of the other variable:

h 64 68 70 Distribution of S
s
female 1/3 1/6 0 1/2
male 1/6 1/6 1/6 1/2

Distribution of H 1/2 1/3 1/6

The table that follows shows the first few values in the joint distribution of S,
H, and W . There are 18 values in all, of which many are 0.
18 CHAPTER 1. INTRODUCTION TO BAYESIAN NETWORKS

s h w P (s, h, w)
female 64 30, 000 1/6
female 64 40, 000 1/6
female 64 50, 000 0
female 68 30, 000 1/12
··· ··· ··· ···

We have the following definition:

Definition 1.6 Suppose we have a probability space (Ω, P ), and two sets A and
B containing random variables defined on Ω. Then the sets A and B are said to
be independent if, for all values of the variables in the sets a and b, the events
A = a and B = b are independent. That is, either P (a) = 0 or P (b) = 0 or

P (a|b) = P (a).

When this is the case, we write

IP (A, B),

where IP stands for independent in P .

Example 1.18 Let Ω be the set of all cards in an ordinary deck, and let P
assign 1/52 to each card. Define random variables as follows:

Variable Value Outcomes Mapped to this Value


R r1 All royal cards
r2 All nonroyal cards
T t1 All tens and jacks
t2 All cards that are neither tens nor jacks
S s1 All spades
s2 All nonspades

Then we maintain the sets {R, T } and {S} are independent. That is,

IP ({R, T }, {S}).

To show this, we need show for all values of r, t, and s that

P (r, t|s) = P (r, t).

(Note that it we do not show brackets to denote sets in our probabilistic expres-
sion because in such an expression a set represents the members of the set. See
the discussion following Example 1.14.) The following table shows this is the
case:
1.1. BASICS OF PROBABILITY THEORY 19

s r t P (r, t|s) P (r, t)


s1 r1 t1 1/13 4/52 = 1/13
s1 r1 t2 2/13 8/52 = 2/13
s1 r2 t1 1/13 4/52 = 1/13
s1 r2 t2 9/13 36/52 = 9/13
s2 r1 t1 3/39 = 1/13 4/52 = 1/13
s2 r1 t2 6/39 = 2/13 8/52 = 2/13
s2 r2 t1 3/39 = 1/13 4/52 = 1/13
s2 r2 t2 27/39 = 9/13 36/52 = 9/13

Definition 1.7 Suppose we have a probability space (Ω, P ), and three sets A,
B, and C containing random variable defined on Ω. Then the sets A and B are
said to be conditionally independent given the set C if, for all values of
the variables in the sets a, b, and c, whenever P (c) 6= 0, the events A = a and
B = b are conditionally independent given the event C = c. That is, either
P (a|c) = 0 or P (b|c) = 0 or

P (a|b, c) = P (a|c).

When this is the case, we write

IP (A, B|C).

Example 1.19 Let Ω be the set of all objects in Figure 1.2, and let P assign
1/13 to each object. Define random variables S (for shape), V (for value), and
C (for color) as follows:

Variable Value Outcomes Mapped to this Value


V v1 All objects containing a ‘1’
v2 All objects containing a ‘2’
S s1 All square objects
s2 All round objects
C c1 All black objects
c2 All white objects

Then we maintain that {V } and {S} are conditionally independent given {C}.
That is,
IP ({V }, {S}|{C}).
To show this, we need show for all values of v, s, and c that

P (v|s, c) = P (v|c).

The results in Example 1.8 show P (v1|s1, c1) = P (v1|c1) and P (v1|s1, c2) =
P (v1|c2). The table that follows shows the equality holds for the other values of
the variables too:
20 CHAPTER 1. INTRODUCTION TO BAYESIAN NETWORKS

c s v P (v|s, c) P (v|c)
c1 s1 v1 2/6 = 1/3 3/9 = 1/3
c1 s1 v2 4/6 = 2/3 6/9 = 2/3
c1 s2 v1 1/3 3/9 = 1/3
c1 s2 v2 2/3 6/9 = 2/3
c2 s1 v1 1/2 2/4 = 1/2
c2 s1 v2 1/2 2/4 = 1/2
c2 s2 v1 1/2 2/4 = 1/2
c2 s2 v2 1/2 2/4 = 1/2

For the sake of brevity, we sometimes only say ‘independent’ rather than
‘conditionally independent’. Furthermore, when a set contains only one item,
we often drop the set notation and terminology. For example, in the preceding
example, we might say V and S are independent given C and write IP (V, S|C).
Finally, we have the chain rule for random variables, which says that given
n random variables X1 , X2 , . . . Xn , defined on the same sample space Ω,

P (x1 , x2 , . . .xn ) = P (xn |xn−1 , xn−2 , . . .x1 ) · · · P (x2 |x1 )P (x1 )

whenever P (x1 , x2 , . . .xn ) 6= 0. It is straightforward to prove this rule using the


rule for conditional probability.

1.2 Bayesian Inference


We use Bayes’ Theorem when we are not able to determine the conditional
probability of interest directly, but we are able to determine the probabilities
on the right in Equality 1.3. You may wonder why we wouldn’t be able to
compute the conditional probability of interest directly from the sample space.
The reason is that in these applications the probability space is not usually
developed in the order outlined in Section 1.1. That is, we do not identify a
sample space, determine probabilities of elementary events, determine random
variables, and then compute values in joint probability distributions. Instead, we
identify random variables directly, and we determine probabilistic relationships
among the random variables. The conditional probabilities of interest are often
not the ones we are able to judge directly. We discuss next the meaning of
random variables and probabilities in Bayesian applications and how they are
identified directly. After that, we show how a joint probability distribution can
be determined without first specifying a sample space. Finally, we show a useful
application of Bayes’ Theorem.

1.2.1 Random Variables and Probabilities in Bayesian Ap-


plications
Although the definition of a random variable (Definition 1.5) given in Section
1.1.4 is mathematically elegant and in theory pertains to all applications of
probability, it is not readily apparent how it applies to applications involving
1.2. BAYESIAN INFERENCE 21

Bayesian inference. In this subsection and the next we develop an alternative


definition that does.
When doing Bayesian inference, there is some entity which has features,
the states of which we wish to determine, but which we cannot determine for
certain. So we settle for determining how likely it is that a particular feature is
in a particular state. The entity might be a single system or a set of systems.
An example of a single system is the introduction of an economically beneficial
chemical which might be carcinogenic. We would want to determine the relative
risk of the chemical versus its benefits. An example of a set of entities is a set
of patients with similar diseases and symptoms. In this case, we would want to
diagnose diseases based on symptoms.
In these applications, a random variable represents some feature of the entity
being modeled, and we are uncertain as to the values of this feature for the
particular entity. So we develop probabilistic relationships among the variables.
When there is a set of entities, we assume the entities in the set all have the same
probabilistic relationships concerning the variables used in the model. When
this is not the case, our Bayesian analysis is not applicable. In the case of the
chemical introduction, features may include the amount of human exposure and
the carcinogenic potential. If these are our features of interest, we identify the
random variables HumanExposure and CarcinogenicP otential (For simplicity,
our illustrations include only a few variables. An actual application ordinarily
includes many more than this.). In the case of a set of patients, features of
interest might include whether or not a disease such as lung cancer is present,
whether or not manifestations of diseases such as a chest X-ray are present,
and whether or not causes of diseases such as smoking are present. Given these
features, we would identify the random variables ChestXray, LungCancer,
and SmokingHistory. After identifying the random variables, we distinguish a
set of mutually exclusive and exhaustive values for each of them. The possible
values of a random variable are the different states that the feature can take.
For example, the state of LungCancer could be present or absent, the state of
ChestXray could be positive or negative, and the state of SmokingHistory
could be yes or no. For simplicity, we have only distinguished two possible
values for each of these random variables. However, in general they could have
any number of possible values or they could even be continuous. For example,
we might distinguish 5 different levels of smoking history (one pack or more
for at least 10 years, two packs or more for at least 10 years, three packs or
more for at lest ten years, etc.). The specification of the random variables and
their values not only must be precise enough to satisfy the requirements of the
particular situation being modeled, but it also must be sufficiently precise to
pass the clarity test, which was developed by Howard in 1988. That test
is as follows: Imagine a clairvoyant who knows precisely the current state of
the world (or future state if the model concerns events in the future). Would
the clairvoyant be able to determine unequivocally the value of the random
variable? For example, in the case of the chemical introduction, if we give
HumanExposure the values low and high, the clarity test is not passed because
we do not know what constitutes high or low. However, if we define high as
22 CHAPTER 1. INTRODUCTION TO BAYESIAN NETWORKS

when the average (over all individuals), of the individual daily average skin
contact, exceeds 6 grams of material, the clarity test is passed because the
clairvoyant can answer precisely whether the contact exceeds that. In the case
of a medical application, if we give SmokingHistory only the values yes and
no, the clarity test is not passed because we do not know whether yes means
smoking cigarettes, cigars, or something else, and we have not specified how
long smoking must have occurred for the value to be yes. On the other hand, if
we say yes means the patient has smoked one or more packs of cigarettes every
day during the past 10 years, the clarity test is passed.
After distinguishing the possible values of the random variables (i.e. their
spaces), we judge the probabilities of the random variables having their values.
However, in general we do not always determine prior probabilities; nor do we de-
termine values in a joint probability distribution of the random variables. Rather
we ascertain probabilities, concerning relationships among random variables,
that are accessible to us. For example, we might determine the prior probability
P (LungCancer = present), and the conditional probabilities P (ChestXray =
positive|LungCancer = present), P (ChestXray = positive|LungCancer =
absent), P (LungCancer = present| SmokingHistory = yes), and finally
P (LungCancer = present|SmokingHistory = no). We would obtain these
probabilities either from a physician or from data or from both. Thinking in
terms of relative frequencies, P (LungCancer = present|SmokingHistory =
yes) can be estimated by observing individuals with a smoking history, and de-
termining what fraction of these have lung cancer. A physician is used to judging
such a probability by observing patients with a smoking history. On the other
hand, one does not readily judge values in a joint probability distribution such as
P (LungCancer = present, ChestXray = positive, SmokingHistory = yes). If
this is not apparent, just think of the situation in which there are 100 or more
random variables (which there are in some applications) in the joint probability
distribution. We can obtain data and think in terms of probabilistic relation-
ships among a few random variables at a time; we do not identify the joint
probabilities of several events.
As to the nature of these probabilities, consider first the introduction of the
toxic chemical. The probabilities of the values of CarcinogenicP otential will
be based on data involving this chemical and similar ones. However, this is
certainly not a repeatable experiment like a coin toss, and therefore the prob-
abilities are not relative frequencies. They are subjective probabilities based
on a careful analysis of the situation. As to the medical application involv-
ing a set of entities, we often obtain the probabilities from estimates of rel-
ative frequencies involving entities in the set. For example, we might obtain
P (ChestXray = positive|LungCancer = present) by observing 1000 patients
with lung cancer and determining what fraction have positive chest X-rays.
However, as will be illustrated in Section 1.2.3, when we do Bayesian inference
using these probabilities, we are computing the probability of a specific individ-
ual being in some state, which means it is a subjective probability. Recall from
Section 1.1.1 that a relative frequency is not a property of any one of the trials
(patients), but rather it is a property of the entire sequence of trials. You may
1.2. BAYESIAN INFERENCE 23

feel that we are splitting hairs. Namely, you may argue the following: “This
subjective probability regarding a specific patient is obtained from a relative
frequency and therefore has the same value as it. We are simply calling it a
subjective probability rather than a relative frequency.” But even this is not
the case. Even if the probabilities used to do Bayesian inference are obtained
from frequency data, they are only estimates of the actual relative frequencies.
So they are subjective probabilities obtained from estimates of relative frequen-
cies; they are not relative frequencies. When we manipulate them using Bayes’
theorem, the resultant probability is therefore also only a subjective probability.
Once we judge the probabilities for a given application, we can often ob-
tain values in a joint probability distribution of the random variables. Theo-
rem 1.5 in Section 1.3.3 obtains a way to do this when there are many vari-
ables. Presently, we illustrate the case of two variables. Suppose we only
identify the random variables LungCancer and ChestXray, and we judge the
prior probability P (LungCancer = present), and the conditional probabili-
ties P (ChestXray = positive|LungCancer = present) and P (ChestXray =
positive|LungCancer = absent). Probabilities of values in a joint probability
distribution can be obtained from these probabilities using the rule for condi-
tional probability as follows:

P (present, positive) = P (positive|present)P (present)

P (present, negative) = P (negative|present)P (present)


P (absent, positive) = P (positive|absent)P (absent)
P (absent, negative) = P (negative|absent)P (absent).
Note that we used our abbreviated notation. We see then that at the outset we
identify random variables and their probabilistic relationships, and values in a
joint probability distribution can then often be obtained from the probabilities
relating the random variables. So what is the sample space? We can think of the
sample space as simply being the Cartesian product of the sets of all possible
values of the random variables. For example, consider again the case where we
only identify the random variables LungCancer and ChestXray, and ascertain
probability values in a joint distribution as illustrated above. We can define the
following sample space:

Ω=

{(present, positive), (present, negative), (absent, positive), (absent, negative)}.

We can consider each random variable a function on this space that maps
each tuple into the value of the random variable in the tuple. For example,
LungCancer would map (present, positive) and (present, negative) each into
present. We then assign each elementary event the probability of its correspond-
ing event in the joint distribution. For example, we assign

P̂ ({(present, positive)}) = P (LungCancer = present, ChestXray = positive).


24 CHAPTER 1. INTRODUCTION TO BAYESIAN NETWORKS

It is not hard to show that this does yield a probability function on Ω and
that the initially assessed prior probabilities and conditional probabilities are
the probabilities they notationally represent in this probability space (This is a
special case of Theorem 1.5.).
Since random variables are actually identified first and only implicitly be-
come functions on an implicit sample space, it seems we could develop the con-
cept of a joint probability distribution without the explicit notion of a sample
space. Indeed, we do this next. Following this development, we give a theorem
showing that any such joint probability distribution is a joint probability dis-
tribution of the random variables with the variables considered as functions on
an implicit sample space. Definition 1.1 (of a probability function) and Defi-
nition 1.5 (of a random variable) can therefore be considered the fundamental
definitions for probability theory because they pertains both to applications
where sample spaces are directly identified and ones where random variables
are directly identified.

1.2.2 A Definition of Random Variables and Joint Proba-


bility Distributions for Bayesian Inference
For the purpose of modeling the types of problems discussed in the previous
subsection, we can define a random variable X as a symbol representing any
one of a set of values, called the space of X. For simplicity, we will assume
the space of X is countable, but the theory extends naturally to the case where
it is not. For example, we could identify the random variable LungCancer as
having the space {present, absent}. We use the notation X = x as a primitive
which is used in probability expressions. That is, X = x is not defined in terms
of anything else. For example, in application LungCancer = present means the
entity being modeled has lung cancer, but mathematically it is simply a primi-
tive which is used in probability expressions. Given this definition and primitive,
we have the following direct definition of a joint probability distribution:

Definition 1.8 Let a set of n random variables V = {X1 , X2 , . . . Xn } be speci-


fied such that each Xi has a countably infinite space. A function, that assigns a
real number P (X1 = x1 , X2 = x2 , . . . Xn = xn ) to every combination of values
of the xi ’s such that the value of xi is chosen from the space of Xi , is called a
joint probability distribution of the random variables in V if it satisfies the
following conditions:

1. For every combination of values of the xi ’s,

0 ≤ P (X1 = x1 , X2 = x2 , . . . Xn = xn ) ≤ 1.

2. We have
X
P (X1 = x1 , X2 = x2 , . . . Xn = xn ) = 1.
x1 ,x2,... xn
1.2. BAYESIAN INFERENCE 25

P
The notation x1 ,x2,... xn means the sum as the variables x1 , . . . xn go
through all possible values in their corresponding spaces.

Note that a joint probability distribution, obtained by defining random vari-


ables as functions on a sample space, is one way to create a joint probability
distribution that satisfies this definition. However, there are other ways as the
following example illustrates:

Example 1.20 Let V = {X, Y }, let X and Y have spaces {x1, x2}1 and {y1, y2}
respectively, and let the following values be specified:

P (X = x1) = .2 P (Y = y1) = .3
P (X = x2) = .8 P (Y = y2) = .7.

Next define a joint probability distribution of X and Y as follows:

P (X = x1, Y = y1) = P (X = x1)P (Y = y1) = (.2)(.3) = .06

P (X = x1, Y = y2) = P (X = x1)P (Y = y2) = (.2)(.7) = .14


P (X = x2, Y = y1) = P (X = x2)P (Y = y1) = (.8)(.3) = .24
P (X = x2, Y = y2) = P (X = x2)P (Y = y2) = (.8)(.7) = .56.
Since the values sum to 1, this is another way of specifying a joint probability
distribution according to Definition 1.8. This is how we would specify the joint
distribution if we felt X and Y were independent.

Notice that our original specifications, P (X = xi) and P (Y = yi), nota-


tionally look like marginal distributions of the joint distribution developed in
Example 1.20. However, Definition 1.8 only defines a joint probability distri-
bution P ; it does not mention anything about marginal distributions. So the
initially specified values do not represent marginal distributions of our joint dis-
tribution P according to that definition alone. The following theorem enables
us to consider them marginal distributions in the classical sense, and therefore
justifies our notation.

Theorem 1.3 Let a set of random variables V be given and let a joint proba-
bility distribution of the variables in V be specified according to Definition 1.8.
Let Ω be the Cartesian product of the sets of all possible values of the random
variables. Assign probabilities to elementary events in Ω as follows:

P̂ ({(x1 , x2 , . . . xn )}) = P (X1 = x1 , X2 = x2 , . . . Xn = xn ).

These assignments result in a probability function on Ω according to Definition


1.1. Furthermore, if we let X̂i denote a function (random variable in the clas-
sical sense) on this sample space that maps each tuple in Ω to the value of xi in
1 We use subscripted variables X to denote different random variables. So we do not
i
subcript to denote a value of a random variable. Rather we write the index next to the
variable.
26 CHAPTER 1. INTRODUCTION TO BAYESIAN NETWORKS

that tuple, then the joint probability distribution of the X̂i ’s is the same as the
originally specified joint probability distribution.
Proof. The proof is left as an exercise.
Example 1.21 Suppose we directly specify a joint probability distribution of X
and Y , each with space {x1, x2} and {y1, y2} respectively, as done in Example
1.20. That is, we specify the following probabilities:
P (X = x1, Y = y1)
P (X = x1, Y = y2)
P (X = x2, Y = y1)
P (X = x2, Y = y2).

Next we let Ω = {(x1, y1), (x1, y2), (x2, y1), (x2, y2)}, and we assign

P̂ ({(xi, yj)}) = P (X = xi, Y = yj).

Then we let X̂ and Ŷ be functions on Ω defined by the following tables:

x y X̂((x, y)) x y Ŷ ((x, y))


x1 y1 x1 x1 y1 y1
x1 y2 x1 x1 y2 y2
x2 y1 x2 x2 y1 y1
x2 y2 x2 x2 y2 y2

Theorem 1.3 says the joint probability distribution of these random variables is
the same as the originally specified joint probability distribution. Let’s illustrate
this:

P̂ (X̂ = x1, Ŷ = y1) = P̂ ({(x1, y1), (x1, y2)} ∩ {(x1, y1), (x2, y1)})
= P̂ ({(x1, y1)})
= P (X = x1, Y = y1).

Due to Theorem 1.3, we need no postulates for probabilities of combinations


of primitives not addressed by Definition 1.8. Furthermore, we need no new
definition of conditional probability for joint distributions created according
to that definition. We can just postulate that both obtain values according
to the set theoretic definition of a random variable. For example, consider
Example 1.20. Due to Theorem 1.3, P̂ (X̂ = x1) is simply a value in a marginal
distribution of the joint probability distribution. So its value is computed as
follows:

P̂ (X̂ = x1) = P̂ (X̂ = x1, Ŷ = y1) + P̂ (X̂ = x1, Ŷ = y2)


= P (X = x1, Y = y1) + P (X = x1, Y = y2)
= P (X = x1)P (Y = y1) + P (X = x1)P (Y = y2)
= P (X = x1)[P (Y = y1) + P (Y = y2)]
= P (X = x1)[1] = P (X = x1),
1.2. BAYESIAN INFERENCE 27

which is the originally specified value. This result is a special case of Theorem
1.5.
Note that the specified probability values are not by necessity equal to the
probabilities they notationally represent in the marginal probability distribu-
tion. However, since we used the rule for independence to derive the joint
probability distribution from them, they are in fact equal to those values. For
example, if we had defined P (X = x1, Y = y1) = P (X = x2)P (Y = y1), this
would not be the case. Of course we would not do this. In practice, all specified
values are always the probabilities they notationally represent in the resultant
probability space (Ω, P̂ ). Since this is the case, we will no longer show carats
over P or X when referring to the probability function in this space or a random
variable on the space.

Example 1.22 Let V = {X, Y }, let X and Y have spaces {x1, x2} and {y1, y2}
respectively, and let the following values be specified:

P (X = x1) = .2 P (Y = y1|X = x1) = .3


P (X = x2) = .8 P (Y = y2|X = x1) = .7

P (Y = y1|X = x2) = .4
P (Y = y2|X = x2) = .6.

Next define a joint probability distribution of X and Y as follows:

P (X = x1, Y = y1) = P (Y = y1|X = x1)P (X = x1) = (.3)(.2) = .06

P (X = x1, Y = y2) = P (Y = y2|X = x1)P (X = x1) = (.7)(.2) = .14


P (X = x2, Y = y1) = P (Y = y1|X = x2)P (X = x2) = (.4)(.8) = .32
P (X = x2, Y = y2) = P (Y = y2|X = x2)P (X = x2) = (.6)(.8) = .48.
Since the values sum to 1, this is another way of specifying a joint probability
distribution according to Definition 1.8. As we shall see in Example 1.23 in the
following subsection, this is the way they are specified in simple applications of
Bayes’ Theorem.

In the remainder of this text, we will create joint probability distributions


using Definition 1.8. Before closing, we note that this definition pertains to any
application in which we model naturally occurring phenomena by identifying
random variables directly, which includes most applications of statistics.

1.2.3 A Classical Example of Bayesian Inference


The following examples illustrates how Bayes’ theorem has traditionally been
applied to compute the probability of an event of interest from known proba-
bilities.
28 CHAPTER 1. INTRODUCTION TO BAYESIAN NETWORKS

Example 1.23 Suppose Joe has a routine diagnostic chest X-ray required of
all new employees at Colonial Bank, and the X-ray comes back positive for lung
cancer. Joe then becomes certain he has lung cancer and panics. But should
he? Without knowing the accuracy of the test, Joe really has no way of knowing
how probable it is that he has lung cancer. When he discovers the test is not
absolutely conclusive, he decides to investigate its accuracy and he learns that it
has a false negative rate of .4 and a false positive rate of .02. We represent this
accuracy as follows. First we define these random variables:

Variable Value When the Variable Takes This Value


T est positive X-ray is positive
negative X-ray is negative
LungCancer present Lung cancer is present
absent Lung cancer is absent

We then have these conditional probabilities:

P (T est = positive|LungCancer = present) = .6

P (T est = positive|LungCancer = absent) = .02.


Given these probabilities, Joe feels a little better. However, he then realizes he
still does not know how probable it is that he has lung cancer. That is, the prob-
ability of Joe having lung cancer is P (LungCancer = present|T est = positive),
and this is not one of the probabilities listed above. Joe finally recalls Bayes’
theorem and realizes he needs yet another probability to determine the probability
of his having lung cancer. That probability is P (LungCancer = present), which
is the probability of his having lung cancer before any information on the test
results were obtained. Even though this probability is not based on any informa-
tion concerning the test results, it is based on some information. Specifically, it
is based on all information (relevant to lung cancer) known about Joe before he
took the test. The only information about Joe, before he took the test, was that
he was one of a class of employees who took the test routinely required of new
employees. So, when he learns only 1 out of every 1000 new employees has lung
cancer, he assigns .001 to P (LungCancer = present). He then employs Bayes’
theorem as follows (Note that we again use our abbreviated notation):

P (present|positive)

P (positive|present)P (present)
=
P (positive|present)P (present) + P (positive|absent)P (absent)
(.6)(.001)
=
(.6)(.001) + (.02)(.999)
= .029.

So Joe now feels that he probability of his having lung cancer is only about .03,
and he relaxes a bit while waiting for the results of further testing.
1.3. LARGE INSTANCES / BAYESIAN NETWORKS 29

A probability like P (LungCancer = present) is called a prior probability


because, in a particular model, it is the probability of some event prior to
updating the probability of that event, within the framework of that model,
using new information. Do not mistakenly think it means a probability prior to
any information. A probability like P (LungCancer = present|T est = positive)
is called a posterior probability because it is the probability of an event
after its prior probability has been updated, within the framework of some
model, based on new information. The following example illustrates how prior
probabilities can change depending on the situation we are modeling.

Example 1.24 Now suppose Sam is having the same diagnostic chest X-ray
as Joe. However, he is having the X-ray because he has worked in the mines
for 20 years, and his employers became concerned when they learned that about
10% of all such workers develop lung cancer after many years in the mines.
Sam also tests positive. What is the probability he has lung cancer? Based on
the information known about Sam before he took the test, we assign a prior
probability of .1 to Sam having lung cancer. Again using Bayes’ theorem, we
conclude that P (LungCancer = present|T est = positive) = .769 for Sam. Poor
Sam concludes it is quite likely that he has lung cancer.

The previous two examples illustrate that a probability value is relative to


one’s information about an event; it is not a property of the event itself. Both
Joe and Sam either do or do not have lung cancer. It could be that Joe has
it and Sam does not. However, based on our information, our degree of belief
(probability) that Sam has it is much greater than our degree of belief that Joe
has it. When we obtain more information relative to the event (e.g. whether
Joe smokes or has a family history of cancer), the probability will change.

1.3 Large Instances / Bayesian Networks


Bayesian inference is fairly simple when it involves only two related variables as
in Example 1.23. However, it becomes much more complex when we want to
do inference with many related variable. We address this problem next. After
discussing the difficulties inherent in representing large instances and in doing
inference when there are a large number of variables, we describe a relation-
ship, called the Markov condition, between graphs and probability distributions.
Then we introduce Bayesian networks, which exploit the Markov condition in
order to represent large instances efficiently.

1.3.1 The Difficulties Inherent in Large Instances


Recall the situation, discussed at the beginning of this chapter, where several
features (variables) are related through inference chains. We introduced the
following example of this situation: Whether or not an individual has a history
of smoking has a direct influence both on whether or not that individual has
bronchitis and on whether or not that individual has lung cancer. In turn, the
30 CHAPTER 1. INTRODUCTION TO BAYESIAN NETWORKS

presence or absence of each of these features has a direct influence on whether


or not the individual experiences fatigue. Also, the presence or absence of
lung cancer has a direct influence on whether or not a chest X-ray is positive.
We noted that, in this situation, we would want to do probabilistic inference
involving features that are not related via a direct influence. We would want to
determine, for example, the conditional probabilities both of having bronchitis
and of having lung cancer when it is known an individual smokes, is fatigued,
and has a positive chest X-ray. Yet bronchitis has no influence on whether a
chest X-ray is positive. Therefore, this conditional probability cannot readily
be computed using a simple application of Bayes’ theorem. So how could we
compute it? Next we develop a straightforward algorithm for doing so, but we
will show it has little practical value. First we give some notation. As done
previously, we will denote random variables using capital letters such as X and
use the corresponding lower case letters x1, x2, etc. to denote the values in the
space of X. In the current example, we define the random variables that follow:

Variable Value When the Variable Takes this Value


H h1 There is a history of smoking
h2 There is no history of smoking
B b1 Bronchitis is present
b2 Bronchitis is absent
L l1 Lung cancer is present
l2 Lung cancer is absent
F f1 Fatigue is present
f2 Fatigue is absent
C c1 Chest X-ray is positive
c2 Chest X-ray is negative

Note that we presented this same table at the beginning of this chapter, but we
called the random variables ‘features’. We had not yet defined random variable
at that point; so we used the informal term feature. If we knew the joint
probability distribution of these five variables, we could compute the conditional
probability of an individual having bronchitis given the individual smokes, is
fatigued, and has a positive chest X-ray as follows:
P
P (b1, h1, f1, c1, l)
P (b1, h1, f 1, c1) l
P (b1|h1, f 1, c1) = = P , (1.5)
P (h1, f1, c1) P (b, h1, f 1, c1, l)
b,l
P
where b,l means the sum as b and l go through all their possible values. There
are a number of problems here. First, as noted previously, the values in the joint
probability distribution are ordinarily not readily accessible. Second, there are
an exponential number of terms in the sums in Equality 1.5. That is, there
are 22 terms in the sum in the denominator, and, if there were 100 variables
in the application, there would be 297 terms in that sum. So, in the case
of a large instance, even if we had some means for eliciting the values in the
1.3. LARGE INSTANCES / BAYESIAN NETWORKS 31

joint probability distribution, using Equality 1.5 simply requires determining


too many such values and doing too many calculations with them. We see that
this method has no practical value when the instance is large.
Bayesian networks address the problems of 1) representing the joint proba-
bility distribution of a large number of random variables; and 2) doing Bayesian
inference with these variables. Before introducing them in Section 1.3.3, we
need to discuss the Markov condition.

1.3.2 The Markov Condition


First let’s review some graph theory. Recall that a directed graph is a pair
(V, E), where V is a finite, nonempty set whose elements are called nodes (or
vertices), and E is a set of ordered pairs of distinct elements of V. Elements of
E are called edges (or arcs), and if (X, Y ) ∈ E, we say there is an edge from
X to Y and that X and Y are each incident to the edge. If there is an edge
from X to Y or from Y to X, we say X and Y are adjacent. Suppose we have
a set of nodes [X1 , X2 , . . . Xk ], where k ≥ 2, such (Xi−1 , Xi ) ∈ E for 2 ≤ i ≤ k.
We call the set of edges connecting the k nodes a path from X1 to Xk . The
nodes X2 , . . . Xk−1 are called interior nodes on path [X1 , X2 , . . . Xk ]. The
subpath of path [X1 , X2 , . . . Xk ] from Xi to Xj is the path [Xi , Xi+1 , . . . Xj ]
where 1 ≤ i < j ≤ k. A directed cycle is a path from a node to itself. A
simple path is a path containing no subpaths which are directed cycles. A
directed graph G is called a directed acyclic graph (DAG) if it contains no
directed cycles. Given a DAG G = (V, E) and nodes X and Y in V, Y is called
a parent of X if there is an edge from Y to X, Y is called a descendent of
X and X is called an ancestor of Y if there is a path from X to Y , and Y is
called a nondescendent of X if Y is not a descendent of X. Note that in this
text X is not considered a descendent of X because we require k ≥ 2 in the
definition of a path. Some texts say there is an empty path from X to X.
We can now state the following definition:

Definition 1.9 Suppose we have a joint probability distribution P of the ran-


dom variables in some set V and a DAG G = (V, E). We say that (G, P ) satisfies
the Markov condition if for each variable X ∈ V, {X} is conditionally in-
dependent of the set of all its nondescendents given the set of all its parents.
Using the notation established in Section 1.1.4, this means if we denote the sets
of parents and nondescendents of X by PAX and NDX respectively, then

IP ({X}, NDX |PAX ).

When (G, P ) satisfies the Markov condition, we say G and P satisfy the
Markov condition with each other.
If X is a root, then its parent set PAX is empty. So in this case the Markov
condition means {X} is independent of NDX . That is, IP ({X}, NDX ). It is
not hard to show that IP ({X}, NDX |PAX ) implies IP ({X}, B|PAX ) for any
B ⊆ NDX . It is left as an exercise to do this. Notice that PAX ⊆ NDX . So
we could define the Markov condition by saying that X must be conditionally
32 CHAPTER 1. INTRODUCTION TO BAYESIAN NETWORKS

V C S

V S

(a) (b)

V S

V C S

(c) (d)

Figure 1.3: The probability distribution in Example 1.25 satisfies the Markov
condition only for the DAGs in (a), (b), and (c).

independent of NDX − PAX given PAX . However, it is standard to define it as


above. When discussing the Markov condition relative to a particular distri-
bution and DAG (as in the following examples), we just show the conditional
independence of X and NDX − PAX .

Example 1.25 Let Ω be the set of objects in Figure 1.2, and let P assign a
probability of 1/13 to each object. Let random variables V , S, and C be as
defined as in Example 1.19. That is, they are defined as follows:

Variable Value Outcomes Mapped to this Value


V v1 All objects containing a ‘1’
v2 All objects containing a ‘2’
S s1 All square objects
s2 All round objects
C c1 All black objects
c2 All white objects
1.3. LARGE INSTANCES / BAYESIAN NETWORKS 33

B L

F C

Figure 1.4: A DAG illustrating the Markov condition

Then, as shown in Example 1.19, IP ({V }, {S}|{C}). Therefore, (G, P ) satisfies


the Markov condition if G is the DAG in Figure 1.3 (a), (b), or (c). However,
(G, P ) does not satisfy the Markov condition if G is the DAG in Figure 1.3 (d)
because IP ({V }, {S}) is not the case.

Example 1.26 Consider the DAG G in Figure 1.4. If (G, P ) satisfied the
Markov condition for some probability distribution P , we would have the follow-
ing conditional independencies:

Node PA Conditional Independency


C {L} IP ({C}, {H, B, F }|{L})
B {H} IP ({B}, {L, C}|{H})
F {B, L} IP ({F }, {H, C}|{B, L})
L {H} IP ({L}, {B}|{H})

Recall from Section 1.3.1 that the number of terms in a joint probability
distribution is exponential in terms of the number of variables. So, in the
case of a large instance, we could not fully describe the joint distribution by
determining each of its values directly. Herein lies one of the powers of the
Markov condition. Theorem 1.4, which follows shortly, shows if (G, P ) satisfies
the Markov condition, then P equals the product of its conditional probability
distributions of all nodes given values of their parents in G, whenever these
conditional distributions exist. After proving this theorem, we discuss how this
means we often need ascertain far fewer values than if we had to determine all
values in the joint distribution directly. Before proving it, we illustrate what it
means for a joint distribution to equal the product of its conditional distributions
of all nodes given values of their parents in a DAG G. This would be the case
34 CHAPTER 1. INTRODUCTION TO BAYESIAN NETWORKS

for a joint probability distribution P of the variables in the DAG in Figure 1.4
if, for all values of f , c, b, l, and h,

P (f, c, b, l, h) = P (f |b, l)P (c|l)P (b|h)P (l|h)P (h), (1.6)

whenever the conditional probabilities on the right exist. Notice that if one of
them does not exist for some combination of the values of the variables, then
P (b, l) = 0 or P (l) = 0 or P (h) = 0, which implies P (f, c, b, l, h) = 0 for that
combination of values. However, there are cases in which P (f, c, b, l, h) = 0 and
the conditional probabilities still exist. For example, this would be the case if
all the conditional probabilities on the right existed and P (f|b, l) = 0 for some
combination of values of f , b, and l. So Equality 1.6 must hold for all nonzero
values of the joint probability distribution plus some zero values.
We now give the theorem.

Theorem 1.4 If (G, P ) satisfies the Markov condition, then P is equal to the
product of its conditional distributions of all nodes given values of their parents,
whenever these conditional distributions exist.
Proof. We prove the case where P is discrete. Order the nodes so that if Y is
a descendent of Z, then Y follows Z in the ordering. Such an ordering is called
an ancestral ordering. Examples of such an ordering for the DAG in Figure
1.4 are [H, L, B, C, F ] and [H, B, L, F, C]. Let X1 , X2 , . . . Xn be the resultant
ordering. For a given set of values of x1 , x2 , . . . xn , let pai be the subset of
these values containing the values of Xi ’s parents. We need show that whenever
P (pai ) 6= 0 for 1 ≤ i ≤ n,

P (xn , xn−1 , . . . x1 ) = P (xn |pan )P (xn−1 |pan−1 ) · · · P (x1 |pa1 ).

We show this using induction on the number of variables in the network. As-
sume, for some combination of values of the xi ’s, that P (pai ) 6= 0 for 1 ≤ i ≤ n.
induction base: Since PA1 is empty,

P (x1 ) = P (x1 |pa1 ).

induction hypothesis: Suppose for this combination of values of the xi ’s that

P (xi , xi−1 , . . . x1 ) = P (xi |pai )P (xi−1 |pai−1 ) · · · P (x1 |pa1 ).

induction step: We need show for this combination of values of the xi ’s that

P (xi+1 , xi , . . . x1 ) = P (xi+1 |pai+1 )P (xi |pai ) · · · P (x1 |pa1 ). (1.7)

There are two cases:

Case 1: For this combination of values

P (xi , xi−1 , . . . x1 ) = 0. (1.8)


1.3. LARGE INSTANCES / BAYESIAN NETWORKS 35

Clearly, Equality 1.8 implies

P (xi+1 , xi , . . . x1 ) = 0.

Furthermore, due to Equality 1.8 and the induction hypothesis, there is some k,
where 1 ≤ k ≤ i, such that P (xk |pak ) = 0. So Equality 1.7 holds.
Case 2: For this combination of values

P (xi , xi−1 , . . . x1 ) 6= 0.

In this case,

P (xi+1 , xi , . . . x1 ) = P (xi+1 |xi , . . . x1 )P (xi , . . . x1 )


= P (xi+1 |pai+1 )P (xi , . . . x1 )
= P (xi+1 |pai+1 )P (xi |pai ) · · · P (x1 |pa1 ).

The first equality is due to the rule for conditional probability, the second is due
to the Markov condition and the fact that X1 , . . . Xi are all nondescendents of
Xi+1 , and the last is due to the induction hypothesis.

Example 1.27 Recall that the joint probability distribution in Example 1.25
satisfies the Markov condition with the DAG in Figure 1.3 (a). Therefore, owing
to Theorem 1.4,
P (v, s, c) = P (v|c)P (s|c)p(c), (1.9)
and we need only determine the conditional distributions on the right in Equality
1.9 to uniquely determine the values in the joint distribution. We illustrate that
this is the case for v1, s1, and c1:
2
P (v1, s1, c1) = P (One ∩ Square ∩ Black) =
13

P (v1|c1)P (s1|c1)P (c1) = P (One|Black) × P (Square|Black) × P (Black)


1 2 9 2
= × × = .
3 3 13 13
Figure 1.5 shows the DAG along with the conditional distributions.

The joint probability distribution in Example 1.25 also satisfies the Markov
condition with the DAGs in Figures 1.3 (b) and (c). Therefore, the probability
distribution in that example equals the product of the conditional distributions
for each of them. You should verify this directly.
If the DAG in Figure 1.3 (d) and some probability distribution P satisfied
the Markov condition, Theorem 1.4 would imply

P (v, s, c) = P (c|v, s)P (v)p(s).

Such a distribution is discussed in Exercise 1.20.


36 CHAPTER 1. INTRODUCTION TO BAYESIAN NETWORKS

P(c1) = 9/13
P(c2) = 4/13

V S

P(v1|c1) = 1/3 P(s1|c1) = 2/3


P(v2|c1) = 2/3 P(s2|c1) = 1/3

P(v1|c2) = 1/2 P(s1|c2) = 1/2


P(v2|c2) = 1/2 P(s2|c2) = 1/2

Figure 1.5: The probability distribution discussed in Example 1.27 is equal to


the product of these conditional distributions.

Theorem 1.4 often enables us to reduce the problem of determining a huge


number of probability values to that of determining relatively few. The num-
ber of values in the joint distribution is exponential in terms of the number of
variables. However, each of these values is uniquely determined by the condi-
tional distributions (due to the theorem), and, if each node in the DAG does
not have too many children, there are not many values in these distributions.
For example, if each variable has two possible values and each node has at most
one parent, we would need to ascertain less than 2n probability values to de-
termine the conditional distributions when the DAG contains n nodes. On the
other hand, we would need to ascertain 2n − 1 values to determine the joint
probability distribution directly. In general, if each variable has two possible
values and each node has at most k parents, we need to ascertain less than 2k n
values to determine the conditional distributions. So if k is not large, we have
a manageable number of values.
Something may seem amiss to you. Namely, in Example 1.25, we started
with an underlying sample space and probability function, specified some ran-
dom variables, and showed that if P is the probability distribution of these
variables and G is the DAG in Figure 1.3 (a), then (P, G) satisfies the Markov
condition. We can therefore apply Theorem 1.4 to conclude we need only de-
termine the conditional distributions of the variables for that DAG to find any
value in the joint distribution. We illustrated this in Example 1.27. How-
ever, as discussed in Section 1.2, in application we do not ordinarily specify
an underlying sample space and probability function from which we can com-
pute conditional distributions. Rather we identify random variables and values
in conditional distributions directly. For example, in an application involv-
ing the diagnosis of lung cancer, we identify variables like SmokingHistory,
LungCancer, and ChestXray, and probabilities such as P (SmokingHistory =
1.3. LARGE INSTANCES / BAYESIAN NETWORKS 37

yes), P (LungCancer = present|SmokingHistory = yes), and P (ChestXray =


positive| LungCancer = present). How do we know the product of these con-
ditional distributions is a joint distribution at all, much less one satisfying the
Markov condition with some DAG? Theorem 1.4 tells us only that if we start
with a joint distribution satisfying the Markov condition with some DAG, the
values in that joint distribution will be given by the product of the condi-
tional distributions. However, we must work in reverse. We must start with
the conditional distributions and then be able to conclude the product of these
distributions is a joint distribution satisfying the Markov condition with some
DAG. The theorem that follows enables us to do just that.

Theorem 1.5 Let a DAG G be given in which each node is a random variable,
and let a discrete conditional probability distribution of each node given values of
its parents in G be specified. Then the product of these conditional distributions
yields a joint probability distribution P of the variables, and (G, P ) satisfies the
Markov condition.
Proof. Order the nodes according to an ancestral ordering. Let X1 , X2 , . . . Xn
be the resultant ordering. Next define

P (x1 , x2 , . . . xn ) = P (xn |pan )P (xn−1 |pan−1 ) · · · P (x2 |pa2 )P (x1 |pa1 ),

where PAi is the set of parents of Xi of in G and P (xi |pai ) is the specified
conditional probability distribution. First we show this does indeed yield a joint
probability distribution. Clearly, 0 ≤ P (x1 , x2 , . . .xn ) ≤ 1 for all values of the
variables. Therefore, to show we have a joint distribution, Definition 1.8 and
Theorem 1.3 imply we need only show that the sum of P (x1 , x2 , . . . xn ), as the
variables range through all their possible values, is equal to one. To that end,
XX XX
... P (x1 , x2 , . . .xn )
x1 x2 xn−1 xn
XX XX
= ··· P (xn |pan )P (xn−1 |pan−1 ) · · · P (x2 |pa2 )P (x1 |pa1 )
x1 x2 xn−1 xn
  " #  
X X X X
=  · · · P (xn |pan ) P (xn−1 |pan−1 ) · · ·  P (x2 |pa2 ) P (x1 |pa1 )
x1 x2 xn−1 xn
   
X X X
=  · · · [1] P (xn−1 |pan−1 ) · · ·  P (x2 |pa2 ) P (x1 |pa1 )
x1 x2 xn−1
" #
X X
= [· · · 1 · · · ] P (x2 |pa2 ) P (x1 |pa1 )
x1 x2
X
= [1] P (x1 |pa1 ) = 1.
x1

It is left as an exercise to show that the specified conditional distributions are


the conditional distributions they notationally represent in the joint distribution.
38 CHAPTER 1. INTRODUCTION TO BAYESIAN NETWORKS

Finally, we show the Markov condition is satisfied. To do this, we need show


for 1 ≤ k ≤ n that whenever P (pak ) 6= 0, if P (ndk |pak ) 6= 0 and P (xk |pak ) 6= 0
then P (xk |ndk , pak ) = P (xk |pak ), where NDk is the set of nondescendents of
Xk of in G. Since PAk ⊆ NDk , we need only show P (xk |ndk ) = P (xk |pak ).
First for a given k, order the nodes so that all and only nondescendents of
Xk precede Xk in the ordering. Note that this ordering depends on k, whereas
the ordering in the first part of the proof does not. Clearly then

NDk = {X1 , X2 , . . . Xk−1 }.

Let
Dk = {Xk+1 , Xk+2 , . . . Xn }.
X
In what follows, means the sum as the variables in dk go through all
dk
their possible values. Furthermore, notation such as x̂k means the variable has
a particular value; notation such as n̂dk means all variables in the set have
particular values; and notation such as pan means some variables in the set
may not have particular values. We have that
P (x̂k , n̂dk )
P (x̂k |n̂dk ) =
P (n̂dk )
X
P (x̂1 , x̂2 , . . .x̂k , xk+1 , . . .xn )
kd
= X
P (x̂1 , x̂2 , . . .x̂k−1 , xk , . . .xn )
dk ∪{xk }
X
P (xn |pan ) · · · P (xk+1 |pak+1 )P (x̂k |p̂ak ) · · · P (x̂1 |p̂a1 )
dk
= X
P (xn |pan ) · · · P (xk |pak )P (x̂k−1 |p̂ak−1 ) · · · P (x̂1 |p̂a1 )
dk ∪{xk }
X
P (x̂k |p̂ak ) · · · P (x̂1 |p̂a1 ) P (xn |pan ) · · · P (xk+1 |pak+1 )
dk
= X
P (x̂k−1 |p̂ak−1 ) · · · P (x̂1 |p̂a1 ) P (xn |pan ) · · · P (xk |pak )
dk ∪{xk }

P (x̂k |p̂ak ) [1]


= = P (x̂k |p̂ak ).
[1]
In the second to last step, the sums are each equal to one for the following reason.
Each is a sum of a product of conditional probability distributions specified for
a DAG. In the case of the numerator, that DAG is the subgraph, of our original
DAG G, consisting of the variables in Dk , and in the case of the denominator,
it is the subgraph consisting of the variables in Dk ∪{Xk }. Therefore, the fact
that each sum equals one follows from the first part of this proof.

Notice that the theorem requires that specified conditional distributions be


discrete. Often in the case of continuous distributions it still holds. For example,
1.3. LARGE INSTANCES / BAYESIAN NETWORKS 39

X Y Z

P(x1) = .3 P(y1|x1) = .6 P(z1|y1) = .2


P(x2) = .7 P(y2|x1) = .4 P(z2|y1) = .8
P(y1|x2) = 0 P(z1|y2) = .5
P(y2|x2) = 1 P(z2|x2) = .5

Figure 1.6: A DAG containing random variables, along with specified condi-
tional distributions.

it holds for the Gaussian distributions introduced in Section 4.1.3. However,


in general, it does not hold for all continuous conditional distributions. See
[Dawid and Studeny, 1999] for an example in which no joint distribution having
the specified distributions as conditionals even exists.

Example 1.28 Suppose we specify the DAG G shown in Figure 1.6, along with
the conditional distributions shown in that figure. According to Theorem 1.5,

P (x, y, z) = P (z|y)P (y|x)P (x)

satisfies the Markov condition with G.

Note that the proof of Theorem 1.5 does not require that values in the
specified conditional distributions be nonzero. The next example shows what
can happen when we specify some zero values.

Example 1.29 Consider first the DAG and specified conditional distributions
in Figure 1.6. Because we have specified a zero conditional probability, namely
P (y1|x2), there are events in the joint distribution with zero probability. For
example,

P (x2, y1, z1) = P (z1|y1)P (y1|x2)P (x2) = (.2)(0)(.7) = 0.

However, there is no event with zero probability that is a conditioning event in


one of the specified conditional distributions. That is, P (x1), P (x2), P (y1), and
P (y2) are all nonzero. So the specified conditional distributions all exist.
Consider next the DAG and specified conditional distributions in Figure 1.7.
We have

P (x1, y1) = P (x1, y1|w1)P (w1) + P (x1, y1|w2)P (w2)


= P (x1|w1)P (y1|w1)P (w1) + P (x1|w2)P (y1|w2)P (w2)
= (0)(.8)(.1) + (.6)(0)(.9) = 0.

The event x1, y1 is a conditioning event in one of the specified distributions,


namely P (zi|x1, y1), but it has zero probability, which means we can’t condition
40 CHAPTER 1. INTRODUCTION TO BAYESIAN NETWORKS

P(w1) = .1
P(w2) = .9

P(x1|w1) = 0 P(y1|w1) = .8
P(x2|w1) = 1 P(y2|w1) = .2
X Y
P(x1|w2) = .6 P(y1|w2) = 0
P(x2|w2) = .4 P(y2|w2) = 1

Z
P(z1|x1,y1) = .3 P(z1|x1,y2) = .4
P(z2|x1,y1) = .7 P(z2|x1,y2) = .6

P(z1|x2,y1) = .1 P(z1|x2,y2) = .5
P(z2|x2,y1) = .9 P(z2|x2,y2) = .5

Figure 1.7: The event x1, y1 has 0 probability.

on it. This poses no problem; it simply means we have specified some meaning-
less values, namely P (zi|x1, y1). The Markov condition is still satisfied because
P (z|w, x, y) = P (z|x, y) whenever P (x, y) 6= 0 (See the definition of conditional
independence for sets of random variables in Section 1.1.4.).

1.3.3 Bayesian Networks


Let P be a joint probability distribution of the random variables in some set
V, and G = (V, E) be a DAG. We call (G, P ) a Bayesian network if (G, P )
satisfies the Markov condition. Owing to Theorem 1.4, P is the product of its
conditional distributions in G, and this is the way P is always represented in
a Bayesian network. Furthermore, owing to Theorem 1.5, if we specify a DAG
G and any discrete conditional distributions (and many continuous ones), we
obtain a Bayesian network This is the way Bayesian networks are constructed
in practice. Figures 1.5, 1.6, and 1.7 all show Bayesian networks.

Example 1.30 Figure 1.8 shows a Bayesian network containing the probability
distribution discussed in Example 1.23.

Example 1.31 Recall the objects in 1.2 and the resultant joint probability dis-
tribution P discussed in Example 1.25. Example 1.27 developed a Bayesian
network (namely the one in Figure 1.5) containing that distribution. Figure 1.9
shows another Bayesian network whose conditional distributions are obtained
Other documents randomly have
different content
GEORG FRIDERIC
HANDEL
(Born at Halle, February 23, 1685; died at London, April 14, 1759)

“Mr. Georg Frideric Handel,” Mr. Runciman once wrote, “is by far the
most superb personage one meets, in the history of music. He
alone, of all the musicians, lived his life straight through in the grand
[29]
manner.” When Handel wrote “pomposo” on a page, he wrote not
idly. What magnificent simplicity in outlines!... For melodic lines of
such chaste and noble beauty, such Olympian authority, no one has
approached Handel. “Within that circle none durst walk but he.” His
nearest rival is the Chevalier Gluck.

And this giant of a man could express a tenderness known only to


him and Mozart, for Schubert, with all his melodic wealth and
sensitiveness, could fall at times into sentimentalism, and
Schumann’s intimate confessions were sometimes whispered. Handel
in his tenderness was always manly. No one has approached him in
his sublimely solemn moments! Few composers, if there is anyone,
have been able to produce such pathetic or sublime effects by
simple means, by a few chords even. He was one of the greatest
melodists. His fugal pages seldom seem labored; they are
distinguished by amazing vitality and spontaneity. In his slow
movements, his instrumental airs, there is a peculiar dignity, a
peculiar serenity, and a direct appeal that we find in no other
composer.

Would that we could hear more of Handel’s music! At present he is


known in this country as the composer of The Messiah, the 151
variations entitled The Harmonious Blacksmith, and the monstrous
perversion of a simple operatic air dignified, forsooth, by the title
“Handel’s Largo.”
TWELVE CONCERTI GROSSI, FOR STRING
ORCHESTRA

No. 1, in G major
No. 2, in F major
No. 3, in E minor
No. 4, in A minor
No. 5, in D major
No. 6, in G minor
No. 7, in B flat major
No. 8, in C minor
No. 9, in F major
No. 10, in D minor
No. 11, in A major
No. 12, in B minor

Handel apparently took a peculiar pride in his Concerti Grossi. He


published them himself, and by subscription. They would probably
be more popular today if all conductors realized the fact that music
in Handel’s time was performed with varied and free inflections; that
his players undoubtedly employed many means of expression. As
German organists of forty years ago insisted that Bach’s preludes,
fugues, toccatas, should be played with full organ and rigidity of
tempo, although those who heard Bach play admired his skill in
registration, many conductors find in all of the allegros of Handel’s
concertos only a thunderous speech and allow little change in
tempo. In the performance of this old music, old but fresh, the two
essential qualities demanded by Handel’s music, suppleness of pace
and fluidity of expression, named by Volbach, are usually
disregarded. Unless there be elasticity in performance, hearers are
not to be blamed if they find the music formal, monotonous, dull.

The twelve concertos were composed within three weeks.


Kretzschmar has described them as impressionistic pictures,
probably without strict reference to the modern use of the word
[30]
“impressionistic.” They are not of equal worth. Romain Rolland
finds the seventh and three last mediocre. In the tenth he 152
discovers French influences and declares that the last allegro
might be an air for a music box. Yet the music at its best is
aristocratic and noble.

Handel’s twelve grand concertos for strings were composed between


September 29 and October 30, 1739. The London Daily Post of
October 29, 1739, said: “This day are published proposals for
printing by subscription, with His Majesty’s royal license and
protection, Twelve Grand Concertos, in Seven Parts, for four violins,
a tenor, a violoncello, with a thorough-bass for the harpsichord.
Composed by Mr. Handel. Price to subscribers, two guineas. Ready
to be delivered by April next. Subscriptions are taken by the author,
at his house in Brook Street, Hanover Square, and by Walsh.” In an
advertisement on November 22 the publisher added, “Two of the
above concertos will be performed this evening at the Theatre Royal,
Lincoln’s Inn.” The concertos were published on April 21, 1740. In an
advertisement a few days afterwards Walsh said, “These concertos
were performed at the Theatre Royal in Lincoln’s Inn Fields, and now
are played in most public places with the greatest applause.” Victor
Schoelcher made this comment in his Life of Handel: “This was the
case with all the works of Handel. They were so frequently
performed at contemporaneous concerts and benefits that they
seem, during his lifetime, to have quite become public property.
Moreover, he did nothing which the other theaters did not attempt to
imitate. In the little theater of the Haymarket, evening
entertainments were given in exact imitation of his ‘several
concertos for different instruments, with a variety of chosen airs of
the best master, and the famous Salve Regina of Hasse.’ The
handbills issued by the nobles at the King’s Theatre make mention
[31]
also of ‘several concertos for different instruments.’”

The year 1739, in which these concertos were composed, was the
year of the first performance of Handel’s Saul (January 16) and
Israel in Egypt (April 4)—both oratorios were composed in 1738—
also of the music to Dryden’s Ode for St. Cecilia’s Day (November
22).

Romain Rolland, discussing the form concerto grosso, which consists


essentially of a dialogue between a group of soloists, the concertino
(trio of two solo violins and solo bass with cembalo) and the 153
chorus of instruments, concerto grosso, believes that Handel
at Rome in 1708 was struck by Corelli’s works in this field, for
several of his concertos of Opus 3 are dated 1710, 1716, 1722.
Geminiani introduced the concerto into England—three volumes
appeared in 1732, 1735, 1748—and he was a friend of Handel.

It is stated that the word “concerto,” as applied to a piece for a solo


instrument with accompaniment, first appeared in a treatise by
Scipio Bargaglia (Venice, 1587); that Giuseppe Torelli, who died in
1708, was the first to suggest a larger number of instruments in a
concerto, and to give the name concerto grosso to this species of
composition. But Michelletti, seventeen years before, had published
his Sinfonie e concerti a quatro, and in 1698 his Concerti musicali,
while the word “concerto” occurs frequently in the musical
terminology of the seventeenth century. It was Torelli who,
determining the form of the grand solo for violin, opened the way to
Archangelo Corelli, the father of modern violinists, composers, or
virtuosos.
Romain Rolland insisted that the instrumental music of Handel has
the nature of a constant improvisation, music to be served piping
hot to an audience, and should preserve this character in
performance. “When you have studied with minute care each detail,
obtained from your orchestra an irreproachable precision, tonal
purity, and finish, you will have done nothing unless you have made
the face of the improvising genius rise from the work.”

154
FRANZ JOSEF
HAYDN
(Born at Rohrau, Lower Austria, March 31, 1732; died at Vienna,
May 31, 1809)

Haydn has been sadly misunderstood by present followers of


tradition who have spoken of him as a man of the old school, while
Mozart was a forerunner of Beethoven. Thus they erred. Mozart
summed up the school of his day and wrote imperishable music.
There has been only one Mozart, and there is no probability of
another being born for generations to come; but Haydn was often
nearer in spirit to the young Beethoven. It is customary to speak
lightly of Haydn as an honest Austrian who wrote light-hearted
allegros, also minuets by which one is not reminded of a court with
noble dames smiling graciously on gallant cavaliers, but sees
peasants thumping the ground with heavy feet and uttering joyful
cries.

It is said carelessly that Haydn was a simple fellow who wrote at


ease many symphonies and quartets that, to quote Berlioz, recall
“the innocent joys of the fireside and the pot-au-feu.” But Haydn was
shrewd and observing—read his diary, kept in London—and if he was
plagued with a shrewish wife he found favor with other women.
Dear Mrs. Schroeter of London received letters from him breathing
love, not manly complimentary affection. And it is said of Haydn that
he was only sportive in his music, having a fondness for the
bassoon. But Haydn could express tenderness, regret, sorrow in his
music.
155

LONDON SYMPHONIES
SYMPHONY NO. 104, IN D MAJOR (B. & H. NO. 2)

I. Adagio; allegro
II. Andante
III. Menuetto; trio
IV. Allegro spiritoso

Haydn’s symphony is ever fresh, spontaneous, yet contrapuntally


worked in a masterly manner. What a skillful employment of little
themes in themselves of slight significance save for their Blakelike
innocence and gayety! Yet in the introduction there is a deeper note,
for, contrary to current and easy belief, Haydn’s music is not all beer,
skittles, and dancing. There are even gloomy pages in some of his
quartets; tragic pages in his Seven Last Words, and the prelude to
The Creation, depicting chaos, is singularly contemporaneous.

Haydn composed twelve symphonies in England for Salomon. His


name began to be mentioned in England in 1765. Symphonies by
him were played in concerts given by J. C. Bach, Abel, and others in
the ’seventies. Lord Abingdon tried in 1783 to persuade Haydn to
take the direction of the Professional Concerts which had just been
founded. Gallini asked him his terms for an opera. Salomon, violinist,
conductor, manager, sent a music publisher, one Bland—an
auspicious name—to coax him to London, but Haydn was loath to
leave Prince Esterhazy. Prince Nicolaus died in 1790, and his
successor, Prince Anton, who did not care for music, dismissed the
orchestra at Esterház and kept only a brass band; but he added 400
gulden to the annual pension of 1,000 gulden bequeathed to Haydn
by Prince Nicolaus. Haydn then made Vienna his home. And one day,
when he was at work in his house, the “Hamberger” house in which
Beethoven also once lived, a man appeared, and said: “I am
Salomon from London, and come to fetch you with me. We 156
will agree on the job tomorrow.” Haydn was intensely
amused by the use of the word “job.” The contract for one season
was as follows: Haydn should receive three hundred pounds for an
opera written for the manager Gallini, £300 for six symphonies and
£200 for the copyright, £200 for twenty new compositions to be
produced in as many concerts under Haydn’s direction, £200 as
guarantee for a benefit concert, Salomon deposited 5,000 gulden
with the bankers, Fries & Company, as a pledge of good faith. Haydn
had 500 gulden ready for traveling expenses, and he borrowed 450
more from his prince. Haydn agreed to conduct the symphonies at
the piano.

Salomon about 1786 began to give concerts as a manager, in


addition to fiddling at concerts of others. He had established a series
of subscription concerts at the Hanover Square Rooms, London. He
thought of Haydn as a great drawing card. The violinist W. Cramer,
associated with the Professional Concerts, had also approached
Haydn, who would not leave his prince. The news of Prince
Esterhazy’s death reached Salomon, who then happened to be at
Bonn. He therefore hastened to Vienna.

The first of the Salomon-Haydn concerts was given March 11, 1791,
at the Hanover Square Rooms. Haydn, as was the custom, “presided
at the harpsichord”; Salomon stood as leader of the orchestra. The
symphony was in D major, No. 2, of the London list of twelve. The
adagio was repeated, an unusual occurrence, but the critics
preferred the first movement.
The orchestra was thus composed: twelve to sixteen violins, four
violas, three violoncellos, four double basses, flute, oboe, bassoon,
horns, trumpets, drums—in all about forty players.

Haydn and Salomon left Vienna on December 15, 1790, and arrived
at Calais by way of Munich and Bonn. They crossed the English
Channel on New Year’s Day, 1791. From Dover they traveled to
London by stage. The journey from Vienna took them seventeen
days. Haydn was received with great honor.

Haydn left London towards the end of June, 1792. Salomon invited
him again to write six new symphonies. Haydn arrived in London,
February 4, 1794, and did not leave England until August 15, 1795.
The orchestra at the opera concerts in the grand new concert hall of
the King’s Theatre was made up of sixty players. Haydn’s
engagement was again a profitable one. He made by concerts,
lessons, symphonies, etc., £1,200. He was honored in many 157
ways by the King, the Queen, and the nobility. He was
twenty-six times at Carlton House, where the Prince of Wales had a
concert room; and, after he had waited long for his pay, he sent a
bill from Vienna for 100 guineas, which Parliament promptly settled.
LONDON SYMPHONIES
SYMPHONY NO. 94, IN G MAJOR, “SURPRISE” (B. & H. NO.

6)

I. Adagio cantabile e vivace assai


II. Andante
III. Menuetto
IV. Allegro di molto

This symphony, known as the “Surprise,” and in Germany as the


symphony “with the drumstroke,” is the third of the twelve Salomon
symphonies as arranged in the order of their appearance in the
catalogue of the Philharmonic Society (London).

Composed in 1791, this symphony was performed for the first time
on March 23, 1792, at the sixth Salomon concert in London. It
pleased immediately and greatly. The Oracle characterized the
second movement as one of Haydn’s happiest inventions, and
likened the “surprise”—which is occasioned by the sudden orchestral
crash in the andante—to a shepherdess, lulled by the sound of a
distant waterfall, awakened suddenly from sleep and frightened by
the unexpected discharge of a musket.

Griesinger in his Life of Haydn (1810) contradicts the story that


Haydn introduced these crashes to arouse the Englishwomen from
sleep. Haydn also contradicted it; he said it was his intention only to
surprise the audience by something new. “The first allegro of my
symphony was received with countless ‘Bravos,’ but enthusiasm rose
to its highest pitch after the andante with the drumstroke. ‘Ancora!
ancora!’ was cried out on all sides, and Pleyel himself complimented
me on my idea.” On the other hand, Gyrowetz, in his Autobiography,
page 59 (1848), said that he visited Haydn just after he had
composed the andante, and Haydn was so pleased with it 158
that he played it to him on the piano, and sure of his
success, said with a roguish laugh: “The women will cry out here!”
[32]
C. F. Pohl added a footnote, when he quoted this account of
Gyrowetz, and called attention to Haydn’s humorous borrowing of a
musical thought of Martini to embellish his setting of music to the
commandment, “Thou shalt not steal,” when he had occasion to put
music to the Ten Commandments. The Surprise symphony was long
known in London as “the favorite grand overture.”
PARIS SYMPHONIES
SYMPHONY NO. 88, IN G MAJOR (B. & H. NO. 13)

I. Adagio; allegro
II. Largo
III. Menuetto; trio
IV. Finale; allegro con spirito

The Parisian orchestra, which Haydn undoubtedly had in mind, was a


large one—forty violins, twelve violoncellos, eight double basses—so
that the composer could be sure of strong contrasts in performance
by the string section. Fortunate composer—whose symphonies one
can, sitting back, enjoy without inquiring into psychological intention
or noting attempts at realism in musical seascapes and landscapes—
music not inspired by book or picture—just music; now pompous,
now merry, and in more serious moments, never too sad, but with a
constant feeling for tonal grace and beauty.

Haydn wrote a set of six symphonies for a society in Paris known as


the Concert de la loge olympique. They were ordered in 1784, when
Haydn was living at Esterház. Composed in the course of the years
1784-89, they are in C, G minor, E flat, B flat, D, A. No. 1, in C, has
been entitled the “Bear”; No. 2, in G minor, has been entitled 159
the “Hen”; and No. 4, in B flat, is known as the “Queen of
France.” This symphony is the first of a second set, of which five
were composed in 1787, 1788, 1790. If the sixth was written, it
cannot now be identified. This one in G major was written in 1787,
and is numbered 88 in the full and chronological listing of
Mandyczewski (given in Grove’s Dictionary).

I. The first movement opens with a short, slow introduction, adagio,


G major, 3-4 which consists for the most part of strong staccato
chords which alternate with softer passages. The main body of the
movement allegro, G major, begins with the first theme, a dainty
one, announced piano by the strings without double basses and
repeated forte by the full orchestra with a new counter figure in the
bass. A subsidiary theme is but little more than a melodic variation
of the first. So, too, the short conclusion theme—in oboes and
bassoon, then in the strings—is only a variation of the first. The free
fantasia is long for the period and is contrapuntally elaborate. There
is a short coda on the first theme.

II. Largo, D major, 3-4. A serious melody is sung by oboe and


violoncellos to an accompaniment of violas, double basses, bassoon,
and horn. The theme is repeated with a richer accompaniment;
while the first violins have a counter figure. After a transitional
passage the theme is repeated by a fuller orchestra, with the melody
in first violins and flute, then in the oboe and violoncello. The
development is carried along on the same lines. There is a very
short coda.

III. The Menuetto, allegretto, G major, 3-4, with trio, is in the regular
minuet form in its simplest manner.

IV. The finale, allegro con spirito, G major, 2-4, is a rondo on the
theme of a peasant country dance, and it is fully developed. Haydn
in his earlier symphonies adopted for the finale the form of his first
movement. Later he preferred the rondo form, with its couplets and
refrains, or repetitions of a short and frank chief theme. “In some
[33]
finales of his last symphonies,” says Brenet, “he gave freer reins
to his fancy, and modified with greater independence the form of his
first allegros; but his fancy, always prudent and moderate, is more
like the clear, precise arguments of a great orator than the headlong
inspiration of a poet. Moderation is one of the characteristics of
Haydn’s genius; moderation in the dimensions, in the sonority, in the
melodic shape; the liveliness of his melodic thought never 160
seems extravagant, its melancholy never induces sadness.”

The usual orchestration of Haydn’s symphonies (including those


listed above) consisted of one (or two) flutes, two oboes, two
bassoons, two horns, two trumpets, kettledrums, and strings. In his
last years (from 1791) he followed Mozart’s lead in introducing two
clarinets. The clarinets accordingly appear in the London symphony
in D major, described in this chapter.—EDITOR.

161
PAUL
HINDEMITH
(Born at Hanau, on November 16, 1895)
“KONZERTMUSIK” FOR STRING AND BRASS
INSTRUMENTS

There was a time in Germany when Hindemith was regarded as the


white-haired boy; the hope for the glorious future; greater even than
Schönberg. In England, they look on Hindemith coolly—an able and
fair-minded critic there has remarked: “The more one hears of the
later Hindemith, the more exasperating his work becomes. From
time to time some little theme is shown at first in sympathetic
fashion, then submitted to the most mechanical processes known to
music. Any pleasant jingle seems to mesmerize the composer, who
repeats it much as Bruckner repeats his themes—Hindemith abuses
the liberty shown to a modern.”

But Hindemith is not always mesmerized by a pleasant jingle.


Witness his oratorio, performed with great success. The title is
forbidding, The Unending, but the performance takes only two
hours. The Concert Music, composed for the fiftieth anniversary of
the Boston Symphony Orchestra, is more than interesting. It cannot
be called “noble,” not even “grand,” but it holds the attention by its
strength in structure, its spirit, festal without blatancy. For once
there is no too evident desire to stun the hearer. It is as if the
composer had written for his own pleasure. It is virile music with
relieving passages—few in number—that have genuine and simple
beauty of thought and expression; exciting at times by the 162
rushing rhythm.
Hindemith, at the age of eleven, played the viola in the theater and
in the moving-picture house; when he was thirteen, he was a viola
virtuoso, and he now plays in public his own concertos for that
instrument. When he was twenty, he was first concert master of the
Frankfort opera house. His teachers in composition were Arnold
Mendelssohn and Bernhard Sekles at the Hoch Conservatory in
Frankfort. He is the viola player in the Amar Quartet (Licco Amar,
Walter Casper, Paul Hindemith, and Maurits Frank—in 1926 his
brother Rudolf was the violoncellist).

Apropos of a performance of one of his works, in Berlin, the late


Adolf Weissmann wrote in a letter to the Christian Science Monitor:
“Promising indeed among the young German composers is Paul
Hindemith. More than promising he is not yet. For the viola player
Paul Hindemith, travelling with the Amar Quartet through half
Europe, has seldom time enough to work carefully. The greater part
of his compositions were created in the railway car. Is it, therefore,
to be wondered at that their principal virtue lies in their rhythm? The
rhythm of the rolling car is, apparently, blended with the rhythm
springing from within. It is always threatening to outrun all the other
values of what he writes. For that these values exist cannot be
denied.”

A foreign correspondent of the London Daily Telegraph, having heard


one of his compositions, wrote: “It was all rather an exhilarating
nightmare, as if Hindemith had been attempting to prove the
theorem of Pythagoras in terms of parallelograms, which is amusing,
but utterly absurd.”

It has been said by A. Machabey that Hindemith has been influenced


in turn by Wagner, Brahms—“an influence still felt”; Richard Strauss;
Max Reger, who attracted him by his ingenuity and freedom from
elementary technic; Stravinsky, who made himself felt after the war;
and finally by the theatrical surroundings in which he lives. “He is
opposed to post-romanticism. Not being able to escape from
romanticism in his youth, today he seems to be completely stripped
of it. Freed from the despotism of a text, from the preëstablished
plan of programme music, from obedience to the caprices 163
and emphasis of sentiment, music in itself suffices.... The
reaction against romanticism is doubled by a democratic spirit which
was general in Germany after the war.” Therefore he has had many
supporters, who welcomed, “besides this new spirit, an unexpected
technic, unusual polyphony and instrumentation, in which one found
a profound synthesis of primordial rhythms, tonalities enriched and
extended by Schönberg and Hauer, economical and rational
groupings of jazz.” Then his compositions are so varied: chamber
music for the ultra-fastidious; melodies for amateurs; dramatic works
for opera-goers; orchestral pieces for frequenters of concerts; he has
written for débutantes and children; for the cinema, marionettes,
mechanical pianos, brass bands. Work has followed work with an
amazing rapidity.

164
ARTHUR
HONNEGER
(Born at Havre, France, on March 10, 1892)
“PACIFIC 231,” ORCHESTRAL MOVEMENT

Some say that Honegger had no business to summon a locomotive


engine for inspiration. No doubt this music of Honegger’s is “clever,”
but cleverness in music quickly palls. Louis Antoine Jullien years ago
in this country excited wild enthusiasm by his Firemen’s Quadrille, in
which a conflagration, the bells, the rush of the firemen, the
squirting and the shout of the foreman, “Wash her, Thirteen!” were
graphically portrayed.

But there is majestic poetry in great machines, even in railway


engines. One of Turner’s most striking pictures is the one depicting a
hare running madly across a viaduct with a pursuing locomotive in
rain and mist. What was the most poetic thing of the Philadelphia
exposition of 1876? The superb Corliss engine, epic in strength and
grandeur. Walt Whitman, Kipling, and others have found inspiration
in a locomotive; why reproach a composer for attempting to express
“the visual impression and the physical sensation” of it? One may
like or dislike Pacific 231, but it is something more than a musical
joke; it was not merely devised for sensational effect.

When Pacific 231 was first performed in Paris at Koussevitzky’s


concerts, May 8 and 15, 1924, Honegger made this commentary:

“I have always had a passionate love for locomotives. To me they—


and I love them passionately as others are passionate in their 165
love for horses or women—are like living creatures.
“What I wanted to express in the Pacific is not the noise of an
engine, but the visual impression and the physical sensation of it.
These I strove to express by means of a musical composition. Its
point of departure is an objective contemplation: quiet respiration of
an engine in state of immobility; effort for moving; progressive
increase of speed, in order to pass from the ‘lyric’ to the pathetic
state of an engine of three hundred tons driven in the night at a
speed of one hundred and twenty per hour.

“As a subject I have taken an engine of the ‘Pacific’ type, known as


‘231,’ an engine for heavy trains of high speed.”

Other locomotive engines are classified as “Atlantic,” “Mogul.” The


number 231 here refers to the number of the “Pacific’s” wheels 2—3
—1.

“On a sort of rhythmic pedal sustained by the violins is built the


impressive image of an intelligent monster, a joyous giant.”

Pacific 231 is scored for piccolo, two flutes, two oboes, English horn,
two clarinets, bass clarinet, two bassoons, double bassoon, four
horns, three trumpets, three trombones, bass tuba, snare drum,
bass drum, cymbals, tam-tam, strings.

The locomotive engine has been the theme of strange tales by


Dickens, Marcel Schwob, Kipling, and of Zola’s novel, La Bête
humaine. It is the hero of Abel Gance’s film, Roué for which it is said
Honegger adapted music, and the American film, The Iron Horse.

166
Welcome to Our Bookstore - The Ultimate Destination for Book Lovers
Are you passionate about books and eager to explore new worlds of
knowledge? At our website, we offer a vast collection of books that
cater to every interest and age group. From classic literature to
specialized publications, self-help books, and children’s stories, we
have it all! Each book is a gateway to new adventures, helping you
expand your knowledge and nourish your soul
Experience Convenient and Enjoyable Book Shopping Our website is more
than just an online bookstore—it’s a bridge connecting readers to the
timeless values of culture and wisdom. With a sleek and user-friendly
interface and a smart search system, you can find your favorite books
quickly and easily. Enjoy special promotions, fast home delivery, and
a seamless shopping experience that saves you time and enhances your
love for reading.
Let us accompany you on the journey of exploring knowledge and
personal growth!

ebookball.com

You might also like