0% found this document useful (0 votes)
54 views680 pages

Probability Theory

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views680 pages

Probability Theory

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 680

The Moftett library

Midwestern University
Wichita Falls, Texas
\
NORTH-HOLLAND SERIES IN

APPLIED MATHEMATICS
AND MECHANICS
EDITORS:

H. A. LAUWERIER
Institute of Applied Mathematics
University of Amsterdam

W. T. KOITER
Laboratory of Applied Mechanics
Technical University, Delft

VOLUME 10

NORTH-HOLLAND PUBLISHING COMPANY - AMSTERDAM • LONDON


PROBABILITY THEORY

Member of the Hungarian Academy of Sciences


Professor of Mathematics at the
Eotvos Lorand University, Budapest
Director of the Mathematical Institute of the
Hungarian Academy of Sciences, Budapest

1970

NORTH-HOLLAND PUBLISHING COMPANY-AMSTERDAM • LONDON

AMERICAN ELSEVIER PUBLISHING COMPANY, INC. —NEW YORK


All Rights Reserved. No part of this publication may be reproduced, stored in a retrieva
system, or transmitted, in any form or by any means, electronic, mechanical, pho¬
tocopying, recording or otherwise without the prior permission of the Copyright owner

This book is the enlarged version of

WAHRSCHEINLICHKEITSRECHNUN G

VEB Deutscher Verlag der Wissenschaften


Berlin 1962

valoszInusegszamitAs

Tankonyvkiado, Budapest 1966

and

CALCUL DES PROBABILITIES


Dunod, Paris 1966

English translation by
DR. LASZLO VEKERDI

© AKADfiMIAI KIADO, BUDAPEST 1970

JOINT EDITION PUBLISHED BY

NORTH-HOLLAND PUBLISHING COMPANY • AMSTERDAM


AND
AKADEMIAI KIADO
PUBLISHING HOUSE OF THE HUNGARIAN ACADEMY OF SCIENCES

SOLE DISTRIBUTORS FOR THE U.S.A. AND CANADA!

AMERICAN ELSEVIER PUBLISHING COMPANY, INC.


52 VANDERBILT AVENUE
NEW YORK, N.Y. 10017

Library of Congress Catalog Card Number 77-97208


ISBN 7204 2360 0

PRINTED IN HUNGARY
(3 A
573
»R455
n?o

144222
PREFACE

One of the latest works of Alfred Renyi is presented to the reader in this
volume. Before his sudden death on the 1st February, 1970, he corrected
the first proof of the book, but he had no longer time for the final proof¬
reading* and writing the preface he had planned.
This preface is, therefore, a brief memorial to a great mathematician,
mentioning a few features of Alfred Renyi’s professional career.
Professor Renyi lectured on probability theory at various universities
throughout an uninterrupted series of years, from 1948 till his untimely
death. His academic career started at the University of Debrecen and was
continued at the University of Budapest where he was professor of the
Chair of Theory of Probability. In the meantime he was invited lecturer for
shorter or longer terms in several scientific centres of the world. Thus he
was visiting professor at Stanford University, Michigan State University,
the University of Erlangen, and the University of North Carolina.
Besides his teaching activities, Professor Renyi was director of the Mathe¬
matical Institute of the Hungarian Academy of Sciences for one and half
decade. Under his direction the Institute developed into an important re¬
search centre of the science of mathematics.
He participated in the editorial work of a number of journals. He was
the editor of Studia Scientiarum Mathematicarum Hungarica and a
member of the Editorial Board of: Acta Mathematica, Annales Sci. Math.,
Publicationes Math., Matematikai Lapok, Zeitschrift fur Wahrscheinlich-
keitstheorie, Journal of Applied Probability, Journal of Combinatorial
Analysis, Information and Control.
The careful reader will certainly note how the long teaching experience
and keen interest in research are amalgamated in the present book. The
material of Professor Renyi’s courses on probability theory was first pub¬
lished in the form of lecture notes. It appeared as a book in Hungarian in
1954, and completely revised in German translation in 1962. The latter
book was the basis of a new Hungarian edition in 1965 and the French

* This was done by Mr. P. Bartfai, Mrs. A. Foldes and Mrs. L. Rejto.
PREFACE

translation published in 1966. In the new Hungarian edition the author


omitted some theoretical chapters of the German text, inserting new ones
dealing with recent results and modern practical methods. The present
book contains the complete texts of the Hungarian and German versions
and is completed with some additional new material. The presentation of
a number of well-chosen problems and exercises will certainly be regarded
as a valuable feature of the book; some of them — following the traditions
of the Hungarian Mathematical Competitions — have been selected from
the material of original publications.
In his lectures and books, Alfred Renyi always strived at arousing interest
in recent results of research, besides presenting the fundamental text-book
material of probability theory. Accordingly, he often wrote of problems in
which he was just engaged. In the present book the reader will also find
many particularities which do not occur in other text-books dealing with
the same field. These problems have been selected mostly from among
research topics pursued by the author and his school; they are presented
with the aim of bringing within the scope of the beginner the spirit of the
living and rapidly developing present-day mathematics.
Pal Revesz
CONTENTS

CH. I. ALGEBRAS OF EVENTS

§ 1. Fundamental relations 9
§ 2. Some further operations and relations 13
§ 3. Axiomatical development of the algebra of events 16
§ 4. On the structure of finite algebras of events 18
§ 5. Representation of algebras of events by algebras of sets 21
§ 6. Exercises 25

CH. II. PROBABILITY

§ 1. Aim and scope of the theory of probability 29


§ 2. The notion of probability 30
§ 3. Probability algebras 33
§ 4. Finite probability algebras 40
§ 5. Probabilities and combinatorics 41
§ 6. Kolmogorov probability spaces 46
§ 7. The extension of rings of sets, algebras of sets and measures
§ 8. Conditional probabilities 54
§ 9. The independence of events 57
§ 10. “Geometric” probabilities 62
§ 11. Conditional probability spaces 69
§ 12. Exercises 74

CH. in. DISCRETE RANDOM VARIABLES

§ 1. Complete systems of events and probability distributions 84


§ 2. The theorem of total probability and Bayes’ theorem 84
§ 3. Classical probability distributions 87
§ 4. The concept of a random variable 94
§ 5. The independence of random variables 99
§ 6. Convolutions of discrete random variables 100
§ 7. Expectation of a discrete random variable 102
§ 8. Some theorems on expectations 105
§ 9. The variance 110
§ 10. Some theorems concerning the variance 115
§ 11. The correlation coefficient 116
§ 12. The Poisson distribution 122
§ 13. Some applications of the Poisson distribution 125
§ 14. The algebra of probability distributions 131
6 CONTENTS

§15. Generating functions 135


§16. Approximation of the binomial distribution by the normal distribution 149
§ 17. Bernoulli’s law of large numbers 157
§ 18. Exercises 159

CH. IV. GENERAL THEORY OF RANDOM VARIABLES

§ 1. The general concept of a random variable 172


§ 2. Distribution functions and density functions 173
§ 3. Probability distribution in several dimensions 177
§ 4. Conditional distributions and conditional density functions 181
§ 5. Independent random variables 182
§ 6. The uniform distribution 184
§ 7. The normal distribution 186
§ 8. Distribution of a function of a random variable 193
§ 9. The convolution of distributions 195
§ 10. Distribution of a function of several random variables 203
§ 11. The general notion of expectation 208
§ 12. Expectation vectors of higher dimensional probability distributions 217
§ 13. The median and the quantiles 217
§ 14. The general notions of standard deviation and variance 219
§ 15. On some other measures of fluctuation 222
§ 16. Variance in the higher dimensional case 225
§ 17. Exercises 232

CH. V. MORE ABOUT RANDOM VARIABLES

§ 1. Random variables on conditional probability spaces 245


§ 2. Generalization of the notion of conditional probability on Kolmogorov
probability spaces 255
§ 3. Generalization of the notion of conditional probability on conditional
probability spaces 263
§ 4. Generalization of the notion of conditional mathematical expectation in
Kolmogorov probability spaces 270
§ 5. Generalization of Bayes’ theorem 273
§ 6. The correlation ratio 275
§ 7. On some other measures of the dependence of two random variables 279
§ 8. The fundamental theorem of Kolmogorov 286
§ 9. Exercises 290

CH. VI. CHARACTERISTIC FUNCTIONS

§ 1. Random variables with complex values 301


§ 2. Characteristic functions and their basic properties 302
§ 3. Characteristic functions of some important distributions 310
§ 4. Some fundamental theorems on characteristic functions 312
§ 5. Characteristic properties of the normal distribution 323
§ 6. Characteristic functions of multidimensional distributions 340
§ 7. Infinitely divisible distributions 347
CONTENTS 7

§ 8. Stable distributions 349


§ 9. Characteristic functions of conditional probability distributions 353
§ 10. Exercises 367

CH. VII. LAWS OF LARGE NUMBERS

§ 1. Chebyshev’s and related inequalities 373


§ 2. Stochastic convergence 374
§ 3. Generalization of Bernoulli’s law of large numbers 377
§ 4. Bernstein’s improvement of Chebyshev’s inequality 384
§ 5. The Borel-Cantelli lemma 389
§ 6. Kolmogorov’s inequality 392
§ 7. The strong law of large numbers 394
§ 8. The fundamental theorem of mathematical statistics 400
§ 9. The law of the iterated logarithm 402
§ 10. Sequences of mixing sets 406
§ 11. Stable sequences of events 409
§ 12. Sequences of exchangeable events 412
§ 13. The zero-one law 418
§ 14. Kolmogorov’s three-series theorem 420
§ 15. Laws of large numbers on conditional probability spaces 424
§ 16. Exercises 428

CH. VIII. THE LIMIT THEOREMS OF PROBABILITY THEORY

§ 1. The central limit theorems 440


§ 2. The local form of the central limit theorem 449
§ 3. The domain of attraction of the normal distribution 453
§ 4. Convergence to the Poisson distribution 458
§ 5. The central limit theorem for samples from a finite population 460
§ 6. Generalization of the central limit theorem through the application of
mixing theorems 466
§ 7. The central limit theorem for sums of a random number of random
variables 471
§ 8. Limit distributions for Markov chains 475
§ 9. Limit distributions for “order statistics” 486
§ 10. Limit theorems for empirical distribution functions 492
§ 11. Limit distributions concerning random walk problems 500
§ 12. Proof of the limit theorems by the operator method 515
§ 13. Exercises 528

CH. IX. APPENDIX. INTRODUCTION TO INFORMATION THEORY

§ 1. Hartley’s formula 540


§ 2. Shannon’s formula 546
§ 3. Conditional and relative information 554
§ 4. The gain of information 560
§ 5. The statistical meaning of information 564
§ 6. Further measures of information 569
8 CONTENTS

§ 7. Statistical interpretation of the information of order a 583


§ 8. The definition of information for general distributions 586
§ 9. Information-theoretical proofs of limit theorems 597
§ 10. Extension of information theory to conditional probability spaces 603
§11. Exercises 605

tables 617
REMARKS AND BIBLIOGRAPHICAL NOTES 638
REFERENCES 645
AUTHOR AND SUBJECT INDEX 661
CHAPTER I

ALGEBRAS OF EVENTS

§ 1. Fundamental relations

Probability theory deals with events occurring in connection with random


mass-phenomena. As it is an abstract mathematical theory, the concept of
events is to be dealt with abstractly too; i.e. relations between events are to
be characterized axiomatically. For this reason, we consider first of all in
this Chapter the “algebras of events”. Indeed, relations between events have
a primarily logical character: one may assign to every event a proposition
stating its occurrence. Thus logical relations between propositions corre¬
spond to the relations between events. The algebraic structure of the set of
events turns out to be a Boolean algebra. Algebras of events as a basis of
probability theory were first considered by V. I. Glivenko [3] (cf. also
A. N. Kolmogorov [9]).
As stated above, events are to be characterized as abstract concepts. We
shall define an algebra of events as a set of events connected with one and
the same “experiment”, taken in the widest sense of this word. There belongs
to every experiment a set of possible outcomes; for every event of the algebra
corresponding to the experiment one must be able to decide for each pos¬
sible outcome whether the event occurred or not.
Let the events A, B, C,. . . be elements of the same algebra of events.
Two events both either occurring or non-occurring at the same time for
every outcome of the experiment are said to be identical. The fact that the
events A and B are identical is denoted by A — B.
The non-occurrence of an event A is itself an event, denoted by A and
called the event complementary to A. From this definition it follows that

A = A. (1)

In the realm of logic this corresponds to the proposition that a statement


doubly negated coincides with the statement itself.
If A and B are two events of the same algebra of events, we may ask
whether they did occur both. Let our experiment be for instance the firing
on a target. By a vertical and a horizontal line we subdivide the target into
four equal parts. Let event A be a hit in the upper half of the target; event
10 ALGEBRAS OF EVENTS [I, § 1

B one in the right side of it. In this case the statement “A and B occurred
both” means the fact that the hit lies in the right upper quadrant of the target
(Fig. 1).

A B AB
Fig. 1

Event C, occurring if and only if both events A and B occur, is said to be


the product of events A and B; we write C — AB. Thus we have defined an
operation, namely the multiplication of events. Let us now see what the
properties of this operation are. First, since AB clearly does not depend on
the order of A and B, we have the commutative law

AB = BA. (2)
Also obviously,
AA = A, (3)

i.e. every event A is idempotent with respect to multiplication. The definition


of the product of events may be extended to more than two factors. A(BC)
occurs, by definition, if and only if the events A and BC occur; that is, if the
events A, B, and C all occur. Evidently, (AB)C has the same meaning. Thus
we have the associative law for multiplication:

A(BC) = (AB)C. (4)

Instead of A(BC) therefore we can write simply ABC. Clearly, the event
AB can occur only if A and B do not exclude each other. If A and B are
mutually exclusive, AB is an impossible event. It is useful to consider the
impossible event as an event too. It will be denoted by O. The fact, that A
and B are mutually exclusive, is thus expressed by AB = O. Since an event
and the complementary event obviously exclude each other, we have

AA = O. (5)

If A and B are two events of an algebra of events, one may ask whether
at least one of the events A and B did occur. Let A denote the event that the
hit lies in the upper half of the target and B the event that it lies in the right
I, § 1] FUNDAMENTAL RELATIONS 11

half; the statement, that at least one of the events A and B occurred, means
then that the hit does not lie in the left lower quadrant of the target (Fig. 2).
The event occurring exactly when at least one of the events A and B occurs,
is said to be the sum of A and B and is denoted by A + B. It is easy to see

that
A + B = B + A (6)

(commutative law of addition) and also that

A + (B + C) = (A + B) + C (7)

(associative law of addition). The definition of the sum is readily extended to


the case of more than two events.
The event A + B occurs thus precisely, if A or B occurs; the word “or”,
however, does not mean in this connection that A and B exclude one an¬
other. Thus for instance, in our repeatedly considered example the meaning
of A + B is the statement that the hit lies either in the upper half of the tar¬
get (this is now the event A) or in the right lower quadrant (event AB).
Therefore we have the relation

A + B — A + AB, (8)

where the two terms on the right hand side are now mutually exclusive.
By applying relation (8), every sum of events can be transformed in such
a way that the terms of the sum become pairwise mutually exclusive.
Clearly the formula
A + A = A (9)

is valid. Further we see that the event A + A certainly occurs; thus by


introducing the notation / for the “sure event” we have

A + A = I. (10)
We agree further that
1=0, 0 = 1, (11)
12 ALGEBRAS OF EVENTS [I, § 1

i.e. that the event complementary to the sure event is the impossible event
O and conversely.
Evidently, the following relations are also valid:

AO = 0, (12)
A + 0 = A, (13)
AI = A, (14)
A + 1= I. (15)

In order to be able to carry out unrestrictedly all operations in the algebra


of events, we need some further basic relations. First of all, does the distri¬
butive law hold for the addition and multiplication in the algebra of events?
Now A(B + C) occurs, by definition, exactly if A occurs and B or C occurs.
This, however, means precisely that either A and B occur or A and C occur,
i.e. that AB + AC occurs. Therefore we have

A(B + C) = AB + AC. (16)

From the distributive law follows the so-called “law of inclusion”

A + AB = A; (17)

since from (14), (16) and (15) we have

A + AB = AI + AB = A(I + B) = AI = A.

Clearly, rule (17) can be verified directly as well; the direct verification is,
however, clumsy for some complicated relations, while by applying the for¬
mal rules of operation one can readily get a formal proof. This is the reason
why the algebra of events is useful; therefore it is advisable to obtain a
certain practice in such formal proofs.
The distributive laws can be extended (just like in ordinary algebra) to
more than two terms. In the algebra of events there exists, however, still
another distributive law:

A + BC = (A + B) (A + C). (18)

The validity of (18) is readily seen: A + BC occurs exactly, if A occurs or


B and Coccur; if A occurs, both factors of the product on the right hand
side occur and the same is true if B and C occur, but in no other case.
This consideration being somewhat more difficult as the preceding ones,
it is of interest to show how (18) is implied by the already deduced rules of
the algebra of events. Indeed, because of (2), (3), (16) and (17) we have

(A + B) (A + C) = A + AB + AC + BC = A + BC,
I, § 2] SOME FURTHER OPERATIONS AND RELATIONS 13

which is what we had to show.


' Next we prove some further important relations:

AB = A + B, (19)

A + B = AB. (20)

The event AB occurs exactly, if AB does not occur, hence if the events
A and B do not both occur; A + B occurs exactly, if A or B (or both) do
not occur. These two propositions evidently state the same thing; thus (19)
is valid. Formula (20) can be proved in the same way.
As to the rules of operation valid for the addition and multiplication
of events, we see that both have the same properties (commutativity, asso¬
ciativity, idempotency of every element) and that the relations between the
two kinds of rules of operation are symmetrical. Formulas (16) and (18)
are obtained from each other - by interchanging everywhere the signs of
multiplication and addition. Such formulas are called dual to one another.
Thus for instance the relations

A + AB = A and A(A + B) = A

are dual to one another. Clearly, there exist relations which are, because of
their symmetry, selfdual; e.g. the relation

(A + B) (A + C) (B + C) — AB + AC + BC.

For sake of brevity we write sometimes n Ak


n

k=l
instead of AXA2 . . . A„
n
and Yj instead of A1 + A2 + • • • + An.
k= 1

§ 2. Some further operations and relations

Subtraction is defined in the algebra of events by the formula

B — A = BA. (1)

With respect to subtraction the following rules hold:

A(B - C) = AB - AC, AB - C = (A - C) (B - C); (2)


14 ALGEBRAS OF EVENTS [I, § 2

they are the two distributive laws of subtraction. Using the subtraction,
the complementary event may be written in the form

A — I — A. (3)

B A-B B-A
Fig. 3

The subtraction does not satisfy all the rules of operation known from
ordinary algebra. Thus for instance (A — B) + B is in general not equal
to A; further A + (B — C) is not always identical to (A + B) — C. Hence,
if in relations between events there figures the sign of subtraction too, the
brackets are not to be omitted without any consideration. There are, how¬
ever, cases when this omission is allowed, e.g.

A - (B + C) = (A - B) - C. (4)

The event A-B occurs exactly if A does and B does not occur; in the
same way, B — A occurs if B does but A does not occur. The meaning of the
expression (A - B) + (B - A) is therefore not O, but the event which
consists of the occurrence of one and only one of the events A and B. It is
reasonable to introduce for this event a new symbol. We put

(A — B) + (B — A) — AAB. (5)

The operation denoted by A is called the symmetric difference of the events


A and B (cf. Figs 3 and 4). It fulfils the following rules of operation, derived
readily from the already known rules:
1, § 2] SOME FURTHER OPERATIONS AND RELATIONS 15

AAA = O AAB = BAA A(BAC) = ABAAC


AAO = A AAB = (A + E) - AB B - A = ABAB
AAI = A A + B = (AAB)AAB B - A = (AAB)B (6)

Finally, we mention still another relation between events. If the occurrence


of an event A always entails the occurrence of the event B, then we say
that the event A implies the event B. We denote this fact by the symbol £.
Therefore A £ B means that from the occurrence of the event A always fol¬
lows the occurrence of the event B.

The following relations hold:

1. O £ A.
2. A £ /.
3. A £ A.
4. A £ B and B <=, C imply T £ C.
5. A £ B and B ^ A imply A = B.
6. A £ A + B.
7. AB £ A.
8. A ^ B implies A = AB.
9. A £ C and i? £ C imply A + B £ C.
10. and C £ B imply C £ AB.
11. A £ B implies B — A + BA.
12. A £ B implies B £ T.
13. A ^ B implies TC £ 2?C.
14. A £ B implies A + C £ B + C.
15. AB = O and C <=, A imply BC = 0.

It is easy to show that the meaning of A = AB as well as of B = A + BA


is the same as of A £ B\ these relations could have served as well for the
definition of the relation £. Indeed, if the relation B = A + BA is valid,
then the occurrence of A implies the occurrence of B. If we have further
A = AB, then the occurrence of A implies the occurrence of B, since

B = BI — B(A + A) = BA + BA = A + BA.

From this it follows that for the validity of the relation A £ B the validity
of one of the relations A = AB and B = A + BA is necessary and sufficient.
The latter relation can be stated in the following form: For the validity of
A cz B a necessary and sufficient condition is the existence of a C such that
AC = O and B = A + C; indeed, from this it follows directly C = BA.
16 ALGEBRAS OF EVENTS [I, § 3

We introduce the following important concept. A system Ak, Az,. . ., An


is called a complete system of events, if the relations Ak^ O {k — 1,2

Aj Ak = O for j =£ k and Al + A2 + ... + An = /

are valid. For instance [A, A } is a complete system of events, provided that
A A O and A A /.

§ 3. Axiomatical development of the algebra of events

In the preceding paragraphs, we introduced certain operations for events


and discussed the rules for these operations. Now we have to make a further
abstraction. A set of arbitrary elements A, B, C,... is said to be a Boolean
algebra, if the following conditions are fulfilled: Given any two elements
A and B of , there exists exactly one element of ^ called the product of
A and B and denoted by AB and exactly one element of called the sum
of A and B and denoted by A + B j1 further there corresponds to every ele¬
ment A of exactly one element A of Let there exist two special ele¬
ments of the set namely O and I. Let the elements of the Boolean algebra
fulfil the relations obtained in the preceding paragraph; the following axioms
are therefore assumed to hold:

AA = A (1.1)
AB = BA (1.2)
A(BC) = (AB)C (1.3)
A + A = A (2.1)
A + B = B + A (2.2)
A + (B + C) = (A + B) + C (2.3)
A(B + C) = AB + AC (3.1)
A + BC = (A + B) (A + C) (3.2)
AA = O (4-1)
A + A = I (4.2)
AI = A (5.1)
A + O = A (5.2)
AO = O (5-3)
A + I= I (5.4)

1 The notations A D B and A U B are used often instead of AB and A + B,


respectively.
I. § 3] AXIOMATICAL DEVELOPMENT 17

It is to be noted that these axioms are not all mutually independent; thus
for instance (3.2) can be deduced from the others. It is, however, not our
aim to examine here which axioms could be omitted from the system.
The totality of the outcomes of an experiment forms a Boolean algebra,
if we understand by the product AB of two events A, B the joint occurrence
of both events and by the sum A + B of two events the occurrence of at
least one of the two events; further, if we denote by A the event complemen¬
tary to A and by O and / the impossible and the sure events, respectively.
Indeed, the above 14 axioms are fulfilled in this case. More generally, every
subset of the set of the outcomes of an experiment is a Boolean algebra if it
contains the sure event, further for every event A its complementary event
A and for every A and B the events AB and A + B.
Clearly, one can find other Boolean algebras as well. Thus for instance,
the totality of the subsets of a set H is also a Boolean algebra. We define the
sum of two sets as the union of the two sets and their product as the inter¬
section oj the two sets. Let I mean the set H itself and O the empty set,
further A the set complementary to A with respect to H and thus B — A
the set complementary to A with respect to B. A direct verification of each
axiom shows that this system is indeed a Boolean algebra.
There exists a close connection between Boolean algebras of events and
algebras of sets. In our example of the target this connection is clearly vis¬
ible. This analogy between a Boolean algebra of sets and an algebra of
events has an important role in the calculus of probability.
In order to obtain a Boolean algebra, it is not necessary to consider all
subsets of a set. A collection T of the subsets of a set H is said to be an algebra
of sets, if the addition can be always carried out in it, if H itself belongs to
T and for a set A its complementary set A = H — A belongs to T as well;
i.e. if the following conditions are satisfied:1

1. H £T.
2. A d T, B £T implies A + B £ T.
3. A £ T implies A £T.

The collection of all subsets of a set H is said to be a complete algebra


of sets. A complete algebra of sets is always a Boolean algebra. Indeed, it
is easy to see that the validity of AB £ T follows from A £T and B £ T by
the conditions 1, 2 and 3, since AB = A + B. The above 14 axioms are
evidently fulfilled.

1 The notation a £ M means here and in the following that a belongs to the set
M; a $ M means that a does not belong to the set M.
18 ALGEBRAS OF EVENTS [I, § 4

§ 4. On the structure of finite algebras of events

An event A is said to be a compound event, if it can be represented as the


sum of two events which are both different from A:

A = B + C, B A A, C A A.

(The condition B A A, C A A is necessary, to exclude the trivial represen¬


tations A = A + O and A = A + A, which are valid for every A.) Events
which do not permit any such representation are said to be elementary events.
Compound events may be obtained in a number of ways, but elementary
events can occur in only one manner. In order to illustrate this by an exam¬
ple, let A denote the event of throwing 10 in a game with two dice. This is a
compound event; indeed the number 10 can be obtained by throwing with
both dice 5 as well as by getting 6 on one of the dice and 4 on the other.
The latter event is again a compound event, since we can have 6 on the first
die and 4 on the second and conversely. If A means the result 12 with two
dice, then A is an elementary event, since it can be realized only by casting
6 with each die.
If A is an elementary event, then from B £ A follows either B — O or
B — A. Since A £ A always holds, we denote the fact that B ^ A with
B A A by the symbol B a A. Clearly, from B c A follows B ^ A, but the
converse does not hold. By using this notation the definition of the elemen¬
tary event may be formulated as follows: The event A A O is an elementary
event, if and only if there exists no B(B A O) such that B c A.1 Indeed, if
B a A is valid for some B A O, then from relation (11) of §2 follows
A = B + AB where B and AB are distinct from A; namely B # A follows
from the assumption B c A, while from AB = A would follow, because of
B a A and thus B = AB, the equation B = AB = (AB)B = AO = O,
which contradicts our assumption.
We give here a further characterisation of elementary events: A A O
is an elementary event, if for an arbitrary event B either AB — O or AB = A.
Otherwise namely there would exist a decomposition

A = AB +AB

of A, where AB ^ A and AB ^ O; the converse is proved readily too.


Next we prove the following

Theorem 1. In an algebra consisting of a finite number of events every


event can be represented as a sum of elementary events. This representation is
unique except for the order of the terms.

1 The impossible event O is not considered to be an elementary event.


I, § 4) ON THE STRUCTURE OF FINITE ALGEBRAS OF EVENTS 19

In order to prove the theorem, we need two lemmas:

Lemma 1. The product of two distinct elementary events is O.

For every Ax and A2 we obviously have AXA2 ^ Ax. In particular, if Ax


and A2 are elementary events, we have either AXA2 = O or AXA2 = Ax.
But the latter is impossible since it would imply Ax c A2, which cannot
hold because of Ax ^ O and Ax A A2.

Lemma 2. For every compound event B of an algebra of events with a finite


number of events, there exists an elementary event A such that A a B holds.

Since B is a compound event, there exists an Ax A O such that Ax c B.


If Ax is elementary, our statement is proved; if, however, it is not elementary,
then there exists an A2 cz Ax. If A2 is elementary, we have found the ele¬
mentary event required; if not, there exists again a new decomposition, etc.
The procedure must end after a finite number of steps because of the finite¬
ness of the number of events. Therefore An must be elementary for a certain
number n.
Proof of Theorem 1. Let B be a compound event. According to Lemma 2
there exists then an elementary event Ax such that Ax cz B, i.e. B — Ax + Bx.
If Bx is elementary, the first statement of our theorem is proved; if Bx is
not elementary, we obtain, by applying repeatedly Lemma 2, a representa¬
tion Bx = A2 + B2, where A2 is elementary; if B2 is compound, the procedure
is to be continued. Thus we obtain a representation of B as a sum of ele¬
mentary events:
B = Ax + A2 + ... + Ar (1)

since the number of events is finite. It is evident, from this proof, that all
the Afs are distinct. If not already known, this could easily be shown be¬
cause of the rule A + A = A and the commutativity of the addition. (The
deduction used above to prove the representability of B as a sum of ele¬
mentary events is nothing else as the so-called “descente infinie” known
from number theory.) It remains still to prove the uniqueness of represen¬
tation (1). If there would exist two essentially different representations

B = Ax + A2 + ... + Ar = A'x + A2 + ... + As (2)

such that for instance Ax # A) (j = 1,2,..., s), then multiplication of both


sides of (2) by Ax would yield, by Lemma 1, Ax = O, i.e. a contradiction.
Herewith Theorem 1 is completely proved. The representation (1) is called
the canonical representation of the event B.

Theorem 2. The number of events oj a finite algebra of events is necessarily


a power of 2.
20 ALGEBRAS OF EVENTS [I. § 4

Proof.Let n denote the number of distinct elementary events of a finite


algebra of events. Every event of the algebra can be expressed as a sum of a
certain number of elementary events; this number r can be one of the num¬
bers 0, 1,if the impossible event is included. If r is fixed, the number
of the events which can be expressed as a sum of r distinct elementary events
is equal to the number of possible selections of exactly r from the n elemen-
n
tary events Ax, A2,. • A„; that is equal to Thus the number of ele¬

ments of the whole algebra of events is £ which is equal to 2".


r=o
It follows from Theorem 1 that the sure event / can be represented as the
sum of all elementary events

I — Ax + A2 + ... + An.

Thus always one and only one of the elementary events Ax, A2,.. ., A„ occurs.
The elementary events form a complete system of events.
Consider now as an example the algebra of events which consists of the
possible outcomes of a game with two dice. Clearly, the number of elemen¬
tary events is 36; let us denote them by Ay (i,j = 1, 2, . . ., 6) where Ay
means that the result for the first die is /, that for the second, j. Accord¬
ing to Theorem 2 the number of events of this algebra of events is
236 = 68 719 476 736. It would thus be impossible to discuss all cases.
We choose therefore another example, namely the tossing a coin twice.
The possible elementary events are: 1. first head, second head as well (let
Axx denote this case); 2. first head, next tail, denoted by A12; 3. first tail,
next head, denoted by A21; 4. first tail, second also tail, notation: A22. The
number of all possible events is 24 = 16. These are: I, O, the four elementary
events, further Axx + AX2, Axx + A21, Au + A_22, A12 + A21, AX2 + A22,
A i + A. , and besides these the four events Axx, AX2, A2X, A22 complementary
2 22

to the four elementary events.


By the canonical representation we have thus obtained a complete de¬
scription of this finite algebra of events. Now also the rules of operation
obtain a new sense. Theorem 1 namely points to a connection which shall
lead us to Kolmogorov’s theory of probability. A compound event, which
is the sum of elementary events, can be characterized uniquely by the set
of terms of this sum. In this way one can assign to every event a set, namely
the set of the elementary events whose sum is the canonical representation
of the event. Let A' denote the collection of elementary events which form
the event A and similarly let B' denote the collection of the elementary
events from which event B is composed. One can show that the collection
bf elementary events,from which the event AB is composed is the inter-
I, § 5] REPRESENTATION BY ALGEBRAS OF SETS 21

section A'B' of A' and B'; further that the collection of elementary events
from which the event A + B is composed is equal to the union A' + B'
of the sets A' and B'. In this assignment of events to sets, the elementary
events themselves correspond to the sets having only one element. Obviously
the empty set corresponds to the impossible event. To the sure event cor¬
responds the set of all possible elementary events (with respect to the same
experiment); this set will be denoted by H and will be called the sample
space. Further, it is easy to show that to the complementary event A corre¬
sponds the complementary set of A' with respect to H.
In the following paragraph we shall show that to every algebra of events
corresponds an algebra of the subsets of a set H such that there corresponds
to the event A + B the union of the sets belonging to A and B and to the
product AB the intersection of the sets belonging to A and B, finally to the
complementary event A the complementary set with respect to H of the
set belonging to A. In other words, one can find to every algebra of events
an algebra of sets which is isomorphic to it.
The proof of this theorem, due to Stone [1], is not at all simple; it will
be accomplished in the next paragraph; on first reading it can be omitted,
since Stone’s theorem will not be used in what follows. We give the proof
only in order to show that the basic assumption of Kolmogorov’s theory,
i.e. that events can always be represented by sets, does not restrict the gen¬
erality in any way.
In the case of a finite algebra of events this fact was already established
by means of Theorem 1. Here we even have a uniquely determined event
corresponding to every subset of the sample space.
The theory of Boolean algebras is a particular case of the theory of more
general structures called lattices (cf. e.g. G. Birkhoff [1]).

§ 5. Representation of algebras of events by algebras of sets

In this paragraph we prove a theorem due to Stone, mentioned in §4.1

Theorem. There can be associated with every algebra of events an algebra


of sets isomorphic to it.

Proof. Let be an algebra of events; let A, B,C,... denote its elements.


Consider a subset a of having the following properties:

1. The event O is not contained in a.


2. From A £ a and B £ a follows AB £ a.
The Moffett Library
'The proof given here is due to O. Frink [1].
Midwestern University
Wichita Falls, Texas
22 ALGEBRAS OF EVENTS [I, § 5

3. Among the sets satisfying conditions 1 and 2 a is maximal in the fol¬


lowing sense: there exists no set P satisfying conditions 1 and 2 and con¬
taining a as a proper subset.
The sets a fulfilling these three conditions are briefly called crowds of
events.1
It is easy to show the following: If a is a crowd of events, we have

a) I d a.
b) A £a implies A $ a and conversely.
c) If A + B d a, then A or B belongs to a.

Now we prove the property of the crowds of events stated in

Lemma 1. AB doc implies A dot. (and clearly, because of the symmetry


of the formula, it implies B ( x as well).

Proof.Suppose that AB d x, A $ a. Let ft be the union of a and all events


AC, where C runs through all elements of a; a is a proper subset of P, since
we have / d a and thus, by assumption, A = AI dP and A a.
The set p fulfils the conditions 1 and 2. First we prove that (l satisfies
condition 1, and hence that O (J (l. Indeed, because of AB d a, we have
AB O, hence A O; further, since (AC)B = (AB)C and (AB)C doc,
we have for C £ a the relation AC ^ O. From this our first statement fol¬
lows. It remains to show that D d p and E d P imply the relation DE d (l.
Now we have either D £ a, E £ x or D d a, E a (or conversely) or else
D (£ a, E a. If D d a, E da, then, by our assumption, we have DE d a
and thus certainly DE d p. If D d oc, E (J a, then there exists a C (a such
that E = AC, hence DE — A(CD)', since CD d oc, we have DE d P- In the
case D (f a, E ^ a, there exist two elements C1 and C2 of a such that D — ACX
and E = AC2. Then DE = A(CXC2), and since CXC2 d oc, we have DE d P-
But this contradicts the assumption that a is maximal. Lemma 1 is there¬
fore proved.
Next we prove

Lemma 2. Every event A (A # O) of an algebra of events f belongs at


least to one crowd of events a.

1 In lattice theory such systems are called ultrafilters: ultrafilters are commonly
characterized as sets complementary to prime ideals. A nonempty subset of a Boolean
algebra cA\% called a prime ideal, if the following conditions are fulfilled. 1. A
and B £ /? imply A + B £ /?. 2. A £ /? and B imply AB £ ft. 3. If AB £ ft, then
A £/? or B £/S (or both). Cf. e.g. G. Aumann [1].
I, § 5] REPRESENTATION BY ALGEBRAS OF SETS 23

This lemma is the consequence of a general set theoretical theorem of


Hausdorff; we have, however, to define some notions in order to state it.
A set is said to be partially ordered, if an ordering relation, denoted by
<, is defined for certain pairs of its elements; if a < b, we say that the ele¬
ment a precedes the element b.
The relation < is required to fulfil the following conditions:

1. For no element does a < a hold.


2. a < b and b < c imply a < c.

The relation < is therefore irreflexive and transitive. A subset 3* of a


partially ordered set^is called a chain, if for any two elements a and b
of 3> either a < borb< a holds. An ordered subset 3* of 3" is thus said to
be a chain. A chain 3* is said to be a maximal chain in 3, if there does not
exist any element of 3 such that by adjoining it to 3 the so obtained subset
would still remain a chain.
We are now ready to state the above-mentioned lemma.

Lemma 3. (Hausdorff). If 32 is a partially ordered set, then every chain


3? of 32 is a subset of a maximal chain.1

Let us return to the proof of Lemma 2. Let 32 denote the set of those sys¬
tems of events /J in the algebra of events which fulfil conditions 1 and 2
of the crowds of events. If /? < y means that /? is a proper subset of y, 3 is
a partially ordered set. If A A O, the set /? = (A) consisting only of the
element A evidently fulfils conditions 1 and 2. According to Lemma 3 there
exists a maximal chain containing /? = (T) as a subset. Let a denote the
union of the subsets y belonging to this chain. Clearly, a is a crowd of events,
since it is the union of sets /? fulfilling the rules 1 and 2 defining the crowds
of events. Therefore no element of the chain contains the event O and thus
a does not contain O either. Further if Bx and Bo belong to a, they belong to
a subset /?x respectively a subset /?2 of a- Since either ^ < /L or the contrary
must hold, and /?2 belong both to ^ or to /L and the same holds for B1B2.
Thus BXB2 belongs to a as well. Further we see that a cannot be extended.
This is a consequence of the requirement that the chain be a maximal chain.
Lemma 2 is thus proved.
Now we can construct to every algebra of events a field of sets isomorphic
to it. Let at be the set of all crowds of events a of the algebra of events .
We assign to every event A of the subset 32A of 3f consisting of all crowds
of events a containing the event A. The setL?1^ will be called the representa-

1 As to the proof of Lemma 3 cf. e.g. F. Hausdorff [1] or O. Frink [2].


24 ALGEBRAS OF EVENTS [I, § J

tive of the event A. As A # O, 32 A is, by Lemma 2, a nonempty set. We asso¬


ciate to the event O the empty set. The system consisting of the empty set
and of all ?%A-s shall be denoted by
We prove the following relations:

32A 32B — 32AB, (1)

32 A — 32 A , (2)
32 A + 32B = 32a+b. (3)

Further it will be proved that the correspondence A -*■ <32A is one-to-one;


thus A A B implies 32A A 32B. Hence the algebra of sets is isomorphic to
the algebra of events^. (We understand of course by 32A32B the intersec¬
tion of 32A and 32B and by 32A + <JPgthe uni on of both sets. 32A denotes
the complementary set with respect to 32?, and AB,A + B, A the corre¬
sponding operations in the algebra of events 3?.)
The relation (1) can be proved as follows. A crowd of events a belonging
to both of 32A and 32B contains both A and B and thus AB as well. Con¬
versely, if AB belongs to a, then by Lemma 1 A £ a and B £ a and thus a
belongs to 32A and to 32B hence also to 32A 32B.
Proof of (2). If the crowd of events a does not belong to the set 32A,
it must contain an event B such that AB = O ; otherwise namely AB A O
for every element B of a and a could be extended by adjoining A and every
product of the form AB (B £ a). This leads, just like in the proof of Lemma 1,
to a result which contradicts the assumption that a is a maximal chain.
From AB = O it follows that B = AB + AB = AB. Hence A belongs,
according to Lemma l,to a. Conversely, if A belongs to a, A cannot belong
to it since AA — O. From this it follows that 32A consists exactly of the
crowds of events containing the event A, hence 32 A = 32A.
Relation (3) is a direct consequence of relations (1) and (2). Indeed
we have

32a+b = 32 AB = 3ab = 32^32^ = 32A = 32 A + ’32B.

Thus we have proved that J^is an algebra of sets. In order to show that
3 is isomorphic to the algebra of events it still remains to prove that the
correspondence A -> 32A is one-to-one. If A A B, we have AAB ^ O.
Hence at least one of the relations AB ^ O and AB ^ O is valid as well.
Suppose that AB ^ O. Because of (1) 32~^B = 32^32B. Hence every crowd
of events belonging to <-%AB belongs to 32A and also to 322B, hence it belongs
to 32B and does not belong to 32A. Thus we proved the existence of crowds of
events which belong to 32B but do not belong to 32A. Hence 32B and 32A
x. § 6] EXERCISES 25

cannot coincide. Herewith the theorem is proved. In this proof we have


used as regards only its property of being a Boolean algebra, therefore
we may formulate the theorem just proved in a more general manner:
There exists to every Boolean algebra an algebra of sets isomorphic to it.

§ 6. Exercises

1. Prove

a) AB + CD = (A~ + B) (C -f D),
b) (A + B) (A + B) + (A + B) (A + B) = I,
c) (A + B) (A + B) (A + B) (A + B) = O,
d) (A + B) (A + C) (B + C) = AB + AC + BC,
e) A — BC = (A — B) + (A — C),
f) A — (B + C) = (A — B) — C,
g) (A — B) + C = [(A + C) — B] + BC,
h) (A—B) — (C—D)= [A—{B+ C)] + {AD — B),
i) A — {A — [B — (B—C)]} = ABC,
j) ABC + ABD + ACD + BCD =
= (A + B) (A + C) (A + D)(B + Q(B + D){C + D),
k) A + B+ C= (A—B)+ (B—C)+ (C—A)+ ABC,
l) A A (B A C) = (A A B) A C,
m) (A + B)A (A~+ B)= A AB,
n) AB A BA — A A B.
o) Prove the relations enumerated in § 2 (6) for the symmetric difference.
p) The relation (A + B) — B = A does not hold in general. Under what con¬
ditions is it valid?
q) Prove that A A B = CAD implies A A C = BAD.

Hint. If A A B = C A D and A, B, C, D are subsets of the same set, then every


point of this set belongs to an even number (0, 2, or 4) of the sets A, B, C, D and
A A C — B A D means the same.
r) Prove that the elements of an arbitrary algebra of events form an Abelian
group with respect to the symmetric difference as the group operation.

2. The elements of a Boolean algebra form a ring with respect to the operations
of symmetric difference and multiplication. The zero element is O, the unit element 1.

3. In a finite algebra of events containing n elementary events one can give several
complete systems of events. Complete systems of events differing only in the order
of the terms are to be considered as identical. Let T„ denote the number of the different
complete systems of events.
a) Prove that 7\ = 1, T2 = 2, T3 = 5, f = 15, Tj = 52, T0 = 203.
b) Prove the recursion formula

and show that Txo = 115 975.


ALGEBRAS OF EVENTS [I, § 6
26

c) Prove

k= 1 K•

4. Let y„ denote the number of complete systems of events consisting of three


elements in a finite algebra of events consisting of n elementary events. Show that
3"'1 + 1
Vn =-A
- 2,n-l

5. Prove the relation

(T„ means the same as in Exercise 3).

6. Let Q„ denote the number of complete systems of events in an algebra with n


elementary events, such that every event is the sum of an odd number of different
elementary events. Prove that

Qi — 1, <22 — 1> 03 — 2, Qi — 5, Q5 — 12, <2C 37,

further that
co
_ gSllX'

1 + n=i
1
7. We can construct from the events A, B, and C by repeated addition and multi¬
plication eighteen, in general different, events namely A, B, C, AB, AC, BC, A + B,
B + C, C + A, A + BC, B + AC, C + AB, AB + AC, AB + BC, AC + BC, ABC,
A + B + C, AB+ AC + BC. (The phrase “in general different” means here that
no two of these events are identical for all possible choices of the events A, B, C.)
Prove that from 4 events one can construct 166, from 5 events 7579 and from 6 events
7 828 352 events in this way. (No general formula is known for the number of events
which can be formed from n events.)

8. The divisors of an arbitrary square-free1 number N form a Boolean algebra’


if the operations are defined as follows: We understand by the “sum” of two divisors
of N their least common multiple, by their “product” their greatest common divisor;
i _ N
d being a divisor of N we understand by d the number — ; the number 1 serves as

O and the number N as I.

9. Verify that for the example of Exercise 8, our Theorem 1 is the same as the well-
known theorem on the unique representability of (square-free) integers as a product
of prime numbers.

10. The numbers 0, 1, ... , 2"-l form a Boolean algebra if the rules of operation
are defined as follows: Represent these numbers in the binary system. We understand
by the “product” of two numbers the number obtained by multiplying the corre-

1 A number is said to be square-free if it is not divisible by any square number


except by 1. The square-freeness of N is required only to ensure the existence of the
complementary element.
I, § 6] EXERCISES 27

sponding digits of both numbers place for place; by the sum the number obtained
by adding the digits place for place and by replacing everywhere the digit 2 obtained
in the course of addition by 1.

11. Let A, B, C denote electric relays or networks of relays. Any two of these
may be connected in series or in parallel. Two such networks which are either closed
both (allowing current to pass) or both open (not allowing current to pass) are con¬
sidered as equivalent. Let A + B denote that A and B are coupled in parallel, AB
that they are coupled in series. Let A denote a network always closed if A is open
and conversely. Let O denote a network allowing no current to pass and / a network
always closed. Prove that all axioms of Boolean algebras are fulfilled.1

Hint. Relation04 + B)C = AC + BC has for instance the meaning that it comes
to the same thing to connect first A and Bin parallel and couple the network so obtained
with C in series or to couple first A and C in series, then B and C in series and then
the two systems so obtained in parallel. Both systems are equivalent to each other in
the sense that they either both allow to pass the current or both do not. A similar
consideration holds for the other distributive law. Both distributive laws are illustrated
in Fig. 5.

(A+B)C = AC + BC

12. A domain of the plane is said to be convex if it contains for any two of its
points the segment connecting these points as well. We understand by the “sum”
of two convex domains the least convex domain containing both, by their “product”
their intersection which, evidently, is convex as well. Let further / denote the entire
plane and O the empty set. The addition and multiplication fulfil axioms (l.l)-(2.3);
the distributive laws are, however, not valid and the complement A is not defined.

13. Let us understand by a linear form a point, a line or a plane of the 3-dimensional
affine space, further let the empty set and the entire 3-dimensional space be called
linearforms too. We define as the sum of a finite number of linear forms the least linear
form containing their set theoretical union; let their product be their (set theoretical)
intersection, which is evidently a linear form too. Prove the same propositions as in
Exercise 12.

1 This example shows how Boolean algebra can be applied in the theory of net¬
works and why it is of great importance in communication theory and in the construc¬
tion of computers (cf. e.g. M. A. Gavrilov [1]).
28 ALGEBRAS OF EVENTS [I, § 6

14. Let Ax, A2,. . . , A„ be arbitrary events. Form all products of these events
containing k distinct factors, and let Sk be the sum of all products. Let Pk be the
product of all events representable as a sum of k distinct terms of Ak, A2, ■ . ■, A„.
Prove the relation

Sk — P„-k+i (k — 1,2,, n)

in a formal way by applying the rules of operation of Boolean algebras and verify
it directly too (cf. the generalization of Exercise Id).

Hint. Sk has the meaning that among the events Ax, A2,..., A„ there are at least
k which occur and the meaning of P„-k+1 is that among these same events there are
no n — jfc + l which do not occur; these two statements are equivalent.

15. Let H be a set and T a set of certain subsets of H. T is an algebra of sets, if


and only if the following conditions are fulfilled:

a) The set H belongs to T.


b) Whenever A and B belong to T, A — B belongs to T as well.

16. Show that condition b) of the preceding exercise cannot be replaced by b')
“whenever A and B belong to T, A A B belongs to T as well”.

17. If the conditions


a) A £ T implies A £T,
p) A € T and B € T imply AB 6 T
are postulated instead of conditions a) and b) of Exercise 15 and T is nonempty,
then T is an algebra of sets.

18. Condition /?) of the preceding exercise may be replaced by


P) “A £ T and B € T imply A + B € T.”

19. Prove that in Exercise 18 the proposition cannot be replaced by


p") “A £T, B £T, and AB = O imply A + B 6 TP

Hint. Let H be a finite set of the elements a, b, c, d and let T consist of the following
subsets of H: {a, b}; {c, d}; {a, c}; {b, d}; O; H.

20. We call a nonempty 01 of subsets of a set H that contains with two sets A and
B also A + B and A — B, a ring of sets. A ring of sets ^ is thus an algebra of sets
if and only if H belongs to 01. Prove that a nonempty system of sets containing with
A and B also AB and A — B is not necessarily a ring of sets. Show that the condition
“with two sets A and B, A + B and AB belong as well to S” is not sufficient for 5
to be a ring of sets either.
CHAPTER II

PROBABILITY

§ 1. Aim and scope of the theory of probability

There exist in nature several phenomena fitting into a deterministic scheme:


given a complex K of circumstances, a certain event A necessarily occurs.
On the other hand, there are a number of phenomena in the sciences as well
as in our everyday life which cannot be described by such schemes. It is
characteristic for such phenomena that given the complex K of the circum¬
stances, the event A may or may not occur. Such events are called random
events and such schemes are said to be stochastic schemes. Consider for in¬
stance a radioactive atom during a certain time interval: the atom may or
may not decay during the time of observation. The instant of the decay de¬
pends on processes in the nucleus which are, however, neither known to us
nor observable.
We can see from this example the need of studying stochastic schemes.
Very often it is entirely impossible (at least at the present state of matters)
to consider all relevant circumstances. But in many practical problems this
is not at all necessary. An event A may be a random event with respect to
a complex of circumstances and at the same time it may be completely de¬
termined with respect to another, more comprehensive complex of circum¬
stances. The randomness or determinedness of an event is a matter of fact:
it depends only on whether the given complex of circumstances K does or
does not determine the course of the phenomena (that is, the occurrence or
the non-occurrence of the event A). But the choice of the complex of cir¬
cumstances K depends on us, and we have a certain freedom to choose it
within the limits of possibilities.
Regarding random mass-phenomena we can sketch their outlines, in
spite of their random character. Consider for instance radioactive disinte¬
grations. Each radioactive substance continues to decay according to a well
determined “rate”; we can predict, what percentage of the substance will
disintegrate during a given time interval. The disintegration follows an
exponential law (cf. Ch. Ill) characterized by the half-life period. (The half-
life period is the time interval during which half of the radioactive substance
disintegrates. In the case of radium this is some 1600 years.) The exponential
law is a typical “probability law”. This law is confirmed by the observations
30 PROBABILITY [II, § 2

with the same accuracy as most of the “deterministic” laws of nature. The
radioactive disintegration is thus a mass phenomenon described, as to its
regularity, by the theory of probability.
As seen in the above example, phenomena described by a stochastic
scheme are also subject to natural laws. But in these cases the complex of
the considered circumstances does not determine the exact course of the
events; it determines a probability law, giving a bird’s view of the outcome.
Probability theory aims at the study of random mass-phenomena, this
explains its great practical importance. Indeed we encounter random mass-
phenomena in nearly all fields of science, industry, and everyday life.
Almost every “deterministic” scheme of the sciences turns out to be sto¬
chastic at a closer examination. The laws of Boyle, Mariotte, and Gay-Lus¬
sac for instance are usually considered to be deterministic laws. But the
pressure of the gas is caused by the impacts of the molecules of the gas on
the walls of the container. The mean pressure of the gas is determined by
the number and the velocity of the molecules hitting the wall of the container
per time unit. In fact, the pressure of the gas shows small fluctuations,
which may, however, be neglected in case of greater gas-masses. As another
example consider the chemical reaction of two substances A and B in a
watery solution. As it is well known, the velocity of the reaction is in every
instant proportional to the product of the concentrations of A and B. This
law is commonly considered as a causal one, but in reality the situation is
as follows. The atoms (respectively the ions) of the two substances move
freely in the solution. The average number of the “encounters” of an ion
of substance A with an ion of substance B is proportional to the product of
their concentrations; hence this law turns out to be essentially a stochastic
one too.
The development of modern science makes it often necessary to examine
small fluctuations in phenomena dealt with earlier only in their outlines
and considered at that level as causal. In the following, we shall find several
occasions to illustrate these principal questions with concrete examples.

§ 2. The notion of probability

Consider an experiment where the circumstances regarded do not unique¬


ly determine the outcome of the experiment, but leave more possibilities.
Let A be one of these. If we repeat the experiment under invariant condi¬
tions, A_ will occur at some of these experiments, while in the other experi¬
ments A will occur. If among n experiments the event A occurred exactly k
k
times, then k is called the frequency and — the relative frequency of the event
II, § 2] THE NOTION OF PROBABILITY 31

A in the given sequence of experiments. Generally, the relative frequency


of a random event is not constant in different sequences of experiments.
Consider as an example screws produced by an automatic machine. Let A
denote the event that the screw does not fit the requirements, i.e. that it

is defective. The frequency of the defective items will be in general different


m every series, for instance in the first series 3 in 100 screws, in the second 5,
etc. Under constant circumstances of production the percentage of the de¬
fective items fluctuates about a certain value. After a change in the circum¬
stances of production this value may be different, but about the new value
there will again be fluctuations.
The mentioned fluctuations of the relative frequency may be observed in
the simple experiment of coin tossing by observing for instance the relative
frequency of the heads. If H denotes head and T tail, we may obtain in a
sequence of 25 tossings the following outcome:
HTTHHTTTHHTHHHTTTHTTHHHTT.
Figure 6 represents the fluctuations of the relative frequency.
Anyone may perform such experiments in a few minutes. The order of
heads and tails will every time be different but the general picture will re¬
main essentially similar to the above: the relative frequency will always

fluctuate about the value —.

Figure 7 represents the outcome of an experiment of 400 tossings. As


early as in the eighteenth century large sequences of tossings were observed.
Buffon for instance performed an experiment of 4040 tossings. The outcome
was 2048 times head, hence the relative frequency was 0.5069. In the begin-
32 PROBABILITY [II, § 2

ning of our century K. Pearson obtained from 24 000 tossings the value
0.5005 for the relative frequency.
There are thus random events showing a certain stability of the relative
frequency, i.e. the latter fluctuates about a well-determined value and the

more trials are performed, the smaller are, generally, the fluctuations. The
number, about which the relative frequency of an event fluctuates, is called
the probability of the event in question. Thus, for instance, the probability

of “heads” (supposed the coin is regular) is equal to

Consider now another example. In throws of regular die of homogeneous


substance the relative frequency of any one of the faces 1, 2, 3, 4, 5, 6 fluc¬

tuates about —, i.e. the probability of each number is equal to —, if,


b 6
however, the die is deformed, e.g. by curtailing one of its faces, these
probabilities will be different.
A further example is the following: there is a well-determined probability
that a certain atom of radioactive substance disintegrates during a given
time interval t. That is, in repeated observation of atoms during a time inter¬
val t we find that the number of atoms decaying during this time interval
fluctuates about a well-determined value. This value is according to the
observations 1 — c~Xt, where 2 is a positive constant depending on the radio¬
active substance. (E.g. in case of radium, if the time is measured in seconds,
2 = 1.38 • 10 1].) Later on we shall give a theoretical foundation of this
law, i.e. we shall deduce it from simple assumptions.
«, § 3] PROBABILITY ALGEBRAS 33

Comparing our examples of coin tossing and radioactive disintegration


we see that in the first a coin is tossed many a times successively, while in
the second a great number of simultaneous events are observed. This differ¬
ence is, however, unessential. Indeed, instead of tossing one coin succes¬
sively, we could toss a great number of similar coins simultaneously. The only
essential thing from the point of view of probability theory is the (at least
conceptually) unrestricted repeatability of the observation under the same
circumstances.
To sum up: If the relative frequency of a random event fluctuates about
a well-determined number, the latter is called the probability of the event
in question. The probability of an event may change with the circumstances
of the experiment. The only way to decide whether the relative frequency
has or does not have the property of stability and to determine the value
about which the statistical fluctuations occur, is empirical observation.
Thus the probability is considered to be a value independent of the
observer. It gives the approximate value of the relative frequency of an event
in a long sequence of experiments.
As the probability of a random event is an objective value which does not
depend on the observer, one can approximately calculate the long run
behaviour of such events, and the calculations can be empirically verified.
In everyday life we often make subjective judgements concerning the prob¬
ability of a random event. The mathematical theory of probability, how¬
ever, does not deal with these subjective judgements but with objective proba¬
bilities. These objective probabilities are to be “measured” just like physical
quantities. Probability theory and mathematical statistics have their own
methods to “measure” probabilities. These methods are mostly indirect
and at the end are always based upon the observation of relative frequencies.
But at a conceptual level we must sharply distinguish the probability, which
is a fixed number, from the relative frequency depending on chance.

§ 3. Probability algebras

The theory developed in this book is due to A. N. Kolmogorov. It is the


basis of modern probability theory.
In this theory one starts from the assumption that to all possible (or at
least all considered) events in an experiment (or with other words: to all
elements of an algebra of events) a numerical value is assigned: the proba¬
bility of the event in question. If we perform an experiment n times and find
k occurrences of the event A, then we have, because of the relation 0 < k <
k
< n, anyhow 0 < — < 1; that is, the relative frequency of an event is always
n
34 PROBABILITY [II, § 3

a number lying between zero and one. Evidently, the probability of an event
must therefore lie between zero and one as well. It is further clear that the
relative frequency of the sure event is equal to one and that of the impossible
event is equal to zero. Hence also the probability of the “sure” event must
be equal to one and that of the “impossible” event equal to zero. If A and B
are two possible outcomes of the same experiment which mutually exclude
each other and if in n performances of the same experiment the event A
occurred kA times and the event B kB times, then clearly the event A + B
occurred kA + kB times. Hence, denoting by fA,fB and fA+B the relative
frequencies of A, B, and A + B, respectively, we have:

/a+b —/a + Jii-

ln other words, the relative frequency of the sum of two mutually exclusive
events is always equal to the sum of the relative frequencies of these events.
Hence also the probability of the sum of two mutually exclusive events must
be equal to the sum of the probabilities of the events. We therefore take the
following axioms:

a) To each element A of an algebra of events a non-negative real number


P(A) is assigned, called the probability of the event A.
p) The probability of the sure event is equal to one, that is P(T) — 1.
y) If AB = O, then P(A + B) = P(A) + P(B).

An algebra of events in which to every element A a number P(A) is


assigned satisfying Axioms a), fi), y), will be called a probability algebra.
Let us first see some consequences of the axioms:

Theorem If B <= A, then P{B) < P{A).

Proof. From B £ A it follows A — B + C, where C = AB and hence


BC = O. Thus by Axiom y) we have

P{A) = P(B) + P(C).

Since because of Axiom a) P(C) > 0, Theorem 1 follows immediately. It


can also be deduced directly from the relation between probability and re¬
lative frequency. Indeed, if the occurrence of event B implies the occurrence
of event A (i.e. if B c A), then in any sequence of experiments the event A
occurs at least as often as the event B.
Since A ^ I for every A £ it follows from Theorem 1 that

P(A) < 1 for every A.


II, § 3] PROBABILITY ALGEBRAS 35

Another important consequence of the axioms is the possibility to find


the probability of the contrary event A from the probability of the event A.
Indeed, we have A + A = / and AA = O, and thus by Axiom y) P(A) +
4- P(A) = P(7) = 1. Herewith we proved

Theorem 2. For any event A the relation

P(A) + P(A) = l
holds.

Because of O — 7, it follows from Theorem 2


P(0) = 1 - P(I) = 1 — 1=0;

i.e. the probability of the impossible event is equal to 0.


Theorem 2 can be deduced directly from the conceptual definition of
probability. Indeed, if in a sequence consisting of n experiments the event
A occurs k times, then A occurs exactly (n — Ic) times. Hence for the rela¬
tive frequencies fA respectively f A we have

Ja +/a - 1-
Axiom y) states that the probability of the sum of two mutually exclusive
events is equal to the sum of the probabilities of the two events. This leads
immediately to

Theorem 3. If the events Ah A.,,. . ., A„ are pairwise exclusive, i.e. if


AjAj = O for i j, then

P(Ai + A2 + ... + An) = P(Aj) + P(A2) + . . . + P(An).

The proof proceeds by mathematical induction. Our theorem is valid for


n = 2. Suppose that it is proved for n — 1. Since clearly (,4X + A2) Ak =
= O (k = 3, 4,. .., n), we have because of the induction assumption

P(AX + . . . + An) = P(dx + A2) + P(AS) + . . . + P(An) =


= P(Aj) + P(A2) + . . . + P(A„).

From Theorem 3 follows, as a generalization of Theorem 2,

Theorem 4. If the events Ax, A2,..., An form a complete system (cf. I,


§ 2), then
P(AX) + P(A2) + . . . + P(AJ = 1.

Indeed, by assumption

Ax + A2 + . . . + An = I and AtAj = O
36 PROBABILITY [IX, § 3

for i ^ j, further P(l) = 1; hence Theorem 4 follows immediately from


Theorem 3.
As a further important consequence of the axioms we show how to cal¬
culate the probability of the sum of two events A and B, if we drop the re¬
striction that A and B should exclude each other.

Theorem 5. Let A and B be arbitrary events. We have then

P(A + B) = P(A) + P(B) - P(AB).

Proof. A + B can be represented as a sum of two mutually exclusive


events. We have namely
A + B = A + AB.
Hence by Axiom y)

P(A + B) = P(A) + P(AB). (i)

On the other hand, we have B = AB + AB and AB-AB = O. From these


it follows
P(B) = P(AB) + P(AB). (2)

Subtraction of Equation (2) from (1) leads to our statement. In particular,


if we have AB = O our theorem will be reduced to the assertion of Axiom y).
We shall need the following simple theorems:

Theorem 6. If A c b, then

P(B - A) = P(B) - P(A).

Proof. We have, by assumption, B = A + (B - A) and A(B — A) — O


hence Theorem 6 follows from Axiom y).
Clearly the relation P(B - A) = P(B) - P(A) does not hold in general.
It is, however, easy to obtain

Theorem 7. P(B - A) = P(B) - P(AB).

Proof. We have (B - A) + AB = B and (B - A)AB = O; from this


our theorem follows because of Axiom y).
Furthermore we have

Theorem 8. P(AAB) = P(A) + P(B) - 2P(AB).

Proof. Because of

AAB -- (A — B) + (B -A)
H, § 3] PROBABILITY ALGEBRAS 37

and
(A — B) (B — A) — O
we find
P(AAB) = P{A - B) + P(B - A),

and Theorem 8 follows from Theorem 7.


We proved Theorem 3 by repeated application of Axiom y). In the same
manner, we can obtain by repeated application of Theorem 5 a formula
for the probability of the sum of an arbitrary number of events. In particular,
we have

P(A + B + C) - P(A) + P(B) + P(C) - P(AB) - P(BC) -


- P(CA) + P(ABC).

More generally we have

Theorem 9. Let Ax, A2,. . ., A„ be arbitrary events of a probability algebra.


Then we have

F(AX + A, + ... + A„) = Y. (~ l)*-1 •S£").


k=l
where

Sy> - I P(A,A,,...Aik).
\<,i±<h<... <ik<,n

In the summation (f, i2,. . 4) run through all combinations, k at a


time, of the numbers 1, 2,..., n, repetitions not allowed.
Theorem 9 is a particular case of the following more general theorem:

Theorem 10 (Ch. Jordan). Let Vf denote the probability of the occurrence


of exactly r among the events Au A2,. . ., An. Then we have

r + k)
nn>= ’£ (-i)* k
Sftk (r = 0, (3)
k =0
where S(f — 1 and

Sp = I P(AiLAu... A„) for l = 1,2,... (4)

and the summation is to be extended over all combinations of the numbers


1, 2, . . ., n, l at a time, repetitions not allowed.

We shall prove Theorem 10 by a general principle which can be used for


the proof of many similar identities.1

1 Cf. A. Renyi [23], [35].


38 PROBABILITY Eli, § 3

We need some preparatory definitions and remarks. Let Ax, A2,. . A„


be arbitrary events. There exists a “least algebra of events” containing
the events Ax, A2, . . ., A„. Clearly, consists of 22" elements and is
generated by the 2" elementary events co = Ah Ah . . . Aik Ah . . . Ajn_k,
where (ix, z2,. . ., ik) is an arbitrary combination of the numbers 1,2,.. ., n;
k at a time (k = 0, 1,. . ., n) and are those of the numbers
1, 2,. . ., n which do not figure among ix, i2,. . ., ik?
We shall call every event belonging tot^tf “an event expressible in terms of
Ax, A2,. . ., An ”. Every B from is namely a sum of certain elementary
events co. Thus B can be expressed by application of the basic operations of
the algebra of events to Ax, A2,. . ., A„. In other words: Every event B £ ^
is a. function of the events Ax,. . ., An. The functional relation between an
event B £ <^and the events Ax, A2,.. An representing B does not depend
on the specific choice of the events Ax, A2,. . ., A„. In particular it does not
depend on the probabilities of the events Ax, A2, ■ ■ A„. The event
Ah Ai2. . . Air, or the event Er that exactly r of the events Ax, A2, . . ., An
occur, are for instance events expressible in terms of Ax, A2, ■ . A„.
The following theorem contains the principle mentioned above:

Theorem 11. Let Ax, A2,. . ., An be n arbitrary events and ^ the algebra
of events of all events expressible in terms of the events Ak. Let cx, c2,. . ., cm
be real numbers and Bx, B2,. . ., Bm a sequence of events such that Bk ^
(k = 1,2,..., m). The inequality
m

Z ckP{Bk) > 0 (5)


k=1

holds for an arbitrary probability algebra obtained by assigning probabilities


to the elements of if and only if it holds for those probability algebras,
in which the sequence of numbers P(Ak) (k — 1,2,.. ., n) consists only of
zeros and ones.
\
Proof. Since every Bk is the sum of certain elementary events co, (5) is
equivalent to the inequality
i

i:- E f ! z^co) >o, (6)


where the real numbers depend only on the numbers ck and on the func¬
tional dependence of the events Bk on the events Aj, but do not depend on
the numerical values P(Aj). The summation is over all the 2" elementary

1 The events Ak(k = 1, ...,«) are here considered as variables (indefinite events);
thus there are no relations assumed between the Ak-s.
II, § 3] PROBABILITY ALGEBRAS 39

events of Indeed we have

4 = Z ck, (oC.Bk

where the summation is over such values of k for which co figures in the
representation of Bk. Since the nonnegative numbers P(co) are submitted
to the only condition IP(co) = 1, (6) holds, in general, if and only if all
numbers 1.^ are nonnegative. But when the sequence of numbers P(Aj)
(J — 1,2consists of nothing but zeros and ones, one and only one
of the elementary events co has probability 1 and all the others have proba¬
bility 0. Thus the proposition that 2^ > 0 for all co is equivalent to the prop¬
osition that (6) is valid whenever the sequence of numbers P(Ak) consists
of zeros and ones only. Theorem 11 is now proved.
From Theorem 11 follows immediately

Theorem 12. If At, A2,. . ., A„ are arbitrary events and Bx, B2,. . ., Bm
are certain events expressible by A}, A2, . . A„, then the relation
m
Z ck P(Bk) = 0 (7)
k=l

holds in every probability algebra, if and only if it holds in all cases when the
sequence of numbers P{Ak) consists of zeros and ones only.
m
Proof. Apply Theorem 11 in turn to the inequalities £ ckP(Bk)> 0
i
tn
and Yj (—cf)P(Bk) > 0. Theorem 12 follows immediately.
/c = 1
Now we can prove Theorem 10 (and thus also Theorem 9). If/fiom
the numbers P(Ak) are equal to 1 and the remaining n — / are equal to
0 (/ = 0. 1, 2, . . ., n), then (3) is reduced to the identity
n—r
'r + k ' l jl, if / = r,
z k r+k (0, if / * r.
(8)

For I < r all terms of the left hand side of (8) are equal to 0, for / — r only
the term k = 0 is distinct from zero, namely 1 and for l > r the sum can be
transformed as follows:
n—r
[r + k ' 1
E (-i)‘
k=0 ( k r + /c,

/ t~r it _ r\

r Z (-1)*
k =0 '
(1 - l)'~r = 0.
40 PROBABILITY [II, § 4

Let us mention as another application of Theorem 11 the inequality of


M. Frechet and A. J. Gumbel (cf. M. Frechet [2]).

Theorem 13 (M. Frechet).

■Sffi ^ sp (r — 0, 1), (9)


n \
I"]
l'+lj
where the same notations are used as in Theorem 9.

Theorem 14 (A., J. Gumbel).

n
c (») -s?>
4-
r+1
r+ 1
- (r= 1,2,... n — 1), (10)
n— 1 In- 1
r r— 1

where the same notations are used as in Theorem 9.

Proof of Theorems 13 and 14. Apply Theorem 11. If / of the numbers


P(Ak) are equal to 1, and the others are equal to 0, then (9) will be reduced
to the trivial inequality

l'+lj M
-
n ' n
I'+lJ r

and Theorem 14 to the likewise trivial inequality

f H / M l
1 r+ 1 „ \r r
i, n — 1 in — 1

I'-l

§ 4. Finite probability algebras

If the set of events of a probability algebra is finite, we have shown in


Chapter I the possibility of representing these events by the class of all subsets
of a finite set Q. Let Q consist of N elements, denoted by co.z,. . ., wN.
II, § 5] PROBABILITIES AND COMBINATORICS 41

The probability P(A) of any event A is then uniquely determined by the values
of P for the sets which consist of exactly one element. Let {<*>,} be the set
consisting of co; only and let further be P({a>,}) = pt (/ = 1, 2,.. N).
Then we have for each event A

W= E ^
o>i£A

Since P(Q) = 1, the (nonnegative) numbers pt must obey the condition


N

Z
i=i
Pi = L

An important particular case is obtained if all the numbers pt are equal

to each other, that is, if they are equal to — . These special probability alge¬

bras are called classical probability algebras, since the classical calculus of
probability exclusively dealt with these algebras.
At the early stages of development of probability theory one wished to
reduce the solution of any kind of problems to this case. But this often turned
out to be either too artificial or unnecessarily involved. Since, however, in the
games of chance (tossing of a coin, games of dice, roulette, card-games,
etc.) the probabilities can be determined in this manner indeed, and since
many problems of science and technology may be reduced to the study of
classical probability algebras, it is worthwhile to deal with them separately.
In the case of classical probability algebras we have

v J N

where K denotes the number of the elements of A. Thus we arrive at the


“classical definition” of probability: The probability of an event is equal to
the quotient of the number of the favorable cases and of the total number
of all possible cases, provided these cases are all equally probable.
Today the “classical definition” of probability is not considered as a
definition any more, but only as a way to calculate probabilities, applicable
in case of classical probability algebras, i.e. finite probability algebras whose
elementary events have, for certain reasons (e.g. symmetry-properties), the
same probability.

§ 5. Probabilities and combinatorics

In classical probability algebras the probabilities are determined by com¬


binatorial methods. In what follows we shall give some examples.
42 PROBABILITY [II, § 5

Example la. A person having N keys in his pocket wishes to open his
apartment. He takes one key after the other from his pocket at random and
tries to open the door. What is the probability that he finds the right key
at the k-th trial? Suppose that the N\ possible sequences of the keys all have
the same probability. In this case the answer is very simple indeed: N ele¬
ments have (N - 1)! permutations with a fixed element occupying the
(N - 1)! 1
/c-th place. The probability in question is therefore ——- = — ; that

is, the probability of finding the right key at the first, second,. . ., iV-th trial,

respectively, is always — . If the keys are on the same key-ring and if the

same key may be tried more than once, the answer is different and will be
dealt with later (cf. Ch. Ill, § 3, 7.).
Example lb. An urn contains M red and N — M white balls. Balls are
drawn from the urn one after the other without replacement. What is the
probability of obtaining the first red ball at the k-th drawing? In order to
answer the question we have to determine the total number of all permu¬
tations of M red and N—M white balls having for their first k—1 balls white
. [N — M\
ones and for the k-th a red ball. The first k-\balls may be chosen in

(different) ways from N — M white balls; furthermore, these can be arranged


in (A: — 1)! different orders. The red ball on the k-Va. place can be chosen
in M different ways and the remaining places can be filled up in (N — k)\
different ways. Hence the probability in question — provided that all the
N\ permutations are equally probable — is given by

N — M'
(k- 1)! M(N — k)\.
k- 1 .
Obviously, the special case M — 1 is equivalent to Example la.
In order to make the calculations more easy, Pk may be written in the
following form:
M k-1 M
N—k+ t n p - N-j+ 1
M
If N and M are large in comparison to k, and — is denoted by p (0 < p <

< 1), then we have approximately Pk « (1 — p)k~xp = nk. Indeed, if N


M
and M tend to infinity while p — —— remains constant, then Pk tends to the
N
expression nk.
II, § 5] PROBABILITIES AND COMBINATORICS 43

Example 2. Consider now the following problem. An urn contains N


balls, of which M are red and N — M white, 1 < M < N. From the urn
n balls are drawn. What is the probability of obtaining k red and n — k
white balls?
A new formulation of the same example discloses the practical importance
of the problem. In the serial production of machine parts a series of N items
contains M rejects. What is the probability, by taking a random sample
of n elements, that this sample will contain k rejects?

N
Solution. A sample of n elements may be chosen from N elements in
n
different ways. Suppose that every such combination is equally probable.
-l
N
Then the probability of every combination is Therefore we have only

to count, how many combinations contain just k of the rejects. There can
l M'
be chosen k elements from M in different ways and n — k elements

N— M
from N M in different ways. Therefore the probability in ques-
n —k
tion is:

M) /A — M\

UJ 1 n-k
[N\
n1

Example 3. The Maxwell-Boltzmann, Bose-Einstein, and Fermi-


Dirac statistics.
We start from the following simple combinatorial problem: How many
different ways are there in which n objects can be placed into N cells? Every
object can be placed into N different cells, hence there are N possibilities
for each object and the number of the possibilities of the different arrange¬
ments is Nn. If we assume that the probability of every arrangement is the
same, i.e. N~n, then we obtain the probability of a certain cell being occupied
by exactly k objects, if we count the possibilities in question. The k objects

occupying the cell can be chosen in different ways, and the number of

the possibilities to arrange the remaining n — k objects into N—1 cells is


\n—k
(N — 1 )n~k, so the number of possibilities in question will be (iV— 1)"
44 PROBABILITY [II, § 5

The probability in question is

1 i""*
Wk =
*J •
Such “problems of arrangements” are of paramount importance in sta¬
tistical mechanics. There it is usual to examine the arrangements of certain
kinds of particles (molecules, photons, electrons, etc.) in the “phase space”.
The meaning of this is the following: If every particle can be characterized
by K data, then there corresponds to the state of the particle in question a
point of the phase space, having for coordinates the data characterizing
the particle. In subdividing the phase space into ^-dimensional parallelepi¬
peds (cells) the physical system can be described approximately by giving
the number of particles in each cell. The assumption that all arrangements
have equal probabilities leads to the so-called Maxwell-Boltzmann sta¬
tistics. This can be applied for instance in statistical mechanics to systems
of the molecules of a gas. But in the case of photons, electrons, and other
elementary particles we must proceed in a different way. For systems of
photons for instance, the following model was found to be valid: in distrib¬
uting n objects into N cells two arrangements containing in each of the
cells the same number of objects are not to be considered as distinct. That
is, the objects are to be considered as not distinguishable and thus only
arrangements having different numbers of objects in the cells can be distin¬
guished from each other. This assumption leads to the so called Bose-Ein-
stein statistics.
Next we calculate the number of possible arrangements under Bose-Ein-
stein statistics. This problem is the same as the following question: In how
many ways can n shillings be distributed among N persons? (Of course,
the number of shillings obtained is of interest, the individuality of the coins
being irrelevant.) This number is equal to the number of combinations of
IN + n - 1
N things, n at a time, repetitions allowed, i.e. to . Another solu-
\ /

tion of our problem is the following: Let to the n objects correspond n colli-
near points. Let these n points be subdivided by N— 1 separating lines. Every
configuration thus obtained corresponds to one possible arrangement.
Every two consecutive lines signify one cell and the number of the points
lying between two consecutive lines represents the number of objects in the
corresponding cell. If there are no points between two consecutive lines,
the corresponding cell is empty. Figure 8 gives a possible arrangement of
eight objects into six cells; here the first cell contains one, the second two
objects, the third cell is empty, the fourth contains three objects, the fifth
is empty, and in the sixth are two objects.
II, § 5] PROBABILITIES AND COMBINATORICS 45

The number of possibilities in question is thus obtained by dividing the


number of permutations of the n points and N— \ lines by the number ob¬
tained by permuting only the points among themselves and the lines among

themselves; it is equal to N n . Hence under the Bose-Einstein hypo-


-i
(N + n- 1
thesis the probability of each arrangement is
n

• • • • • • •

Fig. 8

Now we shall calculate under Bose-Einstein statistics the probability


of exactly k particles being in a given cell. For this it suffices to determine
the number of possibilities having in the cell in question exactly k particles.
This, however, is equal to the number of possibilities of putting n — k
N + n-k-2'
particles into the remaining N — 1 cells, hence to thus the
n—k
probability is

(N+n —k —2'
l n-k
iN+n- 1
l n

We have dealt with Bose-Einstein statistics not only because of their


important physical applications. We also wished to make clear that the
determination of equally probable cases is not a question of pure logic;
experience is involved in it too. This example shows further that a hypo¬
thesis relative to equally probable cases cannot always be checked directly.
Often we must rely on experimental verification of the consequences of the
hypothesis in question.
The Bose-Einstein statistics have no general validity. It does not apply
for instance in the case of electrons, where the so-called Fermi-Dirac sta¬
tistics are appropriate. They are obtained by joining to the principle of
indistinguishability of the Bose-Einstein statistics the requirement that every
cell may be occupied by at most one particle (Pauli’s principle). Hence the
number of distinct arrangements is equal to the number of the possibilities
'N'
of choosing n fromiV elements, i.e. to The probability of a cell being
n.
46 PROBABILITY [II, § 6

occupied by one particle (more than one is out of question) is

'N- 1|
n— 1 n
N N '
n ,

The Fermi-Dirac statistics gives good agreement with experiments in


the case of electrons, protons, and neutrons.

§ 6. Kolmogorov probability spaces

The theory discussed up to now can only deal with the most elementary
problems of probabilities; those involving an infinite number of possible
events are not covered by it. To deal with them we need Kolmogorov’s
theory, which will now be discussed.
In Kolmogorov’s probability theory we assume that there is given an
algebra of sets, isomorphic to the algebra of events dealt with. This assump¬
tion, as we have seen, does not restrict the generality. We assume further
that this algebra of sets contains not only the sum of any two sets belonging
to it but also the sum of denumerably many sets belonging to the algebra
of sets. Algebras of sets with this property are called a-algebras or Borel
algebras.
In Kolmogorov’s theory we therefore assume the following axioms:

I. Let there be given a nonempty set Q. The elements of Q are said to be


elementary events and are denoted by a>.

II. Let be specified an algebra of sets of the subsets of Q; the sets A of


are called events.

III. is a a-algebra, that is1

\ ^ (& = 1,2, ...)=► Yj


k=1
From the Axioms I-III follows immediately that if Ak £ ^ (k = 1,2,...),

then also FI Ak £
*=i

The following axioms prescribe the properties of probabilities:

IV. To each element A of is assigned a nonnegative real number P(A),


called the probability of the event A.

1 Here and in what follows the sign => stands for the (logical) implication.
II, § 7] RINGS AND ALGEBRAS OF SETS AND MEASURES 47

V. P(Q) = 1.

VI. If Ax, A2,. . ., An,... is a finite or a denumerably infinite sequence


of pairwise disjoint sets belonging to then

P(Ax + A2 + . . . + An + . . .) = P(Ax) + P(A2) + . . . + P(A„) + ....

Requirement VI is called the o-additivity (or complete additivity) of the


set function P(/l).
A a-algebra of subsets of a set Q on which a set function P(A) is defined
such that Axioms I-VI are fulfilled will be called a probability space in the
sense of Kolmogorov and will be denoted by [£2, P],
Theorems proved in the previous paragraph hold clearly for Kolmogorov
probability spaces too, as the Axioms ol),/3), and y) correspond to Axioms
IV, V, and VI respectively. Axiom VI, however, requires more than the
Axiom y) of probability algebras, since it assumes the additivity of P(A)
not only for finitely many, but also for denumerably many pairwise disjoint
sets belonging to the cr-algebra ^.
Every finite probability algebra is a Kolmogorov probability space, since
an additive set function on a finite algebra of sets is trivially <r-additive.
The empty set is denoted by O. Obviously, we have always P{0) = 0
(cf. the note to Theorem 2 of § 3).
Apart from finite probability algebras the most simple probability fields
are those in which the space Q consists of denumerably many elements.
Let Q be a denumerable set, with elements a>x, co2,..., con,. ..; let con¬
sist of all subsets of Q. Let the set containing the only element cd„ be denoted
by {<u„}; let further beP({«„}) = p„(n = 1,2,.. .). In order that [Q, ^, P]
00

be a Kolmogorov probability space the conditions pn> 0 and £ pn — 1


n=1
must, according to Axioms IV-VI, be satisfied. Further if A is an arbitrary
subset of Q, then by Axiom VI we have

P(A)= £ pk.
tok^A

Conversely, if the above conditions are fulfilled, [£2, P] is in fact a


Kolmogorov probability space. Thus we have also proved the consistency
of Kolmogorov’s axioms. The cr-additivity of P is readily seen, since in any
convergent series of nonnegative terms the terms can be rearranged and
bracketed in whatever order.

§ 7. The extension of rings of sets, algebras of sets and measures

In this paragraph we shall discuss some results of set theory and measure
theory, used in probability theory. We shall not aim at completeness. We
48 PROBABILITY [II, § 7

assume that the reader is familiar with the fundamentals of measure theory
and of the theory of functions of a real variable. Accordingly, proofs are
merely sketched or even omitted, especially if dealing with often-used con¬
siderations of the theory of functions of a real variable.1
We have seen already in Chapter I that every algebra of events is isomor¬
phic to an algebra of sets. It is always assumed in Kolmogorov’s theory that
the sets assigned to the elements of the algebra of events form a cr-algebra.
Hence the algebra of sets constructed to the algebra of events must be ex¬
tended into a ^-algebra — if it is not already a cr-algebra itself. This exten¬
sion is always possible, even in the case of a ring of sets.
A system of subsets of a set Q is called a ring of sets if

A and B => A + B and A — B (i

A ring of sets is said to be a Borel ring of sets or o-ring, if


00

An£J$(n= 1,2,...)=> X An^-


«=i

The ring of sets ^is an algebra of sets iff2 the set Q belongs to . In fact,
an algebra of sets can be characterized as a system of subsets of a set
Q having the following properties:

I. A £ and B £ => A — B £
II. A d and B £ =>• A + B £
III. Q

This is obvious, as I and III imply that whenever A belongs to ^ so does


Q —A —A and thus conditions 1,2, and 3 of § 3 Chapter I are fulfilled. Con¬
versely, it follows from these conditions that whenever^ £<^£and 5 (t/
hold, so do A £ and A ~ B = A + B £ as well. Hence the conditions
I, II, III are equivalent to the conditions 1, 2, 3 of Chapter I. We have now
the following theorem:

Theorem 1. Let Q be any set and- JA a. ring consisting of subsets of Q. There


exists then a uniquely determined o-ring (or Borel ring) &&(&) with the follow¬
ing properties: contains A/2 and an arbitrary o-ring A/2' containing <A/2
contains Aft(A/2) as well. In other words, 2fj(A/2) is the least o-ring containing A/2.

Proof. Obviously, there exists a tr-ring ^'containing Such is, for in¬
stance, the collection of all subsets of Q. Let now be the intersection

1 More particularly see e.g. P. R. Halmos [1 ] or V. I. Smirnov [2].


2 “iff” stands here and in what follows as usual for “if and only if”.
II, § 7] RINGS AND ALGEBRAS OF SETS AND MEASURES 49

of all ex-rings containing <32. Evidently, all statements of Theorem 1 are


fulfilled by ^(^).
The assumption in Kolmogorov’s theory that the sets assigned to the
events form a cr-algebra is, according to our theorem, no essential restriction
of the generality, since every algebra of sets may be extended by a suitable
extension into a cr-algebra.
Let us now consider another example. Let R denote the set of real numbers.
Let be the collection of those subsets of the real axis which can be
represented as sums of a finite number of disjoint half-open intervals (closed
to the left and open to the right). Obviously, ^ is a ring of sets, though it is
not a cr-ring. By Theorem 1 there exists a least cr-ring containing <J2.
It is easy to see that ) is in our case a cr-algebra as well. The subsets
of R belonging to ^5(^P) are called Bore! sets.
Let us now assume that a nonnegative and completely additive set func¬
tion P(A) is defined on the sets of an algebra of sets (which is not a cr-algebra),
further that P(Q) — 1. We have seen that every algebra of sets may be
extended into a cr-algebra. But is it possible (and if so how), to extend the
definition of P to the elements of while preserving the nonnegativity
and the cr-additivity ?
A nonnegative, completely additive set function defined over a ring of
sets is called a measure. Our question therefore, in a more general formula¬
tion, is whether a measure defined over a ring can be extended to the
least cr-ring -J$(\j€) containing
For our purposes it suffices to consider o-finite measures. A measure
H defined over a ring ddd is said to be o-finite, if for every set A £ there
exists a sequence An £ ^?(n = 1,2,...) such that every set An has a finite
00

measure g(An) < + oo and A (Z £ An . The following theorem can now be


n=i

asserted:

Theorem 2. If y(A) is a o-finite measure defined over a ring of sets there


exists a uniquely determined o-finite measure ju(A) defined over the extended
ring dffidd?) such that for every A £ -J? one has y(A) = ju(A).

Proof. In order to construct Ji let us first define a set function g* in the


following manner. Let g*(A) for any subset of A of Q be the lower bound
CO
of all sums £ h(An), where the An belong to and their union contains A;
n=1
that is
oo
g*(A) = inf
00
£ p(An)
A £ Z An
50 PROBABILITY [II, § 7

It is easy to verify that p*(A) has the following properties:

a) 0;
b) if A £3, then p*(A) = p(A);
OO 00

c) if A £ Y , then ft*(A) < Y


n=1 n=1
H*(A) is called the outer measure of the set A. A set A is said to be mea¬
surable. if every subset B of Q satisfies the following relation:

H*(B) = p*(AB) + n*(AB),

where A is the set complementary to A with respect to Q.


Let 3* be the collection of all sets measurable in this sense. p(A) is defined
on 3* by the equality Ji(A) = n*(A). It is not difficult to prove that

1. 3* is a a-algebra containing 3 (and thus 2&(3) as well);


2. Ji(A) is a measure on 3*, i.e. p(A) is completely additive.

From these statements it follows that p(A) satisfies the requirements of


Theorem 2.
It can be shown that this extension of /i to <2&{3) is unique. The measure
H so obtained has, as it is readily seen from the construction, the following
further property: If A £ ^*and Ji(A) = 0, then from B £ yl follows B £ 3*
and thus n(B) = 0. Measures having this property are called complete
measures. Thus every measure derived from an outer measure in the above
described way is complete.
Let us now consider the following example: Let R be the real axis
and let F(x) be a nondecreasing function continuous from the left satis¬
fying the relations

lim F(x) = 0 lim F(x) = 1.


X—-00 X — + 00

Let the ring 32 be the collection of all sets consisting of a finite number of
intervals closed to the left and open to the right. p.(A) will be defined as
follows: If A consists of the half-open disjoint intervals [ak, bk), ax < bx <
< a2 < b2 < ... < ar < br, let then be

KA)= i (F(bk)-F(aky). (1)


= 1

It is easy to see that p(A) is a measure on 3, hence p(A) > 0 and if A„ £3


00

(n — 1, 2,. . .) and the An are pairwise disjoint, while Y = A£3, then


n=1
II, § 7] RINGS AND ALGEBRAS OF SETS AND MEASURES 51

KA)=
n=1
Z HA„)-

To prove this we first prove another important general theorem:

Theorem 3. A nonnegative additive set function g(A) defined on a ring of


sets <32 is a measure on <32 ifffor every sequence of sets Bn £ such that Bn+1^
oo

— B„, g(Bn) < + oo (n = 1, 2,. . .) and ]~] Bn = 0 (i.e. for every decreasing
n=1
sequence of sets Bn having the empty set as their intersection) the relation

lim g(Bn) = 0 (2)


00

holds.

The proof is simple. Indeed, if (2) is fulfilled and

Ant'd®, £ An £<32,
n=1

while AnAm = O for n m, then we have for every n


oo n-1 oo ^

E 4 = E
c=l k=l
KAk) +a Z 4 ;
k—n )

thus from the fact that the sets Bn = Ak satisfy the conditions
k=n

Bn £<32, Bn+1^ B„, Bn= O it follows that lim g(B„) = 0; thus


n=1

'Z
k—X
Ak =
k=1
Z KA&

i.e. p is completely additive. Conversely, if g is completely additive, then


00 oo

whenever B„ £ <32, Bn+l £ Bn, Y\Bn = O hold, one has Z?i = Z (B„ ~ B»+1)

where Bn - Bn+1 £<32 and (Bn - Bn+1)(Bm - Bm+1) = O (n ? m).


Therefore we have
00

lim g(Bn) = lim £ g(Bk -Bk+1) = 0.


n-*-oo n-*-oo k —n

Now we shall prove that the set function g defined by (1) satisfies the con¬
ditions of Theorem 3. Let therefore be Bn £ <32, Bn+1 c Bn (n = 1, 2,...)
00

and Y\Bn = O. ThenO < g(Bn+1) ^ p(B„) (n = 1,2,...), thus \\mp{Br) = c


n=x "-«>
does exist and g(Bn) > c (n = 1,2,...). We shall show that the as-
52 PROBABILITY [II. § 7

sumption c > 0 leads to a contradiction. In order to prove this we construct


from the set B1 which consists of the intervals [axi, bXl) another set B[ a Bx
consisting of a finite number of closed intervals obtained from every one
of the intervals [aXl, blt) by removing a small open subinterval (b'lh blt)
(au < b'u < blt) such that the points b'u be points of continuity of F(x)
c
and the value of the sum £(F(61() — F{b'xi)) be at most —. This procedure is

repeated with the set B[B2 (which may contain closed intervals as well be¬
sides the half-open intervals) such that the sum of the increments of the func-
c
tion F(x) belonging to the removed intervals be at most —. In this way we
8
obtain a set B2 which consists of a finite number of closed intervals. By
continuing the procedure, we obtain a sequence of sets B'n having the follow¬
ing properties: a) B'n consists of finitely many closed intervals; b) B'n+1 £ B'n;
00

c) n B'n = 0 \ d) the sum £ (F(b’nk) — F(ank)) of the increments of thefunc-


n=l k
c
tion F(x), taken over the intervals [a„k, b'nk] forming B'n, is at least equal to — .

These properties are, however, contradictory since from a), b), c) follows
the existence of a number N such that, for every n > N, B'n is the empty set
(indeed, from B'n = B{ B'2. . . B'n ^ O for every n, the sets B'n being closed,
00

the relation Y\ B'n^ O would follow). But this contradicts d), and thus
n=1
we proved our statement that the set function /i defined by (1) is a measure
on
According to Theorem 2 the definition of the measure \i can be extended
to all Borel subsets of the real axis. Thus we obtained on these sets a measure
H such that for A = [a, b) the relation }l(A) = F(b) — F(a) is valid.
Especially, if
0 for x < 0,
F(x) = x for 0 < x < 1,
1 for 1 < x,

then the above procedure assigns to every subinterval [a, b) of the interval
[0, 1] the value b — a.
We have seen that /r is a complete measure determined on a c-ring <32*,
which contains the er-ring «$(<^). If F(x) has the special form mentioned
above, this measure is just the ordinary Lebesgue measure defined on the
interval [0, 1]. Any measure n constructed by means of a function F(x)
satisfying the above conditions is called a Lebesgue-Stieltjes measure de¬
fined on the real axis.
II, § 7] RINGS AND ALGEBRAS OF SETS AND MEASURES 53

The same construction can be applied in cases of more than one dimension.
Let F(xx, x2,. . x„) be a function of the real variables xl5 x2,. . x„
having the following properties:
1. F(x1, x2,. . x„) is in any one of its variables a non-decreasing func¬
tion continuous from the left.
2. lim F(xx, x2,. .., x„) — 0, (k = 1,2,..., n) and lim F(xx, x2,. . .,xn) =
Xk-*-°O

= 1, if every xk (k = 1, 2,. . ., n) tends to + oo.


In order to formulate the third condition let us introduce the following
notation: Let A$ (k = 1,2,..., n) denote the operation of taking the dif¬
ference with respect to the variable xk and with difference h; i.e. for a
function G(xl5 x2,. . ., x„) we put

^ • • •> • • •> ^k T h} . . ., X^)


- G(x1} x2,. . ., x„).

Now we can formulate the third condition:


3. For any numbers hk > 0 and any real values xk (k = 1, 2,...,«) the
relation
A%A%...A%F(Xl,x2,...,xn)>0
should hold.
Let / be an H-dimensional interval consisting of the points (xt, x2,. . ., xn)
of the n-dimensional space satisfying the inequalities ak< xk < bk. Let
hk — bk — ak and

til) = 42? 4? • • • 1
F(« > «*.•••» an)' (3)
Let be the set of all subsets A of the ^-dimensional space which can
be represented as the union of finitely many pairwise disjoint intervals
r

A, I2, • • A- For A = £ 4 we put


fc=i

= E M4).. (4)
&=1

It is readily seen by the same consideration as in the one-dimensional case


that the set function /t defined by (3) and (4) is a measure on the ring of sets
thus n can be extended to the a-algebra formed by all Borel sets
of the ^-dimensional space.
Especially, if
0 for min xt < 0,
F (xx, x2,..., xn) =
min (xk, 1) for xk > 0 (k = 1,2,..., n),
k=\
54 PROBABILITY UI, § 8

then the extension of the set function ju(A) defined above leads to the ordi¬
nary ^-dimensional Lebesgue-measure defined on the ^-dimensional cube
0 ^ xk < 1 {k = 1, 2,..ri).

§ 8. Conditional probabilities

In the preceding paragraphs we introduced probabilities by means of


the relative frequencies. Accordingly, in order to introduce the notion of
conditional probability we shall examine first conditional relative frequen¬
cies. If an event B occurs exactly n times in N trials and if among these n
trials the event A occurs k times together with the event B, then the quotient
k
— is called the conditional relative frequency of the event A with respect to
n
the condition B. The conditional relative frequency of an event A with re¬
spect to the condition B in a sequence of trials is therefore equal to the simple
relative frequency of the event A in a subsequence of the sequence of trials
in question; this subsequence contains only those trials of the original
sequence in which the event B occurred. If fB denotes the relative frequency

of B in the whole sequence of trials, then fB = ~ defining similarly fAB we

k
have fAB = —. Finally, if fA\B denotes the conditional relative frequency

of A with respect to the condition B, then, by definition, fA\B = —-. Thus


1 n
/AB
Ja\b —
/b

Since fAB fluctuates around P(AB) and fB around P{B), the conditional

relative frequency fA\B will fluctuate for P(B) > 0 around . This
P(B)
P( A R')
number shall be called the conditional probability of the event A with
P(B)
respect to the condition B\ it is assumed that P{B) > 0. The notation for
the conditional probability is P(A\B)\ thus we put

P(AB)
P(A\B) = (1)
P(B)

By means of formula (1) the conditional probability of any event A of a


probability algebra with respect to any condition B can be calculated, pro-
II, § 8] CONDITIONAL PROBABILITIES 55

vided that P(B) > 0. If P(B) = 0, formula (1) has no sense; the conditional
probability P(A | B) is thus defined only for P(B) > 0.1 Formula (1) may be
expressed in words by saying that the conditional probability of an event A
with respect to the condition B is nothing else than the ratio of the probability
of the joint occurrence of A and B and the probability of B.
Equality (1) is (in contradiction to the standpoint of many older text¬
books) neither a theorem nor an axiom; it is the definition of conditional
probability.2 But this definition is not arbitrary; it is a logical consequence
of the concept of probability as the number about which the value of the
relative frequency fluctuates.
In the older literature of probability theory as well as in some vulgariza¬
tions of modern physics one finds often the misleading formulation that the
probability of an event A changes because of the observation of the occur¬
rence of an event B. It is, however, obvious that P(A | B) and P(A) do not
differ because the occurrence of the event B was observed, but because of
the adjunction of the occurrence of event B to the originally given complex
of conditions.
Let us now state some examples.

Example 1. In the task of pebble-screening one may ask, what part of the
pebbles is small enough to pass through a sieve SA, i.e. what is the probability
of a pebble chosen at random to pass through the sieve SA. Let this event
be denoted by A. Assume now that the pebble was already sieved through
another sieve SB, and the pebbles which did not pass through the sieve
SB were separated. What is the probability that a pebble chosen at random
from those sieved through the sieve SB will pass through the sieve SA as
well ? Let B denote the event that a pebble passes through SB, the probability
of this event let be denoted by P(B). Let further AB denote the event that
a pebble passes through both SB and and P(AB) the corresponding
probability. Then the probability that a pebble chosen at random from
those which passed SB will pass SA as well is, according to the above,

P(AB)
P(A | B) =
P(B) *

Example 2. Two dice are thrown, a red one and a white one. What is the
probability of obtaining two sixes, provided that the white die showed a six?

1 We give a more general definition in Chapter IV.


2 In view of certain applications it is advisable to generalize the system of axioms
of probability theory in such a way that the notion of conditional probability is the
fundamental notion (cf. § 11 of this Chapter").
8
56 PROBABILITY [II, §

This conditional probability is by definition


1
~36~ _ 1
1 ~~ 6'

T
From a sheer mathematical point of view the conditional probability
P(A | B) may be considered as a new probability measure. Indeed, let Q be
an arbitrary set, a cr-algebra of the subsets of Q, and P a probability
measure (i.e. a nonnegative, completely additive set function satisfying
P(Q) = 1). Further let B be a fixed element of such that P(B) > 0. Then
P(A | B) is a probability measure on as well, i.e. P(A | B) satisfies Axioms
IV, V, and VI. Indeed, by introducing the notation P*{A) = P(A \ B) we

have P* (A) > Oforzt £ P*(Q)= Jr = 1, further if An


P\B) P(P)
and AnAm — 0 for n / m, then

BYAn £ P(AnB) „
”=1— = -= V P* (A„).
P(B) P(B) „ti

Hence [Q, P{A | B)] is again a Kolmogorov probability space. Thus


all theorems, proved for ordinary probabilities, remain valid, if the proba¬
bility of every event is replaced by the conditional probability of the same
event relative to some fixed event B (of positive probability).
If A and B are two events of positive probability one can consider besides
the conditional probability of A relative to B also the conditional probability
of B relative to A.
From the definition it follows readily that

P(A\B)P(B)
P(B i A) = (2)
~P~(A)~~
hence P(B | A) can be expressed by means of P(A \ B), P(A), and P(B). One
can write (2) in the following form, equivalent to it:

P(B | A) _ P(A | B)
P(B) P(A) ' 1 ^

Formula (1) can be generalized as follows: If Ax, A2,..., A„ are arbitrary


events such that P(AXA2 ■ . . An_x) > 0, we have
P(AxA2 ... AJ = P(AJ P(A2 | Aj) P(A3 | AxA2) . . . P(An | AXA2 . . . An-J.
(4)
II, § 9] THE INDEPENDENCE OF EVENTS 57

This formula is immediately verified by expressing the conditional proba¬


bilities on the right hand side of (4) by means of (1).

§ 9. The independence of events

Let A and B be two events of a probability algebra; assume that F(A) > 0.
and P(B) > 0. In the preceding paragraph the conditional probability
P(A | B) was defined. Generally it is different from P(A). If, however, it is
not, i.e. if
P(A | B) - P(A) (1)

then we say that A is independent of B. If A is independent of B, then B is


independent of A as well; indeed, by Formulas (2) and (3) of the preceding
paragraph
P(B | A) = P(B). (F)

It is therefore permissible to say that A and B are independent of each


other. From Formula (1) of § 8 follows readily a definition of independence
of two events that is symmetrical in A and B. Indeed, because of the inde¬
pendence just defined we have

P(AB) = P(A) P(B). (2)

A and B being independent, (2) is valid; conversely, if (2) holds and P(A),
P(B) are both positive, then (1) and (F) hold as well, thus A and B are inde¬
pendent. Hence (2) is the necessary and sufficient condition of the indepen¬
dence, thus it may serve as a definition, either. Old textbooks of probability
theory used to call relation (2) the product rule of probabilities. However,
according to the interpretation followed in this book (2) is not a theorem
but the definition of independence. (Since we take Formula (2) as the defi¬
nition of independence, any event A with P(A) = 0 or P(A) — 1 is inde¬
pendent of every event B.)
If A and B are independent, A and B are independent as well. Namely
from (2) it follows that

P(AB) = P(A) - P{AB) = P(A) - P(A)P(B) = P(A)P(B).

Therefore the independence of A and B implies the independence of A


and B and, similarly, that of A and B, further of A and B.
The independence of two complete systems of events is defined in the
following manner: The complete systems of events (Alf A2,. . ., Am) and
(2?1} B2,. . Bn) are said to be independent, if the relations

P(AiBk) = P(Aj)P(Bk) O' = 1,2,..., m; k = 1,2,..., n) (3)


58 PROBABILITY [II, § 9

are valid for them. It is easy to see that from the m-n conditions figuring in
(3) every one containing Am or Bn can be omitted. If the remaining mn -
— (m + n — 1) = (m — 1) (n — 1) conditions are fulfilled, the omitted
ones are necessarily fulfilled too, as is seen from the relations

£ P(Aj Bk) = P(Aj) O' — 1,2,..., m) (4)


k=l

m
X P(A,B,)=P(Bt) {.k — 1,2,..., ri). (5)
;=i

Indeed, if (3) is fulfilled for a given j and k = 1, 2— 1, then from


(4) it follows that (3) is fulfilled for k = n as well. Similarly, whenever (3)
holds for some k and for j — 1, 2,..., m — 1, it holds for j = m too.
If m — n = 2, we get again the result proved earlier that three of the
four conditions
P{AB) = P{A)P{B),
P(AB) = P(A)P(B),
P(AB) = P(A)P(B),
P(AB) = P(A)P(B)

are superfluous, since the validity of one implies necessarily the validity of
the remaining three. Thus the independence of the events A and B is equi¬
valent to the independence of the complete systems of events (A, A) and
(B, E). This follows also from the relation

P(AB) - P(A)P(B) = - (P(AB) - P(A)P(B) ) (6)

valid for any two events A and B.


Example 1. For two tosses of a coin, four different outcomes are possible:
head-head, tail-head, head-tail, and tail-tail. Suppose that these possibilities

are all equally probable; then each will have the probability We obtain

the same result by the concept of independence, assuming that head and
1
tail are equally probable at both tosses both having probability — and that

the two tosses are independent from each other. Thus the probability of
each possibility is
II, § 9] THE INDEPENDENCE OF EVENTS 59

Let us now extend the concept of independence to more than two events.
If A, B, and C are pairwise independent (i.e. A and B, A and C, B and C
are independent) events of the same probability algebra, the non-existence
of any dependence between the events A, B, and C does not follow. This
may be seen from the following example.
Let us throw two dice; let A denote the event of obtaining an even number
with the first die, B the event of throwing an odd number with the second,
finally C the event of throwing either both even or both odd numbers. Then

P{A)=P{B)=P(C) = ~.

further

P(AB) = P(AC) = P(BC) = —.

The events A, B, and C are therefore pairwise independent. Nevertheless,

P(ABC) «= 0,
thus
P{ (AB)C) # P(AB)-P(C),

i.e. AB is not independent from C.


We shall say that A, B, and C are completely independent, if they are pair¬
wise independent and each of them is independent of the product of the
remaining two. Thus A, B and C are completely independent, if the relations

P(AB) = P(A)P(B),
P(AC) = P(A)P(C),
P{BC) = P(B)P(C),
P(ABC) = P(A)P(B)P(C)

are valid. The first three of these relations express the pairwise indepen¬
dence of A, B, and C, the fourth the fact that each of the events is inde¬
pendent of the product of the remaining two. Indeed, from the first three
conditions we have:

P{AB)P{C) = P(AC)P(B) = P(BC)P(A) = P(A)P(B)P(C).

The (complete) independence of more than three events may be defined


in a similar manner. The events Ax, A2, ■ . A„ are said to be completely
independent, if for any k = 2, 3, . . ., n the relation

P(Ait A,,... Aik) = P(Ait) P(A„)... P(Aik) (7)


PROBABILITY [II, § 9
60

is valid for any combination (q, /2,. . ., 4) frotn the numbers 1,2,. . ., n.

Since from n objects one can choose k objects in ways, (7) consists of

2n-n-\ conditions. In what follows, by saying for more than two events
that they are independent we shall mean that they are completely indepen¬
dent in the sense just defined. If only pairwise independence is meant this
will be stated explicitly. The independence of more than two complete sys¬
tems of events can be defined in a similar manner.
Combinatorial methods for the calculation of probabilities have already
been mentioned. They rested upon the assumption of the equiprobability
of certain events. By means of the concept of independence, however, this
assumption may often be reduced to more simple assumptions. Besides the
simplification of the assumptions, this reduction has the advantage that
the checking of the practical validity of our assumptions becomes sometimes
more easy.

Example 2. Sampling without replacement. An urn contains n different


objects, numbered somehow from 1 to n. We draw one after the other k
items without replacement. What is the probability that we obtain a given
combination of k elements? Clearly the number of possible combinations

It was supposed that all combinations are equally probable, the proba-
n l-1
bility looked for is thus
/c
This result may also be obtained from the following simpler assumption:
at every drawing the conditional probability of drawing any object still in the
urn is the same. Here the probability that a given combination occurs in a
11 1
given order is— --- -;-.Namely, at the first drawing there
n n-1 n — k+ 1

are in the urn n objects, the probability of choosing any one is — ; at the
n
second drawing the conditional probability of choosing any one of the n — 1

objects which are still in the urn is ———, etc. Since the elements of the com-
n — 1
bination in question may be chosen from the urn in k\ different orders,
the obtained result must be multiplied by k! and thus we get that the proba¬
bility of drawing a combination of k arbitrary elements is

-l
k! n
n{n — 1) ... (n — k + 1)
II, § 9] THE INDEPENDENCE OF EVENTS 61

Example 3. Sampling with replacement. An urn contains N balls, namely

M red and N — M white. Let

What is the probability that we obtain in n drawings k times a red ball,


if the chosen ball is always replaced into the urn and the balls are again well
mixed.
In every drawing the probability of choosing a certain ball is equal to

and the outcomes of the individual drawings are independent. Hence the

probability asked for is

Wu = Pk(\-Pr (8)

Indeed, let At denote the event of choosing a red ball at the z'-th drawing
(i = 1, 2,..., n). These events are, because of the replacement, independent
of each other. The probability that at the z\-th, z'2-th,. . ., 4-th drawing a
red and at all the other j2-th,. . .,jn-k-th) drawings a white ball will
be chosen is nothing else than the probability of the event

As the events Ah, Ah,. . .,Aik, Aa, Ah,. . ., AJn_k are completely inde¬
pendent and P(At) = p, P{Aj) = 1 — p, we get

P(Ah ... Aik 4 ... 44 = / (i - pT


Since the order is irrelevant and only the number of red balls drawn is of
interest, the value so obtained must still be multiplied by the number of
n
the possible orderings, i.e. by . Thus we obtain (8).
K

This result can immediately be generalized for experiments with more


than two possible outcomes. Let the possible outcomes in every experiment
be A(1), A(2),. . A(r); let their probabilities be denoted by P(A(f,)) = ph

(h = 1, 2,. . ., r). Of course we have £ ph = 1. Assume that in repeated


h=\
performance of the experiment the outcomes of the individual experiments are
independent of each other. Then the probability that in n repetitions of the
experiment event A(l) occurs ky times, event A(2) k2 times,. . ., event A(* kr
times, is
n\
Wkl k2 ...kr PxPki ■ Pk/, (9)
kj\ k2\... kr\

where £ kh = n. For r = 2 Formula (9) reduces to (8).


h=1
62 PROBABILITY [II, § 10

§ 10. “Geometric” probabilities

Let £2 be a measurable subset of the ^-dimensional Euclidean space with


positive, finite Lebesgue measure. Let further^ be the set of all measurable
subsets of £2 and p(A) the rc-dimensional Lebesgue measure of the measur¬
able set A. Let P(A) be defined by

P(A) (i)
m'
It is easy to see from the results of § 7 that [£2, ^€,P\ is a Kolmogorov proba¬
bility space. In this probability space probabilities may be obtained by
geometric determination of measures. Probabilities were thus calculated
already in the Eighteenth Century.1
Some simple examples will be presented here.
Example 1. In shooting at a square target we assume that every shot hits
the target (i.e. we consider only shots with this property). Let the probability
that the bullet hits a given part of the target be proportional to the area of
the part in question. What is the probability that the hit lies in the part A ?
Clearly we only have to determine the factor of proportionality. If £2 de¬
notes the entire target, the probability belonging to it must be equal to 1.
Hence

P(A) =
KA)
p{Q)
where p{£2) denotes the area of the entire target and p(A) that of A. Thus
for instance the probability of hitting the left lower quadrant of the target is
, 1
equal to —.

As seen from this example, not every subset of the sample space can be
considered as an event. Indeed, one cannot assign an event to every subset
of the target, since the “area”, as it is well known, cannot be defined for
every subset such that it is completely additive and that the areas of con¬
gruent figures are equal.
In general, the distribution of probability is said to be uniform, if the
probability that an object situated at random lies in a subset can be obtained
according to the definition (1) from a geometric measure p invariant under
displacement (e.g. volume, area, length of arc, etc.).
Example 2. A man forgot to wind up his watch and thus it stopped. What
is the probability that the minute hand stopped between 3 and 6 ? Suppose

1 Of course instead of Lebesgue measure the notion of the area (and volume) of
elementary geometry was applied.
II, § 10] “GEOMETRIC” PROBABILITIES 63

the probability that the minute hand stops on a given arc of the circum¬
ference of the face of the watch is proportional to the length of the arc in
question. Then the probability asked for will be equal to the quotient of
the length of the arc in question, and the whole circumference of the face;
. . 1
i.e. in our case to —.
4
In the above two examples the determination of the probabilities was
reduced to the determination of the area or of the length of the arc
in certain geometric configurations. Though this method is intuitively
very convincing it is nevertheless a very special method. Before applying
it to further examples, let us see its relation to the already described combi¬
natorial method. This relation is most evident in Example 2. If we neglect
the fractions of the minutes and are looking for the probability that the
minute hand stops between the zeroth and the first, the first and second,. . .,
the A:-th and k + 1-th minute {k = 0, 1,. . .,59), then we have a sample
space consisting of 60 elementary events; the probability of every event is
1
the same, viz. In the case of the example of the target let us assume,
60“'
for sake of simplicity, that the sides of the square target are 1 m long. Let
us subdivide the target into rr congruent little squares with sides parallel
to the sides of the target. The probability that a hit lies in a set which can be
obtained as the union of a certain number of the little squares is obtained
by dividing the number of the little squares in question through n2. Thus we
see that geometric probabilities can be approximately determined by a
combinatorial method. We must not, however, restrict ourselves to some
fixed n in the subdivision, for then we could not obtain the probability of a
hit lying in a domain limited by a general curve. If the mentioned subdivi¬
sion is performed for every n however large, then the probability of measur -
able sets, or to be more precise, of every domain having an area in the sense
of Jordan, can be calculated by means of limits. For this calculation we
k
have to consider the quotient —f- where kn means the number of small
n
squares lying in the domain if the large squaie is subdivided into con-
k
gruent small squares and we have to determine the limit of for

n -*■ oo.
Probabilities obtained in a combinatorial way (without passing to the
limit) are always rational numbers; geometric probabilities, however, may
assume any value between 0 and 1. Thus for instance the probability that
, 71
the hit lies in the circle inscribed into the square target is equal to —.
64 PROBABILITY [II, § 10

The reduction of the calculation of geometric probabilities to combina¬


torial considerations is nowadays of historical interest only. Namely it was
thought for a long time that the classical definition of probability suffices to
establish the calculus of probability. From the point of view of the modern

Fig. 9

theory, however, such a reduction is unnecessary (except in the case when a


simplification of the calculations is brought about by it).
In what follows we continue to consider some further examples. Let us
first deal with the so-called Bertrand paradox.
Example 3. Take a circle and choose at random a chord of it. What is
the probability that this chord will be longer than the side of the regular
triangle inscribed into the circle? The difficulty lies in the fact that it is not
clear, what is meant by the expression that we choose a chord “at random”.
Each of the following interpretations seems to be more or less natural.
Interpretation 1. Since the length of a chord is uniquely determined by the
position of its midpoint, we can accomplish the random choice of the chord
by choosing at random a point in the interior of the circle and construct
the chord whose midpoint is the chosen point. The probability that the
point lies in a domain will be assumed to be proportional to the area of
the domain. Clearly the chord will be longer than the side of the inscribed
regular triangle, if the midpoint of the chord lies inside the circle drawn
about the centre of the original circle with half of its radius (cf. Fig. 9); hence
the answer is

2 A *
nr 4

Interpretation 2. The length of the chord is uniquely determined by the


distance of its midpoint from the centre of the circle. In view of the sym-
II, § 10] GEOMETRIC” PROBABILITTES 65

metry of the circle we may assume that the midpoint of the chord lies on a
fixed radius of the circle and choose the midpoint of the chord so that the
probability that it lies in a given segment of this fixed radius is assumed to
be proportional to the length of this segment. The chord will be longer

than the side of the inscribed regular triangle, if its midpoint has a distance
r . 1
less than — from the centre of the circle: the answer is thus — (cf. Fig. 10).
2 2
Interpretation 3. Because of the symmetry of the circle one of the end¬
points of the chord may be fixed, for instance in the point P0; the other end¬
point can be chosen on the circle at random. Let the probability that this
other endpoint P lies on an arbitrary arc of the circle be proportional to the
length of this arc. The regular triangle inscribed into the circle having for
one of its vertices the fixed point P0 divides the circumference into three
equal parts. A chord drawn from the point P0 will be longer than the side
of the triangle, if its other endpoint lies on that one-third part of the circum¬
ference which is opposite to point P0. Since the length of this latter is one
third of the circumference, the answer is, according to this interpretation,

From a well-known theorem of the elementary geometry concerning the


central and peripheral angles it follows that the third interpretation is equi¬
valent to the statement that the probability distribution of the intersection
point of the chord and the semicircle of centre P0 is uniform on this semi¬
circle (Fig. 11).
Obviously, all interpretations discussed above can be realized in physical
experiments. The example seemed once a paradox, because one did not
pay attention to the fact that the three interpretations correspond to dif¬
ferent experimental conditions concerning the random choice of the choid
66 PROBABILITY fll, § 10

and these of course lead to different probability measures, defined on the


same algebra of events. The obtained measure in the set of straight lines is,
however, invariant with respect to motions of the plane only in the second
interpretation,1 in the other two interpretations congruent sets of lines do
not necessarily have equal measure.

Fig. 11

Example 4. Decompose a unit segment into three subsegments by two


points chosen at random. What is the probability that a triangle can be
constructed from the three segments ? Clearly we have to examine the proba¬
bility that any one of the three segments is less than the sum of the remain¬
ing two. Compared with the above, the example contains something new,
as here we have to choose two points at random. However, the problem
can be reduced readily to one similar to those dealt with above. Let indeed
be the segment in question the interval (0, 1) and the abscissas of the two
points chosen at random be x and y. To these two points there corresponds
a point of the plane with abscissa x and ordinate y. Thus there corresponds
to any decomposition of the unit interval (0, 1) into three segments one
point of the unit square of the plane and conversely. Now let the random
choice of the two points on the interval (0, 1) be performed in such a manner
that “the probability that the point representing the decomposition in ques¬
tion lies in a domain A of the unit square” be equal to the area of that do¬
main (not only proportional, since the area of the unit square is 1). In this

1 Cf. W. Blaschke [1 ].
II § 10) “GEOMETRIC” PROBABILITIES 67

case we only need to compute the area of the domain determined by the
inequalities (Fig. 12)

A 1 1
0<x<—<y< 1 and y —x < —
2 2

or

0<y<—<x< 1 and x y<


2

Fig. 12

The method just applied is often used, for instance in statistical physics.
Here, to every state of the physical system a point of the “phase space”
may be assigned, having for its coordinates the characterizing data of the
state in question. Accordingly, the phase space has as many dimensions,
as the state of the system has data to characterize it (the so-called degrees
of freedom of the system). In our example we assigned a point of the phase
space to a decomposition of the (0,1) interval by two points; the degree of
freedom of the “system” is here equal to 2. The analogy can be made still
more obvious by assigning to the decomposition of the (0,1) interval a
physical system: two mass points moving in the interval (0,1).
Clearly the phase space may be chosen in many ways; by solving problems
of probability in this way, however, one must not forget to verify in every
given case separately the assumption that the probabilities belonging to
the subdomains of the phase space are proportional to the area (volume).
Finally we shall discuss here a classical example, Buffon's needle problem
(1777).
68 PROBABILITY [II, § 10

Example 5. A plane is partitioned with equidistant parallel lines of dis¬


tance d into parallel strips of equal width. A needle of length / is thrown at
random upon the plane. What is the probability that the needle intersects
a line ? For sake of brevity suppose / < d; in this case the needle can inter¬
sect no more than one of the parallel lines. The problem may then be solved
as follows. The position of the needle on the plane may be characterized by
three data: by the two coordinates of its midpoint and by the angle between
the needle and the direction of the lines. Let the coordinate of the needle’s
midpoint perpendicular to the direction of the lines be denoted by x, that
in the direction of the lines by y. Whether the needle does or does not inter¬
sect a line depends only on the coordinate a: and the angle. So the coordinate
y may be disregarded. Or, what comes to the same thing, we may draw a
perpendicular to the parallels and assume that the midpoint of the needle
lies always on it. If we take the origin of the coordinate system on one of
the parallel lines, it may be assumed that the midpoint of the needle lies in
the first parallel strip, since the strips are all of equal width d < l. Let cp
denote the angle between the needle and the positive x-axis. The position
of the needle is then characterized in the rectangular coordinate system
(x, (p) (in the “phase space”) by one point and it can be assumed that this
point lies in the rectangle (0 < x < d, 0 < cp < n). The probability that
the point (x, (p) lies in an arbitrary domain of this rectangle is assumed to
be proportional to the area of this domain. Loosely speaking, we assume
that “all positions of the midpoint of the needle are equally probable and
all directions of the needle are equally probable”.
Fixing now the value of cp, the needle intersects the line x = 0if0<x<

< — sin 09 and the line x = d if d sin cp < x < d. Thus the needle
2
intersects the line x = 0, if and only if the point (x, cp) characterizing the
position of the needle lies to the left of the sine curve drawn over the line

x = 0 with an amplitude — and the line x = d, if the characterizing point

lies to the right of the sine curve drawn over the line x = d with the same
amplitude (Fig. 13).
I
Since the area under a half-wave of a sine curve of amplitude — is equal

to /, the area of the domain formed by the points which correspond to inter¬
section will be 21. The area of the whole rectangle being nd, the sought
t . 21
probability is —— . Thus in many repetitions of Bufifon’s experiment one will

21
find intersection in approximately a fraction — of the experiments. It was
nd
tried more than once to determine approximately the value of n by this
II, § II] CONDITIONAL PROBABILITY SPACES 69

method. Since, however, the assumptions (especially the equiprobability of


the directions) are quite difficult to realize, even many thousands of experi¬
ments give the value of n only to a few digits. In principle, however, nothing
prevents us to carry out an experiment in which our assumptions about
position and direction of the needle are very nearly satisfied and thus to de¬

termine the value of n with any prescribed precision. Of course this would
have no practical importance, since there are more straightforward and
reliable methods to compute the value of n. Still the question is of great
interest, since it shows that certain mathematical problems can be solved
approximately by performing experiments of a probabilistic nature. Now¬
adays difficult differential equations and other problems of numerical anal¬
ysis are treated in this manner (this is the so-called Monte Carlo method).
Questions dealt with in this paragraph are closely connected to integral
geometry.

§11. Conditional probability spaces

The axiomatic foundation of probability theory given in 1933 by A. N.


Kolmogorov was of paramount importance and since then it furnishes the
very basis of this branch of mathematics. There are, however, problems ei¬
ther entirely outside of its range or leading in this theory to serious compli¬
cations. In physics (e.g. in quantum mechanics) and in some parts of proba¬
bility theory (especially in the theory of Markov chains and stochastic
processes) as well as in applications to number theory and integral geometry,
one often has to deal with so-called unbounded distributions, i.e. unbounded
measures. The use of unbounded measures, however, cannot be justified in
the theory of Kolmogorov. For instance one cannot speak (in the sense of
70 PROBABILITY [II, § 11

the preceding paragraph) about a uniform probability distribution in the


whole Euclidean space. Similarly, it is nonsense to speak about the random
choice of an integer such that every integer has the same probability to be
chosen. At the first glance it might seem that this difficulty cannot be over¬
come, since the value of probability can never exceed 1. In spite of this,
one can obtain by means of these unbounded, that is “nonsensical”, distri¬
butions conditional probabilities which are in agreement with experience.
Thus the necessity arose to generalize the theory of probability in a way
justifying the use of such distributions.1
One can indeed give an axiomatic theory of probability which matches
the above-mentioned requirements.2
This theory contains the theory of Kolmogorov as a special case. The
fundamental concept of the theory is that of conditional probability; it
contains cases where ordinary probabilities are not defined at all.
We start from the following definitions and axioms:
Let there be given a set Q (called the space of elementary events) and let
denote a cr-algebra of subsets of Q. The elements A, B, .. . etc. of are
called events. The set Q — A will be denoted by A. Let further 3d be a non¬
empty system of sets such that 3d ^ ^. We assume that a set function
P(A | B) of two set variables is defined for A £ and B £ 3d. P(A \ B) will
be called the conditional probability of the event A with respect to the condi¬
tion B.
We postulate the following axioms:

a) P(A | B) > 0 and P(B \ B) = 1 (A £ B £ 3d).


b) For any fixed B £ 3d P(A \ B), as a function of A, is a measure on
i.e. if An £ ^ and AnAm = O for n A m, we have

p I 4,1 B = X P(AJB).
n=1 n=1

c) If A e B C B C c and P(B | C) > 0, then

P(AB | C)
P(A | B) =
P(B | C) ‘

If the Axioms a), b), and c) are satisfied, we shall call the system
[fi, , -f,P(A | fi)] a conditional probability space.

1 The history of mathematics shows that on several occasions procedures, successful


e.g. in physical applications but inexact in a mathematical sense of the word, were
made exact later on by an extension of the mathematical notions involved.
2 Renyi, A. [14], [15], [18], The idea of such a theory is due to Kolmogorov himself;
he, however, did not publish anything about it.
II, § 11] CONDITIONAL PROBABILITY SPACES 71

If P*{A) is a measure defined on and P*(Q) = 1 (that is, if [Q, , P*]


is a Kolmogorov probability field), further if £8* denotes the collection of
all sets B £ ^ such that P*(B) > 0, then — as it is easy to see — the system
0

[£>, , &*, P*{A | B)] is a conditional probability space, provided P*(A | B)


is defined by

g@*, P* (A j 5)] will be called the conditional probability space


generated by the Kolmogorov probability space [Q, P*].
We shall prove some simple theorems which follow directly from our
axioms.

Theorem 1. For A and B dAfj we have

P(A | B) = P(AB | B).

Proof. The statement follows from Axioms a) and c) by substitution of


C = B.

Theorem 2. For A £ and B £ we have

P(A | B) < 1.

Proof. According to Theorem 1 and Axiom b) we have P(A | B) <


< P(B | B). Our statement follows by Axiom a).

Theorem 3. For B £ £8 we have P(0 \ B) = 0.

Proof. The statement is evident because of Axiom b).

Theorem 4. If A £ B £ $ and AB = O, then P(A \ B) = 0.

Proof. The statement follows from Theorems 1 and 3.

Theorem 5. For B £ ^ we have P(Q \ B) = 1.

Proof. The statement is obvious because of Axiom a) and Theorem 1.

Theorem 6. If for fixed C ( J we put P*(y4) = | C)> the system


[£>, , P *} is a Kolmogorov probability space. If B is an element of such
72 PROBABILITY [II, § 11

that BC £38 and Pc(B) > 0, further if P£(A | B) is, as usually, defined by

P*c (AB)
P*C(A\B) =
P*(B) ’
we have
P*(A | B) = P(A | BC).

Proof. The first statement of the theorem is evident since P£ is a measure


on and Pc(Q) = 1. The second statement follows from Axiom c); indeed
we have by Theorem 1

P£(AB) P(AB\C) P(ABC | C)


P* (A I B) = P(A 1 BC) .
P*(B) -P(B]~Q P(BC \C)

Theorem 7. Suppose Q C -ft and put P*(A) — P(A ] Q). Then [Q, /**]
is a Kolmogorov probability space. Further if P*(B) > 0 we have

P* (AB)
P(A I B) =
~P*(B) '

Remark. 38 may contain sets B such that P*(B) — 0. On the other hand, sets
B for which P*(B) > 0 may not belong to Hence [£?, 38, P(A \ B)]
is not necessarily identical to the Kolmogorov probability space
[Q, P(A | £2)], not even in the case Q £ 3H.

Proof. Theorem 7 is a special case of Theorem 6.

From the theorems proved above one readily sees how the generalized
theory of probability can be deduced from our axioms.
Let us mention here some further examples.
Example 1. Let Q be the ^-dimensional Euclidean space; let the points
of Q be denoted byai = (co^ co2,.. .,coK). Let denote the class of all mea¬
surable subsets of Q, let further/(co) be a nonnegative, measurable function
defined on Q and -% the set of all measurable sets B such that | f(co)dco be
finite and positive. Put B

j /(co) dto
P(A | B) = AB
\f(oi)da ‘

[Q,, 38, P(A | B)] is then a conditional probability space. If j f(ai)d<x> <
n
< + oo, a conditional probability space generated by a Kolmogorov proba-
II § ‘1] CONDITIONAL PROBABILITY SPACES 73

bility space is obtained; if, however, | f(co)dco = + oo, this is not the case.
n
Especially when /(co) = 1, we obtain the uniform probability distribution
in the whole ^-dimensional space. In this case

Pn (AB)
P(A\B)='
Pn(B)

where nn(C) denotes the ^-dimensional Lebesgue measure of the set C.


Example 2. Let Q be the set of the natural numbers, the class of all
subsets of Q, further pn (n = 1, 2,.. .) a sequence of arbitrary nonnegative
numbers not all equal to 0; let P& denote the set of those subsets B of Q for
which £ pn is positive and finite. Let £ pn be denoted by r(A) for A £
n£A
and put
r(AB)
P(A | B) =

Clearly [Q,^€, &,P(A | 5)] is a conditional probability space. It is gener-


00

ated by a Kolmogorov probability space if and only if the series £ pn is


n =1
convergent.
Especially when pn = 1 (n = 1,2,.. .),

I 1
P(A\B)^^1
n£B

is equal to the ratio of the number of elements of the set AB and the set B.1
Evidently the question arises how conditional probabilities are connected
with relative frequencies, i.e. whether the generalized theory does have a
frequency-interpretation too.
The answer is affirmative and even very simple. The conditional proba¬
bility P(A | B) can be interpreted in the generalized theory (as well as in the
theory of Kolmogorov) as the number about which the relative frequency of
A with respect to the condition B fluctuates. Thus the generalized theory
has the same relation to the empirical world as Kolmogorov’s theory.

/«(A B)
1 In both cases, P{A | B) could have been represented as the ratio where
fi(B)
fx is an unbounded measure.] (With respect to the conditions for the existence of such
measures cf. A. Csaszar [1], and A. Renyi [18]).
74 PROBABILITY [II, § 12

§ 12. Exercises

1. Let pu p2, pl2 be given real numbers. Prove that the validity of the four inequal¬
ities below is necessary and sufficient for the existence of two events A and B such
that P(A) = pu P(B) = p2, P(AB) = p12.

1 - Pl - Pi + Pl2 > 0 , (1)


Pl - Pl2 ^ 0 » (2)
Pi - Pl2 ^ 0 , (3)
Pu 2^ 0 ■ (4)
Hint. On the right hand side of the inequalities (1)—(4) we have the probabilities
of AB, AB, AB, and AB. Of course they must be nonnegative, thus the conditions
are necessary. Their sufficiency can be shown as follows: from (1)—(4) it is clear that

0 <p12<pi<pl + p2- Pl2 < 1


and similarly
0 < p12 < p2 < pl + p2 - p12 < 1.

The numbers pu p2, p12 are therefore nonnegative and do not exceed 1.
Consider the interval I — (0, 1) and suppose that a random point P is uniformly
distributed in this interval; i.e. let the probability that P lies in a subinterval of / be
equal to the length of this subinterval. Let A denote the event that the point lies in
the interval 0 < x < pu and B that it lies in the interval px — p12 < x < py +
+ Pi — Pi2 • Then we have P(A) = pu P(B) = p2, P(AB) = pl2.

2. Generalize the assertion of Exercise 1 to n events (ji = 3, 4, . . .).

3. Examine how the conditions of Exercise 2 can be simplified if we assume that


Pij2...ik = P(AkAh . . . AiJ (l< i\ < i2 < .. . < 4 < n) depends only on k
(k = 1, 2,1).

4. How can the conditions of Exercise 2 be simplified if we assume that for every
k = 2, 3, . . . , n
Plilz—ijc PiiPii • * • Pi^
5. Let Au A2,. . ., A„ be any n events and suppose that the probabilities
P(Ak Ar3. . . Aik) (1 < k < n, 1 < q < i2 < . . . < ik < n) are known. Find the
probability that at least k of the n events Au A2, . . . , A„ will occur.

6. Prove the inequality

P(A A C) < P(A A B) + P(B A C).

Remark. If we define the “distance” d(A, B) of the events A and B as the probability
P(A A B), then we have the “triangle inequality” d(A, C) < d(A, B) + d(B, C).
7. If the distance of A and B is defined as

P(A A B)
for P(A + B) > 0,
d* (A, B) = P(A + B)

0 otherwise.

then the triangle inequality is again valid.


II, § 12] EXERCISES 75

8. What is the probability that in n throws of a die the sum of the obtained numbers
is equal to kl
Hint. Determine the coefficient of xk in the expansion of the generating function
(x + x2 + x3 + x4 + x5 + *6)".

9. What is the probability that the sum of the numbers thrown is larger than 10
in a throw with three dice ?

Remark. This was the condition of gain in the “passe-dix” game which was current
in the Seventeenth Century.

10. What is more probable: to get at least one six with four dice or at least one
double six in 24 throws of two dice? (Chevalier de Mere’s problem.)

11. In a party of n married couples everybody dances. Every gentleman dances


with every one of the ladies with the same probability. What is the probability that
nobody dances with his own wife? Find the limit of this probability for n—» oo .

12. An urn contains n white and m red balls, n ^ m; balls are drawn from the urn
at random without replacement. What is the probability that at some instant the
numbers of white and red balls drawn are equal?

13. There is a queue of 100 men before the box-office of an exhibition. One ticket
costs 1 shilling. 60 of the men in the queue have only 1 shilling coins, 40 only 2 shilling
coins. The cash contains no money at the start. What is the probability that tickets
can be sold without any trouble (i.e. that never comes a man, having only 2 shilling
coins, before the cash desk at a moment when the latter contains no 1 shilling coin) ?

14. A particle moves along the x-axis with unit velocity. If it reaches a point with
integer abscissa it has one of two equiprobable possibilities: either it continues to
proceed or it turns back. Suppose that at the moment t — 0 the particle was in the
point x — 0 . Find the probability that at a time t the particle will have a distance
x from the origin {t is a positive integer, x an arbitrary integer).
15. Let the conditions of Exercise 14 be completed by the following: at the point
with abscissa x0 (a positive integer) there is an absorbing wall; if the particle arrives
at the point of abscissa .v0 it will be absorbed and does not continue its movement.
Answer the question of the preceding exercise for x <, x0.
16. A box contains M red and N white balls which are drawn one after the other
without replacement. Let Pk denote the probability that the first red ball will be drawn
at the &-th drawing. Since there are N white balls, clearly k <, N + 1 and thus
P-i + P2 + • • • + -Pjv+i = 1. By substituting the explicit expression of Pk we obtain
an identity. How can this identity be proved directly, without using probability theory ?
17. Let us place eight rooks at random on a chessboard. What is the probability
that no rook can take another ?
Hint. One has to count the number of ways in which 8 rooks can be placed on a
chessboard so that in every row and in every column there is exactly one rook.
18. Put
76 PROBABILITY [II, § 12

and

wk = \ k |^(i-pyn—k {k — 0, 1,...,«),

M M
where p = —-. Prove that if M and N tend to infinity so that = p remains
N
constant, then Pk (M, N) tends to Wk.

19. Put
N- M
{k - 1)! M(N - k)\
k - 1
Qk(M,N) =
Nl
and
vk = P{\ - Py k-l
(k = 0,1, 2 . . .),

where p = —-. Show that if M and N tend to infinity so that = p remains


N iv
constant, then Qk(M,N) tends to Vk (cf. § 5, Example lb). Estimate the error
| Qk(M,N)- Vk\.

20. How many raisins are to be put into 20 ozs of dough in order that the proba¬
bility is at least 0.99 that a cake of 1 oz contains at least one raisin?

21. The amount of water in a container may be determined as follows: a certain


amount a of soluble stain is solved in 1 gallon of water taken from the container
and the stained water is replaced. After perfect mixing 1 gallon water is taken again
and its stain content determined. If the latter number is x and the mixing is supposed

to be perfect, the vessel contains ■ gallons of water. Similarly, the number of the
x
fishes in a pond may be determined as follows: 100 fishes are caught, marked (e.g.
by rings) and replaced into the pond. After an interval of some days 100 fishes are
caught again and the marked ones are counted. If their number is x > 0 then the
100
pond contains about -- fishes. If the pond contains 10 000 fishes, what is the
x
probability that among the fishes of the second catch the number of marked fishes
.s 0, 1, 2, or 3?

22. A stick is broken at a random point and the longest part is again broken at
random. What is the probability that a triangle can be formed from the three pieces
so obtained? (Observe that the conditions of the breaking differ from those of Example
4 of § 10.)

23. Consider an undamped mathematical pendulum. Let the angle of the maximal
elongation be 2°. What is the probability that at a randomly chosen instant the
elongation will be greater than 1°?

24. Let Buffon’s problem be modified by throwing upon the plane a disk instead
of a needle. What is the probability that the disk will not cover any of the lines?

25. In a five storey building the first floor is 8 metres above the ground floor, while
each subsequent storey is 6 meters high. Suppose that the elevator stops somewhere
because of a short circuit. Let the height of the door of the elevator be 1.8 meter.
II, § 12] EXERCISES 77

Compute the probability that at the time of the stopping only the wall of the elevator
shaft can be seen from the elevator.

26. What conditions must the numbers p, q, r, s satisfy in order that there exist
events A and B such that

P(A \B) = p, P(A \ B) — q, P{B \ A) = r, P(B \ A) = si

27. A box contains 1000 screws. These are tested at random so that the probability

of a screw being tested is equal to . Suppose that 2 per cent of the screws are

defective; what is the probability that from the tested screws exactly two are defec¬
tive?

28. If A and B are independent events and A H B, prove that either P(A) — 0
or P(B) = 1.
29. Show by an example that it is possible that the event A is independent of both
BC and B + C, while B and C are also independent but A is not independent either
of B or of C .

30. Prove that if A is independent of BC and of B + C, B of AC, and C of AB,


further if P{A), P{B), and P(C) are positive, then A, B, and C are completely indepen¬
dent.
31. We perform n independent experiments; suppose that the probability of the
event A in the y'-th experiment is p,(j= 1,2,..., n). Let nn,k denote the probability
that in the n experiments the event A occurs just k times. Prove that one has always
n2nik > 7tnJc-i • nn,k+u regardless of the values of the probabilities pu p2,...,pn.
Hint. Use the relation

nn + l,k = 71fl.fc-1 Pn +1 "b Pn +1)

and proceed by mathematical induction.

32. Let Au Ait... ,A„ be any distinct events. Let P(Ak) = pk(k— 1, 2,..., «),
further let Ur denote the probability that exactly r from the events Ak (k = 1,2.n)
occur. Put
Sk= X P(AhAh...Alk) (.k = 1,2,..., n).

Then by Theorem 10 of § 3 we have the following relation:

r+ 1 r+2 n
Ur = Sr $r +1 + Sr + 2 —...+(— 1)"
1 n — r

How will this expression be simplified if we assume that the events Alt A2,. . . , A„
are completely independent and equiprobable?

33. In the notations of the previous exercise prove that

r + 1
Sr=Ur + Ur+l + . . • + U„.
1
78 PROBABILITY [II, § 12

34. Which simplification of the relation in the preceding exercise is possible if we


assume that Au A2, . .., A„ are completely independent and equiprobable ?

35. Let Au A2,. . ., A„ be arbitrary events and B an event which is a function


of the events Alt A2, . . ., A„. Prove that there exist numbers C0 and ^lilt..Jr
(1 < r < n; 1 < 4 < i2 <...</,< ri) independent of the choice of the events
Ak such that

P(B) =CQ+t £ CtM P(AiLAi2. .. Air).


r= 1 \<fi<k<...<ir<.in

Hint. See the proof of Theorem 11 of § 3.

36. Prove Theorem 10 of § 3 using the results of Exercise 35, and by determining
the coefficients Ctj2^lr in the particular case when the events Ak are independent.
According to the statement of Exercise 35 the formula with these coefficients will be
valid in the general case too.

37. As an application of Theorem 10 of § 3 solve the following problem: suppose


that n persons throw their visit-cards into a hat and then everybody draws one
visit-card from the hat. The probability that exactly r persons (r = 0, 1,. . ., n) draw
their own cards is:
1 (- I)""'
Wr(n) = — (l - — + J_ -!-... + —---
r! 1! T 2! 3! (« - r)l

1
e.g. for a-> oo we have Wr(ri) -
e • ri

38. The events Alt A2, ..., A„ are said to be exchangeable1, if the value of

p(.^h Ah . . . A,) (1 < r < n; 1 < q < i2 < . . . < ir < n)

depends only on r and does not depend on the choice of the different indices
h> 4> • • • > ir (r = 1,2, , n). Thus if Ax, A2, . . ., A„ are independent and equiprob¬
able, they are also exchangeable. Show that from the exch: ngeability of the events
Au A2,. . ., A„ their independence does not follow.

39. a) Let an urn contain M red and N — M white balls, n balls are drawn without
replacement, n < min (M, N — M). Let Ak denote the event that the &-th drawing
yields a red ball. Prove that the events Ax, A2,. . . , An are exchangeable.
b) Prove that the events At, A2,. . ., A„ defined in Exercise a) are even then
exchangeable, if every replacement of a ball drawn from the urn is accompanied by
throwing R balls of the same colour into the urn.

40. Each of N urns contains red and white balls. Let the number of the red balls
in the r-th urn be an that of the white balls br and let vr be the probability of drawing

a red ball from the r-th urn; that is, we put vr = ——-. Perform the following
ar+ br
experiment. Choose first one of the urns; suppose that the probability of choosing
the r-th urn is pr > 0 (r = 1,2,..., N). Draw from the chosen urn n balls with

1 Also called “symmetrically dependent” or “equivalent” events.


[II, § 12 EXERCISES 79

replacement. Let Ak denote the event that the A>th drawing yields a red ball. Prove
now the following statements:
a) The events Ax, A2, . . ., A„ are exchangeable.
b) The events Ak are, generally, not even pairwise independent.
c) Let Wk denote the probability that from the n drawings exactly k yield red balls.
Compute the value of fVk .
d) Let nk denote the probability that the first red ball was drawn at the A>th
drawing; compute the value of nk .

41. Let Ak denote the event that given the conditions of Exercise 37 the k-\h person
draws his own visiting card. Prove that the events Ak are exchangeable.

42. Let TV balls be distributed among n urns such that each ball can fall with the
same probability into any one of the urns. Compute
a) the probability P0(n,N) that at least one ball falls into every urn;
b) the probability Pk («, TV) that exactly k (k — 1,2,,n — 1) of the urns remain
empty.

43. Let Ak denote in the preceding exercise the event that the &-th urn does not
remain empty; show that the events Ak are exchangeable and

Vt = P(AhA„.. -A„)=i (*) <-

Show that if TV = 'k n and « —> °then lim Vk = (1 — e *)*.


(Remark. V„ is equal to the probability P0 (n, TV) occurring in Exercise 42.)

44. Banach was a passionate smoker and used to put one box of matches in both
pockets in order to be never without matches. Every time he needed a match, he chose
at random either the box in his right or that in his left pocket with the same probability

_. One day he put into his pockets two full boxes, both containing n matches. Let

Pk denote the probability that on first finding one of the boxes to be empty, the other
box contained k matches. Calculate the value of Pk and find the value k which
maximizes this probability.

45. An urn contains M red and TV — Tkf white balls, —= p . Let Pr denote the

probability that in a sequence of drawings with replacement the r-th drawing of a


red ball is preceded by an even number of drawings of white balls. Prove that we have

Pr > — for every value of p and r.

46. Let an urn contain TVf red and TV — TV/ white balls. Draw all balls from the
urn in turn without replacement and note the serial numbers of the red drawings.
Let these serial numbers be ku k2,. .., kM, and put X — kx + k2 + . . . + kM . Let
P„ (M, TV) denote the probability that X = n (A < n < B), where

M(M+ 1)
A - and B = A + M(N — M).
2
80 PROBABILITY [II, § 12

Put
B

F(M, N, x) = £ Pn(M,N)xn.
n=A

Determine the polynomial F(M, N, x) and thence the probabilities Pn(M, N).1
Prove that
PB-n (M, N) = PA+n (M, N).

47. Prove by means of probability theory that if cp{ri) denotes the number of the
positive integers less than n and relatively prime to n (n = 1,2,.. .), then2

1 '
<p(n) = n 11
p\n P,
where the product is to be taken over all distinct prime factors p of n .
Hint. Choose at random one of the numbers 1,2,,n such that each of these
numbers is equally probable. Let Ap denote the event that the number chosen can
be divided by the prime number p . Show that if pu p2, . . . are the distinct prime
factors of the number n, then the events APi, A,P2. . . are independent. The proba¬

bility that the chosen number is relatively prime to n is, by definition, -. On the
n

other hand we have P(Ap) = — , hence, because of the independence of the events Ap,
P

^-=p(ni,)=n^.) = n(i-4i-
p\n

48. a) Let f? 'be a countably infinite set, let its elements be cou co2, . . ., con,....
Let cA consist of all subsets of Q and let the probability measure P be defined in the

following manner: P({<w„}) = pn where pn > pn+x > 0 (n = 1,2,...) and £ Pn = 1-


1
Prove that the set of those numbers a for which an A £<A can be found such that
P(A) = x, is a perfect set.
b) Prove that, given the conditions of Exercise 48 a), the range of the set function
P(4) is identical to the interval [0, 1 ], if and only if
oo

Pn<Y. P« (« = 1, 2, . . .).
k = n+1

c) Given the conditions of Exercise 48 a), prove that to every r-tuple of numbers
xu x2, . .., xr with

Xi > 0, (/= 1,2, ...,r)

a complete system of events

Au A2, . . . , Ar with P(A,) = x, (/ = 1,2,..., r)

1 This exercise is the basis of an important statistical method, called Wiicoxon’s


test.
2 <p(n) is called Euler’s function.
II, § 12] EXERCISES 81

can be found if and only if

1
Pn< — Y Pk (« = 1,2, . . .).
r k=n

Hint, a) A number x is said to be representable if there exists an event A £cA


such that P(A) = x, i.e. if x can be represented in the form x = Y pn. If xn
(»n(zA
(n = 1,2,...) is representable and lim x„ — x ^ 0, then it is readily seen that x is
ft—► 00

representable too. Indeed we can select frorh the sequence x„ an infinite subsequence
xnic(k = 1, 2, . . .) such that in the representation of each x„k the greatest member
is ptl. Take now from this sequence an infinite subsequence having in its representation
for second greatest member ptl. By progressing in this manner we obtain a sequence
oo
pis(s — 1, 2, .. .) and it is easy to verify that Y p,t — x. The range of the function
5=1
P(A) is thus a closed set. Furthermore, if x is a number which can be represented as
N
a sum of a finite number of the prs, e.g. x = Y pu, then
i=l

N
x = lim (£ Pu + p„).
ft—► CO /= 1
CO

If x = ^ Pij, then
i=\
n
x = fim Yj Pir
ft-*- CO /= 1

Thus the range of the function P(A) is a perfect set.


c) It is easy to see that the condition is necessary. Its sufficiency can be shown

in the following manner: suppose we have xx > x2 > . . . ^ xr. Then xx > —, and

on the other hand px<— hence px can be used for the representation of xx. Let now
r

be x\ = max (xx — pu x2) then x[ > — (1 — px). Sincep2 < — (1 —Pi), Pi can therefore

be used for the representation of x[, that is for one of the x;-s. Proceeding in this way
we can see that every p„ can be used for the representation of an x;. Since

fJPn=fJ Xj = 1,
n=1 /=1

we have obtained a decomposition of the series ^ p„ into r disjoint subseries such


n= 1
that the sum of the y'-th subseries is equal to x;. If A/ consists of the elements con for
which pn occurs in the representation of x„ then the sets Aj have the required
properties.
b) is the special case r = 2 of the statement c).

49. The Kolmogorov probability space [Q, cA, P] is said to be non-atomic, if there
exists to every event A of positive probability an event B cz A such that 0 < P(B) <
< P(A). Prove that in the case of a non-atomic probability space [Q, cA, P] the range
of the function P(A) is the whole interval [0, 1].
82 PROBABILITY [II, § 12

Hint. Prove first that for any e > 0 Q can be decomposed into a finite number
of disjoint subsets A, (Af € cA; j — 1,2,. . . , m) such that P(Aj) < e. This can be
seen as follows. If A £ cA, P(A) > 0, then A contains a subset B c A such that
0 < P(B) < e. Indeed if P(A) < e, we can choose B — A. If P(A) > e, then (since
P is non-atomic) a B cz A can be found such that B ZcA and 0 < P(B) < P(A);
P(A) P(A)
here either P(B) or P(A — B) is not greater than —-—. If —-— < e, we have

P(A)
completed the proof; if > e, the procedure is continued. Since for large enough
2
P(A)
r we have < e, there can be found in a finite number of steps a set B such that
V

B cz A, B £cA and 0 < P(B) < e .

Let us put now

iUs (A) = sup P(B) for A 6 cA.


B £ A,PiB)<.e

According to what was said above, f(/f) > 0 for P(A) > 0. Choose a set Ax ^cA
for which 0 < P(At) < e, further a set A2 cz A, for which

e > P(A2) > ~ /it (A,)

and then a set A3 cz At + A2 for which e > P(A3) ^ (^i + A2); generally, if

the sets Au A2,. . ., An are already chosen, we choose a set A„+1 such that the con¬
ditions

A„+1 c Ay + A2 + . . . + An

and

e > P(/4„+1) > — (/4X + A2 + • • • + A„)


co

are satisfied. Then Av A2, .... A„,... are disjoint sets, hence ^ P (A„) < 1
n=l
and thus lim P (A„) = 0 and at the same time
/!—► CO

lim fxe (Al + A2 + . . . + A„) = 0.


n—*~ oo

00

Since fie (A) is a monotonic set function, we get introducing the notation £ An = B
n=2
that [iE (B) = 0. But it follows that P(B) = 0 and thus, introducing the notation
A\ — At + B, we obtain that
CO

A\ + £ A„ = Q. 0 < P(An) <e (n = 2, 3,...),


n= 2

and 0 < P(A'i) < e .


II, § 12] EXERCISES 83

Choose now N so large that £ P(A„) < e. Then the sets A\, A2,.. ., AN-1 and
n^N
CO

A'n = Yj A„ possess the required properties. Now we can construct for an arbitrary
n= 1
number x (0 < * < 1) an A ^cA such that P(A) = x in the following manner: Q is
decomposed first into a number Nx of disjoint subsets Av such that

0 < P(AV) < ^ U= 1,2.N).


r

Let xlir — P(^£j Ay}. Then x lies in one of the intervals [xlir, x1>r+1), r = 1, 2,,
7=1
Nx — 1, let it be e.g. the interval [xI>fi, xliri+1). If x ■= xliri, we have finished the
construction. If xljfl < x < x1>(.1+1 we decompose 41>ri+1 into subsets A2t (J =
= 1,2,..., N2) such that

0 < P(A2i) < * .

Let

*2.r = P £ ^ £ ^,1 (-= 1,2,...,AT2).


w=i i=i )

Then x lies in one of the intervals [x2>r, x2>r+1); e.g. x £ [x2>r2, x2_r +x). By con¬
tinuing this procedure we obtain a set

A — X Au+ £ A^ + .. . + £ Asi + . ..
i=i i=i i=i

for which P(A) = x.

50. Prove for an arbitrary probability space that the range of P(A) is a closed set.
Hint. A set A £cA will be called an atom (with respect to P), if P(A) > 0 and if
B £ cA, B c A imply either P(B) = 0 or P(B) = P(A). Two atoms A and A’ are, a
set of zero measure excepted, either identical or disjoint. From this it follows that
there can always be found either a finite or a countably infinite number of disjoint
co

atoms An (n = 1,2,...) such that the set Q — A„ contains no further atoms. Put
n=1

f] An — B, Hi(A) = fi(AB), fi2{A) = fi(AB).


n= 1

Then /u(A) = (A) + M2(A). Here /u^A) can be considered as a measure on the
class of all subsets of the set Q’ having for its elements the sets A„, and /u2(A) is non-
atomic. Hence the statement of Exercise 50 is reduced to the Exercises 48a) and 49.
CHAPTER 111

DISCRETE RANDOM VARIABLES

§ 1. Complete systems of events and probability distributions

We have defined in § 2 of Chapter I the concept of a “complete system of


events” with respect to finite probability algebras; this concept will now be
extended to arbitrary Kolmogorov probability spaces. A finite or denumer-
ably infinite system of events {An} (An n = 1, 2, . . .) is said to be
complete (in the wider sense), if for i + j AtAj — O and if the occurrence of
an event An (n = 1, 2,. . .) is “almost sure” (i.e. it has the probability 1):

T(lA) = Si>W= 1- (1)


n n

Thus we do not require that J^A„ — Q; only that P (Q') — 0 should


hold, where "
Q' = YjA„CZQ.
n

The sequence of probabilities of a complete system of events will be called


a probability distribution (or briefly distribution). From a purely mathemati¬
cal point of view every sequence of nonnegative numbers P\,p2, ■.. for which

ZPn=h (2)
n

can be considered as a probability distribution.


The expression “probability distribution” hints at the interpretation that
the probability 1 of the sure event is “distributed” among the events An
(n = 1,2,...). There is a close analogy between probability distributions
and mass-distributions in mechanical systems, since every sequence of non¬
negative numbers px, p2,. .. fulfilling (2) may be considered as a distribu¬
tion of the unit mass among a finite or denumerably infinite number of
points. Later on we shall often return to this analogy.

§ 2. The theorem of total probability and Bayes’ theorem

Let Bh Bo,. . ., Bn,... be a complete system of events and let P(Bt) > 0
(i = 1, 2,. . .). Then an arbitrary event A £ ^ can be decomposed accord-
in, § 2] THEOREM OF TOTAL PROBABILITY AND BAYES’ THEOREM 85

ing to the formula


00
A = Y. 4B„.
n= 1

Since BjBj — O holds for i / j, we obtain

P(A) = Y.P(AB„). (1)


n

According to the definition of conditional probabilities we have

P(AB„) = P(A\Bn)(PBn).

When substituted into (1) this gives

P(A) = Y.P(A\B„)P(B„). (2)


fl

This relation is called the theorem of total probability. Since YjP(Bn) = 1,


n

according to (2) the probability P(A) is the weighted mean of the conditional
probabilities P(A | Bn) taken with the weights P(Bn). From this it follows
immediately that

inf P(A | Bn) < P(A) < sup P(A\B,y (3)


n n

The theorem of total probability is closely connected to the following


simple theorem of mechanics. The center of the gravity of a body can be
obtained by decomposing the body into arbitrarily many parts and consider¬
ing the mass of each part as concentrated in its center of gravity and then
forming the center of gravity of the resulting point-system. Equation (2) is
further analogous to the following chemical relation: Different solutions
of the same salt are placed into N vessels, the total volume of the solutions
being 1. Let P(B„) denote the volume of the n-th vessel and P(A \ Bn) the
concentration of the solution of the n-th vessel. If we mix the contents of
the vessels and denote by P(A) the concentration of the resulting solution,
Equation (2) will hold for this case too.
Example. Let an urn contain M red and N — M white balls. Draw balls
from the urn without replacement. Let Ak denote the event that we obtain
M
a red ball at the &-th drawing. Clearly, P(A^ We shall show that

M
77 (k
P{Ak) 2, 3, . . ., N). According to the theorem of total proba-

bility
P(A2) = P(A21 AJ P(AJ + P(A21 Af P(Ai),
86 DISCRETE RANDOM VARIABLES [HI, § 2

hence
M— l M M N-M M
a) " N- 1 ’ ~N + N — 1 ’ N~ _ '

Similarly we obtain that P(Ak) = —, if k = 3, 4,. .A(cf. Exercise 39a,

§ 12, Ch. II).


Let A and B be any two elements of an algebra of events with P(A) > 0
and P(B) > 0. From the values of P(A), P(B) and P(A \ B) one can obtain
P(B | A) as well; indeed (cf. (2), § 8, Ch. II)

P(A | B) P(B)
P(B | A) = (4)
P(A)
If {Bn} is a complete system of events and if in- (4) Bk is substituted for
B and Expression (2) for P(A), we have

P(A\Bk)P(Bk)
P(Bk | A) = (5)
Y,P(A\Br)P(B„)

This is Bayes’ theorem.


There is hardly any other theorem of probability theory so much debated
as this.
Bayes’ theorem is well-proven, its validity cannot be doubted; only its
practical applications are controversial. An often-used name of this theorem
is for instance “theorem of the probability of causes”. This name originates
in the use of Bayes’ theorem to infer the probabilities of the hypotheses
(causes) Bk (k — 1, 2, . . .) from the occurrence of an event A; i.e. if one
wishes to examine how much the occurrence of an event A supports or re¬
futes certain hypotheses. If the so-called a priori probabilities P(Bk) are known,
then Bayes’ theorem can be applied and the a posteriori probabilities
P(Bk | A) can be computed. However, the probabilities P(Bk) are often un¬
known. Then it is usual to give them arbitrary values, which is really a
questionable procedure.
The name “theorem of the probability of causes” can lead to misunder¬
standings, hence we must discuss it a little further. Indeed, from (4) it follows
that
P(A\B) P(B | A)
P(A) P(B) •
Thus if the occurrence of the event A increases (e.g. doubles) the probability
of B, then the occurrence of the event B increases (doubles) the probability
of the event A as well; hence it is entirely impossible to infer the direction
of the causal relation from the value of the conditional probability only.
Ill § 3] CLASSICAL PROBABILITY DISTRIBUTIONS 87

We mention finally the following chemical analogy of Bayes’ theorem:


Let N vessels contain solutions of different concentrations from the same
salt. Let the total volume of the solutions be 1. Let P(Bk) denote the volume
of the solution in the &-th vessel and P(A | Bk) the concentration of the salt
in it, then formula (5) gives, what part of the total mass of salt is in the k-th
vessel.

§ 3. Classical probability distributions

1. In the preceding Chapter we have already discussed independent repeti¬


tions of a simple alternative. Repeat n times an experiment with two possible
outcomes A and A, such that the repetitions are independent. If Bk (k =
= 0, 1,. . ., n) denotes the event that A occurred at exactly k experiments,
then the events Bk {k — 0, 1, . . ., n) form a complete system of events and
the corresponding probabilities are

Wk = P(Bk) = pk qn~k (k = 0,1,... ,n) (1)

where p = P(A) and q — 1 — p = P(A). The sequence of numbers Wk is


called the binomial distribution of order n and parameter p. The name hints
at the fact that the numbers Wk are the terms of the expansion of (p + q)n
according to the binomial theorem.

2. A natural extension of the binomial distribution is the polynomial


distribution ;x it is obtained by independent repetitions of an experiment hav¬
ing several possible outcomes. Let the possible outcomes of the experiment
be A„ A2,. . ., Ar and let P(Aj) = pj (j = 1,2,..., r), then these probabili¬
ties fulfil the condition px + p2 + . . . + p, = L Let Bkl, ki,. . ., kr denote
the event that in n independent repetitions of the experiment the event Ax
occurs kx times, the event A2 occurs k2 times, ..the event Ar occurs kr
times, where kx + k2 + . . . + kr = n. Then we have

n\
P(Bku k2, — PlP2---Pkrr- (2)
kx\ k2\ ... kr\

The name “polynomial distribution” comes from the fact that the terms
P(Bku k„...,kr) can be obtained by expanding (px + p2 + . . . + pr)n
according to the polynomial theorem. If r = 3, we call the distribution
the trinomial distribution.

1 Called also “multinomial” distribution.


88 DISCRETE RANDOM VARIABLES [HI, § 3

3. Let an experiment have only two possible outcomes, A and A. Perform


n independent repetitions of this experiment, but let the probability of A
(and thus of A) change from experiment to experiment. Let Bk denote the
event that A occurred exactly k times (k = 0, 1,. . ., ri), then

P(Bk) - X ph ph ... pik (1 - Pj) (1 - Pj) ■ • • (1 - PJn_k) (3)

where pt is the probability that A did occur at the z-th experiment. The
summation is to be taken over all combinations (zl5 z2,. . ., ik) of the k-th
order of theelements (1, 2,. . ., n) and . . -,jn-k denote those numbers
of the sequence 1, 2,. . ., n which do not occur among z'l5 z2,. . ., ik. The
numbers P(Bk) form a probability distribution. If for instance all probabili¬
ties Pi are equal to each other, we obtain as a particular case the binomial
distribution (1).
The distribution (3) occurs for instance in the following practical problem:
In a factory there are n machines which do not work all the time. They are
switched on and switched off independently from each other. Let pt denote
the probability that the z'-th machine is working at a given moment, let
P(Bk) be the probability that at this instant exactly k machines are working,
n

then P(Bk) is given by the Formula (3). The fact that ]T P(Bk) = 1 can
k=0
be seen directly in the following manner: A simple calculation gives that

£ P(Bk) xk = f[ (1 - pt + Pi X);
k=0 (=1

by substituting r= 1 we obtain £ P(Bk) = 1.


k=0
4. The following problem was discussed in the preceding Chapter. An
urn contains M red and N — M white balls (M < N). Draw n times one
ball from the urn without replacement (n < N). What is the probability
that there are k red balls among the n balls drawn? Denote this event by
Q. Then the events Ck [max (0, n — (N — M)) < k < min (n, M)] form
a complete system of events. The corresponding probabilities are, as we have
already shown:

(M\ fA — M\

P(Ck) =
UJ n —k ]
(k = 0,1,..., n). (4)

This distribution is called the hypergeometric distribution.


HI, § 3] CLASSICAL PROBABILITY DISTRIBUTIONS 89

5. This distribution can be generalized in the following manner: Suppose


that the urn contains balls of r different colours, namely exactly Nt balls of
r

the /-th colour (i = 1, 2, . . r). Let N = Z Nt be the total number of balls


i=i

and let Cki, ks, . . ., kr denote the event that among n balls drawn without
replacement the first colour occurs kx times, the second k2 times,. . ., the
r-th colour kr times (kx + k2 + . . . + kr = n). By a simple combinatorial
consideration we obtain that

Nx (N, Nr\

P(Cki, k,. j A:r)


M ki kr J (5)

Distribution (5) is called the polyhypergeometric distribution. It is, for ex¬


ample, applied in statistical quality control, when the commodities are
classified into several categories. (Such categories are for instance: a) fault¬
less; b) faulty but still serviceable; c) completely faulty.)
The events

Cku k"..., kT (0 < kt < min («, N{); Z k‘ =n)

form a complete system of events, thus

.o = i-
This can be seen directly, if we compare the coefficient of xn on both sides
of the identity

n (i+*)*'=(! +
2=1
x)n.

6. Let an urn contain M red and N — M white balls. Let Ak (k = 0, 1,. . .,


N — M) denote the event that at consecutive drawings without replacement
we obtained the first red ball at the (k + l)-st drawing. As was proved in
§ 5 of the preceding Chapter, we have

M
^o) = -F>

k—l
M M
P(Ak) =
N-k
n N-j
(k= 1,2,... .N — M). (6)

Since the events Ak (k = 0, \ , ..., N - M) form a complete system of


events, we have the relation

144222
90 DISCRETE RANDOM VARIABLES [HI, § 3

_M N~M M *~x
N + kti N-k ,U

This identity also has a direct proof, but it is not quite simple. It happens
often that certain identities for which a mathematical proof may be rather
elaborate, are readily obtained by means of probability calculus.

7. Let the preceding exercise be modified in the following manner: Let


an urn again contain M red and N - M white balls, but the drawn balls
should now always be replaced. Let Ak denote the event that we obtain the
first red ball at the (7c + l)-st drawing. The most marked difference between
this problem and that of the drawing without replacement dealt with above
is that the number k was there bounded (k < N — M). Now, however, k
can be arbitrarily large; in principle, it is even possible that we draw always
a white ball. Hence it can be questioned whether the events Ak(k — 0, 1, . . .)
do form a complete system of events. Clearly, the events Ak mutually ex¬
clude each other, the only thing we have to examine is whether it is sure that
one of the events does occur. By introducing the notation
00

ki' = YAn
71 = 0

we have Q' ^ 12.


We shall prove, however, that the possibility to draw always a white ball
in an infinite repetition of drawings has probability 0, thus in practice it
does not count at all, i.e., the system of events {Ak} is in a wider sense
complete.
M
First of all we compute the probabilities P(Ak). Put— — p, 1 — p — q.

The probability that we obtain at the first k drawings white balls and at the
(k + l)-st drawing a red one is

P(Ak) — pqk (k = 0,1,...). (7)


From this
00 00

X P(Ak)=p£ qk = = 1.

Hence the probability of Q' is 1 and thus P(Q — Q') — 0. Though it is, in
principle, possible that Q — Q' occurs; this possibility can be neglected in
practice. Hence the system {Ak } of events is, in a wider sense of the word,
complete.
The distribution pqk (k = 0, 1,.. .) is often called the geometric distri¬
bution, since the sum of the members pqk is a geometric series. We shall see
HI, § 3] CLASSICAL PROBABILITY DISTRIBUTIONS 91

later on that this distribution belongs to a larger class of distributions,


namely to the class of negative binomial distributions.
Examples 1-6 show finite probability algebras; the construction of these
probability algebras causes scarcely any difficulty at all. As Example 7
deals with an infinite probability algebra, this deserves to be examined some¬
what more thoroughly. This example deals with the infinite repetition of a
simple alternative, the elementary events are thus infinite sequences consist¬
ing of the events A and A. It is readily seen that the set of the sequences of
this type has cardinal number of the continuum. Indeed, we associate to
every sequence
-®1> ^2> • • • 5 En, ...,

where the meaning of E„ can be only A or A, the number x having the binary
expansion x = 0, where

( 1 if En = A,
[ 0 if En = A.

Thus the set of the elementary events of the sequence of experiments is


mapped onto the interval [0, 1]. This mapping is one-to-one, with exception
k
of the binary rational numbers x — ^j{k and / are nonnegative integers).

Now we construct the probability space corresponding to this system. First


of all let denote the set of events obtained by prescribing the outcome
of a finite number of experiments assuming nothing about the remaining
experiments. Let ix < i2 < ... < ik denote the indices of the experiments
where the occurrence of A is prescribed and j\ < j2 < ... < jt similarly for
the occurrence of A. Let C denote the event so defined, then we have

P(C) =pkql.

From this we can compute P(C) for every C £ Clearly [^0,P] is a proba¬
bility algebra, but is not a a-algebra. But if we consider the least a-al¬
gebra containing and extend the set function P{C) defined over
(readily seen to be a measure on to then we obtain the Kolmogorov
probability space sought for (cf. Ch. II, § 6). In order to prove that P(C)
is a measure on let us consider the above mapping of the sample space
onto the interval [0, 1]; let the interval [0, 1] be denoted by Q*. There cor¬
responds to the algebra of sets the class of the subsets of Q* consist¬
ing of a finite number of pairwise disjoint intervals with binary rational end¬
points. Just like in Chapter II, § 7 there can be given a function F(x) so that
the probability belonging to the interval [a, b) = 1 be equal to F(b) — E(a)m
92 DISCRETE RANDOM VARIABLES [III, § 3

m+ 1
Indeed, if the interval [a, b) is of the form (m being odd) and
2n

mil 1
- — -;— -:— ”f” • • • “f" -:— (4 < 4 < . . . < 4 = n),
2n 21' 2h 2lk

then we put
F(b) - F(a) = / q"~k.
From this F(x) can be determined at every binary rational point x. Thus
for instance

F(0) = 0, F(l)=l,

F
fi \ , „ f3
= + q2p-.
[«j = q + pq, f
[tH- f 8
5 \
F f7 = q+pq+p2q, ;tc.
T) = q + qp' F J
In general, if

X = Y •—r— (4 < 4 <••• < 4)


ti 2"=
then
r

m= E
k=1
It is easy to see that F(x) is an increasing continuous function and F(0) =
= 0, .F(l) = 1. Hence the result of Chapter II, § 7 can be applied here. The
extension of is in this case the collection of all Borel-measurable

subsets of £3*. Especially, if p = q = ~, then F(x) = x and the measure

P* is the Lebesgue-measure. The fact that in an infinite sequence of experi¬


ments the probability of “obtaining except for a finite number of experi¬
ments always the same event” is zero, corresponds to the fact that in a
binary expansion of almost every number both digits occur infinitely often.
This is a special case of the well-known theorem of Borel, which will be dis¬
cussed later on. The above construction of the probability space [£3, P]
is a special case of a general theorem of Kolmogorov (the so-called funda¬
mental theorem of Kolmogorov) which will be proved later.

8. The negative binomial distribution can be obtained as a generalization


of the preceding problem. Consider an experiment having two possible out¬
comes A and A and let the probability of the event A be p, that of A be
HI, § 3] CLASSICAL PROBABILITY DISTRIBUTIONS 93

q — 1 — p. Let A^ denote the event that during independent repetitions


of the experiment the event A occurred for the r-th time (r > 1) at the
(r + &)-th experiment. We obtain by a simple combinatorial consideration
that
'k + r — lj
p(4r)) = Prqk (k = 0,1,...). (8)
. r-1 J
The events A$ (k = 0, 1,. . .) form a complete system of events in a wider
sense. Since the events Af? {k = 0, 1,. ..) are pairwise disjoint, it is enough
00

to show that £ P{A(J?) = 1. This follows from (8) in the following manner:
k=0
00 00
k + r- 1'
I p q (-#
. r-1 , k=0
The distribution (8) will be called the negative binomial distribution of r-th
order, since the probabilities in question can be obtained as terms of the
binomial series (for a negative exponent) of the expression p\ 1 — q)~r.
Since the events A$ (k = 0, 1,. . .) form a complete system of events, the
probability, that the number of occurrences of an event in infinite repeti¬
tions of an alternative remains bounded, has the value zero.
Indeed, if C„ denotes the event that in the infinite sequence of experi¬
ments A occurred exactly n times; then, as proved above, P(CJ = 0. If
therefore C denotes the event that A occurs in the infinite sequence of events
00
only a finite number of times, then C = £ C„, and
«=o

P(C) = t P(Cn) = 0.
n=0

Thus the event A occurs infinitely many times with the probability l.1
9. Consider the following problem: let an urn contain M red and N - M
white balls. Draw a ball at random, replace the drawn ball and at the same
time place into the urn R extra balls with the same colour as the one drawn.2
Then we draw again a ball, and so on. What is the probability of the event
that in n drawings we obtain exactly k times a red ball? Let this event be
denoted by Ak. Of course we assume that at every drawing each ball of the
1 Later on we shall prove more: let kn denote the number of occurrences of A in
the first n experiments, then not only lim kn — + 00 with probability 1, but more
n-*-co

k
precisely lim —— = p with probability 1.
n—*-co tl
2 R can be negative as well. In case of negative R we remove from the urn R balls
of the same colour as the one drawn.
94 DISCRETE RANDOM VARIABLES [III, § 4

urn is selected with the same probability. We compute first the probability
that we obtain at each of the first k drawings a red ball, and white balls at
the remaining n — k drawings. Clearly this probability is

n cM+jR> n
j=o h=a
(.N-M + hR)
n —1 (9)
n
l-0
(n+ir)

From this it follows easily that

;m("+;s) n cn
ln\k~1

\K)J=0_h = 0_
n-k-l
-m+hk >
P(Ak) = (10)
ff (n+«).
1=0

Distribution (10) is called the Polya distribution.


M
If R = 0 and — = p, then we obtain from (10) the binomial distribution

as a particular case. If R = —1, we get as a particular case from (10) the


hypergeometric distribution.
We can also compute the probability that we obtain at the (k + l)-th
drawing the first red ball. Obviously, this probability is

M kz} l N — M + jR
N+kR /J N+jR ,
(11)

In the cases of R = 0 and R = — 1 we have the particular cases already


dealt with.

§ 4. The concept of a random variable

So far we have only considered whether a random event does or does not
occur. Qualitative statements like this are often insufficient and quantitative
investigations are necessary. In other words, for the description of random
mass phenomena one needs numerical data. These numerical data are not
constant, they show random fluctuations. Thus for instance the result of
a throw in dicing is such a random number. Another example is the number
of calls arriving at a telephone exchange during a given time-interval, or
the number of disintegrating atoms of a radioactive substance during a
given time-interval.
In order to characterize a random quantity we have to know its possible
values and the probabilities of these values. Such random quantities are
HI, § 4] THE CONCEPT OF A RANDOM VARIABLE 95

called random variables. In the present Chapter we shall discuss only random
variables having a countable set of values; these are called discrete random
variables. The random variables figuring in the above examples were all
of the discrete type. The life time of a radioactive atom is for instance also
a random variable but it is not a discrete one. General (not discrete) random
variables will be dealt with in the following Chapters. In what follows ran¬
dom variables will be denoted by the letters of the Greek alphabet.
Let A be an arbitrary event. Let the random variable be defined in
the following way:
I
I if A occurs,
0 otherwise (i.e. if A occurs).

Obviously, the value of depends on chance, further we have

P(Za = 1)=P(A),
and similarly
P(&A = 0) = P(A) = 1 - P(A).

A random variable £A associated in this way to the event A is called the


indicator of A. Conversely, we can assign to every random variable b, assum¬
ing only two values, a and b, the event A that £ = a (where the event A
means that £ = b).
Starting from this trivial remark, we can make a further step forward.
If Xj, x2,. . . are the different possible values of the random variable £
(i.e. the set of the possible values is finite or denumerable), a complete sys¬
tem of events {A„} can be associated to it. Indeed, let An denote the event
oo co

that b, — xn, then clearly AnAm = 0, 'An^m and £ vfn — Q, hence £P(AJ =
n=1 n=1
= 1. Conversely, there can be assigned (in several ways) to every complete
system of events {An} a random variable £ such that in case of the occurrence
of A„ the value of £ should depend on the index n only. £ can for instance
be defined in the following manner:

b =n if An occurs (n = 1,2,...).

The value n may be replaced by /(«), where f(n) is any function defined for
the positive integers, for which f(n) A f(m) if n ^ m. Thus we can see that
a complete system of events can be assigned to every discrete random va¬
riable in a unique manner, while there can be assigned infinitely many dif¬
ferent random variables to a complete system of events.
We shall deal in this Chapter with random variables assuming only real
values. It must be said that probability theory deals also with random vari¬
ables whose range does not consist of real numbers but for instance of
96 DISCRETE RANDOM VARIABLES [HI, § 4

^-dimensional vectors. There are also random variables whose values are
not vectors of a finite dimension but infinite sequences of numbers or func¬
tions, etc. Later on, we shall also examine such cases.
Now let us see, how the notion of a random variable is dealt with in the
general theory of probability.
In Chapter II we were made familiar with Kolmogorov’s foundation of
probability theory. We started from a set Q, the set of elementary events,
and an-algebra consisting of subsets of Q. Here consists of all events
coming into our considerations. Further there was given a nonnegative
er-additive set function P defined on such that P(Q) = 1. The value
P(A) of this function for the set A defines the probability of the event A. Nat¬
urally, we understand by a random variable a quantity depending on which
one of the elementary events in question occurs. A random variable is there¬
fore a function £ — £(co) assigning to every eleme'nt co of the set Q (i.e., to
every elementary event) a numerical value.
What kind of restrictions are to be prescribed for such a function? If we
have a probability field where every subset of Q corresponds to an event,
no restriction is necessary at all. But if this is not the case, then the definition
of a random variable calls for certain restrictions.
Since we consider in this Chapter discrete random variables only, we
confine ourselves (for the present) to the following definition:
Let [£?, P] be a Kolmogorov probability space. A function £ = E(m)
defined on Q with a countable set of values is said to be a discrete random
variable, if the set, for which £(co) takes on a fixed value x belongs to *^6 for
every choice of this fixed value x.
Let xx, x2,. • . denote the different possible values of the random variable
£ = £(<«) and An the set of the elementary events co £ Q for which £(co) = x„,
then An must belong to the algebra of sets<^f for every n. Only in this case
the probability
= X.) = P(An)
is defined.
A complete system of events associated with a discrete random variable
thus consists of those subsets of the space of events for which the random
variable takes on the same value. Especially, if £A = <^(co) is the indicator
of the event A, then <^(co) is a random variable having the value 1 or 0
according as co does or does not belong to the set A.
The sequence of probabilities of a complete system of events is said to be
a probability distribution. Now that we have introduced the concept of
random variable this probability distribution can be considered as the set
of all probabilities corresponding to the different values taken on by a ran¬
dom variable. If for instance an experiment having the possible outcomes
A and A is independently repeated n times, then the number £ of the experi-
Ill, § 4] THE CONCEPT OF A RANDOM VARIABLE 97

ments showing the occurrence of the event A is a random variable with the
binomial distribution, i.e.

P(Z = k) = pk qn~k (k = 0, 1,...,«)

where p - P(A) and q = 1 — p.


Let £ be a random variable and g(x) an arbitrary real valued function of
the real variable x. Then q = g{£) is also a random variable. Let further
£x, £2, • • £r be random variables and let g{xx, x2, ■.., xr) be an arbitrary
function of the r real variables xl5 x2,. . .,x;, then q = #(£x, £2,.. £r) is a
random variable as well.
The distribution is the most important concept for the characterization
of a random variable, it does not, however, characterize a random variable
completely. If for instance we know the distributions of the random vari¬
ables £ and q, it is not, generally, possible to determine from this alone
the distribution of the random variable £ = g(£, q)- In order to do this we
have to know the “joint distribution of the random variables £, q, that is
the probabilities P(£ = xn, q — y,„). But if £ and q are given as functions
of the elementary event co £ Q, the joint distribution of £ and q is herewith
also given and the distribution of any function £ = g(£, q) as well.
Let the possible values of the random variable £ be xx, x2,. . . and let A
be an arbitrary event of positive probability. We define the conditional
distribution of the random variable £ with respect to the condition A by
the sequence of numbers

P(£ = xn\ A) (n= 1,2,...).

We introduce further the notion of the distribution function of a random


variable. If £ is a random variable, then the function F(x) defined by
P(x) = P(£<x)
for every real x is said to be the distribution function1 of £. Here P(£ < x)
stands for the probability of the event that the value of £ is less than x; this
event can be represented as the set Ax of the elements co £ Q for which
£(cu) < x. If the discrete random variable £ takes on the values x„ with the
probabilities pn = P(£ = x;,) (n = 1,2,...) then we have, clearly,
P(x)= £ pk,
Xk<X

where the sum is to be extended over all values of k such that xk < x.

1 Called also cumulative distribution function.


The definition F(x) = P(£ < x) is also customary. This induces only minor
modifications in its properties, e.g. this function is continuous from the right, while
< x) is continuous from the left.
98 DISCRETE RANDOM VARIABLES [III, § 4

If, for instance, the distribution of £ is a binomial distribution of order n


and parameter p, then
pk qn-k
F(*)=P(£<x) I (1)

Sometimes, an integral form of this distribution function is used. For this


purpose we define the incomplete (and the complete) beta integral of Euler.//

a > 0, p > 0, 0 < x < 1,


put

B(a,P, x) = jV-^l - ff~x dt. (2)


o

B(a, p, x) is called Euler's incomplete beta integral of order (a, /?). It is well
known that

B(pc,P) = B(ocJ1,1) =
mm (3)
n* + p) 5
where

F(a) = | x* 1 e x dx (a > 0)

is the so-called gamma function. B(a, P) is called Euler's complete beta inte¬
gral of order (pc, p). It is readily verified through integration by parts that

B(r + 1 ,n- r,p)


(4)
B(r + 1, n — r)
hence
B(r + 1 ,n - r,p)
F(x) = P(Z<x)= 1 -
B(r + 1, n - r)
if
r <x<r + 1 (r = 0, 1,.. .,n - 1). (5)
The distribution function F(x) of an arbitrary (not necessarily discrete)
random variable £(a>) exists, iff the set Ax defined above belongs for every
real x to ^. This will be always assumed in the following Chapters
during the study of general random variables. In case of discrete random
variables, however, this follows from the assumption that for every possible
value n the set An of elements w £ Q such that <*(co) = xn, belongs to
The distribution function F(x) is always nondecreasing; further lim F(x) =
= 0 and lim F(x) = 1.
X— + 00
HI, § 5] THE INDEPENCE OF RANDOM VARIABLES 99

§ 5. The independence of random variables

It is obvious to call two random variables independent if the complete


systems of events belonging to them are independent. This definition corre¬
sponds to the natural requirement that two random variables should be
considered as independent, if the fact that one of them takes on a definite
value has no influence on the random fluctuations of the other. Let co denote
any element of the space of events Q; let £ = £(co) be the first, rj = rj((o)
the other random variable, let further An be the set of all co-s for which
£(co) = xn (n = 1,2,...) and Bn of those for which t](co) = ym{m = 1, 2,...).
£ and rj are said to be independent, if
P(AnBm) = P(An)P(Bm) (1)
for every n and for every m; that is, if the complete systems of events {An}
and {Bm} are independent. Or in a different notation: £ and t] are called
independent, if
P(Z = xn,rj = ym) = P(f - xn) P(rj =ym) (1')

for every n and m. Hence in case of two independent random variables the
joint distribution of £ and r\ is, according to (1'), determined by the distri¬
butions of £ and t].
This definition can be generalized to the case of several random variables.
The discrete random variables £l5 £2, • • are said to be (completely)
independent, if for every system of values xki, xki, . . ., xkr the relation

P(Zi = xki, £2 = xki, ...,£, = xkr) = [] p^j = **,) (2)


J=i
holds. It is easy to see that any s < r out of r independent random variables
are also independent. The independence of the random variables £l5 £2,. .
(s < r) can be verified by summing Formula (2) over all possible values
of • • •> xkf.
The converse of the statement is not true; from the pairwise independence
of £i, £2, £3 their complete independence does not follow. Let the random
variables £l9 £2, £3 be the indicators of the events Ax, A2, A3, then relation
(2) expresses just the complete independence of the events Alt A2, A3. Since
— as we know already — the complete independence of three events does
not follow from their pairwise independence, the same holds for random
variables, too.
A constant is, clearly, independent of any random variable. Indeed, if
t] = c (c = constant) and £ is an arbitrary random variable, then
P(£ = xk,y = c)= P<£ = xk) = P(£ = xk) P(j] = c),
since the set defined by rj = c is the entire space Q.
100 DISCRETE RANDOM VARIABLES [III, § 6

Next we prove the following theorem:

Theorem 1. Let £1; £2, • •£r be independent discrete random variables


and gfx), gfx),. . ., gfx) arbitrary functions of a real variable x. Then the
random variables r/1 — gff),. . ., rj, = gr(fr) are independent as well.

Proof. The proof will be given in detail only for r — 2, for r > 2 the
procedure is essentially the same.
Let {xjk} be the sequence of the possible values of the random variable
(J — 1, 2) and {A k) the complete system of events belonging to the ran¬
dom variable £ ; Ajk is thus the set of those elementary events co £ Q for
which f(a>) = xjk.
If yjt is one of the possible values of the random variable rjj = gff),
then the set B), defined by gfff) — yji can obviously be obtained as the
union of finitely or denumerably many sets Ajk; Bn is equal to the union of
the sets Ajk whose indices satisfy the equation gfxjk) = yjt.
Since the complete systems of events {Alk} and {A2k) are independent,
the sum of an arbitrary subsequence of the sets Alk is independent of the
sum of an arbitrary subsequence of the sets A2k. From this our assertion
follows.
We give a reformulation of the above theorem which we shall need later
on. Let £(co) be a discrete random variable with possible values jq, x2, . . .,
xn, . . . and let A„ denote the set of those elementary events co for which
£(co) = xn. Let further be the least cr-algebra containing the sets An.
is called the o-algebra generated by £. Clearly, consists of the sets
obtained as the union of finitely or denumberably many of the sets An.
Obviously ^ Jf £r are independent random variables,
. . ., are the o-algebras generated by f, f and Bj
is an arbitrary element of ^ (j = 1, 2, . . ., r), then the events B1} B2, . . ,,Br
are independent.

§ 6. Convolutions of discrete random variables

Let b, and t] be two random variables with possible values xn and ym


{n, m — 1,2,.. .), respectively. Let the distributions of t; and q be

P(Z = *n) = Pn and P(rj = ym) = qm (n,m= 1,2,...).

If g(x> y) is any real valued function of two real variables, then - as men¬
tioned above — ( = g(rj) is a random variable.
Ill, § 6] CONVOLUTIONS OF DISCRETE RANDOM VARIABLES 101

Let us determine the distribution of the random variable £. For every


real number z
/>((=*)= £ P({ = i, = tim). (1)
g(xn,ym)=z

The sum is here extended over those pairs (n, m) for which g{pcn, ym) = z.
If such pairs do not exist, the sum on the right hand side of (1) is zero.
In order to compute P(£ = z) we have to know therefore, in general,
the joint distribution of £ and rj. If £ and r\ are independent, then
= xH,ij = ym) = P(£ = x„) P(rj = yj and thus

P(C = z) = Y PnVm• O')


g(xn,ym)=z

Let us consider now the important special case when £ and rj are independent
and g{x, j) = x + y, hence £ = £ + q. Then

P(Z = z)= Y Pn<lm• (1")


Xn + ym — z

If £ and tj assume only integer values and pn = 1P(£ = n), qm = P(q = m)


(n, m = 0, + 1, + 2,. ..), then

P(C = k) = Y Mk-J (P = o, ± 1, ± 2,...). (2)


j=-<x

If £ and r\ have only nonnegative integer values, then

P(C — k) — Y PjQk-j- (3)


j=o

The distribution of £ = £ + q is called the convolution of the distributions


£ and q. In what follows we shall compute the convolution of some discrete
distributions.
Let £ and rj be independent random variables having binomial distribu¬
tions of order nx and n2, respectively, with the same parameter^:

K
P(^ = k) = pkqn l~ (k — 0,1,..n±),
A
plqn2-1
P(rj = l) (l — 0,1,..., ri<i),

where q = 1 — p. If £ = £ + ?7, then

n2 pkqtii+rti-k
pa=k) = (4)
k-j)
102 DISCRETE RANDOM VARIABLES [HI, § 7

By the well-known identity

«1 + Mo
z
7=0
»1

,j
[ «2

U-./J k .

it follows from (4) that

'n pk qn-k

o'
P(C = k) =

II
(5)
k.

where n = nx + n2.
Hence the random variable £ has a binomial distribution too. This result
can also be obtained without any computation as follows: Consider an ex¬
periment with the possible outcomes A and A; let P(A) — p. In the above
example £, resp. rj is equal to the number of occurrences of A in the course
of nx resp. «2 independent repetitions of the experiment. The assertion that
£ and r\ are independent means that we have two independent sequences of
events. Perform a total number of n = nl + n2 independent experiments,
then £ = £ + 17 means the number of the occurrences of A in this sequence
of experiments; hence £ is a random variable having a binomial distribution
of order n and parameter p; that is, Formula (5) is valid.
We encounter a practical application of this result when estimating the
percentage of defective items. Consider a sampling with replacement from
the population investigated. According to the above this can be done also
by subdividing the whole population into two parts having the same per¬
centage of defective items and selecting from one part a sample of n1 ele¬
ments and from the other one a sample of n2 elements. This estimating pro¬
cedure is equivalent to that which consists of the choice of a sample of
n = + «2 elements from the whole population.
It is to be noted here that the distribution of the sum of two independent
random variables with hypergeometric distributions does not have a hyper¬
geometric distribution. Hence the former assertion is not valid if the sam¬
pling is done without replacement. The difference is, however, negligible in
practice, if the number of elements of the population is large with respect
to that of the sample.

§ 7. Expectation of a discrete random variable

The random fluctuations of a random variable are described by its distri¬


bution function. In the practice, however, it is often necessary to characterize
a distribution by a small number of data. The most important and simplest
HI, § 7] EXPECTATION OF A DISCRETE RANDOM VARIABLE 103

one of such data is the expectation defined below (first for discrete distribu¬
tions only).
Let the possible values of the random variable £ be xlt x2,. . . with corre¬
sponding probabilities pn = P(f = x„) (n = 1,2,.. .). Perform TV independent
observations of £; if N is a large number, then, according to the meaning of
probability, at approximately Npx occasions we shall have g — xx, at approx¬
imately Np2 occasions £ = x2, and so on. Taking the arithmetic mean of the
^-values obtained at the N observations, we obtain approximately the value

Npx ■ xx + Np2 • x2 + ...


N
Y.Pkx*;

this is the value about which the arithmetic mean of the observed values of
£ fluctuates. Hence we define the expectation E(f) of the discrete random
variable by the formula

(i)
k
Obviously, E(f) is the weighted arithmetic mean of the values xk with weights
pk} In order that the definition should be meaningful we have to assume
the absolute convergence of the series figuring on the right side of (1). Other¬
wise, namely, a rearrangement of the xk values would give different values
for the expectation.
If £ can take on infinitely many values, then E(f) does not always exist.
E.g. if

P(( = X) = E (4=1,2,...),

then the series £ pkxk is divergent. Clearly, the expectation of discrete and
k
bounded random variables always exists.
Sometimes, instead of “expectation”, the expressions “mean value” or
“average” are used. But they may lead to confusion with the average of the
observed values. In order to discriminate the observed mean from the
number about which the observed mean fluctuates we always call the latter
“expectation”.
Obviously, the expectation E(£) depends only on the distribution of
hence if £■, and £2 are two discrete random variables having the same dis¬
tribution, then £■(£•[) = £(£2). Therefore E(£) can also be called the expecta¬
tion of the distribution of f The fluctuation about E{£) of the averages

1 Hence E(0 lies always between the lower and the upper limit of the possible
values of £.
104 DISCRETE RANDOM VARIABLES [III, § 7

formed from the observed values of t, is described more precisely by the laws
of large numbers, which we shall discuss later on. Here we mention only
that the average of the observed values of £ and the expectation E(f) are
essentially in the same relationship as the relative frequency and the proba¬
bility of an event. This will be readily seen if we consider the indicator
of an event A having the probability p\ indeed, E(E, A) = p • 1 + (1 — p)- 0 =
= p and the average of the observed values of is equal to the relative
frequency of the event A.
Next we compute the expectations of some important distributions.

1. The expectation of the binomial distribution. The random variable t


has a binomial distribution if it assumes the values k = 0, 1, . . ., n with
probabilities

P(i = k)= LA"-*

where q — 1 — p and 0 < p < 1. Hence according to (1)

prq{n-\)-r = ^

Example. The number of atoms disintegrating during a time interval t out


of N atoms of a radioactive substance has a binomial distribution; indeed
the probability that in the given time interval exactly k atoms will disinte¬

grate is equal to pkqN k, where N means the number of atoms present

at the beginning of the time interval and p = 1 — e~Xt (.1 is the disintegra¬
tion constant). Hence the expected value of the atoms disintegrating during
the time interval t is given by iV(l - e~Xt); thus the expected number of
nondisintegrated atoms is Ne~Xt. This exponential law of radioactivity does
not state — as it is sometimes erroneously suggested — that the number of
the nondisintegrated atoms is an exponentially decreasing function of the
time; on the contrary, it only states that the number of nondisintegrated
atoms has an expectation which is an exponentially decreasing function of
the time.

1. The expectation of the negative binomial distribution. The random


variable £ has a negative binomial distribution if its possible values are
r + k (k = 0, 1, . . .) and if it takes on these values with probabilities

'k + r - I
P(£ = r + k) = prqk (k = o,i,...),
rII, § 8] SOME THEOREMS ON EXPECTATIONS 105

where 0<p<l, q=l — p. Because of (1) we have


00
k + r- 1)
£({) = E (' + *)
k=0 r - 1
PrT = — Z
P k=0
ft+Vv--
<• I p
Example. In shooting at a target, suppose that every shot hits the target
with the probability p and the outcomes of the shots are independent of
each other. How many shots are necessary to hit the target r times ?
The mathematical wording of the problem is as follows: Let the experi¬
ments of a sequence be independent of each other. Let the experiment have
only two outcomes: A (the shot hits the target) and A (it does not). Let £
denote the serial number of the experiment at which A occurred for the r-th
time. As noted in § 3 of this Chapter, the probability that the event A occurs
in the (k + r)-th experiment for the r-th time is

hence £ has a negative binomial distribution of order r. Thus in the average


r
we need to fire — shots in order to get r hits.
P
3. The expectation of the hypergeometric distribution. The random vari¬
able £ has a hypergeometric distribution if it takes on the values k = 0,1,
. . ., n with probabilities
M \ IN-M\
k)\n-k)
=v= IN\
n)
We obtain from (1) by a simple calculation that

M
E(0 = n
~N'

Example. The hypergeometric distribution occurs for instance in sampling


M . ...
without replacement. Let p — denote the fraction of defective items in

the lot examined. We want to estimate p from a sample of size n. The number
of defective items has the same expectation np as in sampling with
replacement.

§ 8. Some theorems on expectations

We shall now prove some basic theorems about expectations.


106 DISCRETE RANDOM VARIABLES [III, § 8

Theorem 1. If E(f) and E(rj) exist, then + rj) exists too and

E{f + ri) = E(0 + E(rf).

The statement of this theorem is plausible because of the intuitive meaning


of expectation. Indeed, if the observed values of £ are £l5 £2, ■ ■ -An and
1 "

those of rj are r\2,. . tjn, then — y 4 fluctuates about the number £■(£)
n k=i
1 ” 1 n

and —V t]k about the number E(rj), hence-—Y + r]f) fluctuates about
n n k=1
the number E(£) + E(rj); in consequence E(f + rj) = E(f) + E(rj). Let us now
give the proof of the theorem. Let the possible values of £ be Xj (J = 1,2,.. .)
and those of rj yk (k = 1,2,.. .), let further Ajk denote the event that £ = x}
and r\ = yk. Clearly, the Ajk (j,k = 1,2,.. .) form a complete system of
events. Further
yP(Ajk) = P(r, = yk)
j
and
■£P(AJk)=P(( = Xj).
k

On the other hand, the possible values of £ + t] are the numbers z repre¬
sentable as Xj + yk. It may happen that a number z can be represented
in more than one way in the form z = Xj + yk; in this case

zP{f + r] = z) = z y P(Ajk) = y (xj + yk) P(Ajk).


x/+yk=z xi+yk=z

Since the sum of two absolutely convergent series is itself absolutely conver¬
gent, we obtain that

E(£ + n) = y y (Xj + yk) P(Ajk) = E(0 + E(>7)


j k

and this is what we wished to prove.


The next theorem follows by mathematical induction from Theorem 1.

Theorem 2. If E(£t) (i = 1,2,...,«) exist, then E(£,1 + exists,


too, and

E(Zx + £2 + . •. + („) = E(Z 1) + E(60 + ... + E(Q.

It is easy to prove the following theorem:

Theorem 3. Let cl5 c2,. . ., cn be constants and £■,, £2,. . -An random vari-
HI, § 8] SOME THEOREMS ON EXPECTATIONS 107

ables the expectation of which exists, then

E(t Ckik) = tckE(ik).


k=\ k=X

In other words, E is a linear operator.


It is further easy to show the following properties of the expectation:
If ^ > 0, then E(f) > 0. If | £ | < | rj | and E(r]) exists, then E(f) exists as
well.
Consider now some examples. We have proved already that a random
variable with a binomial distribution of order n has the expectation E(£) =
= np. This can be deduced immediately from Theorem 2; indeed ^ can be
n

written in the form £ = £ <jj), where £} is the indicator of the event A at the
7=1
y'-th experiment. Since E(^) = p, it follows from Theorem 2 thatEf^) = np.
Similarly a random variable having a negative binomial distribution of
order r can be considered as the sum of r independent random variables
each having a negative binomial distribution of the first order, with the
same parameter p. Thus it follows from Theorem 2 that the negative bi-
r
nomial distribution of order r has the expectation —, as proved already.
P
Similarly, a random variable with a hypergeometric distribution can be
represented as the sum of n indicator variables whose expectation is p (cf.
the example after the theorem of complete probability). These indicator
variables are not independent, but this does not affect the validity of
Theorem 2.

Theorem 4. If r\ = £ — E(f), then E(rfi = 0.

Proof. According to the additivity

£(„) = £({) - £(£({)).


Since the expectation of a constant is obviously the constant itself, we have
E(E(f)) = E(f), and our statement follows.

Theorem 5. If t; and rj are discrete random variables such that the expecta¬
tions E(f2) and E(tf) exist, then E{fyj) exists as well and

i £({,) r < (i)


Note. Essentially, the inequality (1) is Schwarz’s inequality known from
analysis.

Proof. Consider the random variable

Ca = a -
108 DISCRETE RANDOM VARIABLES [HI, § 8

where X is a real parameter. Since 0 < Cx — 2£2 + 2X2tf, E(Cx) exists.


Because of Theorem 3 we have

£(Q = E(e) - 2XE^rj) + X2E(r,2). (2)

Since > 0 we have E(£f) > 0 for every real X, therefore the polynomial
(2) in X of degree 2 is nonnegative. But as it is well known this is only pos¬
sible if (1) holds, which is what we wished to prove.
Let ^ be a discrete random variable and A an event having positive proba¬
bility. The conditional expectation of £ with respect to the condition A is
defined by the formula

E(Z\A) = '£tP(S = xk\A)xk, (3)


k

provided that the series on the right side is absolutely convergent (which is
always fulfilled if E{f) exists), where x„ (n = 1,2,...) denote the possible
values of £. E(£ \ A) is therefore the expectation of the conditional distri¬
bution of £ with respect to the condition A. If the events A„ (n = 1,2,...)
form a complete system of events, then in view of the theorem of total proba¬
bility

£(4) = £P({ = Xk)xk = £ £/>({ = Xk | A)P(A) ** = EP(A)£« I A„).


k k n n

Thus we proved the following theorem:

Theorem 6. If An (n = 1,2,.. .) is a complete system of events and £ is


a discrete random variable, then

£({) = SP0J£«M„). (4)


n

provided that E(£) exists.

Particularly, if is the indicator of the event B, then E(ff) — P{B),


E{£,b | A) — P{B | A) and we obtain the theorem of total probability as a
special case of Theorem 6. Hence Theorem 6 is used to be called the theorem
of total expectation.
Theorem 6 may also be interpreted in the following manner: The condi¬
tional expectation Eif \ A„) can be considered as a random variable which
takes on the value E(£, \ A„), if the event An (n = 1, 2, . . .) occurs. Accord¬
ing to this interpretation the right side of (4) is the expectation of the discrete
random variable E{£ | An). Let rj be a random variable whose value depends
on the event which actually occurs of the events An: e.g. put r\ = n if An
occurs (« = 1,2,...). Since now £(£ | rj) can be written instead of E{<£ \ A„).
Ill, § 8] SOME THEOREMS ON EXPECTATIONS 109

we have, according to the statement of Theorem 6,

E(m |,)) = £({). (5)

This relation will be used later on.


Example. Formula (5) can also be used to compute the expectation of the
sum of a random number of random variables. Let £l5 £2, • • • be indepen¬
dent random variables and let v be a random variable independent of
in = 1,2,...) and taking on the values 1,2,... with probabilities
qx, q2, . . . . Consider the random variable

C — £i + £2 + • • • +

which is the sum of a random number of random variables. It follows from


(5) that
m E(m= 1 v)).

If En is the expectation of then, in view of Theorem 2 and of the inde¬


pendence of the random variables and v, we obtain that

E(t;\v = n) = E1 + E2+. ..+£„,

hence, according to (5),


GO

E(D = TlVn(E1 + E2 + ...+EJ,


n=1

or, after rearrangement of the terms (which is admissible if the series

Yj Qn ( I Ex | + | E21 + ... + | En | )
n=1

converges), we have
00 00

£(0 = E^(2 ^)-


«=1 /c = n

In the special case where the expectations of the random variables ^


are equal, i.e. En = is, then

£(0=£I nqn=E'E(v). (6)


n=l

Theorem 7. If £ and q are independent discrete random variables and if


E(f) and E(q) exist, then E(j;rf) exists as well and

E^rj)=E(Om- (7)
110 DISCRETE RANDOM VARIABLES [HI, § 9

Proof. Let Ajk denote the event £ = xj} r] — yk (j, k = 1,2,...). Clearly,
the possible values of are the numbers which can be represented in the
form z = Xjyk. Further zP^rj = z) = z £ P(Ajk) = £ XjykP(Ajk), hence
xjyk=z x/yk=z

Em = YJY,xjykp<<Ajk). (8)
j k

Because of the independence of £ and t] we have P(Ajk) = P(£ — Xj) x


x P(r\ = yk). Thus we obtain from (8) that

mn) = (Z Xj Pit = Xj)) (X ykP(ti = yk)) - m E(rj).


j k

Since a series obtained as a sum of term-by-term products of two absolute


convergent series is itself absolute convergent, Theorem 7 is herewith proved.

§ 9. The variance

The expectation of a random variable is the value about which the random
variable fluctuates; but it does not give any information about the magni¬
tude of this fluctuation. If we compute the expectation of the difference
between a random variable and its expectation we obtain, as we have already
seen, always zero. This is so because the positive and negative deviations
from the expectation cancel each other. Thus it seems natural to consider
the quantity
d(0 = E(\Z~E(01) (1)
as a measure of the fluctuations. Since, however, this expression is difficult
to handle, it is the positive square root of the expectation of the random
variable (£ - E(£))2 which is most frequently used as a measure of the
magnitude of fluctuation. This quantity, called the standard deviation of £,
is thus defined by the expression

m =+dm -£©)2) (2)


(provided that this value is finite) and D%£) is called the variance of £.3
The choice of Z)(£) for measuring the fluctuations is advantageous from a
mathematical point of view, as it makes computations easier. The real im¬
portance of the concept of variance is shown, however, by some basic theo¬
rems of probability theory discussed in the following Chapters, e.g. the
central limit theorem.

1 The letter D hints at the Latin word dispersio.


HI, § 9] THE VARIANCE 111

From the fact that E is a linear operator follows immediately

Theorem 1. If D(f) exists, then

D\0 = £(?)-[E(Of-

This is the formula by which the standard deviation is most readily com¬
puted. If the discrete random variable b, assumes the values xn (n = 1,2,.. .)
with probabilities pn = P{t, = x„), then

n
(*„ - £(0)2 (3)
and, according to Theorem 1,

£2(0 = Z/i Pn A - (Z Pn
n
XnY• (4)

We obtain by a similar simple argument the somewhat more general

Theorem 2. For any real number A one has

D\() = E((( - Af) - [£({) - A]2.

From this we obtain immediately the following theorem:

Theorem 3. For any real number A

E(ff — A)2) > D2 (£).

The equality holds if and only if A = E(f).

Theorems 2 and 3 are similar (from a formal point of view even equal)
to the well-known Steiner theorem in mechanics which states that the mo¬
ment of inertia of a linear mass-distribution about an axis perpendicular to
this line is equal to the sum of the moment of inertia about the axis through
the center of gravity and the square of the distance of the axis from the
center of gravity, provided that the total mass is unity; consequently, the
moment of inertia has its minimal value if the axis passes through the center
of gravity.
Theorem 3 exhibits an important relation between the expectation and
the variance.
Theorem 2 is mostly used if the values of £ lie near to a simple number A
but the expectation has not exactly this value. For computational reasons
it is then more convenient to calculate the value of Elf — A)2.
112 DISCRETE RANDOM VARIABLES [HI, § 9

Obviously, the standard deviation/)^) is always nonnegative. If £>(£) =0,


then £ is equal to a constant with the probability 1. Indeed, because of
(£ - £(0)2 ^ 0 the equality D(£) = 0 can only hold if P(£ = E(E)) = 1,
hence £ is a constant with probability 1.

Theorem 4. For any random variable £

d(0 < D(Z).

Proof. According to Theorem 5 of § 8

d*(0=E2(\Z-E(0\
Equality can occur in other cases besides the trivial case when £ is with
probability 1 a constant, thus e.g. if f takes on the values +1 and -1 with

Theorem 5. If r\ — ab, + b {a and b are constant), then

D{rf) = | a | • £>(£)•

Proof. Since E(t]) = aE{f) + b, we obtain that

D\rj) = E(a%t; - £(£))2) = a2D2 (f).

Especially, we obtain that the standard deviation does not change if we add
a constant to the random variable £, or multiply it by -1.
It is seen from (3) that the variance of a random variable depends on its
distribution only. Hence we can speak about the variance of a distribution.
We shall now compute the variances of certain discrete distributions and
for sake of comparison we determine the values of d(0 as well.

1. The variance of the binomial distribution. Let the distribution of the


random variable ^ be a binomial distribution of order n:

In §7 we have seen that the expectation of f is E(g) = np; similarly, we


obtain here that

The value of d(f), for sake of simplicity, will only be determined for a bi-
HI, § 9] THE VARIANCE 113

nomial distribution of an even order and parameter .If n — 2N and

P = ~, then E(fi) = N and thus

(2 N
N
1 ™ (2N)
U r ~"1 =
)2 N

By using Stirling’s formula we obtain that for N -> + oo

(2N\
N
N
22N

Here « is the sign of asymptotic equality. If aN and bN (N = 1,2,...) are


two sequences of numbers (bN =£ 0), we say that the two sequences are
asymptotically equal (aN « bN) if

lim -^=1.
V- 00 bN

Since in the case of n — 2N, p — — according to (5) we have

N
m= 7T
2 ’

it follows that

d(0 m-
7r

Thus the quotient ——- tends ter N oo to the limit . We shall see
4 m v *
later on that this holds for a whole class of distributions.

2. The variance of a negative binomial distribution of the first order. Let


the distribution of the random variable £ be a negative binomial distribu¬
tion of the first order, i.e.

P(£ = k+ 1) = pqk (k = 0,1,...),


114 DISCRETE RANDOM VARIABLES [III, § 9

where 0<p<l,q=l — p. We have seen in § 7 that E(E) = —. Thus

00 1
D\l;)=pY1(k+\fqk--j,
k=0 P
and therefore

D2(0

If p is small, then D{£) is approximately equal to £(£) = — .

3. The variance of the hypergeometric distribution. Let £ have a hyper¬


geometric distribution, i.e.

m IN — M

P(f = k)
UJ n —k
(k = 0, 1,...,«)•

1
As in the two preceding examples we obtain

n — 1
i -
N— 1

. , , . M
Let us introduce the notations —r = p, q = 1 — p, then

n- n
p>(0 = Jnpq 1 -
N— 1

hence the standard deviation of the hypergeometric distribution is some¬


what less than that of the corresponding binomial distribution; the differ¬
ence is, however, small if n is small with respect to N (cf. Example 18, § 12,
Ch. II). Random fluctuations are for drawing from an urn without replace¬
ment less than for drawing with replacement. The quotient of the two stan-
M
dard deviations tends to 1 for N -> + oo, if the value of -= p remains
N
fixed and n increases more slowly than N.
to, § 10] SOME THEOREMS CONCERNING THE VARIANCE 115

§ 10. Some theorems concerning the variance

In the present paragraph we shall prove several theorems concerning the


variance, which will be used often later on.

Theorem 1. If £2, are pairwise independent, then

k =1 fc=1

Proof. Let E(^k) — Ek, then

DHY, «*) = E & «*) + 2 z £(«, - £,) ((„ - Ek)).


k=l k=l j<k

From the pairwise independence of £k-s and from Theorem 7 of § 8 follows


that
E((tj-Ej)(Zk-Ek))= 0 if j±k.

Thus we have proved our theorem.


It follows immediately, as a generalization of Theorem 1,

Theorem 2. If £2, are pairwise independent and cx, c2,. . ., c„


are real constants, then

k=l k=1

Because of later applications, the following particular form of Theorem 1


deserves to be mentioned: If £x, £2,..., are pairwise independent random
variables having the same distribution and standard deviation D, their sum
C„ = & + £2 + • • • + f» clearly has

m„) = Dfn .
Let E denote the expectation of the distribution of £k, then

E{U = nE.

Hence the ratio

_ D
E(U Ejn
116 DISCRETE RANDOM VARIABLES [III, § 11

tends to zero for n-> oo, provided that E is distinct from zero. Consequences
of this are dealt with in Chapter VII. If £ is a positive random variable, the

quotient
m) is called the coefficient of variation of f
m
As an interesting consequence of Theorem 2 we mention that if ^ and rj
are independent, then

D2(Z + r,) = D2(Z-r1).

Theorems 1 and 2 can be used to compute the variance of distributions.

1. The variance of the binomial distribution. If £l5 £2are indepen¬


dent random variables assuming the value 1 with probability p and the
value 0 with probability q = 1 — p, then their sum

= £l + £2 + • • • +

is a random variable having a binomial distribution of order n. Since


D2(fk) = P<T it follows from Theorem 1 that

D2 (C„) = npq.

Thus by applying Theorem 1 we can avoid the calculation used in § 9 for


the determination of the variance of the binomial distribution.

2. The variance of the negative binomial distribution. In the former para¬


graph the variance of the negative binomial distribution of the first order
was determined. If the independent random variables £2,. . ., £r have
a negative binomial distribution of the first order, i.e. if

P(fj = k+l)=pqk (* = 0,1,...; j = 1,2,... r)

then, as we know already, D%) = ~ . Applying Theorem 1 it follows for

the negative binomially distributed random variable ^ + . . . + £r °f


order r that

rq
D\Q =

§ 11. The correlation coefficient

The correlation coefficient gives some information about the dependence


of two random variables. If £ and rj are any two nonconstant discrete ran-
III, § 11] THE CORRELATION COEFFICIENT 117

dom variables, the value i?(£, rj) defined by the formula

n* , _ m - m] [n - md (i)
{ mm
is said to be the correlation coe fficient of £ and tj. (If £ or r) is constant, we
put i?(£, rj) = 0.)
From this definition follows immediately that R(rj, £) = i?(£, rj). If the
possible values of £ and rj are xm (m = 1,2,...) and yn (n = 1,2,.. .),
and rmn = P(f = xm, r\ = y„), then

n) = mm ?? m(y* ”m)-
If £ is any nonconstant random variable, the random variable

r .t-m
(2)
m
satisfies
E{0 = 0 and D(£) - 1.

The operation (2) which applied to £ gives is called the standardization


of the random variable £. It follows immediately from the definition of the
correlation coefficient that
R^,rj)=E^'rl'). (3)

Now we shall prove some theorems about the correlation coefficient.

Theorem 1. We have
E^rj)-E(QE(ri)
m n) = mm
(4)

Proof. It follows from the linearity of the operator E that

E([£ - E(0] [n - E(rj)]) = Etfrj) - E(£)E(rj).

Theorem 2. The value of i?(£, rj) lies always between — 1 and + 1.

Proof. According to Theorem 5 of § 8

ie([z - Em [n - Em) i < mm-


Theorem 2 cannot be further sharpened, since

R(Z,Z) = + 1
118 DISCRETE RANDOM VARIABLES [III, § U

and
R(€,-Z) = -1.

Theorem 3. If £ and rj are independent, then

R(Z,rj) = 0.

Proof. If £ and q are independent, then, according to Theorem 7 of § 8

E(£ri) = Ett)E(ri).

Hence Theorem 3 follows from Theorem 1.


Remark. The converse of Theorem 3 does not hold. The independence
of £ and tj does not follow, in general, from R(£, y) = 0. If R(£, rj) = 0,
then the random variables £ and q are called uncorrelated. While uncorre¬
lated random variables are not necessarily independent, nevertheless this is
true for certain special cases (cf. e.g. Theorem 4).

Theorem 4. If and £B are indicators of the events A and B with positive


probabilities, the condition R(£A, £s) = 0 is equivalent to the independence
of and i;B.

Proof. Since E(U = P(A), E(£B) = P(B) and E(£JB) = P{AB), it


follows from the condition R(fA, %B) — 0 that

P(AB)=P(A)P(B),

which is equivalent to the independence of A and B.


The following is an example of two uncorrelated but not independent
random variables. Let

Pif = 1, q = 1) = P{f = - 1, q = 1) = Ptf = 1, q = - 1) =

= p(i = - i,if—1) = -£-,

P(( = 0, r, = 1) =P(f = 0, t, = - 1) = P(( = 1,, = 0) =

= i>({ = -l,, = 0) = -izE.

where 0 < p < 1. Then E(f) — E(q) — E{fq) = 0, hence £ and q are uncor¬

related. Since, however, P(£ = 0, q = 0) = 0/ P(£ = 0) P(q = 0) = ^

they are not independent.


Ill, § 11] THE CORRELATION COEFFICIENT 119

In what follows we shall study, what kind of consequences can be deduced


from the knowledge of the value of the correlation coefficient. First we
prove the following simple theorem:

Theorem 5. | R(£rj) | = 1 holds, if and only if

*1 = ac, + h (5)

with probability 1, where a and b are real constants and a ^ 0; in this case
R(£, rj) = +1 or — 1 according as a > 0 or a < 0.

Proof. Let £(£) = m. If the relation (5) holds between £ and rj, we have

E(a(f — rrif) 1
i.iJm =sgna-
Suppose, for instance, R(£, rj) = +1. (The case R(f, rj) = — 1 can be dealt
with in the same manner.) Put

r= z-m , _ h - E(rj)
m 5 " dm ’
then by (3)
E(,Z' n) = i,
hence
£(«' - iV) =2-2 = 0.

From this it follows that

P^ = n)= 1,
that is
z -m
t] = E{rj) + D(rj)
D(0
with the probability 1.
Thus, unless a linear relation of the form (5) holds between f and rj, the
absolute value of their correlation coefficient is less than 1.

1 sgn x means the sign (signum) of x; it is defined by

1 if x > 0

sgn x = 0 if x = 0
1 if x < 0.
120 DISCRETE RANDOM VARIABLES [HI, § II

In the following we shall say that there is a positive correlation between


£ and rj, if Rtf, 17) > 0 and a negative correlation, if R(f, 17) < 0.
The following most instructive theorem is due to L. V. Kantorovich:

Theorem 6. Let £ and 17 be discrete random variables assuming only a


finite number of values. Let the possible different values of £ be xt (i = 1, 2,
. .m) and those of r\ (j = 1,2,..n). //£* and r\k are for h= 1,2,...,
m — 1 uhJ k = 1,21 uncorrelated, i.e.

E(£hrik) = E<£h)E(rik) (h = 1,..m — 1; k = 1,...,« — 1), (6)

then £ and rj are independent.

Proof. Let

P(£ = */) - A, P(/7 = Ty) = qj, P(Z = xh rj = yj) = ry,

then Equation (6) goes over into the equivalent form

E E ru xhj=( E Pi 4) ( E ?/Ty);
' = ly=l J=1 y=l

the latter clearly holds also if h = 0, k < n - 1 and if k = 0, h < m - 1.


By introducing the notation <5;; = - piqj we obtain for these new
unknowns the following system of equations:

m n
E %h4ykj =0 {h = 0, 1,.. .,m - 1; k = 0, 1,...,n - 1). (7)

Introducing the notation

dik = E (8)
y=i

we have for the unknowns dik (i — 1,2,..., m) the system of linear equa¬
tions

E =0 {h = 0 1,..., m - 1). (9)

The determinant of this system is the so-called Vandermonde determinant.


It is well known that its value is different from 0. Thus the system (9) of
equations has no solution distinct from 0, i.e.:
Ill, § 11] THE CORRELATION COEFFICIENT 121

Since the above consideration holds for every & = 0, 1,. . n — 1, we obtain

tsijyj= o (fc = o,i,...,H-i). (io)


7=1

The determinant of Equations (10) is again a Vandermonde determinant,


thus
= 0 O'= 1,2,...,«).

The same can be shown for every i = 1, 2,. . ., m. From this follows

ru = Pi Qj,

thus £ qnd rj are independent.


Remark. The random variables £ and // must fulfil (m — 1) (n — 1) con¬
ditions in this theorem; as was seen in Chapter II, § 9, the same number of
conditions is necessary to ensure the independence of two complete systems
of events consisting of m and n events.
Finally, we give an example in which the correlation coefficients are effec¬
tively computed.
Fet the r-dimensional distribution of the random variables

be a polynomial distribution
n\
P(Z1 = k1,Z2 = k2,...,Zr = kr) = PllP22 • • • Pr',
kx\ k2\... kT\

where 0 < kt < n (i = 1, 2,.. ., r) and k± + k2 + . . . + kr = n; further¬

more 0 < pi <


and Y,Pi = 1* We compute the correlation coefficient
1
i=i
R(£j, £j). It follows from a simple calculation that

E(£iQ = n(n - 1) Pi Pj-

It is easy to see that every component of the polynomial distribution has


a binomial distribution and thus

E{£k) = npk and D(£k) = Jnpk( 1 - pk),


i.e.
PiP]
m, ey) = - 0 7*j).
(i -pd(i -p^
thus £,• and are always negatively correlated.
122 DISCRETE RANDOM VARIABLES [HI, § 12

§ 12. The Poisson distribution

Under certain conditions, the binomial distribution can be approximated


by the so-called Poisson distribution. The Poisson distribution, dealt with
in this and in the following paragraph, is one of the most important distri¬
butions in probability theory. Let us first consider a practical example.
The following problem occurs in the production of glass-bottles. In the
melted glass, used for the production of the bottles, there remain little solid
bodies briefly called “stones”. If a stone gets into the mass of a bottle, the
latter becomes defective. The stones are situated at random in the melted
glass. But under constant circumstances of production, a given mass of
glass contains in the average the same amount of stones. Suppose for in¬
stance that 100 kg of fluid glass contains an average number x of stones,
let further the weight of a bottle be 1 kg. What per cent of the produced
bottles will be defective, because of containing stones? At the first glance
we could think that as the mass of 100 bottles contains in the average stones,
approximately x per cent of the bottles will be defective. This consideration
is, however, wrong, since it does not take into account that more than one
of the stones can get into the mass of one bottle and thus the number of
the defective items will usually be less.
The problem in question can be solved by means of probability theory.
Let us first reduce the problem to a simplified model, nevertheless fulfilling
the practical requirements. In practical applications of mathematics we
generally work with such models. Whether such a model gives a true picture
of the real situation depends on the adequate choice of the model.
We construct the following model for our problem. Suppose that every
stone gets with the same probability into the mass of any of the bottles inde¬
pendently of what happens to the other stones. Thus the problem is reduced
to an urn-problem: n balls are dropped at random into N urns, what is the
probability that a randomly chosen urn contains exactly k balls? Since
there are N equally probable possibilities for every one of the balls, the proba¬
bility that an urn should contain just k balls is, according to the formula
of the binomial distribution,

i n—k
Wk = (i __L| (1)
W N

We ask for the percentage of defective items, if the production of N bottles


requires M tons of liquid glass. In this case N — 100 M and n = xM. Since
we are interested in the percentage of defective items in a long period of
X
production, we may assume that M is very large. Let- = A, then a
6 100
HI, § 12] THE POISSON DISTRIBUTION 123

simple calculation gives that

n-k k-1

• n 7=1
(2)

It is known that

lim (3)
n— oo
hence from (2)
Xk
1 im Wk = —— e~x (k = 0,1,...). f4)
k\
Let

(k =0,1,...). (5)

From the power series of ex we have


00 00 jj k

I
A:=0
I ir =
k=0 K'
1. (6)

Thus the probabilities defined by (5) are the terms of a probability distri¬
bution, called the Poisson distribution with parameter X: the meaning of X
in the above example is the average number of balls in one urn. It can be
shown by direct calculation that X is the expectation of the Poisson distri¬
bution (5). Namely from the relation

Pit = k) (A: = 0,1,...)


k\
we have
00 y|&—1
Xk
£«) = I * k\
M-e"i = A E
= Xex e x = X. (7)

Thus the expectation of the Poisson distribution (5) is X; hence the distri¬
bution (5) can be called the Poisson distribution with expectation X. The vari¬
ance of the Poisson distribution can easily be calculated;
co ik oo 1k

E(?)= Yjk*^re-x=Zk(k-l) — e-x + X = X* + X,


k=0 K- k=2 K-

hence
D2(Z) = X 2 + X-X2 = X;
124 DISCRETE RANDOM VARIABLES [HI, § 12

that is, the standard deviation of the Poisson distribution (5) is D(f) = X.
Thus the variance of a Poisson distribution is equal to the expectation.
In the passage to the limit in (4) no use was made of the property that the

probability for a ball to enter in a certain urn is — with a natural number N.

Therefore our result can also be stated in the following form: The k-th term

pkqn-k
Wk = (8)
of the binomial distribution tends to the k-th term of the Poisson distribution,
i.e. to the limit

Pk (9)

if n -> oo and p -*■ 0 in such a way that np = X, where X > 0 is a constant


number. (Clearly, the condition np — X can be substituted by the condition
np -a X.)
The distribution function of the Poisson distribution can be expressed
in integral form by means of Euler's gamma incomplete function. Let

r(z,x) = \ f-1 e~‘dt (10)


6

for x > 0, z > 0, denote the incomplete gamma function of Euler and
oo

r(z) = r(z1+co) = \ tz~x e~'dt (11)


6

the complete gamma function of Euler. Partial integration yields the formula

T(r+ 1,2)
i xke
k—0 k\
= 1 fe-‘ dt = 1
P(r+ 1)
(12)

Let us now return to our practical problem. Because of the relation be¬
tween relative frequency and probability, the ratio of defective bottles and
produced bottles is approximately equal to the probability of a bottle being
defective, provided the number of manufactured bottles is sufficiently large.
This probability, however, is 1 - W0 hence approximately 1 - e~x. Since

X = , the percentage of defective items is 100 1 — exp .If x


100 100
is very small, this is in fact nearly equal to x; in the case of large x, however,
it is not. In the extreme case, when x = 100, the fraction of defective bottles
is not 100 per cent as it would follow from the consideration mentioned at
IH, § 13] APPLICATIONS OF THE POISSON DISTRIBUTION 125

the beginning of this paragraph, but only 100(1 — e-1) = 63.21 %. Of course
such a large fraction of defective items will not occur. If for instance x = 30,
the fraction of defective items is 100(1 — e-0-3) « 25.92% instead of 30%.
Clearly, if the number of stones is large, it is more economical to produce
small bottles, provided of course that there is no way for clearing the liquid
glass. Using 0.25 kg glass per bottle instead of 1 kg, the fraction of defective
items decreases for x = 30 from 25.92% to 7.22%. As is seen from this
example, probability theory can give useful hints for practical problems
of production.

§ 13. Some applications of the Poisson distribution

In the previous paragraph the Poisson distribution was introduced as an


approximation to the binomial distribution. Now we shall show that the
Poisson distribution represents the exact solution of a problem of probabil¬
ity theory. This problem is of fundamental importance in physics, chemistry,
biology, astronomy, and other fields.
First let us deal with the example of radioactive decay. The atoms of a
radioactive element are randomly disintegrating. As experience shows, the
probability for an atom (non-disintegrated until a certain moment) to dis-
i ntegrate during the next time interval of length / depends only on the length
t of this time interval. Let this probability be F(t) and put G(t) = 1 — F(t).
As to the function G(t) we know only that it is monotone decreasing and
G(0) = 1. Let As denote the event that a certain atom does not disintegrate
during the time interval (0, s), then clearly F(As+t | AJ = G(t). It follows
from the definition of conditional probability that

P(As+t)=P{As+t\As)P(As), (1)
hence
G(s + t) = G(s) G(t). (2)

Thus we obtained a functional equation for G(t). If we assume further that


G(t) is differentiable at the point t = 0, G(t) may be obtained in the follow¬
ing simple manner: Substitute in (2) At for 5, then from (2) it follows

G(t +At)-G(t) G{At) -1


At ^ At

Let At tend to 0. Because of G(0) = 1, we get

G\t) = G'(0) G(t). (4)


126 DISCRETE RANDOM VARIABLES [III, § 13

In the deduction of this equation the existence of the derivative of G(t)


was supposed only at the point t = 0; namely if the limit on the right side
of (3) exists for At -> 0, then the same holds for the left side as well. G'(0) is
necessarily negative. It follows namely from the monotone decreasing prop¬
erty of G(t) that (r'(0) < 0. If we had G'(0) = 0, it would follow from (4)
that G(t) = 1, which means that no radioactive disintegration can occur.
Thus putting G'(0) = — X we have X > 0. The solution of (4) and (7(0) = 1
is
(7(0 = e"" (5)
The same result can be obtained without the assumption of the existence
of (7'(0); the assumption of G{t) being monotone decreasing suffices. In
fact, we have from (2)

(7(20 = G\t), G(3t) = G\t),

or, generally, for every positive integer

G(nt) = G\t). (6)


Let nt = s, then
- Is)
[(7(0]” =GS { > >

From (6) and (7) we obtain

m
t\ = [G(0] "
n

hence for every positive rational number r

G(r) = [(7(l)]r. (8)


Since G(l) < 1, G(l) can be written in the form G(l) = e~x. Thus we obtain
from (8) that for every rational t

(7(0 = (9)

However, because of the monotonicity of G{t), (9) holds for every t. There¬
fore
F(0 = 1 - (7(0 = 1 - <rA'. (10)

Let us now examine the physical meaning of the constant X. By expand¬


ing the function F(At) = 1 - e~XAt in powers of At we obtain the equality

F{At) = XAt + 0{{Atf). (11)

The left side of (11) is the probability that an atom, which did not disinte¬
grate until the moment t, will disintegrate before the moment t + At. X has
Ill, § 13] APPLICATIONS OF THE POISSON DISTRIBUTION 127

thus the following physical meaning: the probability that an atom disinte¬
grates during the time interval between t and t + dr is (up to higher powers
of At) equal to XAt. The constant X is called constant of disintegration', it
characterizes the radioactive element in question and may serve foi its
identification. It is attractive to give another interpretation of the number
X, which enables us to measure it. The time during which approximately
half of the mass of the radioactive substance disintegrates, is said to be the
half-life period. More exactly, this is the time interval such that during it

each of the atoms of the substance has probability — of disintegrating. Con¬

sider a given mass of a radioactive element of disintegration constant X.


Since every atom disintegrates during the half-life period T with the proba¬

bility we have F(T) = ~. However, G{T) = 1 — Fit) = e~XT, thus

e -XT _

(12)

The disintegration constant is therefore inversely proportional to the half-


life period. The obtained result may be expressed as follows: the life time of
any atom of a radioactive element is a random variable £ such that its dis¬
tribution function F(t) = P(£ < t) has the form

F{t) = 1 - e~M {t > 0),

where A is a positive constant, the disintegration constant of the element in


question. (For t < 0 clearly Fit) = 0 since the life time cannot be negative.)
More concisely: the life time of a radioactive atom is an exponentially distrib¬
uted random variable. Hence the custom to speak about the exponential
law of radioactive disintegration.
Suppose that at time t = 0 there are N atoms. How many non-disinte-
grated atoms shall there be at time t > 0? The probability of disintegration
during this time is for every atom 1 — e~Xt. So in view of the relation be¬
tween relative frequency and probability, the number of disintegrations
will be approximately iV(l - e~Xl). Hence approximately Ne~u atoms
remain non-disintegrated.
Let Pkit) be the probability that during the time interval (0, t) exactly k
atoms disintegrate. Suppose that the disintegration of each atom is an event
independent of the disintegration of the others, then we have

(N-k)Xt
Pk(t) = (1 — e x,)k e (13)
128 DISCRETE RANDOM VARIABLES [HI, § 13

The number of disintegrations thus obeys the binomial law. If Xt is small


and k not too large, Pk(t) may be approximated by a Poisson distribution;
the probability Pk(t) is approximately

_ [JV(1 - e-*)]* exp [-Afg


(14)
' k\

As a further step we can replace for small Xt values (1 — e~Xt) simply by


Xt. Thus Pk(t) is near to
(NXtf e~Nlt
Pt (0 =
k\
(15)

The half-life period of radium is 1580 years. Taking a year for unit we
obtain X = 0.000439. If t is less than a minute, If is of the order 10~9. For
1 g uranium mineral, containing approximately 1015 radium atoms, the re¬
lative errors committed in replacing Pk(t) by P*{t) are of the order 10~3.
If we restrict ourselves to the case where t is small with respect to the half-
life period, we can choose the model so that the Poisson distribution re¬
presents the exact distribution of the number of radioactive disintegrations.
Consider a certain mass of radioactive substance and assume

1. If h < t2 < t3 and Ak(tl7 t2) denote the event that “during the time
interval (tl5 t2) k disintegrations occur”, then the events Ak(tlf t2) and
At(t2, ts) are independent for all nonnegative integer values of k and /.

2. The events Ak(t1} t2), k = 0, 1,. . . form a complete system. If k is


given, PlA^t-L, t2)] depends only on the difference t2 - tv In other words,
the process of radioactive disintegration is homogeneous with respect to
time. Let Wk(t) denote the probability of k disintegrations during a time
interval of length t (t2 — h = t).

3. If t is small enough, the probability that during a time interval t there


occurs more than one disintegration is negligibly small compared to the
probability that there occurs exactly one. That is

1 - W0(t) - Wx(t)
lim
Wx{t)
= 0, (16)
r^o

or equivalently
1 ~ W0(t)
lim = 1. (17)
o Wi(t)

In words: the probability that there occurs at least one disintegration is,
in the limit, equal to the probability that there occurs exactly one.
HI, § 13] APPLICATIONS OF THE POISSON DISTRIBUTION 129

Clearly, IF0(0) = 1 and Wk(0) = Ofor k > 1. Further W0(t) is a monotone


decreasing function of t. From this and from conditions 1 and 2 it follows
that
W0(t + s)=W0(t)W0(s);
hence we "have

W0(t) — e where n > 0. (18)

In order to determine the functions Wk (t) we show first that

Wk(At)
lim = 0 if k = 2,3,... . (19)
At

Obviously, this is a consequence of (16) and of the relation

YJWk(At)=\-W0(At)-W1(At). (20)
k— 2

Since for k > 1 Wk(0) = 0, (19) can be written in the form

Wk( 0) = 0 (k = 2,3,..,). (21)


It is to be noted here that the existence of W'k (0) was not assumed, but
proved.
The event that k disintegrations occur during the time interval (0, t +
+ At), can happen in three ways:

a) k — 1 disintegrations occur between 0 and t and one between t and


t + At',
b) k disintegrations occur between 0 and t andO between t and t + At;
c) at most k - 2 disintegrations occur between 0 and t and at least 2
between t and t Ar At.
Thus, because of conditions 1 and 2, we get

Wk (t + At) = Wk (0 Wo(At) + Wk_i (0 Wx (At) + R, (22)

where R = o(At), according to condition 3 and relation (19). In view of


(17) and (18) we obtain from (22) for At -*■ 0

Wi(t) = ix(Wk^(t)-Wk(t)) (k — 1,2,...). (23)

Thus we obtained for the Wk(t) a readily solvable system of differential


equations. Put
Vk(t)=Wk(t)e»‘, (24)
130 DISCRETE RANDOM VARIABLES [III, § 13

then, from (23)

K(t)=iiVk_i(t) (k = 1,2,...). (25)

From W0(t) = efollows V0(t) — 1 and we obtain

*i (0 = kU
n2t2
^2 (0 =
2 ’

and, in general,
(Htf
Vk if) =
k\ '
Hence
(ut)ke ^
Wk{t)= kT~ (* = 0,1,...).

Thus we have proved that the number of disintegrations during a time


interval t, given conditions 1-3, has a Poisson distribution with expecta¬
tion proportional to t.
The Poisson distribution can also be used in studying the number of tele¬
phone calls during a given time interval. Let Ak(tx, t2) be the event: “between
the moments tk and t2 a telephone exchange receives exactly k calls”; the
assumptions introduced for the radioactive disintegration are here approxi¬
mately valid (at least during the “rush hours”). The number of the calls has
thus a Poisson distribution. The situation is analogous for the number of
electrons emitted by the glowing cathode of an electron tube during a time
interval t; also for the number of shooting stars observed during a time inter¬
val t as well as for other phenomena exhibiting random fluctuations.
As an application of the Poisson distribution in astronomy let us consider
now the mean density X of the stars in some region of the Milky Way. This
density can be considered to be constant. We understand by this that in a
volume V there are in the average VX stars.
In the same manner as in the case of radioactive disintegration (reformulat¬
ing of course conditions 1-3 adequately), it can be shown that the probabil¬
ity, that a region of volume V of the space contains exactly k stars, is equal
to
(XV)ke~xv
k\
(fc = 0,1,...). (26)

The distribution of the stars thus follows the same law as the radioactive
disintegration; the only difference is that here the volume plays the role
Ill, § 14] THE ALGEBRA OF PROBABILITY DISTRIBUTIONS 131

of time. The same reasoning holds for particular kinds of stars as well, e.g.
for double stars. In the same manner the distribution of red and white cells
in the blood can be determined. Let Ak denote the event that there are
exactly k cells to be seen in the visual field of the microscope, then we have

(k = 0,1,...), (27)

where T is the area of the visual field and X is the average number of cells
per unit area.

§ 14. The algebra of probability distributions

In the present paragraph we shall summarize systematically the relations


between probability distributions, which we encountered in the previous
paragraphs. In particular, we shall deal with relations which permit to con¬
struct other distributions from a given one. We shall consider probability
distributions belonging to discrete random variables £ taking on positive
values only; such distributions will be denoted by {p0,Pi, • • -,Pk> ■ • •}
where pk = P(£ = k) (k = 0, 1,. . .). For the sake of brevity the notation
SA = {p0,Pi,. . -,pk,. ■ •} will be used as well.
A fundamental operation is the mixing of probability distributions. Let
{a„} (« = 0, 1,. ..) be nonnegative numbers with sum equal to 1 and let
= {pnk} be for each value of n (n = 0,1,...) a probability distribution.

Let us form the expression


00

^k Z ®nPnk■ (1)

' Obviously, the numbers nk (k = 0, 1, . . ,) form again a probability distri¬


bution ; indeed nk > 0 and

00 00 00 00

E
k=0
nk= Z a« E
7i=0 k=0
Pnk = E a« = L
7i=0
(2)

Let the probability distribution 77 = {Ttk} be defined by

00

77 will be called the mixture of the probability distributions taken with the
weights a„.
132 DISCRETE RANDOM VARIABLES [III, § 14

For instance, the mixture of the binomial distributions

pkqfi-k
(P)

re~x
taken with the weight a„ = is a Poisson distribution. In fact
n\

" Xne~x n k n-k {Xpfk eo-*-P


n=k n\ k pV = -^T (3)

Another example is the mixture of hypergeometric distributions

M N-M'
k n —k
Jf„(M,A) =
A
n

with weights a„ = pnqN ".This leads to the binomial distribution P$M{p),


l n)
as is seen from the relation

(M N-M
N-(M-k)
[A] „n „N-n
lk n -k ) f.M\ pkqM-k.
I (4)
n=k u (A] UJ

Geometrically, mixtures of distributions can be represented in the follow¬


ing way: Two distributions^ = [plk] and = [p2k] can be considered
as two points in an infinite dimensional space having the coordinates plk and
Pik respectively. The mixture

a^i + /kiA, = {ccplk + Pp2k} (0 < a < 1, jS = 1 - a)

subdivides the “segment” 6fiv952 in proportion a : /?. All distributions of


probability = {pn} are on the “hyperplane” of this space with equation
00

Z Pn — namely in that part of this hyperplane for which pn > 0. These


n=0
points constitute thus a “simplex” S. Since

a^i + P&2 (0 < a < 1, jg = 1 - a)

is a probability distribution as well, it follows that S contains with two points


the segment joining them. S is thus convex.
II, § 14] THE ALGEBRA OF PROBABILITY DISTRIBUTIONS 133

Another often-used operation is the convolution of probability distribu¬


tions. The convolution of the distributions 3 = {pk} and Q = {,qk} is the
distribution 3 = {rk}, where

3 = Z Pj 9k-
7=0
■ (5)

As it was seen in § 6 <32 is the distribution of the sum £ +1] of two independent
random variables £ and rj having the distributions 3 and Q respectively.
Even without the knowledge of this result, it is readily shown that <3“ is a
probability distribution. In fact rk > 0 and

00 00 00

Z rk= Z#Z <ih = i. (6)

The convolution of 3 and Q is denoted by 3Q. Since


k k

Z
=0
PjQk-j=
7=0
Z
<ljPk -7’

we have
3Q = Q3. (7)

The convolution is thus a commutative operation. It is associative as well:

3\ (3% 3s) = (3± .33) <33 = 3X .3 2 3Z. (8)

In fact, if 3j = {pjk} (J = 1, 2, 3), the &-th term of the distribution


<3X (323z) as well as of the distribution (3132)33 is equal to

Z
i+j+h=k
PliPyPvr

In this manner multiple convolutions and convolution-powers of a distri¬


bution may be defined. By the n-th convolution-power of a distribution 3
we understand the n-fold convolution of the distribution 3 with itself, in
symbols 3n. Thus for instance if p0 = q = 1 — p and px = p, we obtain
as convolution of power n of the binomial distribution iP) = {G,P} of
order 1 the binomial distribution

3n(P) = (31(p)y. (9)


In fact
n pk-jpn-k+j _
m + n pk qm+n-k}
p> qm-J
k~J)
134 DISCRETE RANDOM VARIABLES [III, § 14

hence
•$m (p) 35n(p) = &m + n (p). (1°)

Relation (9) can be obtained from (10) by mathematical induction.


Similarly, it can be shown that for the negative binomial distribution
@,(p) - [@i(p)]r, where Vr(p) = {Pk(r)} with pk{r) = 0 for k < r and p(rk] =
Ik - 1
prqk r for k > r.
: \r-l.
It can be shown finally that the convolution of two Poisson distributions
* Xke~k 1
is again a Poisson distribution. If ^(A) = j— , then
k\

^(A) • &Qi) = ^(A + p) (11)


since
k AJe~k pk~j e-" (A + p)k e“(1+")
A y! (k —j)\ = kT

i.e. the distribution obtained as the convolution “product” of two Poisson


distributions has for its parameter the sum of the parameters of the two
“factors”.
Let us now introduce the degenerate distribution 0. It is defined by

= {1,0, 0,..0,...}.
Obviously, for any distribution & one has

(12)
Thus the distribution plays the role of the unit element with respect to
the convolution operation.1 The distributions <%„, defined by pn = \,pm — 0
for m ^ n, are also degenerate distributions. It is easy to show that

%r&s = WT+ii = (13)

It is readily seen that the operations mixture and convolution commute:

(Z an(^nU)- (14)
n=0 n=0

By means of the operations mixture and convolution functions of probabil-

1 The probability distributions form a commutative semi-group with respect to


convolution with unit element S0.
HI, § 15] GENERATING FUNCTIONS 135

oo
ity distributions can be defined in the following manner: Let g(z) = £ Wnzn
n=0
00

be a power series with nonnegative coefficients such that ^(1) = £ W„ = 1.


. n=0
If is an arbitrary probability distribution, let g{f^) be defined by

g(&) = twn&* (^ = g-0). (15)


n=0

If for instance <%j is the degenerate distribution defined above and if g(z) =
= (pz + q)n (0 < p < 1), then, because of (13), we have

&n(p) = (p&i + q)n. (16)


Similarly, if g(z) = eA(z_1) (X > 0), then

= exp [X(8’1 — 1)], (17)

where is a Poisson distribution of parameter X. In fact

XkWl _ “ Xk \Xke~x
exp[A(^i - l)]=e A £ = e XY — &k=l
fc = o k\ k=0 k\ k\

§ 15. Generating functions

In the present paragraph, we shall again deal with random variables


taking on nonnegative integer values only. Let £ be such a random variable
and put P(£ = k) — pk (k = 0, 1,. . .). The generating function1 Gfz) of
the random variable £ is defined by the power series

G«(z) = Z Pkzk, (1)


k=0

where z is a complex variable. The power series (1) is certainly convergent


for | z | < 1, since

f Pu = 1 (2)
k= 0

and represents an analytic function which is regular in the open unit disk.
The introduction of the generating function makes it possible to treat some
problems of probability theory by the methods of the theory of functions
of a complex variable.

1 Called sometimes probability generating function.


136 DISCRETE RANDOM VARIABLES [III, § 15

Since the generating function is uniquely determined by the distribution


of a random variable, we may speak about the generating function of a
probability distribution on the set of the nonnegative integers.
It follows immediately from the definition of the generating function
that the distribution of a random variable is uniquely determined by its
generating function; in fact
Gp (0) (k= 1,2,...) (3)
Po = C{( 0), Pk = ' k\

where G^k\z) is the fc-th derivative of (7{(z).The series (l)may converge in


a circle larger than | z | < 1, or even in the entire plane.

Examples

1. Generating function of the binomial distribution. Let c be a random


variable having a binomial distribution of order n, then
(z) = (1 + Pi? - 1))" = ipz + q)n.
2. Generating function of the Poisson distribution. Let £ be a random vari¬
able having a Poisson distribution with expectation A, then
G% (z) = eA(z_1).
(Compare these with the corresponding Formulas (16) and (17) of the pre¬
ceding paragraph.)
3. Generating function of the negative binomial distribution. Let ; be a
r
random variable of negative binomial distribution with expectation —, then
P
we have / pz
G< (z) =
1 - (1 -P)z)

From the generating function of a distribution one can obviously get all
characteristics (expectation, variance, etc.) of the distribution. We shall
now show that these quantities can all be expressed indeed by the deriva¬
tives of the generating function at the point z = 1. Since the generating
function is, in general, defined only for | z | < 1, we understand by the
“derivative at the point z = 1” always the left side derivative (provided it
exists).
If the derivatives G(p(z) of G5 (z) exist at z = 1, we have the following
relations:

<?;(')=z
k=l
kPl.
00

G;0) = E k(k- \)pk


k=2
Hi, § 15] GENERATING FUNCTIONS 137

and, in general
00

1) = E Kk ~ 1) • • • (k - r + IK (r = 1,2,...), (4)
k=r

where the series on the right is convergent. Conversely, it is easy to show that
if the series in (4) converges, the derivative G^r)(l) exists and Formula (4) is
valid. The number

Mt = E((‘) = fk‘pk (s = 1,2,...) (5)


k=l

is called the moment of order s of £ (hence M1 is the expectation). Thus we


have
Gr (1) = M0 = 1,

G'i(l) = M1,

Gl (1) =M2- M1;


and, in general,

G«(1) = E^ (r = 1,2,...) (6)


7=1

where the are Stirling numbers of the first kind defined by the relation

x(x - 1)... (x - r + 1) = £ Sfxj,


7=1

Equations (6), if solved with respect to Mp give

Mx = G\ (1),

M2 = G\(X) + G\(X)
and, in general,

Ms = ±o<pGf{\), (7)
7=1

where of are Stirling numbers of the second kind, defined by

xs = fJ ofx(x - 1) ... (x -j + 1).


7=1

Equations (7) allow the calculation of the central moments of t;, i.e. the
moments of £ — E(f)\

ms = E([t;-E(t;)Y) (5 = 2,3,...). (8)


138 DISCRETE RANDOM VARIABLES [III, § 15

In fact

(9)

For s = 2 we obtain the often used formula

D2 (0 = m2 = M2- Ml = G\ (1) + O',, (1) - [G' (l)]2. (10)

A convenient procedure to calculate moments (central or not) of higher


orders by means of the generating function is the following: Substitute
z = ew into Gfz) and expand the function Gfew) in powers of w:

(11)
or

(12)

The function Hfw) is called the moment generating function of the random
variable f In order to calculate central moments we put

(w) = e~wMl 77? (w). (13)

A simple computation furnishes

(14)

Ifw) is called the central moment generating function of f Hfw) and Ifw)
exist only if G^w) is regular at z = 1, The necessary and sufficient condi¬
tion for this is the existence of all moments of £ and the finiteness of the ex¬
pression

This condition is always fulfilled for bounded random variables and also
in case of certain unbounded distributions, e.g. the Poisson distribution and
the negative binomial distribution.
If H^w) exists, then 7^(0) = 1, since G{(1) = 1. But then there can be
found a circle | w | < r in which Ifw) A 0, hence In Ifyv) is regular. Put
Kfw) — In f(w). Since 0) = 0 and 7^(0) = 7^(0) = 0, we have for | w | < r

(15)
Ill, § 15] GENERATING FUNCTIONS 139

The coefficients kt = kfg) (/ = 2, 3,. . .) are called cumulants or semi¬


invariants of the random variable If q = £ + C (C being a fixed positive
integer), then k^rj) and kff) are identical (since Gfz) = zc Gfz), thus Ifw) =
= /v(w)), hence the name semi-invariant. The meaning of the name
“cumulants” will be explained later.
Between the first cumulants and the first central moments we have the
following simple relations

k2 = m2 = D2 (£),

k3 = m3, (16)

kA = m4 — 3ml.

These can be established by differentiating the equality Kfw) = In Ifw).


The function Kfw) is called the cumulant generating function of £.
Example. The cumulants of the Poisson distribution. Let £ be a random
variable having a Poisson distribution with expectation X. We have

Gi(z) = e^~1\

f (w) = eX(e"-1-WJ,
hence
00 wl
K^w) = X(? - 1 - w) = X £ — . (17)
1=2 l-

In consequence, all cumulants h(£) are equal to X. In particular, not only


the variance of £, but also its third central moment are equal to X. This can
also be seen by direct calculation.
In what follows we shall prove some properties of generating functions,
properties which make the application of these functions a very fruitful
device in probability theory.

Theorem 1. If E, and q are two independent random variables, we have

G!t„(z) = G((z)-G„(z) (18)

and, consequently,

Ki+n(w) = Kt(w) + Kn(w). (19)

Relation (19) states that in adding independent random variables, their


cumulant generating function as well as their cumulants themselves are
140 DISCRETE RANDOM VARIABLES [III, § 15

added (or “cumulated”); since (19) implies

kt(£.+ r{) = kt(0 + k,(n) (1 = 2,3.). (20)

Remark. For 1=2 relation (20) is already well known to us: the variance
of the sum of independent random variables is equal to the sum of the
variances. For 1=3, relation (20) shows that this holds for the third cen¬
tral moments, too.

Proof. Equality (18) is proved by direct calculation; (19) follows imme¬


diately from (18) and from E(tl + rf) = E{g) + E(rj).

Theorem 2. If the distribution of the random variable rj is the mixture with


00

weights <xn (<x„ > 0, £ a„ = 1), of the random variables (n = 0, 1,. . .),
H=0

then
00

G,(z) = £ «,<?{.(*)• Cl)


«=0

Proof. The probabihty that the quantity rj is equal to the random variable
£nis, by assumption, equal to a„. Thus, if qk = P(rj = k) and pnk = P(£„ =
= k), we have
00

dk^Y^nPnk- (22)
n-= 0
Consequently
00 00 00

Gn 0) = k=0
Z <lk zk = n—0
Z a« k=0
Z Pnk zk, (23)

where the order of the summations may be interchanged because of the ab¬
solute convergence of the double series. Relation (21) is herewith proved.

Theorem 3. Assume that the random variables £ls £2, have the
same distribution; let G(z) be their common generating function. Let further
be v a random variable taking on positive integer values only, which is inde¬
pendent from the £„-s. The generating function of the sum

*1 — £i + £2 + • • • + £* (24)

of a random number of random variables is equal to G,,[G(z)].

Proof. Theorem 3 is a consequence of Theorems 1 and 2. In fact, the distri¬


bution of rj is the mixture of the distributions SAn (n = 1,2,...) with weights
a„ = P(v — ri), where SA stands for the distribution of According to
Ill, § 15] GENERATING FUNCTIONS 141

Theorem 1 the generating function of £An is [G(z)]n, hence by Theorem 2:

<?,(*)=I

But, by definition, G,,(z) = £ anzn. Hence


«=i

G&) = Gr(G(z)), (25)

which finishes the proof of our theorem.


The generating function of the joint distribution of several nonnegative
integer valued random variables can be defined analogously. Let for in¬
stance £ and q be two random variables assuming nonnegative integer val¬
ues only (independence is here not supposed); the joint distribution of the
random variables £ and q is defined by the probabilities

rh k = P(Z = h,ri = k) (h,k = 0,1,...). (26)

The generating function of the joint distribution of the random variables


£ and q is defined by the series
GO 00

G(x, y) = X Z rhk (27)


h-0 k=0

where x and y are complex variables satisfying the conditions | x | < 1,


| y | < 1. Obviously, G(x, 1) and <7(1, y) are the respective generating func¬
tions of f and q. The probabilities rhk are uniquely determined by G{x, y),
namely
■>h + k
1 G(x, y)
thk~ h\k\ dxhdyk xy=°0' {
If £ and i/ are independent, rhk = phqk with ph — P(£ = h) and qk =
= P{q = k). From this it follows that G{x, y) = G4(x) Gn(y). Conversely,
from the latter relation follows the independence of £ and q.
Example. The binomial distribution. Let the possible outcomes A, B, C
of an experiment mutually exclude each other and have the respective prob¬
abilities p, q, r (p + q + r = 1). Let us perform n independent trials.
Let £ denote the number of trials leading to the outcome A, q the number ot
those leading to the outcome B. The random variables £ and q are not inde¬
pendent. The joint distribution of the random variables £ and q can be given
by the probabilities
ph qk rn-h-k
rhk = P(£ = h, q = k) = (29)
h\k\ (n — h — k)\
142 DISCRETE RANDOM VARIABLES [III, § 15

The generating function Gn(x, y) is given by

Gn(x, y) = (px + qy + r)n. (30)

If the number n of the trials is a random variable having a Poisson distri¬


bution with expectation N, £ and q become independent from each other.
In fact, according to Theorem 2 (which can immediately be generalized to
the case of two dimensions) we obtain that the generating function of the
Nne~N
mixture of the trinomial distributions (30) with weights -—-—, n = 0,1,...,
n\
is

G(x, y)= f —-Gn(x, y) = eN^x+qy^ (31)


«=o n'

and since p + q + r = 1, we have

G(x, y) = eNp(x~1'> eNq(y~1}; (32)

therefore £ and q are independent random variables with Poisson distribu¬


tion and with expectations Np and Nq, respectively.
Conversely, £, and q are only independent if the number of trials has a
Poisson distribution. In fact, if a„ = P(v = n), further if £ and q are inde¬
pendent, then
G(x,y) = G(x, \)G(l,y) = f ocnGn(x,y). (33)
n—0

Let A(z) be the generating function of v, then according to (30) and (33) we
have
G(x, y) = A(p(x - 1) + q(y - 1) + l).

Hence from (33)

A(p(x - 1) + q(y - 1) + 1) = A(p(x - 1) + 1) A(q(y - 1) + 1). (34)

If we put g{z) — A(z + 1), g(z) satisfies the functional equation

g(a + b) = g(a) g(b). (35)

But from this follows because of the regularity of g(z) that g{z) = eNz.
Hence A(z) = eN(z~i:>; that is v has a Poisson distribution.
Now we shall prove the following theorem:

Theorem 4. If the sequence of distributions of random variables £2,. . .


(iassuming nonnegative integer values only) converges to a probability distri-
Ill, § 15] GENERATING FUNCTIONS 143

but ion, i.e. if


lim pnk=pk (k = 0,1,...) (36)
n go

and
00

z Pk = 1 (37)
k=0
are valid for

o
Pnk = P(£n = k) (38)

11
then the generating functions of the converge, in the closed unit circle, to
the generating function of the distribution {pk}. Hence we have

lim Gfz) = G(z) for | z | < 1 (39)

where

G„(z) = k=0
Z PnkZk (40)

and

G(z) = Y pkzk: (41)


k=0

Conversely, if the sequence G„(z) tends to a limit G{z) for every z with \ z | <
< 1, then (36) and (37) are valid, i.e. G{z) is the generating function of a
distribution {pk} and the distributions {pnk} converge to this distribution
{Pk}•

Remark. If (36) does hold while (37) does not, then (39) is valid only in
the interior of the unit circle. This can be seen from the following example.
Let = n, hence
1 for k = n,
0 otherwise;
consequently,
lim pnk = 0 (k = 0, 1,...),
n-+ oo

but
0 for | z | < 1,
lim Gn{z) = lim zn =
n-+ oo n-*~ co 1 for z = 1,

while for z = with 0 < # < 2n there exists no limit.


It can be seen from the same example that if we assume (39) to hold for
( z | < 1 only, G{z) will not necessarily be a generating function.

Proof of Theorem 4. First we show that (39) follows from (36) and (37).
144 DISCRETE RANDOM VARIABLES [III, § 15

Let e > 0 be an arbitrary number; choose N such that

00 £
E
k=N
Pk < ~r,
4
(42)
where pk has the sense given in (37); this will be always possible because of
(37). Choose next a number n so large that

\Pnk~Pk \ < (k = 0,l,.-.,N-l) (43)

holds, which is possible because of (36). Since £ pnk — 1, it follows from


k=0
(42) and (43) that for n large enough
00 e
E Pnk<~Y- (44)
k=N
In fact
00 A-l iV-1 CO P p

k=N
E Pnk = 1 - E
k=0
Pnk ^ 1 - E
k=0
Pk + -T
4
= E
k=N
Pk + X <
4 ^

It follows from relations (42), (43) and (44) that for | z | < 1 and for suffi¬
ciently large n
N—1 00 oo

|G(z) - Gn(z) I < E


L=0
\Pk -Pnk I+E A: = A
Pk + E 7U <
k=N

which was to be proved.


Now we shall prove that (39) implies [36) and (37). From the assumption

lim Gn (z) = G(z) for [z | < 1


«-► oo
and from

\Gn{z)\<Gn{\)^ 1 for | z | < 1, n = 1,2,...

it follows according to the known theorem of Yitali that G(z) is regular for
| z | < 1 and that Gn(z) converges uniformly to G(z) in the entire circle
| z | < r < 1. Putting
oo

G00 = k=0
E Pk zk
and denoting by Cr the circle | z | = r < 1, we obtain that

Gn(z) G{z)
lim pnk — lim dz —
2ni yk +1 yk + 1 = Pk-
2ni
Cr Cr
Ill, § 15] GENERATING FUNCTIONS 145

From this (36) follows. Since G(l) = lim G„(l) = 1, we get (37).

Example. By means of Theorem 4 another proof can be given of the fact


that the binomial distribution converges to the Poisson distribution. Let
' X
Gn(z) be the generating function of the binomial distribution <£8n — , then

<?.(*) =
n
Clearly
-e
lim Gn (z) = PKz-1)

and since eA(;r_1) is the generating function of the Poisson distribution X),
our statement follows from the second part of Theorem 4.
It can be proved in the same manner that the negative binomial dis¬
tribution &r(p) converges to the Poisson distribution &(X) for r -* oo, if
(1 — p)r = X is constant. In other words, if

(r + k - 1
Pitr = k) = Pr<f (k = 0,1,...),
k

X X
where p — 1 — -— and q = 1 — p = — , then the distribution of £r con-
r r
verges to the Poisson distribution X). Since the generating function Gn{z)
X
of the distribution 1 - is given by

rx _ Ay
r
C„(z)
Xz
~7 /
and

(i - A V
r = gA(z-l)
lim
n-*oo

our statement follows from Theorem 4.


The reader may have noticed that the present and the preceding para¬
graph deal substantially with the same problems. The only difference is that
instead of the algebraic point of view the analytical viewpoint is favored
here. Obviously, it means the same to say that the distribution J?3 can be
146 DISCRETE RANDOM VARIABLES [III, § 15

exhibited in the form — G(&j), where G(z) is a power series of nonnega¬


tive coefficients such that (7(1)= 1 and denotes the distribution
{0, 1, 0,..0,,.or to say that the distribution has the generating
function G{z). In dealing with algebraic relations between distributions, the
first point of view is entirely sufficient and the analytic point of view is
superfluous. If, however, theorems of convergence are considered, the anal¬
ytic point of view is preferable.
As an example of the application of generating functions, let us consider
now a problem taken from the theory of chain reactions. Consider the chain
reaction occurring in an electron multiplier. This instrument consists of
so-called “screens”. If an electron hits a screen, secondary electrons are
generated, whose number is a random variable. These electrons hit a second
screen, making free new electrons from it, whose number is again a random
variable, etc. Suppose that the distribution of the secondary electrons pro¬
duced by one primary electron is the same for each screen. Calculate the
probability that exactly k electrons are produced from the «-th screen. Let
^nr (r = 1, 2,. . .) be the number of secondary electrons produced from the
«-th screen by the r-th electron; assume that £nl, £„2,. . . are independent
random variables with the same distribution which take on nonnegative
integer values only. Let pk denote the probability pk = P (£nr = k) (k =
= 0, 1,.. .). Let further 77,, denote the number of electrons issued from the
/7-th screen. We have then

= £nl + £«2 + • • • + (5«iin_1J (45)

in fact, the number of electrons emerging from the «-th screen is the sum of
the electrons liberated by those emerging from the (n — l)-th screen. Thus
the random variable t]n is exhibited as the sum of independent random vari¬
ables, the number of terms of the sum being equal to the random variable
Vn-v Put

G(z) = f pk zk (46)
fc = o

and let Gn(z) be the generating function of rjn. We have G1(z) = G(z) and
it follows from Theorem 3 that

(z) = (?„_! (G(z)) (« = 2, 3,...), (47)


hence
G2 (z) = G(G(z)), G3 (z) = G(G(G(z))) etc.

The generating function G„(z) is thus the n-th iterate of G(z). Sometimes it
is convenient to employ the recursive formula

Gn(z) = G(Gn-i(z)) (n = 2, 3,...). (48)


Ill, § 15] GENERATING FUNCTIONS 147

In general, we have
C„+m(z) = G,(G„(z)). (49)

Let us compute from the generating function (7„(z) the expectation M„ of


n„- Put
00

M = Ykpk = G'( 1). (50)


k=1

It is here to be mentioned that an electron multiplication, in the true sense


of the word, takes place only if M > 1; in fact, only then can it be expected
to observe an increase of the number of electrons (cf. the calculations be¬
low). In order to calculate M„ differentiate (47), put z = 1, then we have

M„ = G;0) = GU(1)G'(1) = 4/,-iM. (51)

Consequently
M„ — M" (n= 1,2,...). (52)

The expectation of the number of electrons emitted from the n-th screen
is thus the «-th power of the expectation of the number of electrons emitted
from the first screen. For M > 1 this expectation increases beyond every
bound for n oo; for M < 1 it tends to 0. In the latter case the process
stops sooner or later. Let us see now, what is the probability of this. Let
P„k be the probability that k electrons are emitted from the «-th screen;
particularly, we have
f„,o = C„(0)- (53)

It can be supposed that (7(0) = p0 is positive, since if (7(0) = 0, obviously


Pn 0 = 0 for n = 1, 2,-
The sequence Pnfi (n = 1,2,...) is monotone increasing. This can be
seen immediately: in fact if no electron is emitted from the n-th screen,
the same will hold for the (n + l)-st screen too; the converse, however, is
not true. According to (53) we have

Pn+1,0 = Gn (^(O)) > Gn (0) = Pn>0, (54)

the sequence P„ o is thus monotone increasing. Since for every n Pn>o — 1 >
the limit
lim Pnfi = P (55>

exists. It follows from (53) that

Pn, o = G(P„ _ i,o) 5 (56)


148 DISCRETE RANDOM VARIABLES [III, § 15

P is therefore a root of the equation

P = G(P). (57)

Since (7(1) = 1, 1 is also a root of this equation. We shall show that for
M < 1 there exist no other real roots. In this case therefore the probability

that no electrons are emitted from the n-th screen, tends to 1 if n -> oo. To
prove this draw the curve y = G(a). Since (7(a) is a power series with non¬
negative coefficients, the same holds for all its derivatives, (7(a) is therefore
monotone increasing in the interval 0 < x < 1 and is also convex. The
equation P = G(P) means that P is the abscissa of the intersection of the
curve y — G(x) and the line y = x. Since (7(0) > 0, (7(a) — a is positive
for a = 0. Now if (7'(1) — M > l, G(a) — a is, because of (7(1) = 1, nega¬
tive in an appropriate left hand side neighbourhood of the point a = 1
(see Fig. 14). As (7(a) is continuous, there exists a value P(0 < P < 1) satis¬
fying (57). Because of the convexity of (7(a) there can exist no further points
of intersection.
It can be proved in the same manner that for M < 1 Equation (57) has
no real roots other than P = 1. (There can of course exist complex roots
of (57).)
It is yet to be shown that for M > 1 the sequence Pn0 (n = 1, 2,. . .)
converges to the smaller of the two roots of Equation (57). This can be seen
immediately from Fig. 15 by relation (47) which gives in case of M > 1 for
HI, § 16] APPROXIMATION OF THE BINOMIAL DISTRIBUTION 149

every z (0 < z < 1) the relation

lim Gn (z) = P, (58)


n-*~ oo
hence
lim Pn.k = 0 (k= 1,2,...). (59)

Fig. 15

Thus the probability that from the n-th screen there are exactly k > 1 elec¬
trons issued tends to 0 for n —► co for each fixed value of k. From
00

lim P„ o = P < 1 and £ Pn>k = 1


n-+cc k=0

it follows that for large enough n the number of the emitted electrons (pro¬
vided that the process did not stop) will be arbitrarily large with a (condi¬
tional) probability near to 1. This is in accordance with experience.

§ 16. Approximation of the binomial distribution by the normal distribution

In probability theory Stirling's formula is often employed in the following


form:

x/2 nn exp (o < en < 1). (1)


[Unj
This can be proved by means of Euler’s summation formula and Wallis’
formula. We employ Euler’s summation formula in the following form:
DISCRETE RANDOM VARIABLES [III, § 16
150

Let f(x) be a continuously differentiable function in the closed interval


[a, b]; let further be

e(x) = x - [x] - — , (2)

where [x] denotes as usually the integral part of x; i.e. [x] = k for k <
< x < k + 1 (k = 0, 1,...). Then we have

Z f(k) = ff{x) dx - \Q{b)f(b) - o(a)f(a)\ + J q(x)f(x) dx. (3)


a<k<,b a a

Remark. If a = A - , b — B -\—, A and B integers, we have

q{o) = Q{b) = 0 and instead of (3) we may simply write1

B B+Y B+'2
Z /(^) = .f /(*)dx + .f e(x)f'ix) dx. (4)
A-Y

In the present paragraph an approximation will be given for the terms

Wt = pk qn~k (k = 0,l,...,n; 0 <p < 1; q = l - p) (5)

of the binomial distribution.


Put z = k — np, hence

k = np + z and n — k = nq — z. (6)

Evaluating asymptotically the binomial coefficient figuring in (5) by

Stirling’s formula, a simple calculation gives

| np+z i nq — z
n
Wk = 1 - 1 + es (7)
2n (np + z) (nq — z) \ np + z nq — z

with
d ^ 0k 6n — k
12 n \2k 12 (n-kY
(8)

where 6n is defined by (1). We assume that the quantity

z
x= (9)
1 The proof of this formula can be found e.g. in K. Knopp [1],
HI, § 16] APPROXIMATION OF THE BINOMIAL DISTRIBUTION 151

remains bounded:
I | ^A (A = constant). (10)
For the different factors on the right hand side of (7) we obtain

n 1 x(q-p)
1 - Ml (11)
2n(np + z) (<nq - z) yjlnnpq 2\fnpq 1«,
and
np+z V nq—z
1 - 1 +
np + z nq

(q-P)*3 1
=e *ar1+
2 +o (12)
bjnpq n

According to assumption (10) we have1

(13)
3=0 (!)•

and the constant figuring in the residual term O


in Equations (11), (12)
n,
and (13) depends on A only; thus we obtain from the relations (7), (11),
(12), and (13) the following theorem:

Theorem 1. If 0 < p < l, q = l - p, and

n pk qti k
W, = (k = 0,1,...,«), (14)
k
further if
k — np
x = <A, (15)
yjnpq
then
X2
' 2
(x3 - 3x) {q - p)
1 + (16)
sJlTinpq 6Jnpq +0|T

1 Here, as well as in what follows, the notation aN = O (bN) is employed. If aN and


bN(N = 1,2,...) are sequences of numbers such that bN ^ 0 and there exists a constant
C> 0 for which | aN | <,C \ bN\, this fact will be denoted by aN — 0(bN)- (Read:“ow is of

order not exceeding that of bN”.) If, however, lim = 0, this will be denoted by
/V—>-co Qpj
aN = o (bN). (Read: “aN is of smaller order than bN”.)
DISCRETE RANDOM VARIABLES [III, § 16
152

1
where the constant intervening in O depends on A only.
n
In practice, usually the following weaker form

(k — 7ip)2
exp
2npq ■1 11
1 + o (17)
W, =
yjlnnpq - ■y/n 1-
of Theorem 1 suffices.

Thus the probabilities pkqn~k are approximated by values of the

function

/(x) = 1 exp [- (x - mf / 2cr2] (18)


x/2 no
at the point x — k, where the constants m and o have the values m = np
and o = Jnpq. This function is represented graphically by a bell-shaped
curve (Fig. 16) called Gauss’ curve (or Laplace curve, or “normal” curve).
Function (18) plays a central role in probability theory.

Fig. 16

Theorem 1 can be “verified” experimentally for p = ■— by means of


2
Galton's desk.
Gabon’s desk (Fig. 17) is a triangular inclined plane provided with nails
arranged regularly in n horizontal lines, the Ar-th line containing k hails.
A ball, launched from the vertex of the triangle, will be diverted at every
irr, § 16] APPROXIMATION OF THE BINOMIAL DISTRIBUTION 153

line either to the left or to the right, with the same probability Under

the last line of nails there follows a line of n + 1 boxes in which the balls are
accumulated. In order to fall into the A-th box (numbered from the left,
k = 0, 1,...,«) a ball has to be diverted k times to the right and n — k

times to the left. If the directions taken at each of the lines are independent,

the probability of this event will be 2~". By letting a large enough num¬

ber of balls roll down Gabon’s desk, their distribution in the boxes exhibits
quite neatly a curve similar to the Laplace-Gauss curve. Theorem 1 states
that the limit relation

pk qn~k
lim 1 (19)
exp [
=
n-+ oo (k — np)2 T
yjlnnpq 1 2 npq .

holds, if with n also k tends to infinity so that


| k — np |

y/npq

remains bounded; with these conditions the convergence is even uniform.


Formula (19) is the so-called de Moivre-Laplace theorem.
154 DISCRETE RANDOM VARIABLES [III, § 16

This result can be expressed in a more concise and practical, though


weaker form. Let Wk be the probability that during n repetitions of an alter¬
native the event A occurs exactly k times. If n is very large, it is more rea¬
sonable to ask for the probability that the number k of occurrences of A
lies between two given limits, than for the probability that it assumes a
fixed value.
Our problem may conveniently be phrased as follows: what is the proba¬
bility that the inequality

np + ajnpq < k < np + bjnpq (20)

should hold, where a and b (a < b) are two given real numbers. It follows
from Formula (16) that this probability is

W(n){a, b) z wk =
, k-np
a<, j-
l npq
< b

q-p
z *
(xzk - 3xk) + O (21)
Jlrcnpq a<,xic<b 6jnpq

k — np
where xk = was substituted. It will be seen that there exists a limit
Jnpq
lim W(,,) (a, b) = W(a, b) (22)

which can be calculated and the residual term estimated.


Choose the numbers a and b such that

A = — + np + ciyjnpq and B = —— + np + byjnpq

are integers; this can always be done without changing the value of
W^(a,b). It follows then from (4) that
b
2 (q - p) (x3 - 3x)
W(n)(a, b) 1 + dx -f O (23)
2n J bjnpq n
v/2
x8
Since je 2 (x3 — 3x)(£x can be given explicitly, we have

Theorem 2. If

np + a yjnpq + — — A and np + b-Jnpq-— = B


Ill, § 16] APPROXIMATION OF THE BINOMIAL DISTRIBUTION 155

are integers {A < B), then

I
A<,k<,B
pk q"~k
In
yjln J 2 dx + R (24a)

where
q-p
R = (24b)
bjnpq

From (24a) and (24b) follows the limit relation


P
lim
"-00 (25)
j/npq

for each given pair (a, /?) of real numbers (a < /j); it suffices in fact to replace
a by a„, /? by bn, where an is the least number such that

A = np + an y/npq + ^->np + a Jnpq

(A integer) and b„ is the largest number such that (.B integer)

B — np + b„J npq -~<np + ft Jnpq .

Obviously,
1 1
a„ — a = O and bn-P = 0
V ti ■
hence
bn x!

lim J e 2 dx = J e 2 dx.

Thus the right hand side of (25) gives an approximate value for the proba¬
bility that the number of occurrences of an event A (having the probability
P(A) = p) in an experiment consisting of n independent trials, lies between
the limits np + a Jnpq and np + /? Jnpq. To use this result we must have
the values of the integral
p

y
x/2
dx

for every pair (a, /?). The integral je 2 dx cannot be expressed by elemen-
DISCRETE RANDOM VARIABLES [III, § 16
156

tary functions; however, the function


y 2
<P(y) = —1= ( e 2 dx (26)
J2n J
V — 00

is tabulated with a great precision and a table of its values is given at the
end of this volume (cf. Table 6). The curve y = $(x) is shown in Fi§- 18-

It is easy to see that


+ CO

i r -xZ
<f>( + oo) = e ~ dx= 1. (27)
V 271J

In fact, when introducing polar coordinates we find

+ 00+00

1 r r |
$2(+oo) = -^— e 2 dx dy —
= Jj re 2 dr = 1.
— 00 —00 0

From Theorem 2 follows immediately

Theorem 3. For every real y

pk qn-k =
lim £ (28)
«-*■<» k—np
I npQ <.y
HI, § 17] BERNOULLI’S LAW OF LARGE NUMBERS 157

In fact, it follows from (25) that for every sufficiently large N

pk qn-k > 0(7) _ $(_ #)


lim X (29)
n—oo k—np
/npq <y
and

lim Yj pk qn~k ^ 1 - $(N) + <P(y). (30)


n-+ oo k — np
{npq

Because of (27), (28) follows.


The function <P(x) is one of the most important distribution functions.
A random variable £ having for its distribution function &(x) or more
x—m
generally <P (with a > 0 and m an arbitrary real number) is said

to be normally distributed or a Gaussian random variable. (P(x) itself is called


the standard normal or Gaussian distribution function, or distribution func¬
tion of Laplace-Gauss.
Theorem 2 may be expressed also by saying that for large n the binomial
distribution is approximated by the normal distribution.

§ 17. Bernoulli’s law of large numbers

The results obtained in the preceding paragraph allow us to prove a very


important theorem, Bernoulli’s “law of large numbers”.

Theorem 1. Let the event A be one of the possible outcomes of an experi¬


ment, with probability

P(A) = p (0 <p < 1)

and let fA(n) be the relative frequency of the event A in a sequence of n inde¬
pendent repetitions of the experiment. Given two arbitrarily small positive
numbers e and 5, there exists a number N depending on e and 5 only such that
for n > N
P(\fA(n)-P{A)\<e)>\-b. (1)

Proof. We have

P(\fA(n)-P<.A)\<£)= Z pk qn~k.
\k—np\ <ns KkJ

Choose a number Y such that

<P(Y)-<P(-Y)> 1- —. (2)
158 DISCRETE RANDOM VARIABLES [III, § 17

For n > = jV, we have Yjnpq < ne, and it follows

pkqtt-k
P( I /M ~ P(A) I < «) a I (3)
\k-np\<Y fnpq

According to Formula (25) of § 16 N2 can be chosen such that for n > N2

pkqn~k > $(Y) - <P(-F) - — ; (4)


^ _ £
\k-np\<,Yl/npq '

from (2), (3) and (4) it follows that (1) is verified for n > N = max (A^, A2).
Bernoulli’s law of large numbers can also be proved directly, without the
use of the de Moivre-Laplace theorem.
Formula (1) is equivalent to

for every n > N.


I
\k—np\ < En Cl pkqn~k >1-5 (5)

The identity
/ \

Y(k~ npf nkJpkqn-k = npq (6)


k=0

(given as relation (5) in § 9) states that the variance of the binomial distri¬
bution is equal to npq. Thus we have

npq > Ya pkqn k (k — np)2 > s2n2 Y pkqn~k


\k—np\^.tn \k — np\~^tn

and, consequently,

pkqn-k > J _ P<1


I pkqn~k = 1 - £
| k-np | <en \k-np\^_tn s2n

Thus for n > —= N relation (1) is verified. Since pq = p{ 1 — p) <


S (}

^ -~7~, it suffices to take for N the value N = —=—. We shall see in Chap-
4 4 e2 8
ter VI that one can take for N a much smaller value as well.
The method of proof employed above is often used in probability theory.
Later on (in Ch. VII) it will be formulated in a more general form as fhe
inequality of Tchebychev.
Finally, some remarks should be added here concerning Bernoulli’s law
of large numbers.
III. § 18] EXERCISES 159

In introducing the concept of probability, a number called probability


was assigned to events whose relative frequency possessed a certain stability
in the course of a long sequence of experiments. This stability of relative
frequency is now proved mathematically. It is quite remarkable that the
theory leads to a precise formulation of this stability; it is undoubtedly a
proof of its power.
At the same time it can be understood, why the stability of relative fre¬
quency could not be defined precisely at the introduction of the concept of
probability. Indeed, in its formulation occurs the concept of probability:
the law of large numbers states just that after a long sequence of experiments
a large deviation between relative frequency and probability becomes very
improbable.
It may seem that there lurks some vicious circle here: probability was in¬
deed defined by means of the stability of relative frequency, and yet in the
definition of the stability of relative frequency the concept of probability
is hidden. In reality there is no logical fault. The “definition” of the proba¬
bility stating that the probability is the numerical value around which the
relative frequency is fluctuating at random is not a mathematical definition:
it is an intuitive description of the realistic background of the concept of
probability. Bernoulli’s law of large numbers, on the other hand, is a theo¬
rem deduced from the mathematical concept of probability; there is thus
no vicious circle.
The theorem dealt with above is a particular case of more general theo¬
rems which will be discussed in Chapter VI. Similarly, the approximation
of the binomial distribution by the normal distribution is a particular case of
the general limit theorems, to be dealt with in Chapter VII of the present book.

§ 18. Exercises

1. Suppose a calculator is so good that he does not make more than three errors
in the average in doing 1000 additions. Suppose he checks his additions by testing
the addition modulo 9 and corrects the errors thus discovered. There can, however,
still remain undetected errors: in fact, it may occur that the erroneous result differs
from the exact sum by a multiple of 9. How many errors remain in the average among
his additions?

Hint. It can be assumed that, if the sum is erroneous, the error lies with an equal

probabilityin any of the residue classes 0, 1, 2, 3, 4, 5, 6, 7, 8 mod 9. Let A denote

the event “the sum is erroneous”, B the event “the error could be detected by testing
the sum modulo 9”. The probability sought is the conditional probability P(A i B);

according to Bayes’ rule it has the value


2992
160 DISCRETE RANDOM VARIABLES [III, § 18

2. A missing letter is to be found with the probability P in one of the eight drawers
of a secretary. Suppose that seven drawers were already tried in vain. What is the
probability to find the letter in the last drawer?

3. Determine the maximal term of the binomial, multinomial, hypergeometric


and polyhypergeometric distributions.

4. Consider the terms


N, N, Nr
ku *2 kr
.kr

of the polyhypergeometric distribution, where


kx + k2 + . .. + kr = n , Nx + N2 + . . . + Nr = N.
Prove that if the numbers Nt (j — 1, 2,.. ., r) tend to infinity so that

N,
lim ~ = p, > 0 (j = 1, 2,. . ., r) ,
N->- oo N

we have for fixed values of ku k. • • • > kr


i, ,t2>
n\
hm Pkl’k.. = k,l k2\ . . . kr\
Thus under the above conditions the terms of the polyhypergeometric distribution
converge to the corresponding terms of the multinomial distribution.
5. If
lim npj = l, > 0 O' = 1. 2, .... r — 1),
n —»- -f cd

the multinomial distribution


n\
kjl k2l ... kr\
tends to an (r — l)-dimensional Poisson distribution. That is, for fixed kltk2, . - j kT—\

(kr = n — kl — k2 — . . . — kr-x) we have for n —> + °°

kk'... C-11 -Ui+...+Ar__


lim pI'p*' ■pk/ =
-h-co k2\ ... k,\ kxl ... kr~il

6. Deduce Formula (4) of § 4 from Formula (12) of § 12, using the convergence of
the binomial distribution to the Poisson distribution.
Xk e~x
7. Determine the maximal term of the Poisson distribution-(k — 0, 1, . . . ;
k\
k > 0).
8. If A is constant and N = n In n + A n, there exists a limit of the probabilities
Pk (n, N) (cf. Ch. II, § 12, Exercise 42.b) for « —> oo: we have for any fixed real value
of A and any fixed nonnegative integer k

(e x)k exp (— e *)
lim Pk («, n In n + A«) = (*=0,1,...).
n—+- oo kl
HI, § 18] EXERCISES 161

Thus, if we distribute IV — n In n + A n balls into n urns, the number of the urns


which remain empty will be, for large n, approximately distributed according to a
Poisson distribution with expectation e~x.

o T, M _ R
~ P> ~T7 ~ r> and n ~> 00 so that
N N

lim np — A > 0 and lim nr — p > 0


+ n —► oo
then
k-1 ---
n—k

n (AT+jR) []
r (N — M + jR)
i=0 /=0

lim
n —*■ co
Ti(N+jR)
1=0

k — 1
P
1 + p- 1 + P ,

Thus under the above conditions, the Polya distribution tends to a negative binomial
Xk
distribution. p — 0, the above limit becomes ; the limit distribution is then a
k\
Poisson distribution.

10. A roll of a certain fabric contains in the average five faults per 100 yards .
The cloth will be cut into pieces of 3 yards. How many faultless pieces does one expect
to find?

Hint. It can be supposed that the number of faults has a Poisson distribution.
The probability of finding k faults in an x yards long piece is therefore equal to

(* = 0, 1,. ..).

11. In a forest there are on the average ten trees per 100 m2. For sake of simplicity
suppose that all trees have a circular section with a diameter of 20 cm. A gun is fired
in a direction in which the edge of the forest is 100 m away. What is the probability
that the shot will hit a trunk?

Hint. It can be assumed that the trees have a Poisson distribution; the probability
that on a surface area T m2 there are k trees is equal to

- - (4 = 0,1,...).

Each tree can be considered and represented by its centre.

12. In a summer evening there can be observed on the average one shooting star
in every ten minutes. What is the probability to observe two during a quarter of an
hour?
DISCRETE RANDOM VARIABLES [III, § 18
162

13. At a certain post office 1017 letters without address were posted during one
year. Estimate the number of days on which more than two letters without address
were posted.
14. Let &x(p) = {p, 1 — P) be a binomial distribution of order 1; let g(z) =
1 _ „
=-. Determine the distribution g[^x(j>)]-
1 — az
15. Let A (p) be the same as in the preceding exercise. Show that

exp [1(^0) — 1)1 = ^(Xp)

is the Poisson distribution with expectation Xp.

16. Let p be the probability of an event A. Perform n independent trials and denote
by/the relative frequency of A in this sequence of trials. With the aid of the approx¬
imation of the binomial distribution by the normal distribution answer the following
questions:
a) If p — 0.4 and n — 1500, what is the probability for / to lie between 0.40
and 0.44?
b) If p — 0.375, how many independent trials have to be performed in order
that the probability of \f— p I < 0.01 is at least 0.995?
7
c) Let p = —, n — 1200. How should £ be chosen in order that the probability

of 1 / — p 1 < e be at least 0.985 ?


d) Suppose n = 14 400. For which values of p will the probability of 1/—p 1 <
< 0.01 be at least 0.99 ?

17. Put
X

Prove that the expansion

<£(*) = i-2 + s/2n


4 — = — +
1 1-3
+

1-3-5
+
-)
holds.

18. Prove that for x > 1,

<P(X) =3 1 — where 0 < 6 < 1.


y/2”x U+-r

Hint. Use the identity

oo
Ill, § 18] EXERCISES 163

19. Suppose P(A) = —. Perform N independent experiments; let /(«) be the

number of the occurrences of A among the first n experiments (n = 1,2,,7V);


put /(0) = 0. Verify the formula
[]W—M)l
Af 1
Pn (M) = P( max (2/(«) - n) < M)= 1 - —Y \ M + 2k
i=£n<;v
k=o
M + 2k ' 4*

for M = 1, 2, . . . .

20. Applying the result of Exercise 19 prove that

lim iV (* y V) = 20(x) - 1 (x > 0).


+oo

21. The function of two variables W(x,y) = *_j fulfils the partial differential

equation of heat-conduction
dxp j dsy

dJ 2 dx2 ’

the function

"<*»> = z(l)4-
1
k<
. _ n+x * ^

fulfils the difference equation

AnU=-A\U

where
AnU = U (x, n + 1) - U(x, n),

A\U = E/(jc + 1, n) - 2£/(x, «) + C/(jc - 1, «).

22. Prove the asymptotic relation

1 (k — «P)2
pk qn~k exp
V2nnpq [- 2 npq

under the following conditions: p is a constant, n and k tend to infinity so that

0k-npY n
lim ---= 0 .
n -f- oo ^

23. Prove the asymptotic relation

M N — M\
n — k (k - np)2
exp
y 2n npq 2 npq

where

N-* oo, M — pN, 0 < p < 1, <7=1— p, n-+ n = o(^/ N)


DISCRETE RANDOM VARIABLES [III, § 18
164

and 2
| k — np | = o(n3).
Thus also the hypergeometric distribution can be approximated by the normal
distribution.

24. Establish the following power series expansion:

0(x) =

1 (- i)V*+1
+ 4- _|-1---b . .
=T+ 112-3 214-5 318- 7 ^ k\2k(2k + 1)
V 2n
How many terms are to be taken to calculate 0(2) with an accuracy of four decimal
digits?

25. If x is positive, the difference


1 2
1 - 0(x) - X
V271
1 1 1-3 1-3-5 (- 1)* 1 • 3 . . .(2k - 1)
X O I K + • • • + 4A+1

is positive or negative according as k is odd or even. The value of the function


1 — 0(x) is thus always contained between the &-th and (k + l)-st partial sums of the
divergent series
1 1 1 1-3 1-3-5
-3~ +
yj 2n X X6 ^-5---
x xb *

How many terms are there to be taken to calculate 0(A) with an accuracy of 10-8?

26. Show that

1 + |1 - e 2
I + n -
< 0(x) < - for x>0.

27. (Method of Laplace.) Let f(x) be a complex valued function continuous in a


neighbourhood of x — a such that

f(a) = 1, \f(x)\ < 1 for x 7^ a, f"(a) = —b< 0.

Let further {gn(t)} be a sequence of complex-valued functions such that


t
lim g„\a + - = A
«-►<» V yj nb) )

be uniformly fulfilled for t in every finite interval. Show that for every x > 0, y > 0
we have
a+
\'nb

l~ nb r A r — u2
lim
n -*■ co J 2ti J
V
'Mrwt-jsrjI » 1)
e 2 du.

(nb
Ill, § 18] EXERCISES 165

28. Show that for every real value of x

Xk e~X
lim Z
A-*cx> k<X.+xl X
tt = ^w-

i/i'nf. Use the result of the preceding Exercise.

29. Show with the aid of the result of Exercise 27 and the relation

P* q" ' — (n — k) tk (1 - dt

(0< p < 1; q — 1 — p) directly the validity of the limit relation

Km Y
Y, ( ? / cf
cf k =
k = ®(x).
®(x).
r—> cd
t—s-cd kk —
— nn \
\ K .

30. Prove the following strong form of Stirling’s formula

31. In a factory at any instant t each of n machines is at work with probability p


and is under repair with probability 1 — p. The machines are operated independently
of each other. What is the probability that at a given instant at least m machines are
working? Calculate an approximate value of this probability for n = 200; p = 0.95;
m -- 180.

32. Prove the well-known approximation-theorem of Weierstrass in the following


manner: Show with the aid of the inequality

deduced in course of the proof of Bernoulli’s law of large numbers, that for any
function f(x), continuous in the closed interval [0, 1 ], the so-called Bernstein poly¬
nomials of fix)

converge uniformly to f(x) in the interval [0, 1] if n—>

33. Find the limit

34. The following problem occurs in statistical mechanics: Let Eu E2.En be


the possible values of the energy of a particle belonging to a system of N particles.
If a particle has the energy Ek, it is said to be in the /c-th state. The state of the whole
system can be characterized by giving the “occupation numbers” Nx, N2, . . ., Nn',
166 DISCRETE RANDOM VARIABLES [III, § 18

Nk being the number of particles having the energy Ek(k — 1,2,, «). Let Wk be
the probability of a particle being in the state k, W(NU N2,.. ., N„) the probability
that the system is in the state characterized by the occupation numbersNu N2,..., N„.
By assuming that the states of the particles are independent from each other, we have,
obviously
N\
W(NX ,N,,...,Nn) = w p... w* (1)
NX\N2\... N„\
with
N = Nx + N, + ... + IV, . (2)
The probabilities of the possible states have therefore a multinomial distribution.
If, however, the total energy E of the system is given, not all these states can actually
occur: besides condition (2) the following one must be fulfilled, too:

YJNkEk=E. ■ (3)
k=1
According to the definition of the conditional probability, probabilities (1) fulfilling (3)
are simply multiplied by a constant factor. Find the values of Nu N,,..., N„ fulfilling
(2) and (3) for which the expression (1) takes on its maximal value.

Hint. Consider the numbers Nk(k = 1, 2,..., n) as continuous variables and replace
in (1) the factorials Nkl by r(Nk + 1), then apply the well-known identity

g(x)dx
In .T (jV -f 1) In N - N+ In V2^ — /
= (w+4 .1 x + N ’

where q(x) = x — [a]-— . By differentiation1 we obtain

r’{N+1) i r e(x)dx
r (N + 1) n + 2N + J (x + N)2 '

By Lagrange’s method of multipliers, the conditional extremum of (1) under the


conditions (2) and (3) can be found. Thus it follows that

Wk e-pE*
Nk&N n

£ W,e~PEt
/=i

where the constant /3 must be chosen so that (3) is satisfied. This is Boltzmann’s energy
distribution.

35. Let 0 < p < ~, q = 1 — p, n an integer such that np = m is also an integer.

Let A be an event with probability p. Show that during the course of n independent

1 Often this formula is deduced by differentiating Stirling’s formula; this of course


is inadmissible. Our procedure is correct, since we differentiate an identity and not
an asymptotic relation.
Ill, § 18] EXERCISES 167

repetitions of an experiment, having possible outcomes A and A, the probability that


A occurs less than m times is greater than the probability that A occurs more than
m times (Simmons' Theorem).

Hint. The inequality to be proven is

Y ,k
k=0 K
pkq"~k > £
k=m+ 1
pk qn~k.

By putting (1)
n _m-r -rt-m+r
Br = (r = 0,1,..., m),
m — r

n p>n + r qtt-m-r
cr = m + r
(r = 0, 1,..., n — m),

(1) may be written in the form

tB'< EC- (H
r=0 r=0

Put —- = D„ then we have


Cr
A-+1 (p - q) (r- + r - npq)_
i =
d7 (n — m — r) (n — m + r + 1 )p2

thus ■ f+1 — 1 is positive for small values of r. It decreases as r increases and is

negative for r > s, where s is the least integer for which sC? + 1) > npq. As D0 = 1,

/) = = npq q > 1 there exists an integer k > 1 such that —- > 1 for
Cx npq + P
D

r — o, 1.k — 1, —- <1 for k < r < n — m. For this value of k we get

k— I
1 )Br> £ (*-r l)Cr (2)
!(*-»•
r=0 r=0
and

£ (r-Jfc+ l)fl, > - X (r~k+ Dcr (3)

From the identity

£ (*: - I" 1 a--* = o


k=0 V k )
it follows
m n-m
(4)
£ rBr= Y rCr ■
r=0 r=0

From (2), (3) and (4) it follows


m n—m

(k - 1) Y r=0
Br > (k - 1) Y C” ' =0

which was to be proved. This proof is due to E. Feldheim.


168 DISCRETE RANDOM VARIABLES [III, § 18

36. Prove the following asymptotic relation for the terms of the multinomial
distribution. For

« _► + oo, £ k, =■ n, | k, - npj | = 0(yj ri)


1=1
we have
n\
P1P2 ...pyx,
ky. k2\. .. kr\

J__ f (k, - nPi)2


exp
(yjInn)r 1 JpiPi ■ ..pr 2« ii Pi

Hint. Use Stirling’s formula.

37. An urn contains N cards labelled from 1 to N. Perform n drawings without


replacement. Let ^ denote the least number drawn. Find the distribution of the random
variable £.

38. Let £ and rj be nonnegative integer valued random variables such that if the
value of rj is fixed £ has a Poisson distribution and conversely. Show that

AV v'k
R,k = PG = J,V = k)=C U, k= 0, 1,...),
j\ k\

where X, fi, v are positive constants, and

1 X' (/ vlk
C /= 0 k=0 j\ k\

For the independence of { and rj it is necessary and sufficient that v — 1 should hold.
The distribution Rik is therefore a generalization of the Poisson distribution for two
dimensions. (Distribution of N. G. Obreskov.)

39. Let £ and rj be two independent random variables both having a Poisson
distribution with expectation X. Determine the distribution of £ — r\.

40. Each of two urns contains N — 1 white balls and one red ball. Draw from
both urns n balls (n < N) without replacement. Put now all 2 N balls into one and
the same urn and draw 2n balls without replacement. In which one of the two cases
is it more probable to obtain at least one red ball?

41. Let X be the disintegration constant of a radioactive material. Let the proba¬
bility of observing the disintegration of any one of the atoms be denoted by c (c is
proportional to the solide angle under which the counter is seen from the point from
where the radiation starts). Let N denote the number of the atoms at the time t — 0,
£, the number of disintegrations observed in the time interval (0, /)• Prove by applying
the theorem of total probability that £, has a binomial distribution.

Hint. The probability that exactly n atoms disintegrate during the interval (0, t) is

_ e-foy e-XHN-n).
(1

the probability that among them k disintegrations are observed, is


(*)<*<! -<*-*•
Ill, § 18] EXERCISES 169

The theorem of total probability gives

P(Zt = k) = £ ^)(1 - e-*r e-*™-* (" j ck (1 - c)n~k =

= (*) (c(l-«“*))* (l-c(1 -*-*))"-* .

Note that because of c (1 — e~h) < 1 — e~cXt somewhat fewer disintegrations are
observed than when the value of the disintegration constant would be Xc and all
disintegrations would be visible. But this difference is only important for large values
of t.

42. Let £ls £2, .. ., be independent random variables with the same negative
binomial distribution of order 1:

P^k = «)=d- p)p" 1 (n = 1, 2,..k = 1,2, ... ,r; 0 < p < 1).

Let v be a random variable, independent from £, , with a Poisson distribution


and expectation X. Determine the distribution of

£ — £i + £2 + • • • + ev+i.

Hint. By using the notations of § 14, the distribution of ^ + £2 + . . . + £*+1


can be written as
1 - p ^+1
<^+1
1 -P$i
that of C is given therefore by

=v ( Isle C‘_ im.


k=0 11 - pSl 1-pSt

, (1 - p) X
where x =-. It is known that
XZ co
l—e
Z L*X)*,
1 — z k=0
where the
ex dk
Lk(x) = (.xk e~x)
xk dxk

are the Laguerre-polynomials.1 It follows

(1 ~P)
P(C = «) = (! — p)e~* Ln_x f- P) A] p”-' (n= 1,2,...).

43. Calculate the expectation of the number of marked fishes at the second capture
(cf. Ch. II, § 12, Exercise 21), if there are 10 000 fishes in the lake and if at the first
capture 100 fishes are marked.

44. Calculate the expectation of the number of matches in one of the boxes in
Banach’s pockets at the moment when he found the other box empty for the first
time (cf. Ch. II, § 12, Exercise 14).

'Cf. e.g. G. Polya and G. Szego [1].


DISCRETE RANDOM VARIABLES [HI, § 18
170

45. Calculate the expectation of the sum defined in Chapter II, § 12, Exercise 46:

X = ky + k2 + . . . + kM.
Hint. Let &k be the distribution of a random variable which assumes the values

0, 1, 2,..., k — 1 with the probabilities — :

Show that the distribution of X - M(M + l)/2 can be written in the form

^N-M+ 1

1 ^2 • • • ^ M

M(N+ 1)
From this it follows that E(X) =
2

46. Suppose that a player gambles according to the following strategy at a play
of coin tossing: he bets always on “tail”; if “head” occurs, he doubles his stake in
the next tossing. He plays until tail occurs for the first time. What is the expectation
of his gain ?

Hint. If the tail occurs at the n-th toss for the first time (the probability of this

event is ), the gain of the player, if his bet at the first toss was 1 shilling, will be

1 shilling, since
n— 1
2” — £ 2k = 1.
k= 0

The expectation of the gain is thus

n=1 L

It seems that with this strategy the player could ensure for himself a gain. This,
however, would be true only if he would dispose over an infinite sum of money. His
fortune being limited, it is easy to show by a simple calculation that the expectation
of his gain is 0 even if he doubles the stake always when a head appears.

47. Calculate the expectation of the Polya distribution.

48. The chevalier de Mere asked Pascal the following. Two gamblers play a game
where the chances are equal. They deposit at the beginning of the game the same
amount of money. They agree that he who is the first to have won N games gets the
whole deposit. They are, however, obliged to interrupt the game at a moment when
the one player gained N — n times and the other N — m times (l<n<iV;l<m<
< N). How is the deposited money to be distributed? Calculate this proportion for
n = 2 and m = 3.

Hint. The distribution of the deposited money is said to be “fair” if the money
is distributed in the proportion p„ : pm, p„ denoting the probability that the first
gambler would win and pm the probability that the second. Thus each gambler receives
Ill, § 18] EXERCISES 171

a sum equal to this expectation. The problem is thus to calculate the probability that
the first (or the second) wins, under the condition that he already won N — n
(i. e. N — m) games.
49. In playing bridge, 52 cards are distributed among four players. The values
ot the cards distributed are measured by the number of “tricks” in the following
manner: If a player has the ace and the king of the same suit, this amounts to 2

ace to 1; ace alone to 1; king alone to — trick. What is the expectation of the total
z
number of tricks in the hand of a player ?
Hint. Obviously, the expectation of the number of tricks is the same for all players
and in each of the suits. Hence the expectation of the total number of tricks for a
player in all four suits is equal to the expectation of sum of tricks for the four players
in one suit. Thus it suffices to consider one suit only, e.g. spades. The expectation
of the tricks in the hand of one player is equal to the sum of the expectations of all
tricks present in the spades. However, this sum is equal to 2, except in the case when
the ace, the king, and the queen of spades are in the hands of different players; in
3
this case the sum of tricks is — . Hence the expectation looked for is 1.801.

M
50. a) There are M red and N — M white balls in an urn. We put -= p. Draw
N
n balls without replacement from the urn and let the random variables £,k (k =
= 1,2, ..., n) be defined as follows:

1 if at the &-th drawing a red ball is drawn,


0 otherwise.

Calculate R(c,-, £*) (1 <j<k<n).


b) If
n ~ + ^2 + • • • + £1
prove that
CHAPTER IV

GENERAL THEORY OF RANDOM VARIABLES

§ 1. The general concept of a random variable

We have already introduced in Chapter III the general concept of a random


variable. Let [Q, P] be a Kolmogorov probability space. We understand
by a random variable a function of a real variable ^ = £(o>), defined for
each co £ Q, such that every level-set of £ belongs to The level-sets of
f = £(cd) are the sets Ax defined by £(co) < x, where x is an arbitrary real
number. The function F(x) = P(AX) = P(f < x) is called (see Ch. Ill) the
distribution function of the random variable f

Theorem 1. If ^ is a random variable and g(x) a Borel-measurable1 function


of the real variable x, then rj — g(f) is also a random variable.

Proof. Let £~\A) be the set of those elementary events co £ 12 for which
£(co) £ A. We have clearly

(la)
n n

e-\A — B) = Z-\A) - Z-\B). (lb)

Let Ix denote the interval (— oo,x) and Iab the half-open interval [a, b).
By assumption, £~\IX) = Ax £ Hence,according to (lb), £-1(/a>6) £
for every pair of real numbeis (a, b), a < b. Since is a u-algebra, it follows
from (la) and (lb) that ^~\A) £ for every Borel-set A of the real line.
Theorem 1 follows immediately.

Theorem 2. If the distribution function F{x) of a random variable £ is


given for every real x, then P[£,-\A)\ is uniquely determined for every Borel
subset A of the set of real numbers.

1 A function #(x) is said to be Borel-measurable if the level-set defined by g(x) < c


is a Borel-set for every real c. In particular, every continuous function is Borel-
measurable.
IV, § 2] DISTRIBUTION- AND DENSITY FUNCTIONS 173

Proof. This theorem follows immediately from Theorem 2 of Chapter


II, §7.

§ 2. Distribution functions and density functions

Let F(x) = P(£ < x) be the distribution function of the random variable
If the random variables £ and ij are almost surely equal (i.e. if P{f A rj) =
— 0), then their distribution functions are obviously identical. In what
f ollows we shall establish some properties of distribution functions.

1. A distribution function F(x) is a nondecreasing function.


According to the definition of level-sets, we have Ax c Ay for x < y\
hpnrp

F(x) = P(A,)<P(A,) = F(y).

A distribution function F(x) is not necessarily continuous. It follows how¬


ever from the monotonicity of F(x) that at any discontinuity point F{x) pos¬
sesses both a left-hand side and a right hand side limit.

2. For any distribution function F(x) we have

lim F(x — h) = F(x).


A- + 0

Hence a distribution function is continuous from the left at every discon¬


tinuity point. In fact, F(x) — F(x — h) is the probability that x — h < £ < x,
i.e. F(x) - F(x - h) = P(Bh), where Bh = AxAx_h. Obviously, the sample
space does not contain any element which belongs to B,,for every h > 0.
If an element co of Q belongs to Bh, we have always £(co) < x. Choose now
a small enough h' > 0 such that h! < x — f(eo), then co will not belong
any more to Bh. Let {h„} (n = 1,2,...) be an arbitrary monotonic sequence
of positive numbers tending to zero. To prove the lelt-continuity of F(x)
it is sufficient to prove that

lim P(5J = 0-
«— + 00

But this is a particular case of Theorem 3, Chapter II, § 7.

3. For every distribution function

lim F(x) = 0 and lim F(x) = 1


JC--00 X- + 00

and thus we may write

F(— co) = 0 and F(+oo) = 1.


174 GENERAL THEORY OF RANDOM VARIABLES [IV, § 2

In fact, if {.*„} (n — 1,2,...) is any sequence of real numbers such that


xn < xn+1 and lim xn = + oo, then AXn is a subset of AXn+i hence the sets
n — oo

A*,„ - An (1=1,2,...) and An are disjoint and

F(x.) = F(AJ + "£p(Aw, - A„).


k=l
Since
00 00
F(AJ + £ P(A„„ - AxJ =P( £ K) = pm = 1,
k=1 k=l

it follows that
lim F(xn) = 1.
tl~* + 00

Similarly it can be shown that

lim F(x) = 0.
JC— — 00

All these results may be combined in the following theorem:

Theorem 1. The distribution function of an arbitrary random variable is a


nondecreasing left-continuous function, which has for — oo and for + oo the
limits 0 and 1, respectively.
fib.

The converse of this theorem is also true: Every function F(x) having
these properties can be considered as a distribution function. The proof
runs as follows: Let a = (7(y) be the inverse function of y — F(x). (The de¬
finition of G(y) is unique, if the following conventions are adopted: if F(x)
has a jump at x0, i.e. if F(x0) = a and F(x0 + 0) = b > a, we put G{y) = x0
for a < y < b; if F(x) is constant and equal to y0 in the interval c < x < d
but F(x) < jo for x < c, we put G(yn) = c.) If Q is the interval (0, 1), the
system of all Borel-measurable subsets of Q and P^jis for A £ the
Lebesgue measure of A, then the function rj{y) = G(y) defined for all y
is a random variable on the probability space [£>, P] and the distribu¬
tion function of rj{y) is

Pip <x)= P(G(y) < x) = P(y < F(X) ) = F(x).

Hence g is a random variable with distribution function F(x).


If £ is a bounded random variable, i.e. if there exist two constants c and
C such that for every element co of the sample space Q the inequality c <
< £(co) < C holds, then clearly F(x) = 0 for x < c and F(x) = 1 for x > C.
If the random variable is almost surely” constant, i.e. if there exists a
set A such thatP(/l) = 0 and ^(co) = c for every co £ A, then we obtain
IV, § 2] DISTRIBUTION- AND DENSITY FUNCTIONS 175

for the distribution function of £

0 for x < c
1 otherwise.

The distribution function of a constant is said to be a degenerate distri¬


bution function. Eveiy nondegenerate distribution function has at least
two points of increase, i.e. at least two points where F(x + h) — F{x) > 0
for every h > 0. If F(x) is the distribution function of a random variable b,
which assumes only a finite number of values, x is a point of increase of
F(x) if and only if £ takes on the value x with positive probability. The set
of jumps of a monotonic function, and thus in particular of a distribution
function, is necessarily finite or denumerably infinite. In fact, if the jumps
of the distribution function are projected upon the y-axis, a system of dis¬
joint intervals is obtained because of the monotonicity of the function.
Our statement can be deduced from the fact that every interval contains a
rational number and the set of all rational numbers is denumerable.
Distribution functions which are not only continuous but absolutely con¬
tinuous, deserve particular attention. A distribution function is said to be
absolutely continuous if for any given positive number e there exists a 5 > 0
such that for every system of disjoint intervals

(ak, bk) (k = 1,2,.. ., n\ ak < bk)

the inequality
n

E ~ak)<s

implies
n
I I F(bk) — F(ak) | < s.
k=l

Every absolutely continuous function, as is known, is almost everywhere


differentiable and is equal to the indefinite integral of its derivative. This is a
necessary and sufficient condition for the absolute continuity of a function.
If the distribution function F(x) is absolutely continuous we put /(x) =
= F\x). If F(x) is at a point non-differentiable, /(x) is not defined there, but
it can be defined arbitrarily. But such points are known to form a set of
measure zero. The function f(x) is called the density function of the probabi¬
lity distribution given by F(x). If F{x) is the distribution function of the
random variable J\x) is called also the density function of f
176 GENERAL THEORY OF RANDOM VARIABLES [IV, § 2

Example. We have already seen (Ch. Ill, § 13) that the function de¬
fined by
1 — e~Xx for > 0
F(x) =
0 otherwise,

with X > 0, is the distribution function of the life-time of a radioactive


atom. This function is absolutely continuous and we have

f Xe~Xx for x > 0


fix) = F\x) =
| 0 for y < 0.

(/(0) is not defined, as F'(0) does not exist.)


Thus the density function of the life-time of radioactive atoms exists and
is equal to Xe~Xx for x > 0.
Let £ be any random variable; let Ay be the event £ < y andyla 6the event
a < £ < b; then we have Aa>b — AbAa. Now, if F(x) is absolutely continuous
and F'(x) — fix), we have

P(Aa,b) = Fib) - Fia) = j fix) dx. (1)


a

From this follows immediately


+ CO

j fix) dx = Fi+ 00) - Fi- 00) = 1. (2)


— 00

Conversely, every nonnegative measurable function /(x) fulfilling (2) can be


considered as a probability density. Indeed, the function

Fix) =
— 00
J fit) dt (3)

obviously has every property of a distribution function and F\x) = fix)


holds almost everywhere.
Consider now the case when b — a = da is small and Ffa) exists. Since,
by definition,
Fia + A a) — Fia)
lim =/(«),
Aa~* 0 Aa

it follows from this that

P(a < f < a + Aa) = /(a) A a + o(da), (4)

where, as usually, o(Aa) represents a quantity which, divided by Aa, tends


to zero for A a -> 0.
IV, § 3] MULTINOMIAL DISTRIBUTIONS 177

Example. In Chapter III, § 16 we encountered the standard normal distri¬


bution having the distribution function

— CO

The density function of the standard normal distribution is therefore

1
/(*) =
v/2?r

We need now the following well-known theorem: Every nondecreasing


monotonic function may be represented as a sum of three nondecreasing
monotonic functions, the first of which is a step function, the second is
absolutely continuous and the third is a continuous “singular” function
(i.e. a continuous nondecreasing function whose derivative is almost every¬
where equal to zero). From this it follows readily that every distribution
function can be written in the form

F(x) = p1F1 (x) + p2F2 (x) + p3F3 (x),

where pL, p2, p3 are nonnegative numbers having sum 1 and Ft{x) (/ = 1,
2, 3) are the three distribution functions such that F^x) is the distribution
function of a discrete random variable, F2(x) is an absolutely continuous
distribution function and F3(x) is a singular distribution function. This
decomposition is evidently unique.

§ 3. Probability distributions in several dimensions

By a random vector of n dimensions we understand an ^-dimensional


vector £ = (£l5 ...,£„) whose components £,- are random variables on the
same probability space. The distribution function of the random vector £
is defined as the function of n variables

F(x1} x2,. . ., x„) = P(£i < xlf £2 < *2, • • •> £„ < *«)• (1)

The probability figuring on the right side of (1) is always defined; in fact,
let A^} denote the level set of all co £Q such that £*(co) < x {k = 1 2,. n), , . .,
then Aand

k=l
GENERAL THEORY OF RANDOM VARIABLES [IV, § 3
178

The right hand side of Equation (1) is exactly

p( kri= l <d-
If a problem of probability theory involves n random variables, these can
always be considered as components of an 77-dimensional random vector.
In general, the function defined by (1) will be called the joint distribution
function of the random variables £lf £2, • • •> £«•
For example the value F(xlf = P(£ < x1? 77 < Ji) of the distribution
function of a 2-dimensional random vector represents the probability that
the endpoint of a random vector £ = (£, 77) beginning in (0, 0) lies in the
quadrant of the (x, y) plane defined by x < xl5 y < y-y.
Let us consider now some general properties of multidimensional distri¬
bution functions.

1. F(xj,..., x„) is a nondecreasing function of every one of its variables.

2. F(xy,. . ., x„) is left-continuous in every variable.

3. F(xx, . . ., x„) = 0, if at least one of its variables is equal to — co.

4. F(x1;. .., x„) — 1 if all variables are equal to + co.

Besides these trivial properties, every 77-dimensional distribution has


another characteristic property which for n — 1 follows from the above
property 1. For n > 1, however, it does not follow from it. The probability
P(ak < £k < bk; k = 1,2,...) may be written in the form

P(ak <£k<bk; k = 1,2,..., 77) = £ (- 1)*=^ F(clt c2,..., c„) (2)

where ck = ekak + (1 — ek)bk. The numbers ek assume independently of


each other the values 0 and 1. The sum on the right hand side of (2) has thus
2" terms. Thus, for instance, for n = 2 we obtain

Pifl 1 ^ £1 < bx, a<l < < b2) = F(bu b2) - F(alt b2) - F(by, a.j) +

+ F(alt a2).

Formula (2) is a direct consequence of Theorem 9, Chapter II, § 3. In


fact, let Ak be the event £k < ak, Bk the event £,k < bk (k = 1,2,.. ., 77).
If we put

A = t Ak,
k=1 *=1
IV, § 3] MULTINOMIAL DISTRIBUTIONS 179

we find that

P(ak <£k<bk; k = 1,2,..., n) = P(AB) = P(B) - P( £ AkB).


k= 1

If we now put Ck = ^*.7? (A: = 1,2,.. w) we obtain (2) by applying the


above-mentioned theorem to the events Ck.
As the left hand side of (2) is a probability, it is certainly nonnegative.
Thus we have established the sought property:

5. We have
n
v ^

Z (“ !)*=1 * f(¥i + (1 - £i) blt..., snan + (1 - e„) bn) > 0,

where e1; e2, ...,£„ assume the values 0 ««<:/ 1 independently of each other
and ak < bk (k = 1,2,..., «) ore arbitrary real numbers.
Property 5 does not follow from properties 1-4. If for instance n = 2
and
F(x Xi) = f 1 if + ^2 > 0,
1’ 2 [O otherwise,

properties 1-4 are fulfilled but property 5 is not, since for instance

F(2,2) - F(— 1, 2) - F(2, -1) + f(-l, -1) = -1 < 0.

By introducing the notation

AP F(xi> • • •, xn) = F(pCi,..., **_!, + h, xk+1,..., xn) - F(x1,..xn)

we may write condition 5 in the following form:

5'. We have

for hk> 0 for any real numbers xk (k = 1,2,..., n). Here the “product”
of the (commutative) operations means that they are to be performed
one after the other. It is easy to prove that if condition 5' holds for hx — /?2 =
= ... = hn = h > 0 it is valid in general.
Conversely, it can be shown that every function F(xx, x2,. . ., xn) fulfill¬
ing conditions 1-5 may be considered as a distribution function. This
follows from § 7 of Chapter II.
If the distribution function of the random vector £ = (£lt £2,..., £„) is
F(x1} x2,..., xn) and B is a Borel-set of the n-dimensional space, then

P(UB) = $...$d F{Xl,...,xn),


J B J
GENERAL THEORY OF RANDOM VARIABLES [IV, § 3
180

where on the right-hand side figures a Stieltjes-integral; in other words


P(C 6 B) is equal to the value on the set B of the measure defined by the
function F in the n-dimensional space.
If the distribution function of an n-dimensional random vector is abso¬
lutely continuous, then the density function

8nF(x1,x2,...,xn)
f(x1,x2,...,xn) - dXidx2 "dXn (3)

exists almost everywhere, and we have always

f(xi, x2, ...,x„)>0

because of property 5. This follows from

A<PA<P...A<pF(x x,x2,...,xn)
• • •? *^2) lim j
h-+ 0 n

Further we have
*1 xn
F(x1,...,xn)= j ... j f{h,..., t„) dh... dtn ; (4)
— 00 — 00

hence in particular
+ 00 +oo
f ... J f(xlf..., x„) dx1... dxn = 1. (5)
— oo —'oo

Further
r1 rn
P(ak < £k < bk; k = 1,2,..., n) = j ... J /(xx,..., xn) dx1... dxn, (6)

or, more generally, if B is a Borel subset of the n-dimensional space, then

P(C 6 B) = J... j Axi, ■ • *n) dxi... (7)

In other words: the probability that the endpoint of the random vector (
lies in a Borel-set B of the n-dimensional space is equal to the integral on B
of /(xl5. . ., xn).

Theorem 1. If cp(x\, • ■ •> x„) is a Bor el-measurable function of n variables


and if £j, ...,£„ are random variables, then t] — . . ., £„) is also a
random variable.
IV, § 4] CONDITIONAL DISTRIBUTIONS 181

Proof. Let £_1(B) be the set of those points co £ Q for which ((co) £ B,
where ( is an n-dimensional random vector and B is a Borel-set of the «-di-
mensional space; clearly (-1(B) From this Theorem 1 follows in the
same manner as Theorem 1 of § 1 was proved.
Let us remark that to every 3-dimensional probability distribution there
can be assigned a 3-dimensional distribution of the unit mass such that any
domain D contains the mass P(D). If f(x, y, z) is the density function of
the probability distribution in question, this same function will represent
the density of the corresponding mass distribution.

§ 4. Conditional distributions and conditional density functions

Let ( be any random variable, B an event of positive probability. Of course


B is assumed to belong to the probability algebra on which ( is defined.
The conditional distribution function of ( with respect to the condition B is
defined as the function

F{x | B) = P(( < x | B) = P(AX | B),

where Ax has the same meaning as in § 1. If the conditional distribution


function thus defined is absolutely continuous, its derivative f{x | B) =
= F'(x | B) will be called the conditional density function of £ with respect to
B. Evidently, if P(B) = 1, the ordinary distribution function and densitv
function are obtained.
Take for instance the conditional distribution function of the life-time
of a radioactive atom with respect to the condition that it did not disinte¬
grate until the moment t0. As is already known, this is equal to the ordinary
distribution function if t is replaced by t — t0. Let B0 be the event: the atom
did not disintegrate until the moment t0. Then we have

I
f 1 — e~x<-‘ fj) for t > t0,
F(t | B0) = Q otherwise

and
ne-W-^ for t > t0,
At\B0) =
lo for t < t0-

If {Bn} (n = 1, 2,. . .) is a complete system of events with P(B„) > 0,


we have
FW = yP(S„)F(x |5„) (1)
n

and
f(x) = YjP(Bn)f(x\Bn). (2)
182 GENERAL THEORY OF RANDOM VARIABLES [IV, § 5

For the generalization of the concept of the conditional distribution func¬


tion and conditional density function see Chapter V, § 2.

§ 5. Independent random variables

Two random variables £ and rj are said to be (stochastically) independent,


if for every real v and y

P(Z < x, t] < y) = P{£ < x) Pin < y), (1)

i.e., if the two-dimensional distribution function of (£, rj) is equal to the prod¬
uct of the distribution functions of £ and r]. From (1) is readily deduced
that

P(a < £ < b, c < rj < d) = P(a < £ < b) P(c < rj < d) (2)

and, more generally, for any two Borel-sets A and B (cf. Theorem 2 below):

P(Z £A,n£B) = P(f 6 A) Pin 6 B). (2')

For discrete random variables, this definition of independence coincides


with that given in Chapter III, § 5.
The independence of several random variables may be defined in a simi¬
lar manner. The random variables £2,,.£„ are said to be independent,
if for every system of real numbers xlf x2, the relation

p(tl < xl> • • •> Zn < Xn) = 17


k=1
P(Zk < Xk) (3)

holds. If the random variables f2, are independent, any k (k < n)


chosen arbitrarily from them are independent as well. To see this, it suffices
to substitute Xj = + oo, where the /-s are the indices of the random variables
which do not figure among the chosen k.
The converse of this relation does not hold. For example the fact that rj,
C are pairwise independent does not imply their independence. We have
already seen this in the preceding Chapter.
If £i> £2, • • •> Zn are discrete random variables, the above definition of
independence is equivalent to the definition given in the preceding Chapter.
We shall prove now some simple theorems about independent random
variables.

Theorem 1. A constant is independent of every random variable.

)
IV, § 5] INDEPENDENT RANDOM VARIABLES 183

Proof. If rj = c {c — constant), we have

P(f < a) for c < y,


P{£ < x,rj < y) =
0 otherwise,
hence (1) is valid.

Theorem 2. Let £2, be independent random variables and let


gk{x) (k = 1,2,.. ., n) be Bor el-measurable functions; then the random
variables t]k = gk{£k) are independent.

Proof. If Bx,. . ., Bn are Borel subsets of the real axis, it follows from (3)
that

e JJ„ ...,{, € Bn) = n P((t € Bk). (4)


k=l

In fact, if Bx, . . ., Bn are unions of finitely many intervals, (4) follows from
(3). Let now B2, Bz,. . ., Bn be fixed and let B± alone be considered as vari¬
able: thus both sides of (4) represent a measure. The theorem about the
unique extension of a measure (Ch. TI, § 7, Theorem 2) can be applied here
and it follows that (4) is true for any Borel-set Bx. Let now be B1 an arbi¬
trary, fixed Borel-set and let B3,. . ., Bn be fixed sets, each of them being the
union of finitely many intervals. By repeating the preceding reasoning it
can be seen that (4) remains valid, if B2 too is an arbitrary Borel-set. By
progressing in this manner (4) can be proved. Theorem 2 follows immediately
from (4).
In particular it follows from Theorem 2 that the random variables

hk — ak£k + bk (k = 1,2,..., n)

where ak and bk are arbitrary constants, are independent if are


independent.
Furthermore, it follows from Theorem 2 that for independent random
variables £l9 Formula (3) remains valid if for one or several values
of A: on both sides one of the expressions < xk, > xk, or > xk will be written
instead of < xk.

Theorem 3. If £x, £2,. . ., are independent random variables with den¬


sity functions ffx)J2{x),. . ., f„(x), respectively, then the joint distribution
of the random variables £x, is absolutely continuous with density func¬
tion

/<*,,..*„) = n /*(**)• (5)


k=1

Conversely, (5) implies the independence of the random variables <^x,. . ., <?„•
184 GENERAL THEORY OF RANDOM VARIABLES [IV, § 6

Proof. (5) follows from (3) because of Formula (3) of § 3. Conversely, (3)
is obtained by integrating (5).

Theorem 4. Let £ls £2, be independent random variables and let


h(xx,.. ., xk) be a Borel-measurable function of k variables (k < n). Then
the random variables

are independent.

The proof is similar to that of Theorem 2.


The independence of two random vectors, £ = (£l5. . £,,) and rj =
= (i/l5 .., rjm) can be defined as follows: £ and rj are said to be indepen¬
dent, if the equality

Tl, . . b,n <L xn, rji < Ti» • • Vm ^ Tm)

= ^(4 < *i, • • ., 4 < x„) PO/i < jl5 . . ., r\m < ym) (6)

is identically fulfilled in the variables xv- and yk.

§ 6. The uniform distribution

The random variable £ is said to be uniformly distributed on the interval


{a, b) (« < b) if its density function is

0 for x < a and for b < x,

m= 1 (1)
for a < x <b.
b—a

At the points x — a and x = b /(x) can be defined arbitrarily1. The corre¬


sponding distribution function is

0 for x< a.

F(x) = —- for a< x < b, (2)


b —a
1 for x > b.

The uniform distribution of a random vector can be defined in a similar


manner. An n-dimensional random vector £ = (4,. . gn) is said to be
uniformly distributed on a nonempty open set G of the n-dimensional space

1 Sometimes/(x) is also called “i ectangular” density function.


IV, § 6] THE UNIFORM DISTRIBUTION 185

with finite ^-dimensional Lebesgue-measure, if the density function of the


random vector is given by

—for (x1? ...,x„)£G


/(*!» • • - *„) Mn(y) (3)
0 otherwise,

where //„(G) is the “volume” (the ^-dimensional Lebesgue-measure) of G.


We already encountered uniformly distributed random variables in Chapter
II, § 10 in connection with the geometric probabilities. The geometrical de¬
termination of probabilities is nothing else than the reduction of the prob¬
lem to certain uniformly distributed random variables. In fact when one
deals with geometric probabilities, it is always assumed that the probabi¬
lity of a point to lie in an interval of the real axis (in a domain of
the plane, space, or more generally, of the n-dimensional space) is propor¬
tional to the length of the interval (to the area or volume of the domain
in question). But this means that the random variable considered (or the
random vector) is uniformly distributed. If £ is uniformly distributed on the
interval (a, b), then, according to (1) and Formula (1) of § 2, the probability
that £ lies in a subinterval (c, d) (a < c < d < b) of {a, b) is given by

d— c
J fix) dx = (4)
C
b —a

thus it is indeed proportional to the length of the interval (c, d).


A similar statement holds also in the multidimensional case. If { is a
random vector uniformly distributed on an n-dimensional domain G, the
probability that the endpoint of £ lies in a domain Gx which is a subset of G
is equal to

j ... f /(*!,..., xn) dxl... dxn (5)


Gr bn{G)

The case when G is a parallelepiped with its edges parallel to the axes
deserves particular consideration: we have

-—— for ak < xk < bk (k = 1,2-- n),


/(*],• • •>*«) bn(G)
0 otherwise.
where

Hn(G) = fl ih ~ ak)-
k=l
GENERAL THEORY OF RANDOM VARIABLES [IV, § 7
186

Hence
/(*!,.. x„) = fl fk(xk), (6)
k=1
where fk(pck) (k = 1,...,«) is the density function of a random variable uni¬
formly distributed on the interval (ak, &*); consequently £„ are inde¬
pendent. Conversely: if fk is uniformly distributed on (ak,bk) and if the
4 are independent, the vector C = (£i, • ••,£„) is uniformly distributed in
the parallelepiped ak < xk < bk (k = 1,2,..., n).
For an infinite interval (or for a domain of infinite volume) the uniform
distribution can be defined by means of the theory of conditional probability
spaces. We shall return to this in Chapter V.

§ 7. The normal distribution

We already encountered the normal distribution. It was introduced as


the limit-distribution of the binomial distribution. It has a paramount role
in probability theory. Many random variables dealt with in practice have
a nearly normal distribution. Often, the normal distribution is called the
“law of errors”, since random errors in the result of measurements are often
normally distributed. In Chapter VIII it will be proved that the distribution
of the sum of a large number of independent random variables has approxi¬
mately a normal distribution under quite general conditions.
First of all let two general notions be defined: that of the similarity of
distributions and that of a family of distributions. Two distribution functions
F1(A') and F2(a) are said to be similar, if there exist two numbers a ^ 0 and
m such that if Fx(x) is the distribution function of a random variable
then F2(x) is the distribution function of r] — at; + m. As the inequality
x — m
+ ra < x is for cr > 0 equivalent to £ < - and for a < 0 to
a
X — ffl
£ > -, we have either

x—m
F2(x) = Ft (la)

(for a > 0), or

F2(x)=1-F1 (lb)

(for a < 0).


If Fx(x) is absolutely continuous, F2(x) is absolutely continuous as well.
In this case, we obtain for the density functions f(x) — F-(x) (/ = 1, 2)

h (*) = (2)
IV, § 7] THE NORMAL DISTRIBUTION 187

Clearly the relation of similarity is symmetric, reflexive, and transitive.


Thus it permits the classification of the distributions into types called fam¬
ilies. Every family of distributions is a set depending on two parameters
(m and cr). All uniform distributions (on the line) are thus similar to the dis¬
tribution uniform on the interval (0, 1). In fact, the uniform distribution on
{a, b) has the density function

l Ax - a
b —a b —a

where /(x) is the density function of the uniform distribution on (0, 1),
that is
[ 1 for 0 < x < 1
/(*) =
[ 0 for x < 0 and 1 < x.

One can also define families of multidimensional distributions. The dis¬


tribution functions Fx{xx,. . ., x„) and Ffxx,. . ., x„) are said to be similar
if there exists a linear transformation
n

b,k @ko T ^ akl £i {k 1,2,.. .,n) (3)


<=i

with a non-zero determinant D = | aik | such that the random vector £ =


— (£x,. . ., £„) with the distribution function Fx(xu . . ., x„) is transformed
by (3) into the vector £' = (£[, . . ., <Q with distribution function
F2(x i,. . ., x„).
If the functions Fx(x j,. ..,
x„) and F2(xx,. x„) are absolutely contin¬ . .,
uous and have the density functions fx(xx,. . ., x„) and f2(xx,. . ., x„), then
by a well-known property of linear transformations it follows that

fzi.x. • • •» Xn) | | fl (.xli • • •> Xn)) (4)

where
n

x'k— ak0 + Yj aki Xi (k = 1, • • -, A).


1=1

For n = 1, (4) reduces to (2).


Let us now return to the normal distribution. We shall call every distri¬
bution normal which is similar to that obtained as the limit of the binomial
_—
2
6
distribution, i.e. to the distribution with density function —Thus the
s/2tc
GENERAL THEORY OF RANDOM VARIABLES [IV, § 7
188

density function of a normal distribution has the form

x—m 1 ' (x — m)2


— i- CAP (5a)
a x/27T(J 2<r

where
1
<p(x) = (5b)
J2n

(We have taken a > 0; this restriction to positive values of a is permissible


since y(x) is an even function.) In other words: a normal distribution func¬
tion has the form

F(x) = <P (6a)

where
X

e 2dt. (6b)

— 00

If the distribution function of a random variable £ is given by (6a), we shall


call £ for the sake of brevity N(m, cr) distributed. Let us now consider the
multidimensional normal distributions. For the sake of simplicity, let us
first restrict ourselves to the case of two dimensions.
If £ and i/ are independent normally distributed random variables with
density functions

mx , 1 ix - m2
<P and -cp -—-
<7i °2 l °2

then the density function of the random vector £ = (£, tj) is equal to the
product of the density functions of £ and rj; i.e. to

1 1 [(x-Wi)2 i (y - m2f 1]
h(x, y) exp O ‘ 1
(7a)
2nal u2

A random vector having a density function of the form (7a) or one similar
to it is said to be normally distributed (or Gaussian). Since all distributions
having density functions of type (7a) are similar to each other, the two-
dimensional normal distributions form a family. The density function (7a)
(with mx — m2 = 0) is represented on Fig. 19.
IV, § 7] THE NORMAL DISTRIBUTION 189

A simple calculation shows that the most general form of the two-dimen¬
sional normal density function is given by

J'AC-B2
exp - y (A (x - mx)2 + 2B(x- wj) (y - m2) + C(y - m2)2) ,(7b)
2n

where A and C are positive, B is a real number such that B2 < AC, m1 and
m2 are arbitrary real numbers. If B ^ 0, £ and t] are not independent. In
fact, in this case the density function cannot be decomposed into two factors,
one depending only on x and the other only on y.
We introduce now the concept of the projection of a probability distribu¬
tion. Let £ = be an n-dimensional random vector. The projec¬
tion of the distribution of £ upon the line g having the cosines of direction
n
9k (k = 1) 2,..., w, ^ git 1)>
k=1

is defined as the distribution of the real random variable

Cg — Yj 9k^k-
k=1

If the distribution of £ is known, all its projections are known as well.


In particular, the distribution of £,c (k = 1,. . ., n) is thus the projection of
the distribution of £ upon the x^-axis. Let F(x1,. . ., x„) be the distribution
function and /(xx,. . x„) the density function of £, Fk{x) and fk(x) those of
£*.. We have

Fk(xk) = F(+ oo,. . ., + co, xk, + oo,. . ., + co) (8)


190 GENERAL THEORY OF RANDOM VARIABLES [IV, § 7

and, similarly (for almost every xk)

+ 00 +00

/*(**)= j... § f(x1,...,xn)dx1...dxk_1dxk+1...dxn. (9)


— 00 — 00

To understand the notion of a “projection” it is useful to consider the


analogous notion for a mass-distribution. For instance let a distribution
of the unit mass over the plane be determined by the density function
h(x, y) and let us “project” it upon the x-axis, in the sense that we assign
to the interval (a, b) the total mass contained in the strip a < x < b, — go <
< y < + oo. This mass is equal to
b +oo
| j h(x, y) dydx.
a —oo

Consider now the projection of an arbitrary two-dimensional normal


distribution (7b) upon the x-axis (and upon the jy-axis). For the density
function /(x) (and g(y)) of these projections, a simple calculation gives,
as we have
+ C0
(x — m)2
exp dx — 1,
n/2 no 2o2

the results

1 x — m1 y-m2
/(*) =— <P and g(y) = — cp — , (10)
(Jo <?2

where

C
0i =
AC-B2
and cr2
AC - If '
(11)

Thus the projections upon the axes of a two-dimensional normal distri¬


bution with density function (7b) are one-dimensional normal distributions.1
The projection on an arbitrary line may be calculated in the same manner
and the result is always a one-dimensional normal distribution. Suppose
that the components £lt £2,..., of an n-dimensional random vector £
are independent and is normally distributed with the density function
1 ( x
— 99 — (a = 1,...,«). The density function of the random vector £ is
ak \ok)

' The projections of a distribution in n-space on the coordinate axes are also called
its marginal distributions.
IV, § 7] THE NORMAL DISTRIBUTION 191

n
1 1
/(*!,..xn) = „ „ exp
2 z (12)
{In)2 n o-fe
fc=i

If the density function of a random vector has the form (12), it is said to
be normally distributed or Gaussian. Every distribution similar to this is
said to be an n-dimensional normal {or Gaussian) distribution. In order to
obtain the general form of the density of an H-dimensional normal distri¬
bution put

0 = k=i
Z cJk 4 + Wj, 03)

where {cjk) is an orthogonal matrix, i.e.


n
j 1 for i = j,
Z1 ^tk Cjk
k= [ 0 otherwise,
(14)

and where the my- {j = 1,. . ., n) are real numbers.1


Consider now the random variables £' as coordinates of a vector
Determine the density function g{x'x,. . ., x'n) of By (13) and (14) we
have

£*= Z CjM-mj) (15)


7=1

and in consequence of (4) we obtain (as in the two-dimensional case),

Is 1 i "
2'
dfali • • •> Xn) n n exp -vEtZ^ (xj - mj-) ’ (16)
a k=l °k l7=1
{2n)2 n ak
k=1

or, by putting
Cik Cjk
*,/= i 2 9
*=1 °k

9{X 1, ■ • •> Xn) =

(2 7i)2 n ^
A: = l

(17)
x exp V Z Z bu (*/ ~ m<) (*/ ~ mJ>
* 1=1 7=1

1 We can restrict ourselves here to orthogonal transformations, since the most


general nondegenerate linear transformation may be decomposed into an orthogonal
transformation and a transformation of the form x'k = i-kxk (k —
192 GENERAL THEORY OF RANDOM VARIABLES [IV, § 7

where (by) is a symmetrical matrix such that the quadratic form

i
1 = 1 y=l

is positive definite. It is known that a positive definite quadratic form can


be transformed into a sum of squares. Thus if the density function of £'
has the form (17), there exists an orthogonal transformation with matrix
C = (cy) such that the ^-dimensional density function of the random vari¬
ables

t°jk«;-«/)
j=i
n
has the form (12). Note that the factor 1 / J^[ tx* is equal to the positive square
*=1
root of the determinant | by |. The matrix B = (by) can be written as CSC*,
where C* is the transpose of C and S is the diagonal matrix

f o 0

l
o 0
S= <^2

0 0 ... 4r

For the determinants, since | C | = J C* j = +1, we have

i«i=isi “ ni-
Ar = l °k
Consequently, the density function (17) can be written as

j n n

g(xx,..., x„) = -7T


Z
Z Z
1=1 7=1
bU (Xi ~ mi) (XJ ~ mi) » (18)

where the quadratic form IbyZjZj is positive definite, mu . . ., mn are arbi¬


trary real numbers, and \B \ is the determinant of the matrix B — (by). Every
density function of the form (18) is the density function of an n-dimensional
normal distribution; a suitable orthogonal transformation leads from (18)
to a density function of the form (12). Since evidently all distributions with
a density function of the form (12) are similar to each other, the /7-dimen-
IV, § 8] FUNCTION OF A RANDOM VARIABLE 193

sional normal distributions form a family. It has some interest to study the
case of an m-dimensional vector ( = (£1; . . 0,. . 0) where m > n,
and where the n-dimensional vector (£ls. . £„) has a density function of
the form (12). By applying the orthogonal transformation
n

?j = Z cJk Zk + Wj
k= 1
(7=1,2,..., m), (19)

we obtain an m-dimensional vector which, however, is not really m-dimen¬


sional; indeed (19) implies
m

Zk = z
j= 1
cik (£• - mj) for k -1,2,.. .,n (20)

0 = Z cjk
/=1
- mi) for k = n + 1,. .., m. (21)

Formula (21) expresses that the point (£[,. . ., £') lies in an ^-dimensional
subspace of the m-dimensional space. A distribution of this kind is said to
be a degenerate m-dimensional normal distribution.

§ 8. Distribution of a function of a random variable

Let E, be a random variable with known distribution and let y — i[/(x)


be a Borel-measurable function. It is then easy to determine the distribution
of the random variable 17 = 1Let iA_1(F) denote the set of real numbers
x for which 1j/(x) belongs to the Borel-set E\ let further Iy denote the inter¬
val (— 00, y) and let F(x) be the distribution function of £ and G(y) thrt of 17.
It follows that

G(y)=P(n<y) = P(m-\Q)= f dF(x).

Let us first consider some particular cases. If ^ is a discrete random vari¬


able, the calculation of the distribution of rj is almost trivial. In fact, let xk
(k = 1,2,...) denote the possible values of £, then

P(?l =y)= Z
Hxk)=y
P{£ = *k),
where the summation extends over those values of k for which \l/(xk) = y.
Let us now consider the case of an absolutely continuous distribution
function. Let /(x) be the density function of Assume \p(x) to be monotonic
and differentiable and suppose i//(x) 9^ 0 for every x. If g(y) is the density
GENERAL THEORY OF RANDOM VARIABLES [IV, § 9
194

function of tj = tKO> one easily finds that

/O^OO) for inf \jj(x) < y < sup i/<x),


g(,y) = I n^~\y))\ (1)
0 otherwise,

where x = “ lO) is the inverse function of y = tA(x).

If for instance ^ is a normally distributed random variable and rj = e*,


we have by (1) putting M =

fin—]2 1
l \ M)_
exp for >- > 0,
y/2nay 2fer2
g(y) = (2)
0 for y < 0.

A random variable having the density function (2) is said to be lognormal.


The lognormal distribution is of great importance in the theory of crushing
of materials. The distribution of the grains of a granular material (stone,
metal or crystal powder, etc.), in particular of a product produced by a
breaking-process, is lognormal under rather general conditions. This den¬
sity function is represented by the curve seen on Fig. 20.
Take now another example. Let the random vector ( be uniformly dis¬
tributed on the circumference of the unit circle; what is the density of the
distribution of the projection £ of £ on the x-axis? We obtain from (1)

1
for — 1 < y < + 1,
g{y) = n^l-y2 (3)
o otherwise.
IV, § 9] THE CONVOLUTION OF DISTRIBUTIONS 195

§ 9. The convolution of distributions

Let two independent random variables £ and rj be given having the distri¬
bution functions F(x) and G(y) respectively. Consider the sum £ = £ + rj;
let H(z) be its distribution function. We have clearly

H{z) =
x+y<z
fj dF(x) dG(y) = (1)

= T F(z-y)dG(y)= +f G(z-x)dF(x).
— OO — 00

The distribution function H(z) is called the convolution of the distribution


functions F(x)and G(y). The convolution operation is denoted by H — F * G
Clearly it is commutative and associative; in fact, if £ls £2, £3are independent
random variables, we have

£1 + £2 = £2 + fi and (£x + £2) + £3 = £1 + (£2 + £3).

From this follows for the distribution functions that

Fx * F2 = F2 * F1 and (F1 * F2) * F3 = Fl * (F2 * F3).

Suppose that £ and rj are independent random variables having absolutely


continuous distribution functions; let f(x) and g{y) be their density functions.
It will be shown that H = F * G is also absolutely continuous and that
the density function of £ = £ + rj is
+ 00 'h°o

h(z) = J f(x) g(z - x) dx = j f(z - y)g(y) dy, (2)


— 00 — 00

(the equality of the integrals in (2) can be shown e.g. by a transformation


of the variable).
Formula (2) can be proved as follows: (1) is equivalent to

H(z) = J J f(x - y) dx dG(y) =

— 00 — CO

, +00 (3)
= J j f(x - y) dG(y) dx.
— 00 — 00

By differentiating (3) we obtain

«*)= f’/Cz-rtrfGO'). (4)


— 00
GENERAL THEORY OF RANDOM VARIABLES [IV, § 9
196

From (4) follows immediately (2). Further it can be seen that the distribu¬
tion of C = £ + n is absolutely continuous, provided that one of £ and r\
has such a distribution, regardless of the other distribution.
The function h{x) defined by (2) is called the convolution of the density
functions/(x) and g(x) and is denoted by h =/* g. It is easy to show that
h{x) is a density function; as a matter of fact (2) implies h{x) > 0 and
+ 00 + 00 + 00 + “? +00

f h(pc) dx — J j f(x-y)g(y)dydx= j f(x) dx j g(y)dy = \.


— 00 —00 —00 “"00 —OO

In what follows, we shall give some examples of the convolution of ab¬


solutely continuous distributions (the convolution of discrete distributions
was already dealt with in Chapter III, § 6).

1. Convolution of uniform distributions.


Suppose that

-- for a < x <b,


b—a (5)
/(*) =
0 otherwise
and

—- for c < x < d,


9(x) = d— c (6)
0 otherwise.

Assume d — c > b — a. The convolution of the density functions of two


independent random variables b, and r\ with the respective density functions
(5) and (6) is equal to

0 for x<a+ c or b + d < x,

x — (a + c)
for a + c < x < b + c,
(b — a){d — c)
h{x) - 1 (7)
for b + c < x < a + d,
d— c
(b T c?) — x
for a + d < x < b + d.
(ib — a){d — c)

The graph of the function y = h(x) is an isosceles trapezoid with its base
on the x-axis (Fig. 21 represents the case a = — 1, b = 0, c = — 1 ,d— +1).
Note that h{x) is everywhere continuous, though/(x) and #(x) have jumps.
(The convolution in general smoothes out discontinuities.)
IV, § 9] THE CONVOLUTION OF DISTRIBUTIONS 197

In particular, if £ and r\ have the same uniform distribution, the graph


h(x) is an isosceles triangle; this is the so-called Simpson distribution.
By repeated application of (2) one can determine the density function of
the sum of several independent random variables with absolutely continuous

distribution. Thus for instance the density function of the sum of three
independent random variables uniformly distributed on (— 1, +1) is given by

0 for |xj > 3,

(3 —|*|)2
for l<|x|<3,
h(x) — 16 (8)
3 - x2
— for 0 <lx| < 1.

The function h{x) (cf. Fig. 22) is not only continuous but also everywhere
differentiable. The curve has already a bell-shaped form as the Gaussian
curve; by adding more and more independent random variables with uni¬
form distribution on ( - 1, +1), this similarity becomes still closer: we have
here a particular case of the central limit theorem to be dealt with later. The
density function of the sum of n mutually independent random variables
with uniform distribution on (—1, +1) is

1 m
Z (-D‘ {n + x — 2k)n 1 for \x\ <n,
(9)
/»(*) = 2" («-!)! *=o

0 otherwise
GENERAL THEORY OF RANDOM VARIABLES [IV, § 9
198

as it is readily proved by mathematical induction. The graph of the function


fn(x) consists of arcs of functions of degree n - 1; it is (n - 2)-times
differentiable, i.e. the first n - 2 derivatives of these functions are equal
at the endpoints of these arcs.
This distribution was first studied by N. I. Lobatchewski. He wanted to
use it to evaluate the error of astronomical measurements, in order to decide
whether the Euclidean or the non-Euclidean geometry is valid in the Uni¬
verse.

2. The convolution of normal distributions.


Let £ and rj be two independent random variables with density functions

mx 1 x — m?
f(x) = <P and g(x) = — cp
°2 <7o

it follows from (2) by an easy calculation thai putting h = f * g one has

1 x — (mx + mf)
h(x) =
/ 2 \ 2
(10)
+ a2 V 0"l + °2

The sum of £ and rj is thus also a normally distributed random variable;


i
the parameters of the distribution are m = mx + m2 and o = (of + ui)2.
It follows that the sum of any number of independent and normally distrib¬
uted random variables is again a normally distributed random variable.

3. Pearson’s y2-distribution}
The distribution of the sum of the squares of n independent random vari¬
ables £i,.. with the same normal distribution, plays an important role
in mathematical statistics. We shall determine the density function of this
sum for any n. Let cp(x) be the density function of the random variables
£k {k = 1,2,..., n). Let the sum of the squares of the be denoted by

2 W k2
In = X
fc = 1
ft- (11)
Let h„{x) be the density function of The statement

e -
for x > 0,
K(x) = 22 r (12)

o for x<0

1 This distribution was already used by Helmert, before Pearson.


IV, § 9] THE CONVOLUTION OF DISTRIBUTIONS 199

can be proven by mathematical induction. For we have by Formula (1)


of §8

for x > 0.
hi(x) = s/2nx (13)
0 for x < 0,

which shows that (12) is valid for n = 1. Suppose that (12) is valid for a
certain value of n. Given (2) and the induction assumption we have
!
1

K+i(*) = (14)
0

As for Euler’s beta function B(ci, b) the formula


i
B(a,b)= f /“^(l-tf-'dt (a>0, b > 0) (15)
6
is valid, we have1
mm (16)
B(a, b)
r(a + b)

Since F = yj n, from (14) follows

n+1
-1 -

hn+i(x) = for x > 0.


n+ l
n + 1
2 2~ r

Thus (12) holds with n+l instead of n; thus it holds for every n.
From (12) we obtain that the density function gn(x) of i„ =
= +... + £ is
^_I 2

gn(x) = 2—n—e—y for x>0. (17)

2¥Ht)
The distribution with density function (12) is called Pearson's distribution
with n degrees of freedom. The distribution with density function (17) is
called the %- distribution with n degrees of freedom.
1 For the proof of this formula cf. e.g. F. Losch and F. Schoblik [1] or V. 1. Smir¬
nov [J ].
200 GENERAL THEORY OF RANDOM VARIABLES [IV, § 9

For n = 3 Equation (17) gives the density function of Maxwell's velocity


distribution, which is of great importance in the kinetic theory of gases.
Consider a gas contained in a vessel. The velocity of a molecule has for
its components in the directions x, y, and z the random variables £, rj, and £,

respectively. It is shown in the kinetic theory of gases that these three ran¬
dom variables are independent, normally distributed, and have the same
density function:

The physical meaning of £, t] and £ having identical distributions is that


the pressure of the gas has the same value in every direction; m — 0 means
that the gas does not move as a whole, only its molecules move at random.
We wish to determine the density function of the absolute value of the
velocity

V = yj + rf + C2 • (18)

£ rj C
Clearly, —, —, — have the density function tp(x); hence, by (17), the density

function of — is g3(x). According to Formula (1) of § 8 the density v(x) of v

'x ' 3 1 f i = —1 r
is —g3
a w
. SinceF
[2 =—r
2 UJ y/n, we have

2 2

‘><t’
»(*) = — —x e {x > 0). (19)
O" n
IV, § 9] THE CONVOLUTION OF DISTRIBUTIONS 201

(The curve representing y = v(x) is drawn on Fig. 23 for a = 1.) Note that
a has the physical meaning

where T is the absolute temperature, M the mass of the molecules, and k is


Boltzmann’s constant.
1 - *
Let further be noted that h2(x) = — e 2 : the ^-distribution with 2

degrees of freedom is an exponential distribution.

4. Convolution of exponential distributions.


The exponential distribution was introduced in the previous section in
connection with radioactive disintegration; but it occurs also in many
other problems of physics and technology. In what follows, we give an
example from the textile industry; namely the problem of the tearing of the
yarn on the loom. At a given moment, the yarn is or is not torn, according
as the section of the yarn, submitted at this moment to a certain stress, does
or does not yield to the latter. Evidently this does not depend on the time
duiing which the loom worked uninterruptedly. Let £ be the random vari¬
able representing this time-interval, i.e. the time between the start of the
work and the first rupture of the yarn; let F(x) denote the distribution func¬
tion and f(x) the density function of £; for F(x) one obtains, as in the case
of the radioactive disintegration, the functional equation

x x yi -r 0j

= 1 - m, (21)
1 - F(s)
from which it foHows that

F{t) = 1 - e~u for t > 0 (22)


and
1

(*>0) (23)
II

(where A is a positive constant). Hence the random variable £ has an


exponential distribution.
Consider now the functioning of the loom during a sufficiently long time
interval. Let £„ denote the time interval until the 77-th rupture of the yarn.
For the sake of simplicity assume the time wasted between the rupture and
the tieing of the yarn to be so small that it can be neglected. Then we have

Cn = £ 1 + £2 + • • • T ,
where £ls. . ., £„ are independent and every one of them has distribution (22).
Let Fn(t) be the distribution function and f,(t) the density function of £„.
GENERAL THEORY OF RANDOM VARIABLES [IV, § 9
202

It can be shown by induction that


)n tn-1
/"(/) =- for t> 0; n = 1,2,— (24)
(n — 1)!

By (23), Formula (24) holds for n = 1. Assume its validity for a certain value
of n. Since {„+i = {„ + £„+i and further C„ is independent of fB+1, For¬
mula (2) can be applied here. Thus we obtain

L («)/i (* ~u)du =-—-


0

and (24) is hereby proved. It follows

Fn (0 = Fn (Xt), (t> 0) (25)


where

r" ^ = (n - l)i J W"_1 du ~ 0)


n
(26)

is the incomplete F-function.


The distribution with the distribution function (25) is called the T-distri-

bution of order n and parameter X. For X = —, fn(t) is equal to the function

h2n(t) defined by (12); thus the ^-distribution with 2n degrees of freedom is


the same as the /"-distribution of order n.
This result permits us to calculate the probability that the yarn is torn
exactly n times during a time interval (0, T). Let vr be the number of break¬
ings of the yarn in the time interval (0, T); clearly vT can only assume non¬
negative integer values. The event vT — n means that £„ < T, but £«+i > T.
Let An denote the event < T; then, because of An+1 cz A„, we have

P(vr = n) = Fn(T)- Fn+1 (T). (27)

Substituting here for Fn{T) and Fn+1(T) and integrating by parts, we find

(XT)n e~XT
P(vT = n) = (28)
n\

Thus the random variable vT has a Poisson distribution of parameter XT.


Here we encountered an important further property of the Poisson distri¬
bution: if a sequence of events has the property that time intervals between
consecutive events do not depend on each other and have distribution function
1 — e~Xt (t > 0), then the number of events occurring in a fixed interval
(0, T) has a Poisson distribution with parameter XT.
XV, § 10] DISTRIBUTION OF A FUNCTION OF RANDOM VARIABLES 203

The above reasoning can be applied to a large number of technical pro¬


blems (e.g. the breaking of machine parts).
One can also determine the distribution of a sum of independent random
variables having exponential distributions with different parameters. Let
• • •> £« be independent random variables with exponential distributions,
let Xk e~Xkt be the density function of £k for t > 0 where the numbers A1}. . .,
A„ are all different. It can be shown by induction that the density function
9„(0 Vn = + ■ ■ ■ + is given by

n e~Xk<
9n (0 = (~ 1 )"_1 X1X2...X„^ TT--— for t > 0. (29)
k—1 11
i^k

Formula (29) has the following physical application: Let Ax be a radioactive


substance with the disintegration constant Ax. The disintegration of an Ax
atom means its transformation into some other kind of atom A2; suppose
that the A2 atoms are radioactive as well and have the disintegration con¬
stant X2. Similarly, let Ak {k — 3, 4,. . ., n) be the result of the disintegration
of an Ak_1 atom, the disintegration constant of Ak being Xn for k < n.
Assume that the substance A„+1 is not radioactive. Denoting by g the
time necessary for the transformation of an atom Ax into an atom An+1,
rjn clearly has the density function (29). For instance if A1 is uranium, An+1
is lead and t]n is the time necessary for an uranium atom to change into lead.

§ 10. Distribution of a function of several random variables

Let be arbitrary random variables with the joint (n-dimensional)


distribution function F(x1,. . xn) and let g(xl5. . ., x„) be a Borel-measur-
able function. Evidently, the distribution function of rj = g{£l5. .^„) is

P(n<y)= J... J dF(xx, ...,xn).


9(x ..x„)<y

Let us consider some important particular cases. Let £ and tj be inde¬


pendent random variables with absolutely continuous distribution functions;

let us consider the random variables (i = £t] and £2 — —. Let the density
n
functions of £ and g be f(x) and g{y); we have

P(£n <?)= j J f(x) g(y) dx dy (la)


xy'<z
and

f(x) g(y) dx dy . (lb)


GENERAL THEORY OF RANDOM VARIABLES [IV, § 10
204

By differentiating we obtain the corresponding density functions p{z) and


q(z) of Ci and C2'
+ 00
dy
P 0) =J g(y)f (2)
\y\’

q(?)= J \y\ff(y)f(?y)dy. (3)


— 00

Let us give some examples.


1. Student's distribution.
We shall determine the distribution of the random variable

£0
c= (4)
\J £>1 + • • • + %n
where £0, Ci,.. C„ are independent random variables having the same
normal distribution with density function
X*
1 2
<P(x) = e
J2n
Let qn(z) be the density function of £. We know already the density func¬
tion of the denominator of (4) (cf. Formula (17) of § 9), hence we obtain
from (3)
n + 1
r
qn 00 = n±1
(5)
2\ 2
x/71 r (i +z0

The distribution with density function (5) is called Student's distribution


with n degrees of freedom. It plays an important role in mathematical sta¬
tistics, since “Student’s t-test” is based on it. The particular case n = 1
gives the Cauchy distribution, with the density function

1
<h(?) = (6)
7T (1 + Z2)

2. Distribution of the ratio of two independent random variables having


'/^-distributions.
In mathematical statistics one is often interested in the density function
h(z) of the ratio of two independent random variables C and rj having x2-
distributions with n and m degrees of freedom, respectively.
It follows easily from Formula (12) of § 9 and from (3) of the present
IV, § 10] DISTRIBUTION OF A FUNCTION OF RANDOM VARIABLES 205

section that
n +m \
r
2 J
h (z) = n+m
for z > 0. (7)
n) tw
r (1+*) 2
2)r W
3. The beta distribution.
If C is the ratio considered in the previous example, let t denote the ran-
c
dom variable t = and k(x) the density function of t. By (1) of § 8
we obtain
1 +c
rln + m\
——i —i
2 i x2 (1 - x) 2 for 0 <x< 1 . (8)
Kx) =
m\
r I—I r
UJ 2j

The distribution function A^(x) = j k{t) dt is thus


o
x
'n + m

*(x)=-
i
t2 -1 (1-02
-l

dt=B m (x) for 0<x<l, (9)


n 2* 2
r r m\
2j 2J
where
*
Ba b(X) =
a’bK )
r(a+b>}.. r
T(a) T(b) J
(i - tf-1 dt
V
(i o)'
V
o
is, up to a numerical factor, Euler’s incomplete beta integral. The distribu¬
tion B(a, b) (a > 0, b > 0) having (10) for its distribution function is called
the beta-distribution of order (a, b).

4. Order statistics.
In nonparametric statistics the following problem is of importance: Let
£2,. .., be independent random variables with the same continuous
distribution: let F(x) be the distribution function of £k. Arrange the values
of £1}. . ., £„ in increasing order,1 and denote by £* the /c-th of these ordered
values: hence, in particular
£f = min 4, £*= max (11)
1 <,k<,n l<,k<,n

I* is called the Ar-th order statistics of the sample (£x,. . ., £„).

1 The probability that equal values occur is 0.


206 GENERAL THEORY OF RANDOM VARIABLES [IV, § 10

Determine now the distribution function jF^x) of (k = 1,2,. .n);


clearly < x means that among the values taken on by £lsthere
are at least k which are less than x. The probability that r given variables
among the 4 are less than x and the other n — r greater than or equal to x,

is given by [T(x)]r [1 - i7(x)]"-r; since the first r can be chosen in dif-

ferent ways, we have

Fk(x)= i ("l[F(x)r[l-JF(x)r-. (12)


r=k rJ

This expression can be simplified by taking into account the identity


p
n
n\
i /(I P)n~r
(k — 1)! (n — k)l
x*_1(l -x)n~kdx, (13)
0
which gives
Fk(x) = Bk>n+1 _* (F(x)) , (14)

where Bkn+1_k (x) is the incomplete beta function of order (k, n + 1 — k)


(cf. (10)). In the case when F(x) = x for 0 < x < 1, i.e. if the <F-s are uni¬
formly distributed on the interval (0, 1), 4* has a beta distribution of order
(k, n + 1 — k), and, in particular, for 0 < x < 1,

Fx(x) = P (min 4 < x) = 1 - (1 - x)" (15)


l<k<,n
and
Fn(x) = P (max 4 < x) = x". (16)
i <,*<,"

If 4>. . ., 4 are independent and have the same continuous, monotone,


and strictly increasing distribution function F(x), then the random variables
r]k = F(4) (k = 1,. . ., n) are independent and uniformly distributed on
the interval (0, 1). In fact, if x — F~x{y) is the inverse function of y = F(x),
we have
P(rjk<x) = P (4 < F~x (x)) = F(F~X (x)) = x (17)

for 0 < x < 1.


If now nl is the k-th among the random variables r\x, ranked ac¬
cording to increasing order, it is clear that r/* = F(^*k) and we have

P (B* < x) = Bk n+1_k (x) . (18)


5. Mixtures.
Let Fk(x) (k = 1,2,. . .) be arbitrary distribution functions and {pk}
IV, § 10] DISTRIBUTION OF A FUNCTION OF RANDOM VARIABLES 207

a discrete probability distribution. Then

m = il>kFk(.x) (19)
fc = l
is also a distribution function. It is called the mixture of the distribution
functions Fk(x) (k = 1,2,...) taken with the weights pk. This concept was
already defined in the foregoing Chapter for the particular case where the
functions Fk(x) are discrete distribution functions.
Consider the following example: a physical quantity is measured by two
different procedures,the errors of the measurements being in both cases nor-
l x 1 lx
mally distributed with density functions —<p — and —cp\— . JVX mea-
°i cr2 l a2j
surements were performed by the first, and N2 measurements by the second
method without registering, which of the results was furnished by the first
and which by the second of the methods (the measurements were mixed).
What will be the distribution function of the error of a measurement chosen
at random from these N = N-i + N2 measurements? If

1
*(*) = dt.
yj'2 TC

it follows from the theorem of total probability that this distribution func¬
tion F(x) is given by
N, X n2 X
F(x) = -±* + — <P
N ^2

i.e. F(x) is the mixture of the distribution functions of the errors of the two
N1 j N2
methods, taken with the weights — and —-.
N N
It is easy to extend the notion of the mixture to a nondenumerable set
of distribution functions. If F(t, x) for each value of the parameter Ms a
distribution function and for each fixed value of x F(t, x) is a measurable
function of t and if G(t) is an arbitrary distribution function, the Stieltjes
integral
+ 00
/7(x) = J F(t, x) dG(t) (20)
t- CO

defines a distribution function called the mixture of the distribution func¬


tions F(t, x) mixed with the distribution function G{t). If G(t) is a discrete
distribution function, (20) reduces to (19). It is easy to see that the function
H(x) defined by (20) is in fact a distribution function.
208 GENERAL THEORY OF RANDOM VARIABLES [IV, § 11

Let us consider an important application. One has often to determine


the distribution function of the sum

V = £1 + £2 + • • • + (21)
such that the number v of the terms is a random variable. Assume that the
are mutually independent and v is independent of the £k. Let Fk(x) de¬
note the distribution function of £*., Gr(x) the distribution function of £„ =
= £1 + ■ - • + £n and H(x) the distribution function of the random variable
t] defined by (21); let further be P(v = n) = pn (n - 1, 2,. . .). Then, by
the theorem of total probability,
00

H(x) = £ pnGn(x), (22)


n= 1

i.e. H(x) is a mixture of the distribution functions Gn(x).


In-
Example. If pn pkqn k (n = k, k + 1,. . .) and Fk(x) = 1 —
k- 1
— e Xx for x > 0, further if the random variables v, ^l5 £2,.. • are inde¬
pendent, then

Xn t*~1 e~u
GJLx) - dt,
in- 1)1
and, by (22)

(PX)kk tfk-1 „-plt


H{x) = dt, (23)
(k- 1)!

hence r\ has a F-distribution of order k and parameter pX.

§ 11. The general notion of expectation

We shall now extend the notion of expectation to an arbitrary random


variable £. In order to do this assume that a great number of independent
observations were made on the value of £. Arrange the observed values
into classes such that the /c-th class should contain the values between kh
(included) and (k + 1 )h (excluded) (h > 0; k 0, 1 = 2, ± ,± ..
.). According
to the law of large numbers the arithmetic mean of the observed values will
be near to
+ 00

X khP (kh < £ < (/c + 1) h) , (1)


fc=—00

provided of course that the series is convergent; the approximation will be


the closer the smaller is the value of h. Hence it is natural to define the expec-
IV, § 11] THE GENERAL NOTION OF EXPECTATION 209

tation of £ by

E(0 = lim £ khP(kh <Z<(k+\)h), (2)


h-+ 0 k= — oo

if this limit exists. If £ is a discrete random variable, this definition coincides


with that given in the preceding Chapter.
Obviously, if the limit (2) exists, it represents the Lebesgue integral of the
function £ = £(cti) with respect to the probability measure P, i.e.

£(f) = J «<»)<//>. (3)


n

(2) can be interpreted in a different manner too. Let £h — h , where

[x] denotes the entire part of the real number x; gh is a discrete random
variable and
+ O0 +00

X khPfkh < ^ < (k + l)F) — X khP(fh = kh) — E(£f)


k= — co k= — oo

is the expectation of £h. (2) can be written in the form

E(0 = lim E(fo ; (2')


h-~0

is the greatest multiple of h not exceeding £. For h = 10_r, £h is nothing


else than the value of £ rounded off to r decimal places.
In what follows, the knowledge of the Lebesgue integral will be taken for
granted; we shall give without proof the properties of E(f) which follow di¬
rectly from the properties of the Lebesgue integral. Theorems, which in the
general case can be proved in the same manner as in the case of discrete
distributions and which were proved for the latter in § 7 of Chapter III,
will be formulated here without proof. But the reader may profit from
carrying through these proofs for the general case too.
Evidently, the expectation E(g) depends only on the distribution function
of £; hence one may call E(£) the expectation of the distribution of £.
If £ is a random variable with distribution function F{x), then

+ 00

E(f) = J xdF{x). (4)


— oo

If £ is bounded with probability 1, then E(£) exists. If P(A < £ < B) — 1,


then A < E(if) < B; in particular, if > 0) = 1, we have E(f) > 0,
the equality being valid if and only if P(f — 0) = 1. If the distribution func-
GENERAL THEORY OF RANDOM VARIABLES [IV, § 11
210

tion of £ is absolutely continuous and if f(x) is the density function of


then

E(0 = f xf(x)dx. (5)

E.g. for the Cauchy distribution with the density function

1
/(*) 7t(l + X2)

the expectation does not exist, since in this case the integral (5) does not
converge.
Let us now consider some examples.

1. Expectation of the uniform distribution.


If ^ is a random variable uniformly distributed on the interval (a, b), it
follows from (5) that
a+b
~ 2~

which is also evident because of the symmetry of the uniform distribution.

2. Expectation of the normal distribution


If £ is a normally distributed random variable, its density function has
the form
(x — m) 2
/(*) = exp (m real, o > 0).
la2

By applying (5) we obtain easily

E(0 = m.

Thus we have found the probabilistic meaning of one of the parameters


of the normal distribution,1

3. Expectation of the gamma distribution.


If the random variable ^ has a T distribution of order k, its density func¬
tion is of the form
Xk xk 1 e Xx
/(*) = (x > 0),
(* - 1) 1

1 Later on we shall see that a is the standard deviation of


IV, § 11] THE GENERAL NOTION OF EXPECTATION 211

where A is a positive constant; from this it follows by (5) that

™-t-
In particular, the ordinary exponential distribution with the distribution

function 1 — e~Xx for .x > 0 has expectation —. Thus we found another


/I

probabilistic meaning of the parameter A of the exponential distribution.


The disintegration constant of a radioactive substance is thus the inverse
of the mean life-time of a radioactive atom. As we have seen, the relation
1 h
— = -—— holds between the constant and the half-period h; from this it
A In 2
follows that the mean life is equal to the product of the half-period and

- (i.e. it is 1.34 times the half-period).


In 2

4. Expectation of the yz- and %-distributions.


According to Formula (12) of § 9 the density function of yfn is

x2 e~2
K (*) = —-n

2Tr

A simple calculation gives

E(xl) = n.
Similarly, for the expectation of yn

n+ 1

E(X.) =sfl

By applying Stirling’s formula, we find for n -> co

hence
E(xl) * [E(Xn))2 if oo.

5. Expectation of the beta distribution.


Let £ be a random variable with a beta distribution; its density function is
212 GENERAL THEORY OF RANDOM VARIABLES [IV, § ll

r(a + b)
x*"1 (1 - xf-1 (0 < x < 1).
K,b(X) mm
From this, by (5),
a
m= a + b

6. Order statistics.
Let £1}.. be independent random variables each uniformly distrib¬
uted on the interval (0, 1). Let be the random variable which assumes the
k-th of the values ranked according to increasing magnitude;
by Formula (14) of § 10

Fk(x) = Bk>n+1_k(x).

Hence £(££) = the expectations of the £*-s subdivide the interval


n+ 1
(0, 1) into n + 1 equal intervals, as could also be guessed by a symmetry
argument.
We hinted already at the analogy between probability distributions and
distributions of masses. Consider now the distribution of the unit mass on
a line, such that between the abscissas a and b > a there should lie a mass
F{b) — F(a), where F(x) is a given distribution function. If x0 is the center
of gravity of this distribution, we know that
+ 00
x0 = j xdF(x) ,
— 00

hence x0 is equal to the expectation of the probability distribution which has


for its distribution function £(x).
Let ^ be an arbitrary random variable and A an event of positive proba¬
bility. We define the conditional expectation E(c, j A) of t; with respect to
the condition A as a limit

£(£ | A) = lim Yj khP (kh < £ < (k + \)h\A).


h-+0 k= — oo

pr d\
Since P(B | A) < - , the existence of £(£) implies the existence of the
P(A)
conditional expectation E(£, | A) for any event A such that P(A) > 0.
If F(x | A) is the conditional distribution function of ^ with respect to
the condition A, then

E(£ \A)= $ xdF{x j A) . (6)


IV, § 11] THE GENERAL NOTION OF EXPECTATION 213

Clearly, since
I {(oj) dP
= *\A)
— Af «“)<« = SI

where Q{B) = P(B | A), and Q{B) is a probability measure, all results valid
for ordinary expectations are also valid for conditional expectations.
We shall now give some often used theorems.

Theorem 1. The relation

E(£ ck(k) = t
k=1 fc=1

holds for any random variables £,k with finite expectation and for any con¬
stants ck. Thus the functional E is linear.

This theorem is a direct consequence of (3) and of the corresponding


properties of the integral.
Let <*; and r] be two normally distributed independent random variables
with density functions
1 1 I x — m2
— <P and — cp -
0i °2 02

The density function of the random variable £ + r\ is, as we have seen al¬

ready, — cp ——^ , where m = m1 + m2 and o =yfoof. It was proved


a
above that the parameter m figuring in the density function is the expecta¬
tion of the distribution. Hence the relation m — mx + m2 is a consequence
of Theorem 1.
Similarly, because of Theorem 1, the expectation of the gamma distribu¬

tion of order n is since the gamma distribution is the distribution of the

sum of n independent random variables with the same exponential distri¬

bution of parameter X (i.e. having the same expectation — ). The sum figur¬

ing in this example was one of independent random variables; one should,
however, realize that Theorem 1 holds for any random variables, without
any assumption about their independence.

Theorem 2. Let An (n = 1,2,...) (P(A„) > 0) be a complete system of


events and £ a random variable such that its expectation E(f) exists, then

E(0= Yj E{f | A„) P(A„). (7)


n=1
214 GENERAL THEORY OF RANDOM VARIABLES nv, § 11

This follows immediately from the theorem of total probability.


The statement of Theorem 2 may be expressed in the following manner:
Consider r\ — E(£, \ A„) as a random variable with its values depending on
which one of the events Ak (k = 1, 2,. ..) occurred; i.e. rj = E(f | A,,),
if event A,, occurred. Then the right hand side of (7) is just the expectation
of this discrete random variable rj, hence

£({)=£(£({ MO). (8)


Theorem 3. If £, and rj are independent random variables such that E{f)
and E(rj) exist, then the expectation of exists as well and

E(St,)=E(QE(a). . (9)

Proof. Assume first £ > 0. Let Ak be the event kh < 17 < (k + 1 )h;
evidently, the events Ak (k = 0, +1, ±2, . . .) form a complete system of
events. Hence, by Theorem 2,

+ 00

E((i)= E P(Ak)E((V\Ak). (10)


h— — 00

The conditional expectations E(£rj | Ak) exist, since 17 is bounded, under


condition Ak.
Since, however, £ and t] are independent, we have

E(t;)kh<E(Zri\Ak)<E(0(k+l)h. (11)

If we put this into (10), the series on the right side can be seen to converge,
thus E(£rj) exists; further (9) holds since the sums

£ khP(Ak) and 'f (k+\)hP(Ak)


k= —
00 k=-
00

tend to E(rj), if h -» 0. Thus (9) is proved for £ > 0. The restriction £ > 0
can be eliminated as follows: Put

t _ KI + £ E Id-f
(12)
- 2 ’ ^= ;

then ^ > 0, > 0 and £ = — £2. Since rj is independent of and £2,


we have

Etfrj) = E(£x n) - E(t;2 r,) = [E{fx) - £(£2)] E(r]) = E{f) E(rj) (13)

and herewith Theorem 3 is proved.


IV, § 11] THE GENERAL NOTION OF EXPECTATION 215

Theorem 4. If F(x) is the distribution function of £ and if E(f) exists, the


following limit relations are valid:

lim x(l - F(x)) — 0, (14)


X--*--fOO

lim xF(x) = 0. (15)


X-+ — CO

Proof. Since E(f) exists, the integral


+ 00
| | x | dF(x)
— 00

exists. Hence
+ 00

0 < lim x(l — F(x)) < lim ( ydF(y) = 0.


*-►+00 X--+00 \

The proof of (15) is similar.

Theorem 5. If E(f) exists, it can be expressed by ordinary integrals:

oo 0
E(0 = f (1 - F(y))dy - J F(y)dy. (16)
6 -00

Conversely, the existence of the integrals on the right-hand side of (16)


implies the existence of the expectation E{£).

Proof. An integration by parts gives

(ydF(y) = - x(l - F(x)) + | (1 - TOO) dy (17)


o o
and

j ydF(y) = xF(— x) - j F(y) dy. (18)


-X -X

If we add term by term Equations (17) and (18) and let x tend to infinity we
obtain, by (14) and (15), Formula (16).
Conversely, the existence of the integrals on the right-hand side of (16)
implies the existence of the expectation E(f). In fact, the convergence of the
integrals implies for x > 0

-
x( 1 - F(x)) < 2 f (1 - F(y)) dy and
;T
xF(- x) < 2 j F(y) dy,
x ~x
216 GENERAL THEORY OF RANDOM VARIABLES [IV, § 11

hence (14) and (15) are valid. Because of (17) and (18), the second part of
Theorem 5 follows.
Theorem 5 has the following graphical interpretation: Draw the curve
representing F(x) and the line y = 1. The expectation is equal to the differ-

Fig. 24

ence of the areas of the domains marked by + and — on Fig. 24. The
(evident) fact follows that a distribution symmetric with respect to x = a
has expectation a if this expectation exists. A distribution is said to be sym¬
metric with respect to a if

F(a — x) = 1 — F(a + x + 0).

Theorem 6. If H(x) is a continuous function, which is on every finite inter¬


val of bounded variation1 and £ is a random variable with the distribution func¬
tion F(x), then

E(H(0)=+$ H(x)dF(x) (19)


— 00

whenever E[H(x)] exists.

Proof. Since every function of bounded variation is the difference of two


monotone functions, it suffices to prove the theorem for monotone H{x).
Let x = H-'iy) be the inverse function of y = H(x). If H(x) is monotone
increasing, P[H(£) < x] = P[f < i7-J(x)], hence
+ 00
£(«({))= J (20)
— 00

Relation (19) results from (20) by a transformation x — 77_1(y) of the


variable of integration.

Examples. 1. The expectations E(cf), if they exist, are expressed by


+ 00

£(£")= J x"dF(x) (21)


— 00

1 Relation (19) holds for every Borel function H(x) provided that its expectation
E[H(x)] exists; cf. § 17, Exercise 47.
IV, § 13] THE MEDIAN AND THE QUANTILES 217

and are called moments of order n(n = 1, 2,. . .) of the random variable £.

2. cpi (t) = E(e“t) = £(cos /£) + /£(sin /£) = +f e'7* </£(*)


— 00

is the characteristic function of the random variable £.


Characteristic functions play an important role in the study of distribu¬
tion functions; Chapter VI will deal with them.
Theorems 4 and 5 of Chapter III, § 8 are also valid in the general case.
Their proof is almost the same as for discrete random variables.

§ 12. Expectation vectors of higher dimensional probability distributions

If the distribution of an ^-dimensional random vector

f - (Cl, • •O

is known, then so are the components £(£*.) (k = 1of its expectation.


They can be considered as the components of an ^-dimensional vector

£(0 = (£(«l),.. ;E(U),


called the expectation vector of the random vector £. In the three-dimen¬
sional case, the expectation vector specifies the center of gravity of the cor¬
responding mass-distribution.
Let us calculate for example the expectation vector of a normally distrib¬
uted «-dimensional random vector £ = (rj^ . . ., rj„), where the density
function of £ is given by Formula (18) of § 7. By the definition of the «-di-
mensional normal distribution, the components r\k can be exhibited in the
form
n

rik = mk + Z ckj tj,


7=1

where the £) are normally distributed independent random variables with


x
density function — <p and thus expectation £(£;) — 0; hence E(r\k) =mk.
o’7
Thus we have found the probabilistic meaning of the parameters mk figuring
in Formula (18) of §7.

§ 13. The median and the quantiles

The notion of the median is related to that of the expectation. Let £ be a


random variable with continuous distribution function F(x), strictly increas¬
ing for every x such that 0 < F(x) < 1. The median of £ is the (unique)
218 GENERAL THEORY OF RANDOM VARIABLES [IV, § 13

number a for which F(x) = ~. If the distribution is symmetric with respect


£
to a certain point, then the median always coincides with the expectation if
the latter exists. There are certain distributions for which the expectation
does not exist, but the median always exists. Consider for instance the

Cauchy distribution, with density function /(x) = — (1 + x2)-1. Here the


71

expectation does not exist, but the median does and is evidently equal to zero.
We introduce the somewhat more general notion of a quantile. The
q-quantile (0 < q < 1) denoted by Q(q), of a random variable £ for, more
precisely, of the corresponding distribution function F(x), continuous and
strictly increasing for 0 < F(x) < 1, by assumption) is defined as that value
1
of x for which F(x) = q. In this notation the median is equal to Q

1 3
In particular, Q is called the lower quartile, Q — | the upper quartile.
4
x m
For the normal distribution with distribution function <P where
X

1
<P(x) = dt,
V2 n
the lower and upper quartiles are

l 3
Q — m — 0.6745 er and Q = m + 0.6745 a.
4~

as follows from the tables of the normal distribution function. Since


F[Q(q)] = <1> the function x = Q{q) is the inverse of the distribution func¬
tion q = F(x).
Now we shall prove a simple but important inequality.

Theorem 1. {Markov-inequality). Let £ be a positive random variable


with finite expectation Then for every X > 1 we have

1
P(( > XE(()) S —. (1)
(The inequality also holds for 0 < X < 1, but in this case it is trivial,
since every probability is at most equal to 1.)

Proof. From 00

m — E( fi) = | xdF(x)
IV, § 14] STANDARD DEVIATION AND VARIANCE 219

follows
00 00

m > j xdF(x) > Xm \ dF(x) = Xm( 1 - F(Xm)),


Am Am

which proves (1).


If F(x) is continuous and strictly increasing and if x = Q(y) is the y-
quantile, i.e. the inverse function of y = F(x), then (1) can be written in the
form

< Xm.

In particular (for £ > 0), the upper quartile can never exceed the fourfold
of the expectation.

§ 14. The general notions of standard deviation and variance

As in the discrete case, the quantity

D(t) = + V£([{ - £({)]*) (1)

is used as a measure of the magnitude of fluctuations of the random variable


£ around its expectation. D\£) is called the variance, D(£) the standard
deviation of f D{Q is a nonnegative number which is zero if and only if
P(f = c) = 1 for some constant c. According to Theorem 6 of § 11, (1)
can be written in the form
-f GO + 00 -f- 00

D\0 = j (x-E(0)2dF(x)= j x2dF(x)- ($ xdF(x)f (2)


— 00 — 00 — 00

where F(x) is the distribution function of the random variable £. If this


distribution function is absolutely continuous and if we put F'(x) = f(x),
then we have

D\0 = J (x~E(0)2Ax) dx = [
— to — CO
X2f(x) dx-(
— 00
J xf(x) dx)2. (3)

Theorems 1-5 of § 9 and 1-2 of § 10 of Chapter III about the variance


are also valid in the general case. The proofs are essentially the same, since
they rest upon the corresponding theorems concerning the expectation.
We shall now calculate the variances of some particular distributions.

1. Uniform distribution.
If £ is a random variable uniformly distributed on (a, b), then by (3)

b—a
0(0 =
'
220 GENERAL THEORY OF RANDOM VARIABLES [IV, § 14

2. Normal distribution.
Let l be a random variable with density function

1 1 (x — m) 2\
— ——— exP
a yjlrc a 2cr2

x—m
We know that E(t;) = m. By a transformation of the variable = u

we obtain
+ 00
(x — w)2
D\0 = —7=— f {x-mf exp dx —
■Jlu a J 2(7

+ 00

U2 E 2 C?W.
/ 2 7T

From here follows by a simple calculation

Z>2 (0 = a2.

Thus we have found the probabilistic meaning of the parameter o of the


normal distribution.

3. Exponential distribution.
If the density function of the random variable £ is given by Xe~Xx for

x > 0, then we have seen that E(£) — ——. Hence


A

00

^«> = aJ(x-T)L-*=T.
o
and

The standard deviation of the exponential distribution is numerically equal


to its expectation.

4. Student's distribution.
Let ^ be a random variable having Student’s distribution with n degrees
of freedom; its density function is given by Formula (5) of § 10. Since /(x)
is an even function, E(£) = 0 for n > 2. [For n = 1 (i.e. in the case of the
IV, § 14] STANDARD DEVIATION AND VARIANCE 221

Cauchy distribution) the expectation does not exist.] By applying (3) we


obtain
+ 00
n + 1 <*
r
D\0 = n+i
dx.
sj n
r 14-

*
(1 +X2) 2

— 00

X
Take for new variable of integration y =-; then
1 +

D2(0 = —^r for n >3;


n—2

for n = 2 the variance is infinite.

5. Beta distribution.

If £ has a beta distribution B(a, b), then £(£) = —-— as we have seen.
a+b
From this, by (3)
i
r(a + b) M +1
D2 (0 = (1 — x)b 1 dx —
mm a+b

ab
(a + b)2 (a + b + 1)

6. Convolution of normal distributions.


Let ^ and rj be independent normally distributed random variables with
densities
1 ' x — mx
and
1
— (p -
,
[x — m2
°2 { a2 ,
The density function of ^ + r] is — cp with m = m1 + m2 and
a
i
g = (gi + cr|)2. The relation m — ml + m2 is valid since the expectation
of the sum of two random variables is equal to the sum of the expectations
i
of the terms. The relation g = (g2 + of)2 follows from Theorem 1 of
§ 10 in Chapter III, since we have seen that the parameter g represents the
standard deviation of the normal distribution.

7. Variance of the gamma distribution of order n.


Let £l5. . ., be independent random variables with the same density
222 GENERAL THEORY OF RANDOM VARIABLES [IV, § 15

function Xe~Xx for jc > 0; their sum = fi + • • • + 6. has a gamma dis"

tribution of order n. Since (cf. No. 3) D2(£k) =—^, Theorem 1 of § 10,

Chapter III implies D2(Cn) — Of course a direct proof is also possible.

8. Variance of Pearson’s f2-distribution.


The ^-distribution with n degrees of freedom was defined as the sum of
the squares of n independent random variables having the same normal
distribution. The variance of the square of a normally distributed random
variable with density function
1
q{x) =
Jin
is according to Theorem 6 of § 11 and Formula (3), equal to
+ 00

D\?) 2 dx - [E(J*)]2 = 2.

Consequently, according to Chapter III, § 10, Theorem 1, the standard


deviation of the ^-distribution with n degrees of freedom is y/ln.

§ 15. On some other measures of fluctuation

The difference of the quartiles characterizes to some extend the fluctua¬


3 (l
tions of a random variable. The quantity ~ -Q is called the

quartile deviation and is denoted by q(f). If £ is uniformly distributed on

(0, 1), then q(<£) = — = ^ 5 if £ is normally distributed, the tables

for the normal distribution give q{£) « 0.6745 cr. It is to be noted that in
some (chiefly older) books the density function of the normal distribution
is not given in the form

1
<*>(*) = dt,
y/2n J
— OO

but by

P{x) = <P(e J2 x) = -4= ( dt


s/n J
IV, § 15] SOME OTHER MEASURES OF FLUCTUATION 223

0.6745
where q « 0.477 ---yj 2. Anyone of these two forms can be taken as

the “standard form”; it is a question of convention which is chosen. 4>(x)


/ _ Jfl \
has the advantage that for a normal distribution of the form <P-

the expectation m and standard deviation a can be obtained immediate¬


ly, without calculation; if a normal distribution is brought to the form
'x — m
«P expectation m and quartile deviation q can be read off without

any further computation.


If the distribution function F(x) of the random variable f is continuous
and strictly increasing for 0 < F(x) < 1, then the value of £ lies with proba¬
1 3
bility in the interval Clearly, every interval
T-e
i
<m, q 5 + 0<<5< —
2
possesses the same property. If the distri¬

bution is symmetric with respect to the origin and if its density function is

monotone decreasing for x > 0, then \Q ' Ml is the smallest


uJ
interval possessing this property. In this case

1 3_ 3
= -Q T hence q(f) = Q
Q
l4J
Theorem 1. For every random variable £ symmetrically distributed about
the origin with a continuous distribution function F(x) that is strictly increas¬
ing for 0 < F(x) <1, the inequality

(1)
is valid.
Proof. Let F(x) be the distribution function of £. As E, is symmetric with
2 /£\
respect to the origin, D2(E) = F(£2). Put A = —2^y an<^ aPPty Markov

inequality (§ 13, Theorem 1) to the random variable £2. Then we obtain


( ' 3]'
P m J).
= P(?>q2(0)< (2)

On the other hand, because of the symmetry of the distribution,

1
Z\>Q (3)
4 ~2
GENERAL THEORY OF RANDOM VARIABLES [IV, § 15
224

1 D2 (0
From (2) and (3) it follows that — < yyy which proves 0).

The inequality (1) is sharp. This is shown by the following example: Let
the distribution of the random variable £ be the mixture, with weights

— — — of three normal distributions with the same standard deviation


4 ’ 2 4
a(> 0) and expectations -1,0, +1. Since s can be chosen arbitrarily small,
it follows from the example that figuring in (1) cannot be replaced by a
smaller number.
The quartile deviation q(0 is mostly used when the standard deviation
of £ is infinite, e.g. in the case of the Cauchy distribution.
The standard deviation of a random variable that is uniformly distributed

on the interval (m- a, m + a) is given by —'jU. If £ is an arbitrary random


JT
variable with E{f) = m and D(£) = cr, the interval

(m — o yj'i , m + cr ■J 3 ) (4)

may be characterized by the fact that a new variable uniformly distributed


on this interval has the same expectation and the same standard deviation
as £. The interval (4) is called the interval of concentration of £, the inverse
of its length is called concentration of ^ and is denoted by k(£).
Sometimes the absolute mean deviation

<«) = £( | {-£«)!)
is also used as a measure of fluctuations. By Theorem 6 of § 11

43 =T I *-£«) | </£(*) .
— CO

For the normal distribution

for the uniform distribution on an interval

40

and for the exponential distribution

40 = —
e

Of course Theorem 4 of § 9 is also valid in the general case.


IV, § 16] VARIANCE IN HIGHER DIMENSIONAL CASE 225

§16. Variance in the higher dimensional case

Let £ = (£l5. . £„) be an ^-dimensional random vector with distribution


function F(xx,. . xn). Clearly the fluctuation of £ cannot be characterized
by a single number. The standard deviations D(£k) furnish certain infor¬
mation. But this is insufficient, since these standard deviations are only
concerned with the fluctuations of the projections on the coordinate axes
and the choice of the axes is arbitrary. More information about the fluctua¬
tions of £ is obtained by considering its projections on all possible straight
lines. Put mk = £(£*) and let P0 be the point (m1?.. ., mn). Let g be a
line passing through P0 with direction cosines al9.. a„ (oq,. . . ., a„ are
n

real numbers for which £ af = 1). Put


k=1

(£* - mk) (1)


k=1

and calculate £>2(Q. (1) implies £(£„) = 0, hence

D\Q = E{Q.
If we put
Dtj = E((£i - m,) (£,- - mj% 2)
then
(3)
i=1 7=1

Let D denote the matrix of coefficients D,v:

£>u ... Din \


D = (4)

y Dni • • -Enn j

Because of (3), the determinant | D \ is always nonnegative. If | D | = 0,


we have a so-called degenerate distribution. In what follows, it will be always
assumed that | D \ > 0. From the coefficients Du the standard deviation
of the projection of £ on an arbitrary line can be calculated. Thus the fluc¬
tuation of £ can be characterized by the matrix (4), called the dispersion
matrix1 of £. The H-dimensional ellipsoid with equation

i i Dijxixj = c2
i=l i=l

1 Since Du is called the covariance of and £ /, the dispersion matrix is also


called the covariance matrix.
226 GENERAL THEORY OF RANDOM VARIABLES [IV, § 16

is called the dispersion ellipsoid of the distribution. It is easy to see that the
dispersion matrix is invariant under a shift of the coordinate system. Under
the rotation of the coordinate system D is transformed as a matrix of a
tensor. Let in fact C = (c,7) be an orthogonal matrix and

Zk = Z ckj (£/ -
7 =1
then E(fk) = 0 and

D’,J = £(«{;) = i
k=1
clk ±CjhDkt.
h=1

If the matrix {D'u) is denoted by D', we have

D' = CDC*,

where C* is the transpose of the orthogonal matrix C. Hence we may speak


about the dispersion tensor, which does not depend on the choice of the
coordinate system. Again, the similarity to the moments of inertia should
be noticed: in case of several dimensions the moment of inertia is also
characterized by a tensor and by ellipsoids of inertia.
We have now to deal with the notion of ellipsoid of concentration. For
the sake of simplicity let us restrict ourselves to the case of two dimensions.
Consider the ellipse

E(x, y) = Ax2 + 2Bxy + Cy2 = 4 (AC - B2 > 0) (5)

and suppose that the random vector ft — (rjl51/2) is uniformly distributed


inside this ellipse. The elements of the dispersion matrix of ft, i.e. the
numbers du = ,2
(i,j = 1 ) are defined by

where E denotes the interior of the ellipse with Equation (5) and F its area.
Calculation of the integrals in (6) gives

B
dn — , du — d21-
AC-B2 Y 5 d22 — (7)
AC-B AC-B2 '
IV, § 16] VARIANCE IN HIGHER DIMENSIONAL CASE 227

Let C = (£1, £2) be any random vector. Choose the numbers A, B, C such
that the dispersion matrix of a random vector uniformly distributed in
the ellipse (5) coincides with that of £. We put, therefore

A>2
A = B= — (8)
~A~’

where A = DnD22 — D\2. Hence the ellipse

Z)22 x2 - 2D12 xy + Du y2 = 4 A (9)

has the property that a random vector uniformly distributed in it possesses


the same dispersion matrix as the given random vector (. The ellipse (9) is
called the ellipse of concentration of the random vector £ = (£1, £2) and the
number

k(o = ~A (io)
471 yj A

i.e. the reciprocal of the area of the ellipse (9), is called the concentration
of £.
lA B
If A, B, C are chosen according to (8), the matrix is the inverse of
[BC

D11 D i2
,D2i D22

The case of higher dimensions turns out to be quite similar. The equation
of the ellipsoid of concentration is here

n n A ■ ■ x- x
(io
1=1 7=1 A

where A is the value of the determinant [ Di} \ and zd,7 the value of the co¬
factor of the element in the i-th row and y'-th column. The concentration,
that is to say the reciprocal of the volume of the ellipsoid (11), is equal to

K0 = (12)
(n + 2)2 n* JA

Of course, this holds only for A > 0. If A = 0, the point (£1?. . ., £„)
lies, with probability 1, on a hyperplane of at most n - 1 dimensions;
228 GENERAL THEORY OF RANDOM VARIABLES [IV, § 16

hence the distribution effectively is not ^-dimensional. Indeed, [ D | = 0


implies the existence of numbers jcl9 . . . ,xn which do not all vanish and
satisfy

YJDijxj = 0 (i= l,...,n).


j=i
But then

mi ^fe-"?/)]2)=o;
7=1

consequently the random vector (fl9..., £„) lies with probability 1 on the
hyperplane

7=1
Z *> «/-»*,) = 0.
Consider now in some detail the two-dimensional normal distribution.
Let

= exp j- ~{Ax2 + 2Bxy+Cy2)^ (13)

be the density function of the two-dimensional random vector £ = (£, t]).


The expectations of £ and rj are equal to zero and the elements of the dis¬
persion matrix are

B
£>n = , £>12 — - 5 £>22
AC — B‘ AC — B"

AC-B2’

It follows that

D,22 D 12 D
A = B= - C = ——
£> i D\’ \D\’

where | Z> | = Z>nr>22 - Z>?2. If we put

£>,
712

@2 — (14)
■J D1XD 22
°1 n/£>11 > \]£>22 5

we find

f(x,y) = X
27T(T1(T2>/1 - £2

1 2pxy j/2
x exp
. 2(1 -e2) * o“ (15)
°la2
XV, § 16] VARIANCE IN HIGHER DIMENSIONAL CASE 229

The number q is the correlation coe fficient R(f, rj) of the random variables
£ ar*d rj- We have already introduced this quantity for discrete distributions.
It is similarly defined in the general case and its properties are the same.
Thus

... mi) rm - mm
- *«)] [i -
^- mm mm
—““ ' (I6)

Theorems 1, 2, 3 and 5 of Chapter III, § 11 are valid and can be proved in


nearly the same way.

Theorem 1. If the random vector £ = (£, rj) is normally distributed in the


plane and if R(f rj) = 0, then £ and rj are independent.

Proof. If q = 0, we see from (15) that the density function is given by

A*, y) <p (17)

hence £ and y are independent. This theorem is easily generalized to any


number of dimensions.

Theorem 2. If the random vector £ — (rj1}. . ., rjn) is normally distributed


in the n-dimensional space and R(rjh ty) = 0 if i ^ j, then the random vari¬
ables rju .. .,ri„ are independent.

Proof. By § 7 we can write


n

hj = Z cjk tk + rnj (J = 1,2,..n). (18)


k=1
Now the random variables £k (k = 1,2,...,/?) are pairwise independent,
each of them is normally distributed, the expectation of is zero and
its standard deviation ck, hence C — (cJk) is an orthogonal matrix. Since
by assumption R{r\h rjf — 0 for i + j, we have

Ewl-0 for j&i. (19)


k=l
The meaning of this is, however, that the vector

(c(1 cin o*) (20)

is orthogonal to every one of the vectors (cjx,. . . , cjn), j # i. This means


that the vector (20) must be parallel to the vector (cn,. .., cin). Hence
230 GENERAL THEORY OF RANDOM VARIABLES [IV, § 16

there exists a constant A, ^ 0 such that

Cik = 2/ Cik 0^ 1 j 2, . . w). (21)


On the other hand, as the inverse of C is its transpose, we can write,
according to (18),
ft

Zk = E cjk (rij - m,)-


7=1

Thus we obtain for the density function of the random vector (r\x,... , rj„)
n 2

la
-
1

-1
exp Z • (22)
tfOh, • • •> y„) =
- 2A I
k=l ak
Cjk (yj - ™j)

(2k)2 n
k=1

But it follows from (21), that

1
V ^ IV r ,)2 v (23)
I -ZT IE cik(yj - = I-
k~1 °k 0“l 2;

and consequently
1 J_ y (>’f ~ Mif
yn) = - exp
2 h 2,.
(2k)2 f[ ak
k=1

which proves the independence of the random variables rj1,. . . ,tjn.


Remark. If, instead of assuming that the random vector (rjx,. . . , rj„)
has an n-dimensional normal distribution, the weaker condition is assumed
that the components rjx,... ,r]n are each normally distributed, then the
assertion of Theorem 2 is false. This can be seen from the following example:
Let the density function of the random vector (£, r[) be

1 ,-x-\
h(x,y) = J2i + x/2
2n

(It is readily verified that h(x, y) is in fact a density function.) The density
functions f(x) and g(x) of £ and rj are

f(x) = g(x) = —L_ e 2 ,


sJ2k

i.e. £ and rj are normally distributed with expectation 0 and standard


deviation 1. Since h(x, y) is an even function of both x and y, it follows
IV, § 16] VARIANCE IN HIGHER DIMENSIONAL CASE 231

that R(£, rj) = 0. The random variables £ and rj, however, are not indepen¬
dent, since evidently h(x, y) ^ f(x) g(y); thus £ and rj are each normally
distributed and are uncorrelated, but they still are dependent.
From Theorem 2 follows

Theorem 3. Let ,. . . , be mutually independent random variables


with the same normal distribution with E(^k) = 0, D(fk) = 1, (k = l,... ,n).
If ax , . . . , an and b1, . . . , bn are real numbers, not all equal to zero, then
the random variables
n n

Vi = Z % tk and rj2 = Z Mfc


k =1 k=1
n

are independent if and only if 'f akbk = 0.


A =1

Proof. Since
n

Z akbk
R(*h, ri2) = k=1
Z *2 Z *2
k=l k=1

the necessary condition that rj1 and i/2 should be uncorrelated is that

Z akbk = 0.
A=1

We shall now show that the random vector (r}x , rj2) is normally distributed.
There can be found an orthogonal matrix (cw) such that

ci k = ^ak, c2k — pbk (k= l,..., n).

The «-dimensional distribution of the random variables

>7y = Z/c = 1
CJk Zk 0' = 1,2,...,n)

is thus a normal distribution and the same holds for the two-dimensional
fj' fj'2
distribution of q1 = —and rj2 — —— ■ Since

ak
and
232 GENERAL THEORY OF RANDOM VARIABLES [IV, § 17

are the direction cosines of two directions,

£ ak bk = 0
fc=i

means that these two directions are orthogonal and r\x, rj2 are (up to a
numerical factor) the projections of the random vector (£x,. . ., £„) on
these directions. Our result may thus be formulated as follows: If £x ,. . . ,
are mutually independent random variables with the same normal distri¬
bution, then the projections of the random vector £ = (£x,. . . , £„) on
two lines dx, d2 are independent iff dx and d2 are orthogonal.

§ 17. Exercises

1. Let the distribution function F(x) of the random variable £ be continuous and
strictly increasing for — co < x < + oo . Determine the distribution function of the
following random variables:

a) Vi = F(0, b) % = In -^y , c) rj3 = ^(FCO)

where x = f'O) is the inverse function of the normal distribution function

t*
y = 0(x) 2 dt.
Jf
V-
2n

2. Draw the curve of the density function

1 (x — m)2
y =/(*) = exp -
2ji a 2a2

and determine its points of inflexion. Let A and B be their abscissas. Calculate the
probability that the value of a random variable with density function f(x) lies between
A and B.

3. Draw the curve of the density function

l (In x — rrif
exp for x > 0,
ax 2a2
y=f(*)= J271
0 for x < 0

of the lognormal distribution and calculate its extrema and points of inflexion. Calcu¬
late the expectation and standard deviation of the lognormal distribution.

4. a) Show that if the random variable ^ has a lognormal distribution, the same
holds for r] — c£a (c > 0; a ^ 0).
IV, § 17] EXERCISES 233

b) Suppose that the diameters of the particles of a certain kind of sand possess
a lognormal distribution; let /(x) be the density function of this distribution (cf. Exer¬
cise 3), with m — — 0.5, a = 0.3; x is measured in millimeters. The sand particles
are supposed to have spherical form. Find the total weight of the sand particles which
have diameters less than 0.5 mm, if the total weight of a certain amount of sand is
given.

5. Let the random variable r] have a lognormal distribution with density function

1 (In x — mY
/(*) = —-— exp la"-
for x > 0.
2n ax

If the curve of y = f(x) is drawn on a paper where the horizontal axis has a logarithmic
subdivision, then (apart from a numerical factor) one obtains a normal curve. It does
not coincide with the density function of In rj, but is shifted to the left over a dis¬
tance <72.

6. Let the random point (f, rj) have a normal distribution on the plane, with density
function
1 x2 + y2
f(x, y) = exp
Ina2 2a'1

Find the density function of C = max (|f| , |??|).

7. a) Let the random point (f, rj) have the same distribution as in Exercise 6. Show
that the angle 6 between the vector f = (f, rj) and the x-axis is uniformly distributed
on the interval (0, 2tt).
b) Determine the density function of 6, if the point (f, rj) has density

1 r l[xi , xS|]
2na.a, £XP[ 2 { of + o% J ‘

8. Let the density function of the probability distribution of the life-time of the
tubes of a radio receiver with 6 tubes be X2t e~*‘ for t > 0, where X = 0.25 if the
unit of time is a year. Find the probability that during 6 years no one of the tubes
has to be replaced. (The life-times of the individual tubes are supposed to be indepen¬
dent of each other.)

9. A distribution with density function y — f(x) satisfying the differential equation

— = -t—-, , „ ■ («, /?, y, ^ are constants)


y p + yx + o x2

is called a Pearson distribution. Show that the following are Pearson distributions:

a) the normal distribution


b) the “exponential” distribution
c) the gamma-distribution
d) the beta-distribution
e) Student’s distribution
f) the z2 distribution
g) Cauchy’s distribution
234 GENERAL THEORY OF RANDOM VARIABLES [IV, § 17

h) the distribution with density function f(x) = for x > 0;


(m —2)!
(c > 0; m = 2, 3,...).

10. a) Let the point (£, rj) be uniformly distributed in the interior of the unit circle.
We put
V
Q = yj t? + V~» <P = arc tan

Show that g and <p are independent.


b) Let the point (£, rj, C) be uniformly distributed on the surface of the unit sphere.
Introduce spherical coordinates 6 and cp (geographical longitude and latitude) and
show that 6 and cp are independent.
c) Let the point (£,rj, t) be uniformly distributed in the cylinder £2 + if < 1.

0 < C < 1. Show that cp = arc tan y, g = £2 + rf, and C are independent.

d) Find the general theorem of which a), b) and c) are particular cases.

Hint. The independence of the new coordinates results, in the three cases, from
the fact that the functional determinant of the transformation can be decomposed
into factors each containing only one of the new variables.

11. Let £ and r\ be independent random variables with the same density function

(—oo < X < -f oo).


/(*) = Ye~'x]

Find the distribution of £ = £, + rj.

12. Show that a two-dimensional normal distribution is uniquely determined by


its projection on three non-parallel lines.

13. Let £ and r\ be independent random variables with the same density function

/<*) = —■- , 1 „ •
n e+e

Find the distribution of C — ^ + rj.

14. Let £ be a random variable with density function

fix) = — exp |- y ix - mf | + exp y(*+ mf-


2V271
Find the values of m for which f(x) has two maxima.

15. Let £2 > • • •• Zn be independent random variables having a Cauchy distri¬


bution with density function
1
/<*) =
Ji(l + x2) -
Find the density function of

k= 1
IV, § 17] EXERCISES 235

16. Let the random variables ilt £2, . . be independent and uniformly distrib-
n
uted on the interval (0, 1). Determine the density function of C = V i\.
^_|

17. Let the random variables iu i2,.... i„ be independent and uniformly distrib¬
uted on the interval (0, 1). Let = Rk(£2, . . £„) A: = 1, 2.n be the A-th
among the values £u . . i„ arranged in increasing order. a* is called the A-th order
statistic of the sample ,.. ., £„).)
a) Find the distribution function of the random variable it_h — 1;% (1 < A < A +
+ h < ti) and show that it is independent of A.

b) Find the distribution function of the ratio —— (1 < A < A + h < n).
£*+ h

c) Show that —-, —E,. . are independent and that their «-dimensionaI
S2 S3 sn
density function is

f{xx, x2,.. ., .y„) = n \ x2x\. . . x"n-1 (0 < xk < 1; A = 1, 2,..«).

d) Show that the random variables are uniformly distributed in the

interval (0, 1).

18. The random variables iu are called exchangeable, if their w-dimen"


sional distribution function F{xx, x2, ..., x„) is a symmetric function of its variables-
(Exchangeable random variables have thus the same distribution and consequently
the same expectation.)
a) Choose at random and independently, with a constant probability density, n
points in the interval (0, 1). Let their abscissas be iu i„. The interval (0, 1)
is subdivided by these points into n + 1 subintervals of the respective lengths
Vu %»•••> Vn+1- Show that
1
E(Vk) —
n + 1 ‘

Hint. The r)u %>•••» Vn+i are exchangeable random variables and we have

n +1
E %=L
k= 1

b) Calculate the standard deviation of the random variables r]k .


c) Let it be the A-th order statistic of the sample (£x, i2 , •.., i„) (see Exercise 17).
Show that
A
E^) =
n+ 1 '

Hint. ^ = Vi + %+••• + Vk-


d) Which is larger:

D\i*k) = />>(£ V,) or I


i=1 /—I
236 GENERAL THEORY OF RANDOM VARIABLES [IV, § 17

e) Calculate the correlation coefficient

*(«/*, {,*) 1 < i < j < n.

19. The mixture with equal weights of the distributions with distribution function
Bk_n+1-k (x) (A: = 1, 2,...,«) is the uniform distribution in the interval (0, 1). How
could this be shown without calculation?

20. If the probability that a car runs at least x miles without a puncture is e Xx
with A = 0.0001, is it worth while to carry three spare tires on a trip of 12 000 miles?
21. Lei the random variables £ and rj be independent, let £ have an exponential
distribution with density function ke~^x (a > 0), and let r] be uniformly distributed
on (0, 2n). Put C, = £ ■ cos rj, C2 = £ • sin rj. Show that Ci and f2 are indepen¬
dent and have the same density function

22. Let fu £2be independent random variables, let the density function
of 4 {k= 1,2,...,*) be

A (A: + h — 1) for a > 0 ,

where A > 0 and h is a real number. Find the distribution function of the sum
n
rj = ^ Ik and show that C = exp (— A ??) has a beta distribution.
A=l

23. Let h„(x) be the density function of Student’s distribution with n degrees of
freedom. Show that

lim
n-+ co

24. The substances Au A2,A„+I form a radioactive chain, i.e. if an Ax atom


disintegrates it is transformed into an A2 atom, similarly the A2 atoms into A.- atoms,
and so on. The A„+l atoms are not radioactive. Suppose that at the instant t = 0
the number of Ax atoms is N„ the number of A2 atoms is N2,..., while there are N„ atoms
of A„. Find the density function of the time interval needed for an atom chosen at
random to change into an An+1 atom.

25. Let A be the disintegration constant of a radioactive atom. Let there be N atoms
present at the time 0.
a) Calculate the standard deviation of the number of atoms disintegrated up to
the time t.
b) Calculate the expectation and the standard deviation of the half-period (i.e. of
N
the random time interval till the — -th disintegration, if N is even).

26. a) Let ijk (fc = 1, 2,...) be the time required for the transformation of a radio¬
active atom Ax into an Ak+l atom, through the intermediary states A2, .. Ak, i.e.
the duration of the process
A\-* A2-+ . . . -*■ Ak + l.
Let further Xk be the disintegration constant of the Ak atoms, gk(t) the density function
of % and £*(0 the number of Ak atoms which are present at the time t. It is assumed
IV, § 17] EXERCISES 237

that at the moment 0 there are only A, atoms present and their number is equal to IV.
Find the distribution function of iqk and of £*(/) (k = 1, 2, ...).

Hint. Let Pk{t) be the probability that at the time t an atom is in the state Ak.
These probabilities can be calculated in the following way: The probability that an
atom Ak changes into an atom Ak + l during a time interval (t, t + At) is, by definition
of gk(t), equal to gk{t)At + o(At). On the other hand, the probability of this event
is as well expressed by Pk(t)XkAt + o(At)\ the possibility that during the time interval
(t, t + At) an atom passes through several successive disintegrations can be neglected.
Hence we have

Pk(t) = ~ gk(t). 0)
Ak

Since the disintegrations of the individual atoms are independent, we obtain

p(h (0= '•) = (r = 0, 1, ..N). (2)

The expectation and the standard deviation of the number of Ak atoms at the moment
t can now be calculated, since we know that

sk(j) - (-1/ 1 kk • •• k Z y[ (A, - a,) (r > 0) (3)

(cf. Ch. IV, § 9, (29)).


b) Put Mk(t) = E^k(t)). Show that the functions Mk{t) satisfy Bateman’s system
of differential equations

M'k(t) = Xk-y Mk-i(t) - Xk Mk(t) (M„(0 = 0; k- 1,2,...). (4)

Hint. By (2) we have


Ngk(t)
Mk{t) = (5)
K
If we differentiate the identity

9k{t) — \ K e-A*u~u) gk-1 (u) du, (6)


0
we obtain
g'k(t) — t-k (9k-1 (0 9k ('’))■ (7)

Because of (5), (4) follows from (7).


Remark. If the number Mk(t) is very large, the fluctuation of the number £*(f)
of the atoms Ak about Mk(t) is small with respect to Mk(t), since by (2)

D(ik(t)) = yjMk(t) 11 -

Hence as a fiirst approach Mk(t) may be considered as the namber of Ak atoms


existing at the time t. However, one should not forget that this number is in reality
a random variable with expectation Mk(t).
c) Show shat the graph of the function y = Mk(t) has for any At) A2, . . ., A* only
one maximum. Show further that 0 = ml < m2 < . . . < mn, where mk denotes the
abscissa of the maximum of the function Mk(t).
238 GENERAL THEORY OF RANDOM VARIABLES [IV, § 17

Remark. The atoms An+l not being radioactive, Mn+l(t) is evidently an increasing
function of time, hence mn+l = +oo ,
d) Show that t = 0 is a zero of order k — 1 of the function Mk(t).

27. Let <*, rj, f be the components of the velocity of a molecule of a gas in a
container. Let the random variables f, rj, ‘Q be independent and uniformly distributed
on the interval ( — A, -\-A). Calculate the density function fx(x) of the energy of this
molecule. Determine further the limit

lim A3fA(x) = w(x).


A-*- + co

Hint. Let the mass of the molecule be denoted by m and its energy by E, then

E = — + rf + t2) ,
hence
(21\-\
P(E <t)= /// dxdydz = for \l< A,
6 A3 m m
(*2 +

since the integral is equal to the volume of a sphere with radius , Thus
V m

71 It
fA{t) = /1 for < A,
m A A3 V m
hence
w(0 = c yj t (c = constant).

28. In Exercise 34 of Chapter III, § 18 we studied the most probable energy


distribution of a gas consisting of N particles, when the total energy E of the gas was
given. The probability pk of the energy Ek was found to be given by
-PEk
Pk

Y Wje-PE>
/= 1

This result was obtained under the assumption that E can only take on the discrete
values Ek. Let now the energy be considered as a continuous random variable. For
the density function of the energy we obtain in a similar way the expression

wfr) e~P'
P(t)
/ w(u) e /3“ du

where /? can be determined from


m co

N / tw(t) e~P‘ dt = E( w(t) e-P’ dt.


« o
Let w{t) be chosen such that
IV, § 17] EXERCISES 239

where c i& a positive constant and c =-j~ . Calculate under these conditions, for
2c'2
the limiting case c —> +°°> the value of (3, the function p(t), and the distribution
of the velocity of the molecule.

Hint. With the above notations we have for c' —> + <=<>

E _ 3
~N ~~ ~2/T *

3 kT
It is known from statistical mechanics that where k is Boltzmann's
N 2
1
constant and T the absolute temperature. So (3 = and
kT

t
2yjt exp
~kT
P(t) =
^n{kT)
Let the velocity of a molecule be denoted by v and its kinetic energy by Ekin, then the
density function of v will be given by

dE kin v2 ( m
/(») = p(E kin)
dv T\Wt
This derivation of the Maxwell distribution coincides essentially with one usually
given in textbooks of statistical mechanics. (We return to this question in Chapter V,
§ 3.)
29. a) Calculate from the Maxwell distribution the mean velocity of the molecules
of a gas having the absolute temperature T and consisting of molecules of mass m.
b) Show that the average kinetic energy at the absolute temperature T of the
3
molecules of a gas is equal to — kT (k is Boltzmann’s constant).

c) Compare the mean kinetic energy of a molecule with the kinetic energy of a
molecule moving with mean velocity. Which of the two is larger?

30. a) Consider a gas containing in 1 cm3 N molecules and calculate the mean
free path of a molecule.
Hint. The molecules are considered as spheres of radius r and are supposed to be
distributed in the space according to a Poisson distribution, i.e. the probability that
a volume V contains no molecules is expressed by e~NV. The probability that the
volume AV contains just one molecule is given by NAV + o (AV). The meaning
of the statement that 4ta molecule covers a distance s without collision and then collides
on a segment of length zls with another molecule is just the following, a cylindti
of radius 2r and height $ does not contain the center of any of the molecules and
another cylinder of radius 2r and height As contains the center of at least one of the
molecules. Thus the probability in question is

4 r2nNe-4nN,ts As + o (As),
240 GENERAL THEORY OF RANDOM VARIABLES [IV, § 17

i. e. the distribution of the free path is an exponential distribution with density function

4jiNr- e~*nN'Zs .

Hence the length of the mean free path is -.


AnNr2
b) Calculate the mean time interval between two consecutive collisions of a molecule.

Hint. Let the length of the free path be denoted by s, the velocity of the molecule
s
by v, then r = —, where r denotes the time interval studied, s and v can be assumed

to be independent, thus ^(t) = iT(.s).E | — j ; the first of these two factors is known

from Exercise 30 a), the second can be computed from the Maxwell distribution.

31. Calculate the standard deviation of the velocity and kinetic energy of a gas
molecule, if the absolute temperature of the gas is T and the mass of its molecules m.

32. Let the endpoint of a three-dimensional random vector £ possess a uniform


distribution on the surface of the unit sphere. Let 0 be the angle between the vector

C and the positive x-axis. Show that the density function of d is given by ~m

(0 < t < n).

33. Choose at random a chord in the unit circle and determine the expectation
of its length under the three suppositions considered in the discussion of Bertrand’s
paradox (Ch. II, § 10).

34. Let ..., cn+m be independent normally distributed random variables with
1 **_
density function —;= e 2. Calculate the expectation and the standard deviation of
s]2n

C= £ + ••• + £

£« + l + • ■ • + Sn + m

35. Let mk be the median of the gamma distribution of parameter X and order k.

Show that lim = — .


k—*~ -j-co k A.

36. Let the distribution of the random variable £ be the mixture of the distribution
of the random variables €u...,£n with weights pk(k = 1,2,Show that

k=i

37. Under the same assumptions as in Exercise 36 put

M/c ~ £k) (k = 1, 2,...,«) .


Show that

X Pk D\tk) + DXfi),
1

where p is a random variable assuming the values Mu Af2, ..., M„with probabilities
IV, § 17] EXERCISES 241

Pi, P2, .. . ,pn. From this follows

D2(£)> £ Pk D\£k);
*= 1
equality holds iff M, = M2 — ... = M„.

38. a) Let £ be a normally distributed random variable with density function

(x — m)2
fix) = -■ ■- exp
yj 2nd 2a2

Deduce £(£) = m from the fact that the function y = /(x) satisfies the differential
equation a-y = — (x — m) y .
b) Let the density function of the random variable £. be given by

Xm~~1 X
X
fix) = e (x > 0),
(m - 2)!

where m > 3 is a positive integer and A > 0. Calculate £(0 from the fact that the
function y — fix) satisfies the differential equation

/= y-

c) Apply the same method in general to Pearson’s distributions (cf. Exercise 9).

39. Suppose that there are 9 barbers working in a hairdressing-saloon. One shaving
takes 10 minutes. Someone coming in sees that all barbers are working and 3 customers
are waiting for service. What waiting time can he expect till he is served?

Hint. Assume that the moments of the finishing of the individual shavings are
independent and are uniformly distributed on the time interval (0, 10')-

40. Let £1; .... be independent random variables having the same distribution.
Prove that
+ ••• + !* i _ (1 <k< n).
£i + ••■ + !(! i n

41. Prove that if the standard deviation of the random variable £, with the distri¬
bution function Fix) exists, then

lim x2 (1 - Fix) + Fi- x)) = 0


X-*- -f-oo

and

E(?) = 2 jf x (1 - Fix) + Fi-x)) dx .


0
42. Calculate the dispersion matrix of a nondegenerate //-dimensional normal
distribution.

Hint. Let the //-dimensional density function of the random variables ??„... ,r?„ be

B
ffiyi, ■ ■ ■, y„) =
(2nf
exp t(Z E b»y>y)
/= 1 i=\
(i)
242 GENERAL THEORY OF RANDOM VARIABLES [IV, § 17

where |Z?| is the determinant of the matrix B = (bu). There can be given independent
normally distributed random variables £,k such that E(£k) = 0 and
n

Vi ^ Cjk 4a (j= lj 2, •. n), (2)


k=1
where C = (cik) is an orthogonal matrix. Let ak = D(^) and let S be the diagonal
matrix having for its elements the numbers —. Then B, S, and C are connected by
o*2
the relation B — CSC*, where C* is the transpose of the matrix C. If we put Dtj —
— E(rj) rji), then we have by (2)
n

A; = k=l
X! c‘k c'k °k' (3)
Hence the matrix D — (Z>iV) can be written in the form D = CS-1 C*, where S'-1
denotes the inverse of the matrix S and thus BD == CSC*CS~1C* = E, where E is
the unit matrix of order n. Thus the dispersion matrix D of the normal distribution
is the inverse of the matrix B of the quadratic form figuring in the exponent of (1).
43. a) Using the result of the preceding exercise, find anew proof for Theorem 2
of § 16.
b) Let be independent normally distributed random variables with
E^k) = 0, D(£fc) = a; show that if the matrix C = (cs) is orthogonal, then the random
variables
n

Vi ^ cik %k
k= 1
are independent.
c) Determine the ellipsoid of concentration of the «-dimensional normal distri¬
bution and prove Formula (12) of § 16.
d) What is the geometric meaning of Exercise b) ?

Hint. The components £lf £2> • • • > £„ of an w-dimensional normally distributed


random vectoi are independent, iff the axes of the ellipsoid of concentration are
parallel to the coordinate axes. If the random variables £x, £2, have the same
normal distribution, then the ellipsoid of concentration is an «-dimensional sphere;
thus the condition required is fulfilled for every choice of the coordinate system.
^ 44. a) When considering errors of measurements the following rule is often used:
If the random variables cx, ..., £„ are independent, further if the first partial deriv¬
atives of the function g(xx, ..., x„) are continuous and if r] — g(xx,..., x„), then

where the partial derivatives are to be taken at the points xk — E(E) (k — 1 n)


Discuss the validity of this rule. ” '
b) Let £ and t] be independent random variables. Prove that
D\Q7j) = DHSWri) + E\0 D^rj) + E\ri)D\£) .

45. Counters used in the study of cosmic rays, radioactivity and other physical
phenomena do not register all particles hitting the apparatus; in fact, the latter remains
in a passive state for some time interval h > 0 after a hit by a particle, and does not
IV, § 17] EXERCISES 243

register any particle arriving before the end of this time interval. The number of
particles counted is thus smaller than the number of the particles actually coming in.
The average number of particles registered during unit time is said to be the “virtual
density of events” and is denoted by P; the average number of the particles actually
arriving during the unit time is said to be the “actual density of events” and is denoted
by p. (Every arriving particle renders the apparatus insensitive for a time interval h,
regardless whether the particle was registered by the apparatus.) As to the arrival
of the particles, the usual assumption made in the study of radioactive radiations
is introduced, namely that the probability of the arrival of n particles during a time
(of)" e~p>
interval t is given by-(n = 0, 1, ...) .
nl
a) Determine the virtual density of events P.
b) Determine that value of the actual density of events which makes the virtual
density maximal.

Hint. The probability that a particle arrives during a time interval At and is registered
is equal to the probability that the particle arrived during the time interval considered
and no other particle did arrive during the preceding time interval of length h. This
probability is approximately pe~pl'At, hence P = pe~p". If the passive period h is
known and P was experimentally determined, the above transcendental equation is
1
obtained for p . By differentiating we find that P has its maximal value, if p = — ;

1
then P = —-— .
eh
c) Calculate the distribution, expectation and standard deviation of two consecutive
registered particle-arrivals.

Hint. Suppose that an arrival was registered at the 4ime t = 0 and let W(t) be the
probability that at least a time interval t will pass till the following registered arrival.
It is easy to see that lV(t) satisfies the following (retarded) difference-differential
equation
W\t) = P(l - W(t - /;)) (t> h) (1)

and fulfils the initial condition W(t) — 0 for 0 < t < h . The solution of (1) is given by

" (— it*-1 Pk (t — kh\k


W(t)= -t-AL-— for nil < 7 < (n + 1) A (n = 1,2,...). (2)
k^l k.

Integrate (1) from h to + °°, then

1 = P ftW'(t)dt.
6
Hence the expectation M of the time spent between two consecutive registered arrivals

is given by — . The standard deviation D of this time interval can be determined in

a similar manner and one obtains


J1 - 2hP
D = (3)
P~

1 D
Observe that for h = 0 we have D =' — = M. For h > 0, we have —r < 1. Hence
P M
244 GENERAL THEORY OF RANDOM VARIABLES [IV, § 17

the fact that the apparatus has a passive period diminishes the relative standard
deviation of the distribution.
d) If the radiation has a too high intensity, a “scaler” is commonly used in order
to make the observations more easy. This apparatus registers only every &-th particle.
(In practice A: is a power of 2.) Calculate the virtual density of events for this case too.

Hint. First calculate the probability that during the interval (t, t -f- At) there arrives
a “k-th particle”, i.e. a particle having in the list of arriving particles a serial number
which is divisible by k. Clearly, the probability of this event is

pnktnk~1 e~pt \
ink- 1)1 J At + o(At).

As the factor of At depends alsd on t, the process is not stationary. But this dependence
on t is very weak when t is large; in fact, it can be shown that
pnk jnk — \ e~pt
p_
lim I (1)
/ —► + co n= 1 " (nk - 1)! k

Relation (1) follows from


pnk jnk- k-1
,pf(COr- 1)
(2)
«?i (nk - 1)1
2nir
where co, = exp (r = 0, 1, ..., k — 1), since the real parts of co, — 1 (r =

— 1, 2,k — 1) are negative and co0 — 1 = 0 . Hence the probability that a particle
arriving between t and t + At is a “A>th particle”, is given, for a sufficiently large t.

approximately by At. Thus the registered density of events is — Here we


k k
neglected the passivity of the apparatus following the arriving of a particle.

46. Suppose that the expectation of the random variable £ exists and let a be a
real number. Prove that E(|£ — a\) takes on its minimum if a is the median of £.

47. Let £ be a random variable with distribution function F(x). Show that

E(H(0) = I” H(x)dF(x)
— 00

holds without restriction for every Borel-measurable function H{x), such that the
expectation £(#(£)) exists.
Hint. The value of F(/7(£)) only depends on the distribution of 7L(£), hence
on the distribution of £, since for every Borel set B P(/f(£) £ B) = P(£ £ H(B)),
where H~\B) denotes the set of the real numbers x for which H(x) £ B . Hence'
E(H(£)) does not depend on the fact on what probability space [Q, cA, P] the random
variable £ is defined; thus let Q be the real axis, c^the set of all Borel subsets of Q
and P the Lebesgue-Stieltjes measure defined on D by P(Iab) = F(b) - F(a), where
/„(, is an arbitrary interval a < x < b . Under these conditions £(x) = x(— oo<x<
< + 00) has distribution function F(.v), hence

+ 00
E(H(&) = r imdp = J H(x) dF(x).
Q .— CO
CHAPTER V

MORE ABOUT RANDOM VARIABLES

§ 1. Random variables on conditional probability spaces

Let Jr= [Q, P] be a conditional probability space (cf. Ch. II, § 11).
A real valued function £ = £(co) defined for co £ Q is said to be a random
variable on if the level sets Ax of £ (Ax is the set of all co £ Q such that
£(co) < x) belong to for every real x. A vector valued function ( =
= , . . . , Q on 13 is said to be a random vector on if all of its com¬
ponents ^1,... ,^r are random variables on JK Since, by assumption,
is a cr-algebra of subsets of Q it follows that for every Borel set B of
the r-dimensional Euclidean space the set £_1£8) of all co £ Q for which
C(co) £ B belongs to the cr-algebra
If C is any fixed element of «$, ^fc = [Q, P(A | C)] is a Kolmogorov
probability space (cf. Theorem 6, § 11, Ch. II). Since every random
variable £ on JFis an ordinary random variable on dPc, the usual notions
can be applied to the random variables on ^c. Thus there can be defined
for every random variables £ on & its conditional distribution function,
its conditional expectation, etc. with respect to the condition C £ M. All
theorems proved for ordinary random variables are valid for the random
quantities defined on Jr with respect to a conditional probability space
for every given C. New problems, however, arise if we let C vary.
Let £ be a (real) random variable on Iab the interval a < x < b and
dab) the set of all co £ Q with £(co) £ Iab. Clearly, for every Iab the set
£-1(/ai) belongs to but it does not necessarily belong to Let <_y#
be the set of all intervals Iab C. f0 with Iab) £ where I0 is a given,
(possibly infinite), interval. The following two conditions are assumed to
hold for ty#:
Condition Nx. The set oy# is not empty; for Ix £ ^y# and /2 £ ^y# there
exists an /3 £ o'# such that Ix + /2 Cl /3 .
Condition N2. For Ix £ o#, U £ o'#, and Ix c /2 we have

p(r1(/i)irU))>o.
Conditions Nx and N2 are evidently fulfilled if o-# consists of a single
element only. Let J be the union of all intervals / £ The set J C /0 is
<_y#.
246 MORE ABOUT RANDOM VARIABLES [V, § 1

an open or half open interval with endpoints a and /? (a may be equal


to — oo and ft to + oo). Let c0 be any point in the interior of J, i.e. a < c0 < ft.
Take a sequence of intervals IUn bn £ (n = 1, 2,. . .) with

«„+i ^ a„ <c0<bn <bn+1, lim a„ = a, lim bn = ft.


n-* + oo «->- + oo

There can always be found such a sequence when the condition Nx is


fulfilled. Put for x dla„b„

F.W = for c0 < x <&„

and
P(Z-XlXc)\Z-\lgnbn))
W= an<x< c0.

From Axioms# and C (Ch. II, § 11) of conditional probability spaces


follows then that for a„ < c < a < b < d <bn and Icd £ o#

{c < d and F„(d) - F„(c) > 0 follow from our assumptions). Furthermore,
for an< x < bn and for A > n

Fn(*) = 4W-

Therefore the value of P„(.x) does not depend on n and we can omit the
index n by wiiting simply

F(X) = Fn(x) for an< x < bn (w = 1,2,...). (1)

The function F(x) is defined everywhere on the interval (a, ft), it is non¬
decreasing and leftcontinuous; for Icd £ ^# we have F(d) - F(c) > 0 and
for c < a < b < d the relation
m-F(a)
F(d) ~ F(c)
(2)

is valid. Thus the following theorem can be stated:

Theorem 1. Let C be a random variable on a conditional probability space


■F — [Q, o#, ■/;, P] and let o# be the set of the half open intervals Iab
contained in an interval/„such that Z~\lab) <E &. Let o# fulfil the conditions
Ni ™d No. Let further J be the union of all I £ o#; J is an interval contained
v, § I] CONDITIONAL PROBABILITY SPACES 247

in IQ; let a and ft denote the endpoints of J. Then there exists a nondecreasing,
leftcontinuous function Fix) defined on (a, /?), such that for Icd £ we have
F{d) - F(c) > 0 and for Iab C Icd the relation

Fib) - m
(3)
F{d) - F(c)
holds.

A function F(x) having the above properties will be called a distribution


function of £ on (a, /?). Under the assumptions of Theorem 1, the random
variable £ thus possesses a distribution function on (a, /J). The distribution
function F{x) of £ is evidently not uniquely determined since for X > 0
and for arbitrary p, the function G(x) = XF{x) + p is also a distribution
function of £ on (a, (1). Conversely, if F(x) and Gfx) are distribution functions
of c on (a, /?) and if the conditions of Theorem 1 are fulfilled, then for any
two subintervals Icd and Iy8 of (a, jS) with lcd £ and IyS £ there
can be found an interval Ief £ such that Icd C Ief and IyS C Ief. Thus
we have
Fid)-F(c) F(f)-F(e) F(S) - F(y)
G(d)-G(c) G(f)-G(e) G(S) - G(y) '
Hence
Fid) - F(c)
for Icd £ o-# (4)
G(d) - G(c)

is a constant. And since for every lab C Ief

Fib) - F(a) = Gjb) - G(a)


F{f) - F(e) G(f) - Gle) ’ 1 J

it follows that

Fib) - XGlb) = Fla) - XGla) = p (6)

is also a constant; thus

Glx) = 7F(x) + p, (7)

where 2 and u are the constants defined by (4) and (6).


Thus the distribution function of £ on (a, /?) is uniquely defined up to
a linear transformation.
When the distribution function Fix) of £ on (a, ft) is absolutely continuous
on every closed subinterval of (a, /?), then

/(x) = F(x) (8)


248 MORE ABOUT RANDOM VARIABLES [V, § 1

is called a density function of £. According to what was said above, the


density function of £ is uniquely determined up to a positive constant;
f{x) is nonnegative, measurable and integrable on every closed subinterval
[a, b] of the interval (a, /?).
Example 1. Let Q be the set of all real numbers and the set of all
Borel subsets of Q, let further be g(x) a function which is nonnegative,
measurable and integrable on every finite interval of the real number axis.
Let 3d be the set of all intervals Iab such that
b
0 < | g(x) dx < + oo;

assume that & is not empty. Define conditional probabilities by

f g(x) dx
AB
PiA\B) =
\g{x)dx
for A £ and B £ 3d. (9)
A

Put £(co) = co(—co < co < + oo). Then £ is a random variable on the
conditional probability space y
= [Q, txf, 3d, P], If 70 = (—oo, + oo),
is identical to 3d and all conditions of Theorem 1 are fulfilled; hence £
has a distribution function F(x) and indeed

A j g(t) dt + p for x > 0,


o
F(x) =
o
(10)
-A j g(t) dt + g for x< 0
X

is a distribution function of £ for any choice of the constants A > 0 and


g. Furthermore Xg(x) is a density function of £ for any A > 0. In particular,
for g(x) = 1

F(x) —Lx + g (— oo<x<+oo)

is a distribution function of £ and the density function of £ is an (arbitrary)


positive constant A; in this case we say that £ is uniformly distributed on
the whole real axis. It should be remarked that the distribution function
of a random variable on a conditional probability space may assume
negative values and is not necessarily bounded. It is easy to see that the
following theorem holds:

Theorem 2. Let F(x) be a distribution function and f(x) a density function


of a random variable £ defined on a conditional probability space Let
V, § 1] CONDITIONAL PROBABILITY SPACES 249

y — h(x) be a monotone function and x = h 1(j;) its inverse. Then

F(h~\y)) = G(y)

is a distribution function and, if y — h(x) is absolutely continuous.

.f{h-\y))
= 9(y)
\h' {h~\y)) I
is a density function of rj = h(c).

Example 2. Suppose that ^ is uniformly distributed on the whole real


axis, then the same holds for rj = a^ + b (for any constants a # 0 and b).
Example 3. Let t; be uniformly distributed on the whole real axis. Then
h = e~ has on (0, + oo) the density function

fix) = — (x > 0>

The distribution of rj is said to be logarithmically uniform on the half line


(0, + oo).

Example 4. Let £ possess a logarithmically uniform distribution on the


half line x > 0. Then rj = at,b has the same distribution as b, for any a > 0
and b # 0.

Example 5. If £ has on the interval (0, + oo) the density function

fix) =

then rj — c£ (c > 0) has the same density function.

Theorem 3. Let £ be a random variable defined on a conditional probability


space .7~ and let F(x) be a distribution function and /(x) a density function
of £ on the interval (a, fi). If Ial Cl Ia[! and C_](/ai) 6 Tff then the conditional
expectation of £ with respect to the condition a < £ < b is given by

b b
j xdF(x) j x /(x) dx
a a
E(£ \ a< ^ < b) = (ll)
m m -
j fix) dx
a

(Clearly, the value of E{f \ a < £ < b) does not depend on the choice of
Fix) or f(x).)
250 MORE ABOUT RANDOM VARIABLES [V, § 1

Example 6. If £ is uniformly distributed on the whole real axis, then

a+ b
E(Z\a<t;<b) = -^r

for all a < b.


Example 7. If £ is logarithmically uniformly distributed on the positive
semi-axis, then

L(£ | a< ^ < b) — ——for 0 <a < b.


In —
a

Example 8. If £ is uniformly distributed on the whole real axis, |c[ is


uniformly distributed on the positive semi-axis.The distribution function of

£2is thus y/x and its density function is —-L= for jc>0. Hence for 0 < a < b
V x
b2 _
j yjX dx
a2 + ab + b~
E(? | a < £ < b}=E(? | a2 < < b2) b°-
3

and consequently
a~ + ab + b1 (b - af
D\^\a<^<b)
12

in accordance with the fact that under the condition a < £ < b, c, is
uniformly distributed on the interval (a, b) and the standard deviation of

such a distribution is -—. (Cf. Ch. IV, § 14.)


2 yj 3
Distribution functions and density functions of an /‘-dimensional random
vector on a conditional probability space can be defined in a similar way.
Let / be an “interval” of the /--dimensional space, i.e. the set of the points
x = ,. . ., xr) whose coordinates satisfy ak < xk < bk{k = 1,2,...,/-)
and let F(x1,. . ., xr) be a function of r variables. Like in Chapter IV,
§ 3, we introduce the notation

A,F = 4J... 4

where hk — bk — ak{k — 1,2,..., r). We have the following theorem:

Theorem 4. Let C be an r-dimensional (/- = 2, 3, . . .) random vector on


a conditional probability space .7= [0, p] and let C~\l) denote the
V, § 1] CONDITIONAL PROBABILITY SPACES 251

set of those co £ Q for which ((to) £ I, where I is an interval of the r-dimen-


sional space Er. Let 10 denote a fixed interval of Er and^/ff the set of those
intervals I C /„ for which £_1(7) £ Assume that the conditions N± and No
given above are fulfilled. Then if J is the union of all intervals I £ o'#(J is
also an interval of Er), there exists a function F on J such that A,F> 0 for
every /C J and for h £ o'# and f C I2 the relation

(12)
I i

is valid.

Proof. If o# consists of just one interval, the statement of the theorem


is trivial. Otherwise, let Ix £ /2 £ o# with Ix C /2, and let

(.vf, 40), • • ■, 40)) 6 h ■

For x = (x1} x2,, xr) £ L put

F(xx, ...,xr) = (-!)*


* P(C~\IX) IC-U)) (13)
pic-fh) irU)) ’
where Ix is the interval a,- < tt < bt (i = l, ... , r) with

ai = min (xf}, x,), bt — max (xf,}, x,)

and k is the number of the values of i for which x,- < xj0).
Like in the proof of Theorem 1, we see that F(xx,.. . , x,.) does not
depend on the choice of /2 . Clearly, F is nondecreasing with respect to
each of its variables, AjF> 0 and (12) is true. Theorem 4 is thus proved.
Every function F fulfilling (12) is said to be a distribution function of (
on J. The distribution function is not uniquely determined; if Fis a distri¬
bution function of £ and p is any nondecreasing function of r — 1 of the
variables x1#. . ., xr then for every X > 0

G(x1;..., xr) = XF{xx,..., xr) + p (14)

is also a distribution function of £.


If F is absolutely continuous on every / £ o'# we call

dr F
f(xx,...,xr) = (15)
dxx... dxr

the density function of £ on J. It is determined up to a positive constant


factor.
Let £1?. .., £r be random variable* on the conditional probability
space & — [Q, J#, X$,P(A [ B)\ and put £ = (£1?. . . , £,). We shall say
252 MORE ABOUT RANDOM VARIABLES [V, § 1

that the random variables <£x, are mutually independent if

^F^XOFM-FtiaS), (16)
i =1

where F is a distribution function of £, / is any interval I = (ak < xk < bk;


k = l,... ,r) with /C7 and the F, are nondecreasing functions. If F
is absolutely continuous and the random variables are indepen¬
dent, the density function / of £ is
r

/ = n /)(**) >
;=i
(17)

where the nonnegative function /(x) is equal to F-(x). Conversely, from


(17) follows (16) and thus the independence of the random variables

Example 9. Let — E' be the r-dimensional Euclidean space; let g{x),


where x = (xx, x2,. .., xr), be a function which is nonnegative, measurable
and integrable on every finite interval I of Er; let be the set of the Borel
subsets of Er, let be the set of all nonempty B £ for which 0 <
< ( g(x) dx < 4- oo and put, for A £ B £
B

| g(pc)dx
AB
P(A | B) =
f 9(x) dx
B

Put £(x) = x. Then & — [Q, P] is a conditional probability space


and £ is a random vector on EE. If Ix denotes the interval

min(0, x,) < ti < max(0, x,) (i = 1, 2, . . . , r).

then the distribution function of £ is given by

F(xx,..., xr) = (- l)fc J g (x) dx,


lx

where k is the number of the values of i for which x, < 0 and g(x) is the
density function of £.
In the case <?(x) = 1, £ is uniformly distributed on the whole space Er.
In this particular case we can put

F(x1,. .., xr) = XjX2 ... xr.

Let £ be an r-dimensional landom vector, / an interval and 5C / a


Borel subset of Er, furthermore let £-J(/) 6 ^9. Let F be a distribution
V, § 1] CONDITIONAL PROBABILITY SPACES 253

function of £. Then we have


f.-f dF
p(e-1(g)ir‘(/))= ‘

If C_1(-S) belongs also to ^and if C c B is another Borel subset, it follows


that

P(C~1(QI£-1(£))

Thus we have proved the following theorem:

Theorem 5. If F is a distribution function of the r-dimensional random


vector £ on a conditional probability space dT = [£>, £8, P] and if B and
C C B are Borel subsets and I is an interval of Er, further if B czl,
C-'(B) C~\I) then
S---SdF
f(c-1(c)K-1w) = r -pF ■
B -

From Theorem 5 we can easily deduce

Theorem 6. Let £ and rj be independent nonnegative random variables on


the conditional probability space df. Let (f, rj) have distribution function
F(x) G(y) (0 < x < + oo; 0 < y < + oc) and let lim F(x) = F{ 0) be finite.
x-»+0
Then the sum ( = + rj has distribution function

H(x) = f (F(x - y) - F(0)) dG(y).


0

Remark. If we put F(y) = F{0) for y < 0, we can also write

H(x) = f (F(x -y) - L(0)) dG{y).


o

If we assume further that F(0) = 0 (which does not restrict generality),


then we can simply write

H{x)=\F(x-y)dG(y).
6

A similar theorem holds for more than two nonnegative independent


random variables.
254 MORE ABOUT RANDOM VARIABLES [V, §1

Proof of Theorem 6. By Theorem 5, if C_1(/crf) and Iab C| lcd.

If
a<,x+y<b_
dF(x) dG(y)
P(a <^ + rj<b\c<^ + t]<d) =
JJ
c<,x+y<d
dF(x) dG{y)

H(b) - H{a)
H(d) - H(c) ’

hence Theorem 6 follows.


If F is absolutely continuous,

Kx) = ]Ax~y)dG(y)
o

is a density function of £ = £, + t]. Finally, if G(y) is absolutely continuous


and g{y) = G'{y),

Kx) = f fix- y)g(y) dy.


0

Example 10. Let the random vector (£l5. . . , £„) be uniformly distributed
on the ^-dimensional space, and put

Z* = £i + • • • +&

The random vector (£?,...,{*) has density function

/(*!> • • •> •’O — 7= for > 0 (A: = 1,2,..., n).


WX1 • • • xn

It follows by Theorem 6 that the density function of %2 is given by

i (x - y)

We obtain by induction

"-1
hn{x) = x2 for x > 0.

In particular, + £f is thus uniformly distributed on the positive semi-


axis.
v. § 2] NOTION OF THE CONDITIONAL PROBABILITY 255

§ 2. Generalization of the notion of conditional probability on Kolmogorov


probability spaces

Let ^ be a discrete random variable defined on a probability space


[£>, P], let xk (k — 1, 2,. . .) be the values taken on by £ with positive
probability. Let Ak denote the event £ = xk and B an arbitrary event.
Let the random variable rj be defined by

t]=P(B\Ak) for £ = xk (1)

(ij = P(B | Ak) for every co £ Q such that £(co) — Instead of (1), the
notation r/ = P^B) will be also used.
Let U denote any Borel set of real numbers and £_1(t/) the set of all
co £ Q such that £(co) £ U. Let further be the family of the sets £-1(£/).
We have thus The family $ is a cr-algebra since

^-\lUk) = U~\Uk) and t;-\U-V) = Z-\V)-Z-\V),

it is the minimal a-algebra with respect to which £ is measurable.


It is easy to see that

P(AB) = f P((B)dP (A^s,B^). (2)


A

In fact, since A £ for every k such that Ak A # O, we have Ak Cl A,


hence
f P, (B) dP = f P(B | Ak)P(AAk) = £ P(ABAk) = P(AB).
A k=1 fc = l

Obviously, P^B) can be interpreted as the conditional probability of the event


Bfor a given value oft, . Of course, the question arises whether this definition
may be extended to any random variable so that formula (2) should remain
valid. We shall show that this extension is possible. The difficulty of the
problem is seen from the fact that for instance a random variable with
absolutely continuous distribution function assumes each of its values
with probability zero and up to now we defined conditional probabilities
only for conditions having a positive probability.
Let £ be an arbitrary random variable. Let us fix a B £ with P{B) >0
and consider the measures P{A) and P(AB) on the c-algebra Clearly,
0 <P(AB) < P(A). Hence P(A) = 0 implies P(AB) = 0: the measure
P(AB) is thus absolutely continuous with respect to P(A).
In what follows, we shall need the Radon-Nikodym theorem:
Let be a o-algebra of the subsets of a set Q, let p(A) be a measure
and v(A) a o-additive real set function on The measure g is assumed
256 MORE ABOUT RANDOM VARIABLES [V, § 2

to be a-finite, i.e. Q is assumed to be decomposable into denumerably many


subsets Qk with Qk £ p(Qk) < +oo. Let further v(A) be absolutely
continuous with respect to p(A), i.e. let ju(A) — 0 imply v(B) = 0 for every
B £ , B a A. Under these conditions there exists a function f(co), measurable
with respect to the a-algebra such that for every A £ ^ the relation

v(A) = (f(to)dp (3)


A
0

holds. If v is nonnegative (i.e. if it is a measure), then f(co) > 0. The function


f(cd) is determined in an essentially unique manner in the sense that whenever
g(oj) is another function fulfilling the conditions of the theorem, then f(co) —
= g(co) holds almost everywhere (with respect to p). That is to say, if D
denotes the set of the points to at which f(co) # g(co), then p(D) = 0.
dv
The function f(co) figuring in (3) is denoted by — and is called the
dp
(Radon-Nikodym) derivative of the set function v with respect to p.
Consider now on the cr-algebra ^ the measures p(A) = P(A) and
v(A) = P(AB) (A £ B £ is fixed) and apply the Radon-Nikodym
theorem. Since P(Q) = 1, P(A) is not only cr-finite but even finite. Further
we have seen that P(AB) is absolutely continuous with respect to P(A).
Hence there exists a function f(co) which is measurable with respect to
such that
P(AB) — jf(co) dP (4)
A

According to the Radon-Nikodym theorem, f(ob) is determined up to a


set of measure zero;/(to) is a nonnegative measurable random variable
with respect to ^. Obviously, we have almost everywhere f(w) < 1. Indeed
(4) implies
P(AB) = f (1 —f(co)) dP. (5)
A

Thus if we would have the inequality 1 - f(co) < 0 on a set C with positive
measure, this would imply P(CB) < 0, which is impossible.
We shall call the random variable /(to) the conditional probability of the
event B for a given value of £ and shall denote it by PAB). Thus we can
write
P(AB) = $P((B)dP (6)
A

which is a generalization of (2). If we want to emphasize that PAB) depends


on to, we shall write PfiB; to) instead of P{(B).
In particular, if we put A = Q in (6) we find

P(B) = iP,(B)dP = E(Pi(B)), (7)


v, § 2] NOTION OF THE CONDITIONAL PROBABILITY 257

i.e. the expectation of the random variable PfB) is equal to P(B). If we


put now B = Q in (6), we have

P(A) = | PfQ) dP for every A £


A

On the other hand, however,

P(A)= f1 -dP,
A

hence, with probability 1 we have

Ps(G)=l. (8)
One can prove in a similar manner that with probability 1

PfBJ < PfB2) for Bx c= B2.

In particular, when £ is a discrete random variable, PfB) coincides


almost everywhere with the random variable defined in (1) at the beginning
of this section, since (2) determines the value of PfB) for almost every co.
It is seen from (1) that the value of PfB), for a discrete random variable
£, only depends on the value of £. But this holds for the general case as
well and is expressed by the fact that PfB) is measurable with respect
to which may be rephrased by saying that PfB) = h{£), where y = /z(x)
is a Borel-measurable function. This can be seen from the following,
somewhat modified, definition of PfB).
Apply the Radon-Nikodym theorem to the measure defined on the
(7-algebra of the Borel subsets U of the real numbers by />(£~1(t/)) = p(U)
and P(B£~\U)) = v(t/). The Radon-Nikodym theorem states the existence
of a Borel-measurable function g{x), defined for the real numbers x, such
that
P(BZ~\U))=\g(x)dF(x) (9)
v

holds for every Borel set of U of the real axis. Here F(x) denotes the
distribution function of £ .
Obviously, the relation g(f(oo)) = f(cd) holds for almost every co £ Q .
If the random variable P(B \ £ — x) is defined by the function g{x) of
formula (9), then by definition it only depends on x; further for almost
every co £ Q P(B [ = x) = PfB; co) where x = £(co).
If A is a fixed set, A £ P(A) > 0, then P(B | A), considered as a
function of the set B £ tAf, is a probability measure. We shall now discuss,
how far this remains valid for PfB). Suppose Bk (k = 1,2,...),
00

BjBk = 0 for j ^ k, and £ R*. = B. Consider an arbitrary random variable


i
258 MORE ABOUT RANDOM VARIABLES [V, § 2

£ and define the random variables PfBk) (k = 1,2,...) and P/fB) as above.
Then, for A £
P(ABk) — \ PfBk)dP (10)
a
and
P{AB) = \Pi(B)dP.
A

But from (10) and from

£ P(ABk)=P(AB)
k=1
it follows that

P(AB)=\(£pt(Bk))dP,
A k=1

OO

hence £ P?(Bk) fulfils relation (6) which defines PAB). Thus with proba-
/c = 1
bility 1
00

/>{(B) = £/>,(£*). (11)


k=1

The elements co for which the relation (11) does not hold form thus a
set C of measure zero, i.e. P(C) = 0. Since PfB) is determined only almost
everywhere, one cannot expect to prove more than this. The exceptional
set C depends on the sets Bk and the union of the exceptional sets corre¬
sponding to the individual sequences is not necessarily a set of measure
zero since the set of all sequences {Bk} is nondenumerable if has infinitely
many elements. Thus we cannot state that for a fixed £, P^B) as a function
of B is a measure; in general this is not true.
In practice, however, this fact causes scarcely any difficulty at all. In most
cases, the conditional probability PfB) = P(B | £ = x) is studied simul¬
taneously for nondenumerably infinitely many B only when the con¬
ditional distribution of a random variable 17 is to be determined with respect
to the condition £ = X; i.e. if the probabilities

1 < y 1£ = x)
are to be considered for every real value of y . If these conditional proba¬
bilities can be defined in such a manner that P(r\ < y | £ = x) is a distri¬
bution function with probability 1, then this function is said to be the
conditional distribution function of r/ with respect to the condition ^ = x
and is denoted by F(y | x):

F(y | x) = P(rj < y | £ = x).


v, § 2] NOTION OF THE CONDITIONAL PROBABILITY 259

If F(v | x) is an absolutely continuous function of y and if

F(y I x) = J f(t | x) dt
— 00

is valid, then f(y \ x) is said to be the conditional density function of rj with


respect to the condition £ = x.
Since conditional probabilities are determined only with probability 1,
it can always be achieved that for almost every x the random variable
P(ti < y | £ = x) as a function of y should be a distribution function.
The proof of this statement will just be sketched.
The conditional probabilities P(rj < y | £ = x) are first defined, by means
of the Radon-Nikodym theorem, for rational numbers y £ only. Then
there exists a set V with i>(^_1(kr)) = 0 such that for x V the function
P (rj <y | £ = x) as a function of y (y rational) is nondecreasing, left-
continuous, and fulfils the conditions

lim P{r] < y \ ^ — x) = 0 and lim P(rj < y | £ = x) = 1.


)/-► — 00 + 00

Extend now the definition of P(rj < y | £ = x) to irrational values of y


in the following manner:

PQi<y\Z = x)= sup Pin < y' I £ = x)-


y<y,y'

Then P(r/ < y\ £ = x) as a function of y is a distribution function and we


have
Pin < y, Z € U) = I' P(rj < y I f = x) dF{x).
v
In fact, this relation is valid for every rational y and hence for every
real y as well. Herewith our statement is proved.
Thus we have defined the conditional probabilities P(B \ A) even for
P(A) = 0; but let it be emphasized that in the latter case the conditional
probability P(B \ A) is only defined, if A can be considered as a level set
of a random variable £, i.e., if there exists an x such that A is the set of
the elements co of Q for which £(co) — x. Then P(B | A) is defined by
P(B | A) = P(B [ £ = x). However, a set of probability zero can be obtained
as a level set of different random variables, thus e.g. A may be defined
by any one of the conditions ^ = xx and £2 = x2. Thus it is possible that

P(B | ^ = xx) # P(B | ^ = x2),

though the conditions = x1 and = x2 define the same set A. A condi¬


tional probability with respect to a condition of probability zero is therefore
260 MORE ABOUT RANDOM VARIABLES [V, § 2

defined only if this condition is an element of the decomposition of the


sample space into pairwise disjoint subsets and is considered as an element
of this decomposition. The corresponding conditional probability P(B \ A)
depends thus on the decomposition in which A was imbedded.
With the Radon-Nikodym theorem, we proved so far the existence of
the conditional probability P^B) only. Let us now see, how P^B) =
— P(B | <!; = x) can be effectively determined. In order to do this let us
remark that relation (9), in the case of P(B) > 0, may be brought to the
form

P(B) (Fffb) - Fffa)) = f P(B | £ = x) dF{x) (12)

where FB{x) denotes the distribution function of £ with respect to the


condition B and where we have chosen for U the interval [a, b]. It follows
by a well-known theorem of Lebesgue that (if F(x) is the distribution
function of £)

f(g1 g z*) lim FB{x + h)~FB{x)


(13)
P(B) F(x + h) - F(x)

for almost every x (i.e. for every jc $ C, with P(ff\C)) = 0).


In particular, if F(x) and FB(x) are absolutely continuous and if
F'{x) = /(x), Fn{x) = fB{x), then for almost every x

P(B | ( = x) = P{B) Iff- (14)


J\X)
whenever /(x) > 0.
Examples.

1. Let (£, rf) be a random vector with absolutely continuous distribution


and with density function h(x, >>). Let

+ 0°

/(*) j
= Kx> y) dy

be the density function of Let £~\U) and £-1(F) denote the events
^ 6 L and rj £ V respectively, where U and V are Borel sets on the real
axis. Assume that the function /(x) is positive for x ( L Then

hjx, j)
Pin € vff 6 h(x, y) dxdy dy\f(x)dx,
x£U
/(*)
x(.U y £V
y£Y
v, § 2] NOTION OF THE CONDITIONAL PROBABILITY 261

hence

P(n = vl« = *) = j
ykv
thus the conditional density function g(y | x) of rj with respect to the con¬
dition £ = x is given for the x values which fulfill f(x) > 0 by

*”'*>-7W- (15)
giy | x) is not defined for those x values for which /(x) = 0.
Similarly, if giy) is the density function of rj and/(x | y) is the conditional
density function of £ with respect to a given value y of ij (i.e. with respect
to the condition n = y), we find for g(y) > 0 that

a* w = ^f.
g(y)
06 )
2. Let £ and n be independent random variables and F(x) the distribution
function of £ . Then

Pin ev,uu) = Pint V) Pit tu)= I' Pin t v) dF(x),


x£U

where U and V are arbitrary Borel sets. Hence

Pintv\z = x)=Pint V). (17)

Consequently, if the random variables £ and n are independent, then


the conditional distribution function of n with respect to the condition
£ = x is identical with the ordinary (unconditional) distribution function
of n ■ Conversely, if (17) is valid for every Borel set V and for every x(t/
with Pi£ £U)= 1, then £ and n are independent.

3. Let (£, n) be a normally distributed random vector with the density


function

exp
2n

Let q and $ (0 < & < 2n) be the polar coordinates of the point (£, n)-
Find the conditional distribution of # with respect to the condition
o — r > 0. We have

dx dy,
P(0 < 0 < <p, £2 + >?2 < *2) = (*2 + y2)
0<,&<<p
x2+y2^R2
262 MORE ABOUT RANDOM VARIABLES [V, § 2

or, by introducing polar coordinates,


R

P(0 < &<<p,e + ri2<R2) = -^-


2n J
o
re 2 dr.

Since the density function of q is given by re 2 , we obtain

P( 0 <r&<(p\o — r) = ----- (0 < 9o < 2n).

Hence & is uniformly distributed in the interval [0, 2 n) under the condition
q — r for every r > 0 and thus § and q are independent.

4. Let £ and r] be independent random variables. We shall determine


the conditional distribution of //(£, i/) with respect to the condition £ = x ,
here H(x, j) is assumed to be a Borel measurable function.
If U is an arbitrary Borel-measurable set and if F(x) and G(y) are the
distribution functions of £ and rj, then

P(ML ri)<Z,££U)= J dF(x) dG{y) = J ( j dG(y)) dF(x);


H(x,y)<Z x ££/ H(x,y)<Z
x£U

the conditional distribution function in question is thus the distribution


function of H(x, rj).

5. Let U be a Borel set and B = Then, with probability 1

1 for co £ B,
0 otherwise.
In fact, if A = ,,

P{AB) = $ dP = $XB(co)dP,
AB A
where
1 for co £ B,
Xb(«>) =
0 otherwise.

Since Xb is measurable with respect to we have, with probability 1,

Pi(B) = Xb(co).

6. (Particular case of 5.) Let Q be the interval [0, 1], let be the set
of Borel subsets of Q and P the Lebesgue measure. Put

£(ai) - co (0 < co < 1).


v, § 3] GENERALIZATION OF CONDITIONAL PROBABILITY 263

Then, for B £ (with probability 1),

_ f 1 for to £ B,
PAB) =
[ 0 otherwise.

7. Let Q be the unit square of the plane (x, y), the class of the Borel
subsets of Q and P the two-dimensional Lebesgue measure. Put £(x, y) = x.
Since, for every B £ ^ and for any Borel set U of the real axis (according
to the theorem of Fubini),

P(BZ~\U))= f( f dy)dx,
U (x,y)£B

we find
P(B | £ = *o) = j dy = y(BXo),
(x0,y)£B

where BXo represents the intersection of B by the line x = x0 and y the


one-dimensional Lebesgue measure. In this case P(B | £ = x0) is thus, as
a function of B. a measure on the a-algebra ^3, for every x0.

§ 3. Generalization of the notion of conditional probability on conditional


probability spaces

Let 3^= [Q, 3d, P(A | 5)] be a conditional probability space and £
a random variable on 3 . Let B £ and C (i 3d be given sets, with
P{B | C) > 0 and let be the least tr-algebra with respect to which f
is measurable. Consider the measures yc(A) = P(A | C) and vc(^4) =
= P(AB | C) on vc(A) is absolutely continuous with respect to yc(A);
there exists thus, by the Radon-Nikodym theorem, a function measurable
with respect to f{co) = PfB I C) such that

P(AB | C) = f P&B I C) dyc for A £ (1)


A

The random variable P^B | C) will be called the conditional probability of


the event B with respect to the condition C and for a given value of £; this
of course depends on f, but also on C; but the dependence is quite obvious
in the most important particular cases.
If C is fixed, PfB | C) can be considered as the conditional probability
of the event B on the ordinary Kolmogorov probability space 3^ =
= [£2, P(A | C)] with respect to the condition that £ assumes a given
value. The random variable PfB '■ C) has thus, for fixed C, all the properties
proved in § 2 for PfB).
264 MORE ABOUT RANDOM VARIABLES [V, § 3

Let us point out the following circumstance. If A(x) is the set of all
co £ Q for which £(co) = x, it may happen that the sets CA(x) belong
to the family for some values of x or even for every one of its values
and thus P(B \ CA(x)) is defined. But a priori it is not at all certain that
P(B | CA(x)) coincides with P^B | C), i.e. that

P(AB | C) = f P(B | CA(0) dpc for A £


A

This regularity property does not follow from the axioms and if necessary
it must be postulated as an additional axiom.
Consider now the following important particular case:
Let Q be an arbitrary set and a a-algebra of subsets of Q. Let further
P be a o--finite measure on and let be the family of sets

B £ with 0 < p.(B) < + oo .


We define
p(AB)
P(A | B) =
m" ’
when A £ and B £
Let ^ be a random variable on the conditional probability algebra

F = [Q, P(A | fi)]

and let ^ be the least cr-algebra with respect to which £ is measurable.


Let B(B £ be fixed; since the measure v(A) = p(AB) is absolutely
continuous on^f with respect to p(A), there exists a function/(co, B) which
is measurable with respect to and has the property

H(AB) = f/(co, B)dp for every A £ (2)

P(AC)
If C ( J and fiCC, it follows from (2) by putting pc(A)

that
KQ
p(AB)
P(AB | C) = fj(a>,B)dMc.
M(C) A
(3)

The function /(co, B) obviously does not depend on C. Hence introducing


the notation P^B) = /(co, B) we have

P{AB\C)=\P£B)dpc (4)
A

for A ( 5 ( uf and B C C £ AS.


v, § 3] GENERALIZATION OF CONDITIONAL PROBABILITY 265

Clearly PfB) is with respect to pi almost everywhere uniquely defined,


further almost everywhere with respect to pi holds

0 < PfB) < 1. (5)

With exception of a set of pi-measure zero we also have


oo 00

ps(B) = E pt(Bk) if YBk = B and bj Bk = 0 for j ¥> k.


k=1 A=1

If rp is another random variable on y, it can be shown as in § 2 that


the values of Pfrp~^{V)) can be chosen such that for every co (J D, with
px(D) = 0, Pi(jp-1(V)') is a measure on the Borel subsets of the real number
axis. This measure will be called the conditional distribution of rp for given
If
Pi (rp ~1 (7)) = f g(y I x) dy for f (o>) = x, (6)
v

then g(y \x) will be called the conditional density function of ip with respect
to the condition £ = x.
Let £ and rp be random variables defined on y with the two-dimensional
density function h{x, y); assume that the integral
+ CO

/(x)=j h(pc, y) dy (7)

exists for every jc. Then f(x) is a density function of In fact, if U and V
are two intervals, U c: V, we have

j f(x) dx
P(Z£U\Z£V) = (8)
$f(x)dx

if z-w
In this case the conditional density function of g with respect to the
condition £ = * is equal, for f(x) > 0, to

Kx- y)
g(y I x) = (9)
fix)

In fact, for [/Cf and £-1(F) € & we have

PjKx,y)dxdy i(ig(ylx)dy)Ax)dx
y(W u w (10)
P(rp£W,UU\UV) =
f f{x)dx f f(x)dx
266 MORE ABOUT RANDOM VARIABLES [V, § 3

Finally, let the relation

+ OO

^g(y\x)dy = j~=\ for f(x)> 0 (11)


— CO

be mentioned, expressing the faci that the conditional distribution of //


for given £ is an ordinary distribution.
Let us consider now some examples.

1. Let the point (£, rf) be uniformly distributed in the domain of ihe
plane defined by | x2 — y21 < 1. The density function h(x,y) of (<f. i/) is

1 for \x2 - y2\ < 1,


h(x, y) =
0 otherwise.

The density function f(x) of £ is

+ 00
/(*)=! h(x,y)dy.

hence

2(y/x2 + 1 - y/x^ - 1) for IJCI > 1,


/(*) =
2 JX2 + 1 otherwise.

Similarly, if g(y) is the density function of t], we have

2{-Jf + 1 - yjr - 1) for \y\>],


g(y) =
2^/y2 + 1 otherwise.
It follows that
l
for I ^ I > 1 ,y/f- 1 <\x\<Jy2+ L
2(sjy2 + l — V? - l)
/O' I y) = 1
for | y\ < 1,0 < !x| < Jy2 + 1,
2s/? + 1
0 otherwise.

Hence £ is. for >7 — y (| y [ < 1), uniformly distributed on the interval
(“ s/y2 4 1,+ Jy2 + 1).

2- Lc» c and ij be two independent random variables with absolutely


continuous distributions. The density function of their joint distribution
v, § 3] GENERALIZATION OF CONDITIONAL PROBABILITY 267

is thus
h(x,y)=f(x)g(y),

where g(y) is an ordinary density function. Hence the conditional density


function g(y | x) of 17 with respect to the condition £ = x is

9(y I x) = g(y).
The conditional density function g(y | x) does not depend on the value
of x.
3. Let (^l5 £2,.. ., £„) be a random vector uniformly distributed in the
whole ^-dimensional space and let rin = + £, 1 + . .. + <j^. Determine
the conditional density function of £k with respect to the condition rj„ =
"-i
= y(y > 0). We know already that the density function of rjn is y2
for y > 0 (§ 1, Example 10). It follows that the two-dimensional density
function of and r]k is

0 otherwise.
For the conditional density function of £k with respect to the condition
vjn = y we find thus
n-3

fn(x\y) =
~s~
for | x | < Jy, (12)

where the constant C„ will be determined by

J fn (x I y) dx = 1. (13)
-fy
From (12) and (13) it follows

(14)

and finally, we obtain thus

n |

L (x I y) =
l f tJ for s/y <x <+Jy. (15)
j*y r n— 1
2
268 MORE ABOUT RANDOM VARIABLES [V, § 3

From fl5) follows

lim f„(x\no2) =—_- e 2a* (—oo < x < + oo), (16)


n-*- + oo ^/27l C

hence every £k (/c fixed) has in the limit a normal (conditional) distribution,
if the condition imposed is rjn — no2 and n tends to infinity.

4. We deduce now the Maxwell distribution from the preceding example.1


Let c,k, rjk, Ck (k = 1,2,,n) be the components of the velocities of n
atoms of a certain amount of gas. We assume that the (a priori) distribution
of the point (£x, rji, .. ., £„, rj,;, £„) is uniform on the whole 3/?-dimen-
sional phase space. Consider the conditional distribution of the velocity
components with respect to the condition that the total kinetic energy
of the gas be constant. This kinetic energy is given by

^ k=1
tfl + ll + Cl),
where m represents the mass of a particle of the gas. The conditional
density function of the distribution studied is, by the above example.

3n
r
2E m Jt2 7WV 3”-3
An r 1
m 2nE 3n — 1 2E
2

2E
for IaI < (17)
m

3
By taking into account that E = — kTn (k Boltzmann's constant, T the

absolute temperature of the gas) we find for the conditional density function
h„{x | T) of the velocity components £k, rjk, Ck at constant temperature T

3n |
i
r 3n-3
2 ) x2 m
hn{x\T) = 1 2
1 3nkTn ^ 3n — 1 3 kTn
(18)
yj m 2

for

1 Cf. A. Retiyi [19].


V, § 3] GENERALIZATION OF CONDITIONAL PROBABILITY 269

hence
1 2kT
lim hn (x | T) — (19)
n-*- + co 2nkT
m

Thus the distribution of each component of the velocity tends to a normal


[jcT
distribution with the expectation 0 and standard deviation /-.
V m
Since for large n under the condition E = constant, £k, r\k, and Ck tend
to be independent, it follows already that the distiibution of the random
variables

vk = \j^k + vik + C| j

i.e. of the velocities of the particles, tends to the Maxwell distribution.


But it is profitable to perform exactly the above calculations, i.e. to calculate
the conditional distribution of vk for every finite n. It is quite natural to
call this distribution the Maxwell distribution of order n; for this distribution
tends to the ordinary Maxwell distribution if n -» + oo .
The calculations are entirely similar to those in the preceding example.
Put

Vk^s/tk + vl + Ck'*
if we put

bSn = Z
k=l
& + ’ll + Cl
and if hn(v, y) is the two-dimensional density function of vk and rj3n, we
have
3/1 — 5 _

hn y) =v*(y -v2) 2 for 0 < v < Jy. (20)

Thus if Vn(v | y) denotes the conditional density function of vk with respect


to the condition ri3n = y we obtain

3« —5
v2
yn ip\y) = Dn- 1 -
2
for 0 < v < Jy, (21)
y
y
the constant D„ being determined by

fy
V„ (v \y)dv = 1. (22)
270 MORE ABOUT RANDOM VARIABLES [V, § 4

Hence
(3 n

D„ = (23)
" -J* r f3" ~ 3
If W„(v | T) denotes the conditional density function of the velocity of the
particles at a given absolute temperature T, we have

3nkT 4v2 171


,3

2
f 2J 3n |

(v\ T) = Vn V x
m Jn 3 nkT.) [3n - 3
’ r

mv,2 \Sn — 5
3nkT
x 1 for 0 < v< (24)
3nkT) m

The distribution with density function (24) is called the Maxwell distribution
of order n. As we have already seen, it tends for n -> oo to the ordinary
Maxwell distribution, i.e.

v*m
lim Wn (u | T) = 2k T
v2, e (0 < V < + 00). (25)

§ 4. Generalization of the notion of conditional mathematical expectation in


Kolmogorov probability spaces

In § 2 we have defined the conditional probability of an event with


lespect to the condition that a random variable assumes a given value.
Similarly, we can define the conditional expectation of a random variable
yj with respect to the condition that the random variable £ assumes a given
value.
Let £ be a random variable and the least cr-algebra with respect to
which £ is measurable; let t] be any other random variable with finite
expectation. If £ is a discrete random variable assuming the values
xk (k = 1,2,.. .) with positive probabilities and if Ak is the event £ = xk,
then let E(r,j | £) denote the random variable such that E(rj \ {) = E(rj \ Ak)
for t; = xk (i.e. for every co £ Q with <^(co) = xk); we have thus, for A £
r +00

\E(t1\OdP = YJE(rj\Ak)P(AAk),
A A =1
v, § 4] GENERALIZATION OF CONDITIONAL EXPECTATION 271

hence
\E(r,\OdP = Sr,dP, (1)
A A

provided that A £ ^ (this means in case of a discrete random variable


£ that A = £ Air O'l < j2 < . . .) is valid).
r

In the general case, we want to define the random variable E(yj \ Z) so


that it is measurable with respect to and the relation (1) is valid. Put

v(A)=\t]dP (AZisfJ;
A

because of the known properties of the integral, v(^4) is a-additive on ^


and absolutely continuous with respect to P(A). Hence, by the Radon-
dv
Nikodym theorem, there exists a function f(co) = which is measurable

with respect to ^ and fulfils the relation

v(A) = f/(o>) dP
A

whenever A £ Therefore if E(rj \ Z) is defined by E(g | Z) = f(co), (1) is


satisfied. It follows from the definition that for A = Q

E(E(rj 1 0) = E(ri). (2)

In particular, if rj = rjB , where t]B is the indicator of the set B, i.e.

f 1 for co ^ R,
TbO^O | q
then
v(A)= f rjBdP = P(AB),
A
and
E{f!B \0 = ps (B).
The conditional probability P^B) of B for a given value of Z may thus
also be considered as a conditional expectation.
Of course one may ask whether E(r\ | Z) is with probability 1 equal to
the expectation of the conditional distribution of r\ for a given value of Z
(i.e. to the expectation of the distribution P^rj-^V)). The response is
affirmative, provided that P^{r]~\V)) is with probability 1 a probability
distribution. This can always be achieved, as we have already seen. In this
case
E(rj \0 = f ijdPt* (3)
h
272 MORE ABOUT RANDOM VARIABLES [V, § 4

with probability 1. In order to prove this it suffices to show that for every
A £ the relation
^\r,dPt)dP = \VdP (4)
A h A

holds. Obviously, this relation is fulfilled for // = t]B, where rjB is the
indicator of the set B\ indeed in this case

f Vb dP/c = P$ (B), j >1b dP = P(AB),


h 'a

and (4) will be reduced to the relation

§Ps(E)dP = P(AB) -
A

defining P^JS). Hence (4) holds when r] takes on a denumerable set of


values. From this, because of the known properties of the Lebesgue integral,
it can be shown that (4) is generally valid.
If £ and rj are independent, it follows from (3) that we have with proba¬
bility 1
E(r,\9 = E(r,). (5)

Furthermore the following theorem can be stated for arbitrary random


variables £ and rj: If f(x) is a Borel-measurable function such that E(f(g)r\)
exists, then we have, with probability 1,

E(f (O n I {) =/({) E(i I {)■ (6)


To prove this it suffices to show that

J f(0E(n I 0 dP = j/(0 gdP for A £ (7)


^ A

It follows from (3) that

I M)E(y \OdP = J M) (f r\dPdP = \E(f(Orj \S)dP= f /(£) ndP.


A A h A A

Relation (6) furnishes a new proof of the fact that, for independent £ and
*1’ = E{^)- E{rj) (cf. Ch. IV). In fact, it follows from (2) and (6) that

E(Sn) = E(Eito I 0) = E(ZE(ti I 0). (8)

Thus if £ and g are independent, E(r\ \ £) — E(r[) with probability 1 and


from this follows the desired result.
Consider now another important property of the conditional expectation
Let £ and t, be two random variables and g(x) a Borel-measurable function.
v. § 5] GENERALIZATION OF BAYES’ THEOREM 273

We have then with probability 1

E(E(n If) \n(0) = E(<I I »«))• (9)

In order to see this it suffices to prove the relation

jE(E(fj\0\ff(0)dP=jridP (10)

for every A £ By applying twice the definition of conditional proba¬


bilities and by taking into account that g(?) cz we obtain

f E{E(n | () | g(())dP =\E(n\()dP=[ ndP,


A A A

which proves (10) and hence (9) too.


The ordinary expectation is known to be a linear functional. How far
does this hold for the conditional expectation ? If cx and c2 are two con¬
stants, we have with probability 1,

E(cx m + c2r,2\0 = cxE(,h | 0 + c2E(rj21 (). (11)

Indeed we have by (1) for every A

f (cx E(Vl \0 + c2 E{t]2 | 0) dP = cx$ E(th \0dP+ c2 f E(rj2 \ 0 dP =


A A A

= f (Cl + C2 y]2)dP.
A

Nevertheless we cannot state that E(r] | £) is a linear functional with proba¬


bility 1, since (11) holds with probability 1 only and the exceptional sets
corresponding to every pair (tju r/2) may together even cover the whole
space Q.

§ 5. Generalization of Bayes’ theorem

Let t, and rj be two random variables with absolutely continuous distri¬


butions and a two-dimensional density function h(x, j). Put further

+ 00
f(x) = j h(x, y) dy, (1)
— 00

+ oo
g(y) = f k(x, y) dx, (2)
274 MORE ABOUT RANDOM VARIABLES [V, § 5

Kx\y) =
Kx>y) for g(y) > 0, otherwise arbitrary, (3)
9{y)

g(y | x) = for f(x) > 0, otherwise arbitrary. (4)


Ax)
Clearly we have
+ 00

Ax) = j f(x\y)g(y)dy (5)

and
+ oo
9(y) = J g(y I x)f(x)dx. (6)
— oo

It follows from (3) and (4) that

g(ylx)=m>m
Ax)
ror /(,)>„, (7)

hence by (5)

9(y I x) Ax\y)g(y)
(8)
I f(x I t)g(t)dt

Formula (8) may be considered as a generalization of Bayes’ theorem for


the case of absolutely continuous distributions. With this formula one can
express the conditional density function of rj for a given value of £ by means
of the conditional density function of £ for a given value of t] and the
unconditional density function of rj. It follows from (8) that

P(a < rj < b | £ = x) = j'g(y | x) dy =-•


a
b
I f(x\y)g(y)dy
(9)
7 /(* I')«(<) *
— 00
or

J7(*l y)dG(y)
P(a<ri<b\Z = x)= —^-, (10)
J f(x\t)dG(t)
— oo

where G{y) is the ordinary distribution function of rj.


v, § 6] THE CORRELATION RATIO 275

Formula (10) is therefore valid even if rj does not have an absolutely


continuous distribution.
Relation (8) is in certain cases also valid for £ and r\ defined on a con¬
ditional probability space. This holds if the two-dimensional density function
(in the sense explained in § 1) exists. For the functions f(x) and g(y) defined
in (1) and in (2) — provided that they exist, i.e. the integrals (1) and (2)
are finite — we have, in general,
+ 00 +00

f f(x)dx = j g(y) dy= + go.


— 00 — oo

Let it be mentioned that h(x, y),f(x), g(y) are only defined up to a constant
factor. If f(pc | y) and g(y \ x) are computed by (3) and (4) or (8), this factor
disappears. The obtained density functions f(x | >>) and g(y | x) are already
so normed that their integral from — oo to +00 has the value 1.

§ 6. The correlation ratio

Let £ and rj be two random variables on a Kolmogorov probability


space; suppose that E(rj) and D\r\) exist, let further E(t/ | f) denote the
conditional probability of rj for a given value of We know that

E(E(r,\0)=m- (1)
For the variance of E(rj | f) we have

D\ (rj) = D2 (E(r, \ 0) = E{E2 (r, \ 0) ~ E2 (rj). (2)

Theorem 1. If E{rj) and D2(rj) exist, we have

D2 00 ^ D\ (ri) + E([E(r, \ 0 - rj]2). (3)

Proof. We have
rj - E{rj) = [17 - E(rj [ 0] + [E(rj | 0 - E(g)],
therefore

D2 (rj) = E{[E(rj \ 0 - rj]2) + D\ (rj) + 2E((rj - E(rj | 01 [E(r, \ 0 - E(rj)]). (4)

By (2) and (6) of § 4

E([tj - E(rj | 0] \E(n I 0 ~ E(tj)]) = E([rj - E(rj \ 0] E(rj | 0) -

= E(E([rj - E(rj \ 0] E(r, \ 0 \ Q) = E(E(r, \ 0 E(r, - E(r, \ 0 \ 0) = 0.

Thus (4) implies (3).


276 MORE ABOUT RANDOM VARIABLES [V, § 6

Remarks.

1. It was implicitly shown in proving (3) that the random variables


rj — E(r] | £) and E(rj \ £) are uncorrelated.

2. The assertion of Theorem 1 may be written in the form

D* 00 = d1 (E(n 10) + e(d* (n 1 {)) (5)


where £>2(r/ | £)isthe conditional variance of tj fora given value of defined by

D2 (-11 {) = £([,-£(■, | «]2R).

According to Formula (2) of § 4 we have thus

£(Z)2 (r, 10) = £(£([1/ - E(n | Of 10) = £([»/ - £fo1 Of)-


Assuming D(ff) > 0, put
1?
II

(6)
AC

Then by Theorem 1
0 <Kffrj) < 1. (7)

Kt(rj) will be called correlation ratio of tj with respect to it is defined


only if D(rf) > 0. This notion was introduced by K. Pearson in a somewhat
less general form, and in full generality by A. N. Kolmogorov. It gives
a certain information about the mutual dependence of £ and rj. This is
shown by the following two theorems.

Theorem 2. If £ and rj are independent, Kfrj) = 0. The converse, however,


does not hold: Kfff) = 0 does not imply the independence of £ and rj, though
it implies the vanishing of the correlation coefficient R(g, rj).

Theorem 3. The relation Kfrj) = 1 is valid iff rj = g(£), where g(ff) is


a Borel-measurable function.

Proof of Theorem 2. If f and rj are independent, Dfrj) = 0, hence


Kfrj) = 0. If Kfrj) = 0, E(r\ | f) is equal to E(rj) with probability 1, therefore
by relation (8) of § 4

£(£»,) 10)=£(0 £00,


thus /?(£, rj) = 0. The following example shows that Kfrj) = 0 does not
imply the independence of £ and rj: Let the point (f, rj) be uniformly
v, § 6] THE CORRELATION RATIO 277

distributed in the circle x2 + y2 < 1; let g(y \ x) be the conditional density


function of r] with respect to the condition £ = x. We have

9(y\x) =-. 1 - for \y\<s/\-x2; -l<x< + l,


2 •>/1 — x2
hence E(rj \ £) = 0 and, consequently, Kfifi) = 0 though l and r\ are
evidently not independent.

Proof of Theorem 3. If Kffi) — 1, (3) shows that

E(ln- £(>|lf)]2) = 0,

hence, with the probability 1,

r, = E(rj\0; (8)
1/ is thus measurable with respect to | and therefore it can be written in
the form rj = g(£). Conversely, if r\ — gif), then *1 is measurable with
respect to thus (8) is valid with probability 1, therefore it follows
that Kfifi) = 1.
Unlike the correlation coefficient, the correlation ratio is not symmet¬
rical. To characterize the dependence between ^ and rj both quantities
Kfit]) and Kff) can be used, provided that the variances of both £ and rj
exist and are positive. The conditional expectation E(rj | if) can be charac¬
terized by the following property:

Theorem 4. If £ and i; are any two random variables and D2(fi) is finite
and if g(x) is a Borel-measurable function, then the expression

takes on its minimum for g(f) = E(tj \ if).

Proof. By Formula (2) of § 4

E(ln - <?C)]2) = E(E(in - g(i)f I «))■ (9)

It follows by a basic property of the expectation (see Theorem 2 of § 9,


Ch. Ill) that

E([« - 0({)f I {) = K-l ' UK))2 dPe S K"! “ E(l I 0)* dpf <1 °)
It follows from (9) and (10) that

E([n- 9«)]2)s£([>i-£(-iU)J2). (ii)


278 MORE ABOUT RANDOM VARIABLES [V, § 6

q.e.d. Equality in (11) can occur, if and only if the relation

g(0=E(r,\{)

is valid with probability 1.

Remark. The curve y — E(tj \ £ — x) is called the regression curve of rj


with respect to f

In particular, it follows from Theorem 4 that for any two real numbers
a and b

E([>1 - K + b)f) > E([r, - E(rj [ £))2> (12)

The left-hand side is minimal for

DM
U = h = E{r]) ~ ■ 03)
The line
D(n)
y-m = R(t, n) [x - E(0] (14)

is called the regression curve of rj with respect to f If a and b are given by


(13), we have

E([ri-(ai; + b)f) = D\rj)[\ - R\Z,rj)}. (15)

On the other hand, because of (3) and (6),

£([//-£(>|K)]2) = .D2('l)[l -K\m- (16)

From (12), (15) and (16) it follows that

R2(£,rj)< Kl(r]). (17)

This permits to restate the proposition of Theorem 2: If Kfrj) = 0 then

V) = 0. Inequality (17) may be sharpened as follows:

Theorem 5. If £ is an arbitrary random variable and >i a random variable


with finite expectation and variance, then

Kt(ri)= SUP R2(g(Z),ri), (18)


v, § 7] DEPENDENCE OF TWO RANDOM VARIABLES 279

where y = g(x) runs through the set of all Borel-measurable functions for
which the expectation and variance of g{£) exist. The relation

Kl(>i) = R2(g(0,ri) (19)


holds, iff
g(0 = aE(r,\0 + b, (20)
where a A 0 and b are constants.

Proof. One can assume without restriction of generality that E(rj) =


= E(g(f)) = 0 and /)('?) = D(g(f)) = 1. By (2) and (6) of § 4 and by the
Schwarz inequality,

R-(g(0,i) = E2 (m(0) = e2 (£(ne(01 {)) = e- 0© E(, | f)) <


<E(E\n\t)) = K^n).
The condition for equality is here easily verified.
Theorem 5 permits to give a new definition of the correlation ratio
Kftj) and even a new definition of the conditional expectation E(t] | if).
Certainly, Formula (19) defines E(rj | £) = g{f) up to a linear transformation
only. But it is easy to obtain a unique definition. In effect, E{rj | f) may be
characterized as the function gff) fulfilling the relation

R2(t],ffo(0)= sup R2(h, g(0) = R\ (h) (21)


9
and the relations
£WO) = E(nV
D\9°(0)=D2(ri)KfW, (22)
E(tg„(0)> 0.

§ 7. On some other measures of the dependence of two random variables

Another measure of the dependence of two random variables is given


by the contingency. This notion was introduced for discrete distributions
by K. Pearson (mean square contingency).1
Let l and r\ be discrete random variables assuming the values xk (k =
= 1, 2,. . .) and yj(j = 1, 2,.. .), and only these with positive proba¬
bilities. Let Ak and Bj denote the events f = xk and rf = y} respectively.
The contingency <p(f rj) is defined by
1
[P(A^)-n^)^)]2 2
<P(5> h) = (1)
j k R(dk) P(Bj)

1 For the general case see A. Renyi [28].


280 MORE ABOUT RANDOM VARIABLES [V, § 7

or with an obvious transformation,

<r2 (£» n) = E X
'tP(Ak)P(Bj)
(2)

It is clear that <p(£, i/) is zero iff £ and tj are independent. If the number
of the values xk is n and that of the y,-s is m with m > n, then

(p2^,r])<n-\, (3)

as because of P(AkBj) < P(Bj) it follows from (2) that

1 <«>

It can be seen from (4) that in (3) the sign of equality holds iff for every
/c and for every j either P(Ak Bj) = P(Bj) or P(AkBj) = 0. Since, however

tp(_AkBj) = P(Bj) 0),


k=1

this cannot occur unless for one kj the relation P(AklBj) = P(Z?y) and for
the other k ± kj the relation P{AkBj) - 0 is valid. But then £ = xkj for
rj = yj and consequently £ = f{r\).
Conversely, if f = /(>?), rj) = n - 1. If both £ and 17 assume
infinitely many values, the series on the right of (1) may be divergent;
in this case <p(£, rf) = +00.
Before defining the contingency for arbitrary random variables, the
notion of regular dependence will be introduced. Let £ and 1/ be any two
random variables. If C is an arbitrary two-dimensional Borel set, we put

P(C) =P(({,„) 6 C). (5)

Let and B be Borel sets on the x-axis and the y-axis respectively; put

P,(A) = Ptf-'CA)) (6a)


and
P2(B)=P(rj-'(B)). (6b)

Let A x B denote the set of the points of the (x, y)-plane for which x £ A
and 7(5. Define the measure Q{C) for the two-dimensional Borel sets
of the form C = A x B by

Q(A x B)=P1(A)P2(B). (7)


V, § 7] DEPENDENCE OF TWO RANDOM VARIABLES 281

This measure can be extended to all two-dimensional Borel sets of the


plane in a unique manner, since the values of its extension are uniquely
determined by the values on the “parallelograms” A x B.
If P is absolutely continuous with respect to Q, the dependence between
E and y\ is said to be regular. This is evidently the case if E and g are
independent, since then P = Q. It is easy to see that the dependence between
two discrete random variables is always regular.
If the dependence between E and 17 is regular, there exists according
to the Radon-Nikodym theorem, a Borel-measurable function k{x, y) =
dP
= —-— such that for every two-dimensional Borel set C the relation

P(C) — f k(x, y) dQ (8)


c

holds. If F(x) and G(y) are the distribution functions of E, and 17, respectively,
and if A and B are any two Borel subsets of the real axis, then the function
k{x, y) satisfies the relation

P(UA,r,£B)= \ f k(x, y) dF(x) dG{y). (9)


xiA y\B

In particular, if E and g are discrete random variables,

P{AkBj)
for x = xk, y = yp
P(.Ak) P(Bj) (10)
Kx, y)
0 otherwise.

If the joint distribution of E and r\ is absolutely continuous with the density


function h(x, y) and if f(x) and g(y) are the density functions of £ and 17
respectively, then we evidently have

h(x, y)
k(x, y) (11)
.f(x)g(y) ’

We can now define the contingency for arbitrary regularly dependent


random variables E and rj by
+ 00+00 r

y(«.»l) = (.f j' [k(x,y)~ lfdF(x)dC(y)y (12)


— 00 — 00

or equivalently by
+ oo +00
<p2 (E, n) = f f k2 (x, y) dF(x) dG(y) - 1. (13a)
— oo—oo
282 MORE ABOUT RANDOM VARIABLES [V, § 7

In particular, if £ and rj are discrete random variables, relation (1) is


obtained from (12), because of (10). If the joint distribution of £ and rj
is absolutely continuous, we obtain, because of (11),
+ 00+00

h\x,y)
dxdy — 1. (13b)
f(x)g(y)

Obviously cp{f, rj) = 0 holds iff P = Q, i.e. if £ and t] are independent.


Now we prove a theorem establishing a relation between the correlation
coefficient and the contingency.

Theorem 1. Let £ and rj be regularly dependent random variables, u(x)


and v(y) Borel-measurable functions such that the variances D\u{£)) and
D\v{r\)) exist and are positive. Then we have

R2 («(0> v(p)) < (p2 ( f rj). (14)

Proof. We may assume without restricting the generality that


R(u(0) — E{y(r\j) — 0 and D(u(f)) = D(v(r/)) = 1. Then by the definition
of the correlation coefficient,

+ 00+00

i?(w(£), v(r\j) = J J u(x) v(y) k(x, y) dF{x) dG(y). (15)


— 00—00

But by assumption
+ 00+00

j j u(x) v(y)dF(x) dG(y) = 0. (16)


— oo —oo

From (15) and (16) follows

+ 00+00

i?(«(0, v(rj)) = J
— oo
j u{x) v(y) [/c(x, j) - l] dF(x) dG{y).
—00
(17)

By applying the Schwarz inequality and (12), we obtain

which proves the theorem.


The quantity
iA(£, h) = sup | R(u(£), v(ri)) |, (18)
u,v

where u(x) and v{y) assume all Borel-measurable functions for which
expectations and variances of u{f) and v(rj) exist, can also be considered
V, § 7] DEPENDENCE OF TWO RANDOM VARIABLES 283

as a measure of the dependence between £ and rj. This quantity is called


maximal correlation of £ and rj and was introduced first, for discrete random
variables, by Hirschfeld, for absolutely continuous distributions by Gebelein.
Its most simple properties are contained in the following theorem:

Theorem 2. If iJ/(£, tj) is the maximal correlation of c, and rj, defined


by (18), we have always

a) =
b) 0 < tj)< 1;
c) tf y — a(.x) and y = b(x) are strictly monotonic functions, then

<A0(0, m) = v);
d) ij/(£, rj) = 0 iff c and tj are independent;
e) if there exists between £ and tj a relation of the form U(fi) = V(rj),
where U(x) and V(y) are Bor el-measurable functions with D(U(fi)) > 0,
then iK£, rj) = 1;
f) we have

I h) I ^ min (K< (rj), Kn (£)) < max (K\ (rj), Kn (0) < <K£, f) ^ rj).

Proof. Properties a), b), and c) are direct consequences of the definition.
If £ and rj are independent, clearly i//(£, rj) = 0. Conversely, if

<A(£> f) = 0, then R(u(if), v(rj)) = 0

for every u and v. If we choose

1 for x < a, 1 for x < b,


ua (x) = vb (x) =
0 for x > a. 0 for x > b,

it follows from R(ujf), vb(rj)) = 0 that

P(£ < a, rj < b) = P(£ < a) P(rj < b).

As a and b are arbitrary this means that £ and rj are independent, hence
d) is proved. If U(f) = V(rj) with D(U(f)) > 0 we know that
R(U(0, V(rj)) = 1,
hence i//(£, rj) = 1. Property f) can be deduced by comparing the defini¬
tion of maximal correlation to Theorem 1 of this § and to Theorem 5
of § 6.
A further notion which we want to study here is that of the modulus
of (pairwise) dependence of a sequence of random variables. Let
£l5 . be a (finite or infinite) sequence of arbitrary random
variables. We define the modulus of dependence of the sequence {£„} as
284 MORE ABOUT RANDOM VARIABLES [V, § 7

the smallest positive real number A satisfying for all sequences {xn} with
Z *^ < +oo the inequality

11
n
Z
m
(J xn xm) I ^ A Z xl
n
(19)

i.e. the least upper bound of the quadratic form

Z Z W*
n m
xn Xm

under condition Z** = 1- (If it is unbounded, the modulus of dependence


is infinite.)
In particular, if the sequence {£„} contains only two elements, its modulus
of dependence will be 1 + i//(£i, £2)- If the sequence {£„} is finite, A is
finite as well; if the sequence contains infinitely many elements, A is not
necessarily finite. If the elements of {£„} are pairwise independent, A — 1,
otherwise A > 1.
The following theorem furnishes an inequality between the correlation
ratio and the modulus of dependence of a sequence of random variables.

Theorem 3. Let {£„} be a sequence of random variables with a finite


modulus of dependence A. If q is a random variable with a finite variance,
we have
Y,KlM<A. (20)
n

For the proof we need a lemma which is a generalization of the Bessel


inequality, well known from the theory of orthogonal series, to the case
of quasiorthogonal functions.

Lemma. Let {C„} be a finite or infinite sequence of random variables such


that exists (n — 1,2,...). Suppose that the quadratic form

n m

is bounded and that we have

IEn m
I* PM; n
(2i>
then for every random variable rj for which Eiyf) exists we have

Y.E2(h O&BEtf). (22)


n
v, § 7] DEPENDENCE OF TWO RANDOM VARIABLES 285

Note. If E(t„ c,„) = 0 for n * m and if E(0 = 1, i.e. if the sequence


{C„} is orthonormal, (21) is valid with B = 1 and (22) reduces to Bessel’s
inequality

X£2070<£0r). (23)

Proof of the Lemma. Put

= m,)- (24)
Obviously,
1
n - > 0. (25)

By carrying out the calculations, we find

E(<f) - ~ I cl + -L y X a, am£(C„ U > 0. (26)

Because of (21) we have

1
(27)

hence it follows from (26) that

~ Z a\ S £(,=), (28)

which is, by (24), equivalent to (22).

Proof of Theorem 3. Let fn{pc) be a Borel-measurable function such


that E(fn(U) = 0 and = 1. Put

£„=/„(£„)• (29)

Then by definition of the maximal correlation

I £(C„ UI = I s(f, ({„), fm(U) I £ «{„ {J. (30)

Hence according to (19)

I Z Z £(C. U *„
n m
| < ££n m
«{„, UI x.\ • I x„, I < A Z
n
(31)
Thus the lemma can be applied to the sequence {£„} with B = A, provided
286 MORE ABOUT RANDOM VARIABLES fV, § 8

that E{tf) exists. Then

££2(C.[>i -£(i;)]) <AD\Vy (32)


But
£(?„ [r, - £(,)]) = £>(»,) «(/„({„), r,)- (33)

Let fn(x) be chosen such that

E(n \U~E(r,)
(34)
JnKn) Din{r()

where D^irj) = B(E(rj | £„)) is the standard deviation of E(rj \ £„). Then
according to Theorem 5 of § 6

*2 (/„&,)>-/) = *!, to- (35)

Hence by (32) and (33)

D\t,)Y.KUtj)<AD\,i). (36)
n

After division by D\r\) we obtain (20) and thus Theorem 3 is proved.


Theorem 3 is a probabilistic generalization of the large sieve of Yu. V.
Linnik, which has important applications in number theory.1

§ 8. The fundamental theorem of Kolmogorov

In what follows, we shall often prove theorems concerning an infinite


sequence of random variables. The conditions of these theorems involve
the simultaneous distributions of a finite number of the random variables
considered. We shall not prove for each particular theorem the existence
on some probability space of a sequence of random variables fulfilling the
assumptions of the theorem; the solution of this existence problem is
furnished by a general theorem due to Kolmogorov. Kolmogorov proved
this fundamental theorem for an arbitrary (not necessarily denumerable)
set of random variables; we restrict ourselves to the case of denumerable sets.

Theorem 1. {Kolmogorov's fundamental theorem). For any integer n let


Fn (vq, x2, . . . , xn) be an n-dimensional distribution function, fulfilling the

1 With the help of this generalization Renyi succeeded to prove that every positive
integer n can be written in the form n — p + P, where p is a prime and P is the product
of at most K prime factors; K denotes here a universal constant. Cf. A. Renyi [2],
v, § 8] FUNDAMENTAL THEOREM OF KOLMOGOROV 287

following conditions of compatibility:

Fn + • • •> +00 ,...,+ 00 ) — /„( Xy, X2, . . ., X„)

(n, m = 1,2,...). (1)

Then there exists a Kolmogorov probability space on which the random


variables (n = 1,2,.. .) can be so defined that for every n the n-dimensional
distribution function of the random variables f2, ...,£„ is equal to
Fn{xi, x2, . . . , x„).

Proof. Let Q be the set of all infinite sequences

OJ = (UJ|, COo, ..., con,...)

of real numbers. /7„ the function, defined on £2, projecting Q upon the
subspace Qtl on the first n coordinates ofco; i.e., for co = (col5 co2,... ,co„,...)
we put
nn CO = («1, (02,. . ., (On). (2)

For A c Q. let H„ A denote the set of all elements of C2„ which can be
brought to the form y = TItl(o with co £ A.
Let now d c .Q„ be any subset of Qn. We shall call the set of elements
0) — (&>!, <u2, . . . , co„ , . . .) such that IJnco = (col5. . . , co„) £ A an n-dimen¬
sional cylinder with base A; we shall denote this set by n~\A).
If A is Borel-measurable, the corresponding cylinder set is said to be
a Bor el cylinder set. Let be the set of all Borel cylinder sets; is an
algebra of sets. To see this let us remark that an n-dimensional cylinder
set is at the same time an (n + w)-dimensional cylinder set as well. In fact

n;\A) = n~lm (/7„+„ n-\A)). (3)

Hence if A is an n-dimensional Borel set, I7„+m (IJ^fA)) is an (n + m)-


dimensional Borel set. Thus for a finite number of cylinders it can always
be assumed that their bases have the same number of dimensions, e.g.
N. If A — n„\A'), B = nff^B'), where A' and B' are Borel sets of the
iV-dimensional space, then

A + B = TIf\A' + B'),

a — b = n^\Ar - By,

A + B and A — B are thus Borel cylinders again. Finally, since Q —


— ny(QN), the set Q itself is a Borel cylinder as well.
288 MORE ABOUT RANDOM VARIABLES [V, § 8

Let be the least o--algebra of subsets of Q which contains all Borel


cylinders of Q. The probability measure P on is defined in the following
manner: First we define P on the Boolean algebra and then we extend
the definition as was done in Chapter II.
Let A be a Borel cylinder of Q and N an integer with A = IJ^\AN),
An being a Borel set of QN. FN(pcx,.. ., xN) generates on QN a probability
measure which we denote by PN. We put P{A) = PN(AN).
The definition is unique, since from

Hn^(An) — nN\M(AN+M)

follows, because of (1),

Pn+m(An+m) = P/vC^nP

Consequently, the definition of P(A) does not depend on the base figuring
in the construction of A.
Clearly, the set function P(A) is nonnegative; it is easy to show that it
is (finitely) additive. If A £ B £ AB = 0, then, because of

A = n~N\AN), B = n~N\BN),

we have ANBN — 0. Hence

P(A + B) —Pn{An + Bn) = Pn(An) + Pn(Bn) = P(A) + P(B).

(We made use of the fact that the value of P(A) does not depend on
the dimension of the chosen base of A.) It is further clear that P{Q) =
— Pn{Qn) — 1. It remains to prove that P{A) is not only additive but
also or-additive on By Theorem 3, § 7 of Chapter II it suffices to show
that P has the following property:
CO

Property K. If An £ An+1(Z An (n = 1,2,...) and Y\ An = 0, then


lim P(An) = 0.
/?—► + OO

We shall show by an indirect proof that property K is fulfilled. The


inequality P(AJ > P(A/i+1), n = 1,2,... is obviously true. Hence
OO

lim P(An) = p exists. Assume p > 0. We show that then D = \\ A„


n^ + co n=i

cannot be empty.
It can be assumed without restriction of generality that An is an n-dimen-
sional cylinder; in fact if dn denotes the exact (minimal) number of dimen¬
sions of A,., we have dn < dn+1. Further lim dn — + oo can be assumed,
n-*- + co
since in the case of d„ <, d our assertion would follow from Pa(A) being
v, § 8] FUNDAMENTAL THEOREM OF KOLMOGOROV 289

a (r-additive measure. If dn -a +oo, we can replace the sequence {Aft}


by another sequence {A'n}, where A'n is an iz-dimensional cylinder and
{A'n} contains the sequence {A,,}.
Put A„ — where Bn is an n-dimensional Borel set. Since
P(A„) — P„(B„) > p > 0, we can find in Qr. a compact set Zn with Z„ C Bt.
such that

A(Z„) - -~r (« = 1,2,...).

Put Cn = n~\Zn). Cn is also a Borel cylinder, and

P(C,) = P„(Zn) > P(A„) - - JL- .

Let now Z)„ = CXC2. . . Cn. We have


n

P(.A, - D.) = P(A. (C, + ... + C„)) < 2 PM, - C*) <
k=l

/c = 1 2 ’

hence

P(D„) = P(A„) - P(A, - B„) £ > 0.

Thus the set Dn cannot be empty for any value of n. Choose now in Dn
a point ai(n) = (o/^, coty ,, ca("}). Then a sequence {«,} can be given
with
lim co(kni) — cok (k = 1,2,...)
y-^ + oo

(G. Cantor’s “diagonal method”). Since all Z„ are closed, for every n

(«1, • • • A-*n) £
OO

hence ai = (coj, co2,..., con,. . .) belongs to D„ and oi £ ]! Ad therefore


«=1
0° 00

Z)„ is not empty. Similarly, A„ cannot be empty either, and thus our
77 = 1 77=1
assumption leads to a contradiction. Hence we must have p = lim P(^„) = 0
n ->- + 00
and P is cr-additive on it follows that the extension of P is cr-additive
on the a-algebra We have proved that [£?, P] is a Kolmogorov
probability space Put therefore

£fc(co) = ojk (k= 1,2,...) for co = (oil5 ...,cok,...). (4)


290 MORE ABOUT RANDOM VARIABLES [V, § 9

Then £k = £k(co) is a random variable on P], since if U is a Borel


set on the real axis, U) is a ^-dimensional Borel cylinder which belongs
thus to and, consequently, to <^€*. On the other hand, obviously

P(& 1 < *i, £2 <X2,..., £„ < xn) = F(xlt x2,...,x„); (5)


the n-dimensional distribution function of the random variables g1} g2,
is thus identical with the function Fn (xl5 x2, . . , x„). Herewith Theorem 1
is proved.
Example. Let {G„(x)}, n= 1,2,... be any sequence of distribution
functions. There can be constructed a probability space and on it a sequence
of random variables ^„{n = 1,2,...) in such a manner that the are
mutually independent and the distribution function of £„ is Gn(x). To see
this it suffices to note that the functions
n

Fn(* i, •••,*„)=n Gk(xk)


k=1

fulfil all conditions of Theorem 1.

§ 9. Exercises

1. Let there be given in the plane a circle Cx of radius R with its center in the
origin, and a circle C2 concentrical with Cx having a radius r < R . Let us draw a
line d at random which intersects Cu so that if the equation of d is written in the form

x cos cp + y sin cp = q,

cp and q are independent random variables, <p being uniformly distributed in (0, n)
and q in (- R, +R) . Let £ denote the length of the chord of d inside C2. Determine
the distribution function, expectation, and standard deviation of £ .

Hint. Let first £ be fixed. Then

for x < 0,

1 x‘
P(£ <x\(p) = r2-— for 0 < a < 2r,
~R

l for 2r < x.

At the point x = 0 the distribution function has thus a jump of the value 1 —— .

This expression being independent of cp, the conditional density function of £


under the condition £ > 0 is

for a < 0 and x > 2r,

Ax) - for 0 < a < 2r.


v, § 9] EXERCISES 291

This leads to

r2n
£(0 =
2Ji
and n2

2. Let d be a line chosen at random as in Exercise 1. Let B be a convex domain


in the circle Cx. Let £ denote the length of the chord of d inside B. Calculate the
expectation of £.

Hint. We have E(£) = E(E| 9?)), where 99 has the same meaning as in Exercise 1.
E(£ | (p) is equal to the integral along the chords of the domain B lying in a given
direction, divided by 2R; for fixation of 99 means restriction to the chords which

form an angle 99
71
+ — with the x-axis. Hence E(£ \ <p) 1*1 |2?| being the area
2R

of B. We see that E(0 = 'll . It is not necessary to require the convexity of B neither
2R
that it be simply connected.

3. Let there be given in the plane a curve L consisting of a finite number of convex
arcs and contained in a circle C of radius R. Choose at random (in the sense explained
in Exercise 1) a line d intersecting C. What is the expectation of the number of the
points of intersection of this line with L?

Hint. Consider first the particular case when L is a segment of length / of a straight
line. In this case the number v of points of intersection is 0 or 1. If 95 is the angle
between the normal to d and the segment L, the expectation under the condition of
/ cos 99
fixed 99 is E(v | 99) = -———. This leads to

71
2

From this it follows for polygons, and by a limit procedure for all piecewise convex
I £ I-
(or concave) curves L, that E(v) =-, where | L | is the length of the curve L.
tzR

4. Calculate £(£") for n — 2, 3, .. . the conditions of Exercise 1

Hint. We have
7V
2
2n rn+1 (
E(£n) = ——— I sinn+1 # d&.
0
Note. Exercises 1 to 4 present well-known results of integral geometry1 from a
probabilistic point of view.

5. Establish the law PV= RT for ideal gases on the basis of the kinetic theory
of gases. V denotes here the molar volume, P the pressure, T the absolute temperature
of the gas, further R = Nk where N is Avogadro’s number and k Boltzmann’s constant.

1 Cf. W. Blaschke [1],


292 MORE ABOUT RANDOM VARIABLES [V, § 9

Hint. The pressure of the gas is equal to the expectation of the quantity of motion
imparted by the molecules of the gas during unit time to a unit surface of the vessel
wall. We assume that the shocks are perfectly elastic. If a molecule of mass m and
velocity v strikes the wall in a direction which forms an angle ft with the normal
vector of the wall, then the quantity of motion imparted by the molecule will be
2 mv cos ft. In order to strike a unit surface K of the wall during a time interval
(t, t + 1), the molecule of velocity v moving in a direction which makes an angle
with the normal vector to the wall has to be included at the time t in an oblique
cylinder of (unit) base^f and height v cos ft. Under the assumption that the molecules
are uniformly distributed in the recipient, the probability of the shock in question
v cos ft-
is ——— , where W denotes the volume of the vessel. Hence the expectation of

the quantity of motion imparted to the wall by the considered molecule will be
2 v~m cos2 ft 4 e cos2 ft
-—-= -—-, where e is the kinetic energy of the molecule. The
W W
4e cos2 ft
quantity-—— is a random variable. Hence we have to calculate its expectation.
W
(Here the relation | ??)) = £(c) is to be applied.) If the velocity components
are supposed to be independent and to have normal distributions with the density
1 f x2 \ jkT
function-= exp — ——- where a — . — , then ft and e are independent and
a^J In \ 2a I V m
the distribution of the direction of the velocity vector is uniform. Hence

4e
E cos-1 4r E(e) £(cos2 ft).
~W

3 1
We know already (Ch. IV, § 17, Exercise 29b) that£(e) = — kT. Since £(cos2 ft) =
eT’
we find
kT
E cos2 ft
W

for the expectation of the “pressure” exerted upon the wall by one molecule. Since
there are N molecules in a gram molecule of gas, we find for n gram molecules, because
of the additivity of the expectation, the value

p_ nNkT NkT RT
~ w ~ ~ ~V ’

w
where V —- is the molar volume and R — Nk the ideal gas constant.

6. Let £t, £2, be independent random variables uniformly distributed in the


interval (0, 1). Let them be arranged into an increasing sequence and let the Ar-th
element of this sequence be denoted by £*.
a) Show that the conditional density function of £*, £*, . . with respect to
the condition ££+1 = c is given by

for 0 < xx < x2 < .. . < xk < c,


f &C*1> *2> • ■ ■> Xk | — c) —

0 otherwise.
V, § 9] EXERCISES 293

b) Show that under the condition £* + 1 = c the random vectors and


(^* + 2,. .£*) are independent.

7. Let £x, £2,. . ., . . . be independent random variables. Consider the sums


C„ = £t + £2 + . . . + C„ . Show that under the condition C„ = x the random variables
Ck and C; are independent for k < n < l.

8. Let the random vector (£, rj) have the normal density function

A(*. y) = V /1C - J32


exp j (Ax2 + 2Bxy + C/)
2tt

Prove the following relations:

V) =-7= .
^ AC

E(y | S) = - -|r I,

£■(11 y) = — -4-1?,

l*l_
*„(!) = (»?) = I m y) I
y/AC~'

9. If the random vector (£, 57) has a nondegenerate normal distribution, show that

71-:. 7) = . ^ — and 7(7 7) = | r |,


V1
where r = 77) is the correlation coefficient, g?(£, ??) the contingency, and ip(£, y)

the maximal correlation of the random variables £ and rj.

10. If the functions a(x) and b(x) are strictly monotone, then

<p(a(0, b(rjj) = (pit, rj) .

11. If (<J, 77) is uniformly distributed in a circle, </'(£>’?) =

12. If £ and r] are the indicators of the events A and B, i.e. if

| 1 for to £ A,
£(«) =
[ 0 otherwise,

] for a> 6 B,
y( co)
0 otherwise,
then
V(£, y) = <P2(£,»?) = ATK*?) = K'jiS) = R2(l rj) =

[P(AB) - P(A) P(B)]2


~ P(A) [1 - P(A)] P(B) [1 ->(£)['’

provided that 0 < P(A) < 1 and 0 < P(B) < 1.


294 MORE ABOUT RANDOM VARIABLES [V, § 9

13. Prove the following variant of Bayes’ theorem: Let £ be a random variable
with an absolutely continuous distribution with the density function f(x) and let r\
be a discrete random variable. Let yk (k — 1,2,...) denote the possible values of
rj and pk(x) the conditional probability P(i) = yk \ £ = x). Let f k(x) be the conditional
density function of £ given rj = yk. We have

fk(x) = -.
J Pk(Of(t) dt
— 00

Hint. By definition

J Pk(x)f(x) dx = P(| 6 A, rj = yk),


A

hence

J Pk{t)f{t)di
P(g < x,rj = yk) — CO
P(£ < x\rj = yk) =
nv = yk)
J Pk(t)f{t) dt

14. Suppose that the probability of an event A is a random variable with density
function p{x) (p(x) — 0 for x < 0 and 1 < x). Perform n independent experiments
for which the value P(A) = l; is constant and denote by rjn the number of the experi¬
ments in which A occurred. Let pnk(x) be the conditional (a posteriori) density function
of £ with respect to the condition rjn = k {k = 0, 1, 2,... n); according to the preceding
exercise
xk(\ — x)n kp(x)
Pnk(x) =

J tk{ 1 - t)n-kp{t)dt
0

a) Show that if t, has a beta distribution of order (r, s), then ^ has under condition
V„ — k a beta distribution of order (k + r, n — k + s).
b) Ifp(x) is continuous and positive on (0, 1) and if /is a constant (0 </< 1),
then
■V2
/(i -/) /(I -/) 1
lim ■ Pn. [fit] /+ y e 2
n —► -f co
v'27r
15. Let C be a random variable and let £t, ^2» • • •, be random variables which
are for every fixed value of C independent and have a normal distribution with
expectation t and standard deviation a {o > 0 is a constant). Let p{x) be the density
function of £ . Study the conditional density function pn(x |y) of C under the con¬
dition

+ • • • +£„

and show that if p(x) is positive and continuous, we have, for fixed x and y,

Pn \y + -= y
lim V" 2(72 .
n —► -j- oa
Vn
V, § 9] EXERCISES 295

16. Let £ be a random variable with an exponential distribution. For every given
value of £, let £t, £2 , ..£„ be independent normally distributed random variables
with expectation £ and standard deviation a > 0 . Determine the conditional distri¬
bution of £ with respect to the condition

fi + £i + * • . +
17. Let /i be a random variable having the density function p(t). Let for every given
value of g the random variables £ls. . ., c„ be independent, normally distributed,
with expectation g and standard deviation a > 0. Show that £lf. .. are exchangeable
(cf. Ch. IV, § 17, Exercise 18).

18. Let , . .., be independent random variables having the same distribution
and finite variance. Put t]n= £t + £2 + • • • + 5« (« = 1,2,..., TV). Calculate the
correlation ratio Knn (?;N) (n < TV).

19. Let fj,. .., £n be independent random variables with the same distribution.
Put rj„ = £, + . . . + ln (n = 1,2,..., TV). Calculate the contingency

<p(y„, Vrn) (n <m< TV).


20. Let the random variables £1, £2,..£„ be independent and uniformly distributed
in the interval (0, 1). Let £* denote the &-th order statistic of the sample £1}. ..,
(See Exercise 17 of Ch. IV.) Compute (£,*) and <?(£*, <^*) for k < l < n.

21. Suppose that the probability ^ of an event A is a random variable on a con-

ditional probability space. Let g{t) = ---(0 < t < 1) be its density function.
*(1 - 0
Let p be constant during the course of n independent experiments and let r\„ denote
the number of those experiments in which the event A occurred. Calculate the
a posteriori density function and the conditional expectation of the random variable
p with respect to the condition tj„ = k (0 < k < n).

Hint. According to Bayes’ theorem the a posteriori density function of p with


respect to the condition rjn — k is
pk~l(\ - p)n-k~x
9k(j>) -
- t)n-k~'dt
0

the “a posteriori distribution” of p is thus a beta distribution of order (k, n — k)


k
and the conditional expectation is — .
n

22. Let £ be a random variable with Poisson distribution and expectation A, where
A is a random variable with a logarithmically uniform distribution on (0, + 00).
Calculate the a posteriori density function and the conditional expectation of A with
respect to the condition £ = n > 1.

Hint. Bayes’ theorem gives for the conditional density function of A with respect
to the condition £ = n:
Xn-le~l
9n0) =
(« - 1)!

the a posteriori distribution of A is thus a gamma distribution of order n.


296 MORE ABOUT RANDOM VARIABLES [V, § 9

23. Let the random variable £ have a normal distribution N(ju, a), where [i is a
random variable uniformly distributed on the whole real axis. Determine the a posteriori
density function and the expectation of /li with respect to the condition £ — a.

Hint. According to Bayes’ theorem

1 0* - a)2
9(j* I £ = a) = exp
2(72

The a posteriori distribution of fi is thus a normal distribution with expectation a


and standard deviation a.

24. Let £[, £2) ..., £n be independent random variables having the same normal

distribution N(m, o). Put £„ = ^ ‘"2 . Determine the conditional


n
distribution of (£1; £2, • • £„) for a given value of £„ .

Hint. By assumption, the n-dimensional density function of the £k(/c = 1,2, .... n) is

ycvj,..., xn~) exp X (Xk -


2a2 tx
(Jl7l)n
Put
1 "
X = —Y/ Xk.
n k=l

We have

X (** - m? = V (x* - xf + n(x - mf,


k=l k=l
hence
1 n{x — m)2 1
f(Xl, ■ ■ ; X„) = / " exo — X
/ 2n CXP 2 02 (o^2n)n 1 Jn

X exp E
2ff2 k= 1
The density function of C„ is

1 / n n(x — m)2
9{x) — exp
2a2

hence the conditional density function of the random vector (£t, . . ., £„) for £n = x is

1
7-=-exp
(cr j2ji)n~l j n 2^ <** “ *>’

This function does not depend on in; a property which is expressed by saying that
t„ is a sufficient statistic for the parameter m.

25. a) Let there be given n independent random variables £u with the same
normal distribution N(0, a). Put

£=~ y, ^ n k=1
and t = x & - o2-
k=1

Show that £ and r are independent.


V, § 91 EXERCISES 297

Hint. Let (cik) (;, k — 1,2, , n) be an orthogonal matrix with clk =

{k = 1, 2, n). Put
n

Vi= Y cik ^ O'= 1.2,


A:= 1

Then — / n C and

E
/=
1
= E^ A= 1
+Ek—l
02.

hence

t= E 1=2
rf-

We know (cf. Ch. IV, § 17, Exercise 43) that rju...,rj„ are independent normally
distributed random variables with expectation 0 and standard deviation a; hence

£ = and r = Y rjf
V" ;=2
are independent; r has a /^distribution with (« -- 1) degrees of freedom.
b) Let ..., £„ be independent random variables with the same normal distribution.
Let the expectation [jl and the standard deviation o of the £k be independent random
variables on a conditional probability space, p. being uniformly distributed on the
whole real axis and a logarithmically uniformly distributed in the interval (0, + Co).
Put

C =
£l + £t + ...+JJL and T = £ (4 _ 0£.
k= l

Determine the a posteriori distribution of p and o" under condition C — x and t — z.

Show that given these conditions a and -— are independent.


a

Hint. The density function of the vector (p, o2) with respect to the condition
£ = x, r = z is, according to Bayes’ theorem and the result of Exercise 25 a)

n— 1 z n(x - ,m)2]
z 2 exp exp
n ' 2o2 2 a2
2n
2 2 on+'r

X • LL
thus o and - are independent.
a

26. Let there be given a sequence of pairwise independent events A„ (n = 1,2,...)


with P(A„) > a > 0 (n = 1,2,...) and an arbitrary random variable rj of finite
variance. Show that
lim E(r] | A„) = E(rj). (1)
n-*- + co

If r) is the indicator of the event B (0 < P(B) < l), it follows from (1) that

lim P(B | A„) = P(B). (2)


n -*■ -f oo
298 MORE ABOUT RANDOM VARIABLES [V, § 9

Hint. Let be the indicator of the event A. We have

P(An) _ [E(V 1 An) - E(r,)f


KUn) =
1 - P(A„) D\rj)

hence, by Theorem 3 of § 7,

00 P(A )
I 1 1^1 A„) - E(V)r < DHv).
n=l 1 - P(A„)
Thus

lim
+CO
,
1 r\A„) Wv I An) - E(v)]2 = 0,
which proves (1).
Remark. Cf. Ch. VII, § 10, Theorem 1.

27. Let a sequence of pairwise independent events A„(n — 1, 2, . . .) be given and


assume

E
n= 1
P(A„) =

Let

B = lim sup An = fj ( £ Ak)


co n= 1 k=n

denote the event that infinitely many of the events A„ occur simultaneously. Show
that P(B) = 1.

Hint. Let C be any event with 0 < P(C) < 1. Like in Exercise 26, it follows from
Theorem 3, § 7 that

E T-P(An) [P(C 1 An) ~ P{C)Y ~ P(C) [1 “ P{C)l


co
Since P(A„) diverges, clearly

lim inf [P(C | An) - P(C)]2 = 0. (1)

Apply (1) to C — Ck — ^ A„. Obviously, P(Ck) > 0. It follows from (1), in view of
n=k

P(Ck | A„) = 1 for n~> k, that P(Ck) = 1; hence P(Ck) — 0. Since B = ]~| Ck, we
*=i
have B = ^ Ck and hence P(B) = 0, which is equivalent to P(B) = 1.
k=\

Remark. The assertion of Exercise 27 is a sharper form of the Borel-Cantelli


lemma (cf. Ch. VII, § 5).

28. Let £ and r] be arbitrary random variables, f(x) and ^(x) Borel-measurable
functions such that

E(f(0) = E(g(rj)) = 0, D(fiO) = D(g(V)) = 1


v, § 9] EXERCISES 299

and

R(m, g(r,j) = E(f(0 *0?)) = Hi, V),

or to put it otherwise, suppose that R(u{0» t>0?)) assumes its maximal value for u = f
and v = g. Then the following equations hold with probability 1, where A = \p (£, rj)\

E(Ki) | v) = te(rj) 0)
and

E(g(v) 1 i) = A/(0, (2)

hence also

E(E(f( 0 I 0 u) = A2/(0 (3)


and

£(£Q?(0 101^)= A^(^). (4)

Hint. We have

Hi, V) = E(m)g(v)) = E(Em)g(rj) | 0) = E(f{t)E(,g(rm) ,

hence according to Schwarz’ inequality

c, t?) < e(e2gkoio) - e>2.

On the other hand, if /*(£) = E(g(^p\^ £ujgjs E^f*(£)^ = 0 and Z>(/*(0) — 1> then

*(/*«), *fo)) = E(f*(Og(v)) < Hi, v) ■


But as

E(nsM,» = = D,

we conclude that D- < Hd, rj). Hence Z)2 = ip2(Z, rf). Since in Schwarz’ inequality
equality holds only in the case of proportionality, we must have E^g(rj) | £) = A/TO
which proves (2). But

E(f(OE(g(V) 10) = Hi, V) •

On the other hand, by (2)

E(mE(g(rj) | 0) = A£ (/2(0) = A,
hence A = i/<£, i?). Equation (1) is proved in a similar way.
29. With the notations of Exercise 28 we have

£(/(0 I gTO) =
and

1/(0) = A/(0.
Hence the regression curves of £* = /(0 with respect to ??* = g(rj) as well as that
of rj* with respect to i* are straight lines (or, as it is expressed, the regression of i*
and r]* is linear).
300 MORE about random variables [V, § 9

Hint. The proof is similar to that of Exercise 28.

30. Let L\ be the set of all random variablesM) such that f(x) is a Borel-measurable
function with £(/(£)),— 0 and £(/2(c)) is finite. If we put

(fi(a MO) = E(fiiOMO),


L\ is a Hilbert space. Further we define AM), for /(£) £ L\, by

AM) = E(E(M)\r]M).

Show that Af(f) = /,(£) belongs also to L\ and the linear transformation AM) of
the space L\ is positive and symmetric, i.e. it fulfils the relations

(AMXM)) > 0 and (AM), g(0) = (M), Ag(O).


CHAPTER VI

CHARACTERISTIC FUNCTIONS

§ 1. Random variables with complex values

Characteristic functions are useful analytic tools of probability theory,


especially for proving limit theorems. This Chapter presents the definition
and the properties of characteristic functions; the following two chapters
will deal with the limit theorems themselves.
The characteristic function of a random variable £ is defined as the
expectation of the complex valued random variable e1^. Thus we have
to study complex landom variables first; we shall see how theorems on
real random variables can be extended to complex random variables.
If £ and 17 are real random variables, we say that the quantity £ = £ + “7
is a complex random variable. The distribution of £ can be characterized
by the joint distribution of £ and rj.
We define the expectation of £ = £ + Irj by

E(Q=\CdP, (1)
h
which implies
E(0 = E(0 + iE(r1). (2)

The random variables Ci = + “7i and £2 = £2 + “72 are said to be


independent if the two-dimensional random vectors (£x, tfo) and (£2, V2)
are independent. The independence of several complex random variables
is defined in a similar way.
If Ci, C2, • • • , C„ are independent complex random variables and if
the expectations £(£*) (k = 1,2,...,«) exist, one can see at once that

e( fi q=n=1 %)•
k=1 k
w
If A(x) = a(x) + ib(x) is a complex valued Borel function of the real
variable x and £ is a real random variable, further if the expectation of
£ = A(£) exists, then the latter can be calculated by

E(0= J A(x)dF(x), (4)


— 00
302 CHARACTERISTIC FUNCTIONS [VI, § 2

where F(x) is the distribution function of In fact, according to Exercise


47, § 17 of Chapter IV,
+ CO +00

£(0 = j a(x)dF(x) + i J b(x) dF(x).


— CO — CO

It is easy to prove that for every random variable with complex values

|£(C)|<E(|C|). (5)

§ 2. Characteristic functions and their basic properties

We define the characteristic function of a random variable £ as the


expectation of e'^1; thus it is a function of the real variable t. It is denoted
by <pft). Thus by definition

<Pt(t) = E(eit'). (1)

According to Formula (4) of § 1

9^(0 = J eixt dF(x) (2)


— 00

where F{x) is the distribution function of £; hence <p((t) is the Fourier-


Stieltjes transform of F{x). If the distribution of £ is discrete and f assumes
the values xk (k = 1,2,...) with probabilities pk(k = 1,2,.. .), then
can be written in the form
00

= PkeitXk- (3)
k=i v y

If the distribution function of ^ is absolutely continuous with the density


function f{x) = F\x), we have

^(0=J eUxf(x)dx. (4)


— 00

Hence 9oft) is the Fourier transform off(x).


Thus we see that the characteristic function of an arbitrary random
variable depends only on its distribution; characteristic functions of random
variables with the same distribution are identical. The function defined
by (2) can thus be called the characteristic function of F(x) (as well as the
characteristic function of a random variable with distribution function
F(x)).
VI, § 2] BASIC PROPERTIES 303

First of all, let it be noted that every distribution function has a charac¬
teristic function since the Stieltjes integral (2) exists always, in view of
1 eUl | = 1.
If £ assumes positive integer values only, with

=k) = pk (k = 0,1,...),
we see that

9»«(0 = Z Pk eik< = ^feu),


k= 0
where

G’«(z) = Z Pkzk
Ar = 0

is the generating function of discussed already in Chapter III, § 15.


In this case the characteristic function is therefore equal to the generating
function on the boundary of the unit circle. In the general case when £
may take on other than positive integral values, the generating function
is not defined; the characteristic function, however, exists for every random
variable.
We shall now prove some elementary theorems concerning the charac¬
teristic functions of probability distributions.

Theorem 1. We have always \ cpft) \ < 1; equality holds for t — 0.

Proof. Since | e‘i! ] = 1 and because of Formula (5) of § 1 we have

I 9>s(01 S £( I ew |) = 1.

Further <^(0) = E(e°) = 1.

Theorem 2. The function cpft) is uniformly continuous on the whole real


axis — oo < t < + co .

Proof. Let e > 0 be given. Choose a X > 0 such that

If we denote by Ax the event | £ | > A, we evidently have

cpft) = E(e^‘ | Ak) P(AX) + E(e‘V \ Ax) P(AX). (5)

Since ] E(e^‘ \ AJ | < 1, we conclude that

Wt (0 - E(e‘S‘ I it) p(ix) I S P^x) < f ■ (6)


304 CHARACTERISTIC FUNCTIONS [VI, § 2

Consequently,

1 Vsih) - (7)

From

| eib - eu' | = | i j elz dz | < b — a for a <b (8)


follows

1 _ eitt 11 < JL for 11, | < A and \t2 — tx | < =-• <5,
3
hence

E{ |e^h -- c,{tl \Ax)<-y for | h - f | < 5. (9)

From (7) and (9) we conclude that

I <Pn(h) ~ <Pi({i) I < £ for 112 ~ h | < 5.

where 5 > 0 depends only on e. This proves Theorem 2.

Theorem 3. If a and b are constants and if rj = at, + b, then

9>„(0 = eibt
Proof.

(pft) = E(ei{ai+h)t) = eibt E(eiial).

Theorem 4. If tlf t2 ,... , tn are arbitrary real numbers and Zj,z2,... ,zn
arbitrary complex numbers, further if cpft) is the characteristic function of
a random variable c, and if z — x — iy is the conjugate of the complex
number z = x + iy, then we have

n /7

I 'L <Ps(th-tk)zhzk>0. (10)

Remark. Functions satisfying (10) are said to be positive definite. A remark¬


able theorem of Bochner says that every positive definite function cp{t)
for which cp{0) - 1 is the characteristic function of a probability distri¬
bution. We shall not give the proof of this theorem.

Proof of Theorem 4. We have

t t <P&h ~ tk) zh~zk=E(\f eu* zk |2).


fc=X k=1
VI, § 2] BASIC PROPERTIES 305

Theorem 5. For every real t, cp^—t) = (pff). In particular, if the


distribution function of £ is symmetric with respect to the origin, <pft) is
a real even function of t.

Proof.Let C be a random variable with complex values; then E(Q = E(f).


This leads to

<pf-t)=E(e-iS') = E(eiil).

If the distribution function of £ is symmetric, i.e. if £ and — E, have the


same distribution, their characteristic functions are identical; we have thus

9>«(0 = <P-i(0 = <P$ (-0 = <P{ (0-

Consequently, cpft) is real and, since cpft) = <pf—t), (pft) is an even


function.

Theorem 6. If £1# £2,. • • , En are mutually independent random variables,


the characteristic function of their sum is equal to the product of the charac¬
teristic functions of the individual terms:
n

Vsi+c,+...+«■ (0 = k11
=1
v&CO-

Proof. This follows from Formula (3) of § 1.

Remarks:
1. Theorem 6 expresses a property of the characteristic functions which
exhibits their successful applicability to probability theory. Indeed, the
distribution of a sum of independent random variables is the convolution
of the distribution functions of the individual terms; the calculation of this
convolution is in most cases rather complicated. On the contrary, Theorem
6 allows a very simple calculation of the characteristic function of a sum
of independent random variables from the characteristic functions of its
terms, as it is just their product. Further, as we shall see in § 4, from the
properties of the characteristic function the properties of the corresponding
distribution function can be deduced.

2. The converse of Theorem 6 does not hold. From

<Pb+dO = <Pdt)<PdO
the independence of ^ and £2 does not follow. Let for instance be
gL = = £, where £ has a Cauchy distribution: (pft) = e~ul
306 CHARACTERISTIC FUNCTIONS [VI, § 2

(cf. Example 4 of § 3). According to Theorem 3 we have thus

Vfc+sXO = 9^(0 =6-21,1 = 9{l(0 9>e.(0»

though £ is obviously not independent of itself.

Theorem 7. If the first n moments E(£k) — Mk (k = 1,2,... ,ri) exist


for the random variable f then the characteristic function cpft) is n times
differentiable and

9>?’(0 ) = ikMk (*=1,2,.(11)

Proof. Let /’(a) be the distribution function of £. If

I' I A I c7F(a)
— 00

exists, the integral


+ oo
J Aew' dF(x)
— 00

converges uniformly in t. Hence

+ 00
y'fij) =J — 00
lAe'*' dF(x);
in particular
<Pt(P ) = iMv (12)
By iterating the operation we obtain

<pf\0 = ik J°° xk eixt dF{ a) (k = 1,2,..., n); (13)


— 00

from here (11) follows by putting t = 0.

Theorem 8. Let the distribution function of the random variable £ be


absolutely continuous. If the density function fix) of £ is k times differentiable
{k = 0, 1,. . .) and if
+ 00

0=1 l/wMI dx

exists for j = 1,2, ... ,k, we have1

lim 11 \k I cpft) | = 0. (14)


m-+°o

1 It suffices to assume the finiteness of Ck; this implies the finiteness of Cu ..Ck_.
Cf. S. Bochner and K. Chandrasekharan [1], p. 29.
VI, § 2] BASIC PROPERTIES 307

Proof. If we perform k times an integration by parts on

<Pt(0 = j f(x)eixtdx (15)


— oo

and consider that by our assumption lim f°\x) = 0 for j = 1 2, , . . ., k— 1,


|jc;-»oo

we obtain
+ 00

<Ps (0 = j f(k) (x) eixt dx. (16)


(tr - CXD

From (16) it follows that

1 n (0 ,< c‘ (17)
' M* '
Since by assumption ]fik\x)\ is integrable on (-oo, +oo), (14) follows
from (16) by Riemann’s lemma concerning the Fourier integral.1
Inequality (17) is obviously of interest for the study of the behaviour
of yfit) for large values of 111.
Remark. According to Theorem 7, the “smoothness” (differentiability)
of (pfit) is determined by the behaviour off(x) for \x\ + oo ; by Theorem 8
the “smoothness” of fix) determines the behaviour of 9off) for |t| -> 00 .
The two theorems are therefore in a certain sense dual.

Theorem 9. If the first n moments of f Mk = E(fi) (k — 1,2,...,«)


exist, we have (with M0 = 1), for t 0 that

= f -ff- + o(f). (18)


k-0 K‘
Proof. This follows immediately from Theorem 7.

Theorem 10. If all the moments Mk = E(£k) {k = 1, 2,...,«) of the


random variable £ exist and if

1
lim sup (19)
«-►+00
h

is finite, then the domain of definition of <pfj) can be extended to complex


t-values. We have, for \ 11 < R,

« Mn (tty
<p4(0 = I —— (20)
«=o n-

1 Cf. G. H. Hardy and W. W. Rogosinski [1), p. 23.


308 CHARACTERISTIC FUNCTIONS [VI, § 2

q>ft) is even a holomorphic function in the whole band \ v\ < R of the complex
plane t = u + iv.

Proof. If the assumptions of the theorem are fulfilled, 9oft) is, because
of (11), arbitrarily often differentiable at the point t = 0 and we have
<p(lf(0) = inMn. From this (20) follows immediately.
Because of (13) for every real t0 and every n

\cpf\t0)\<M2n. (21)

Hence, according to Schwarz’s inequality,

+ 00

| qf-+» (to) |<J \x |2n+1 dF(x) < jM2n M2n+2 < Min +2M2* + 2. . (22)
— 00

We obtain from (19), (21) and (22) for every real t0

1 y(w) (to) 1
lim sup <- (23)
«-► + 00
n\ R

Hence cpft) is regular in every circle \t — t0 \ < R which leads already


to our assertion.

Remark. It follows from Theorem 10 that the function <pft) is uniquely


determined by the sequence Mn (n — 1,2,...), whenever (19) holds.
In fact, by (20) <pft) is determined by the sequence {Mn} in the circle
|*| < R, hence the value of cpft) for every real t can be determined by
analytic continuation. We shall see in § 4 that a distribution function is
uniquely determined by its characteristic function. Hence if (19) is fulfilled,
the distribution of £ is uniquely determined by the sequence of moments

Mn — E(bf) (n — 1,2,...).

The question, whether the moments M„ = E(£n) do or do not determine


uniquely the distribution function F(x) of £, is called the Stieltjes moment
problem. In general, F(x) is not uniquely determined by the sequence of
the moments.

Definition. The random variable £ has a lattice distribution with span


d, if it takes on only values of the form dk + r (k = 0, + 1, ± 2,. ..),
where d > 0 and r are real constants.
VI, « 2] BASIC PROPERTIES 309

Theorem 11. If E, has a lattice distribution with span d, then

= 1 for n = 0, + 1, + 2,... ;

if E, does not have a lattice distribution, we have |<p£t) \ < 1 for every t ± 0.

Proof. If all values of E are of the form dk + r and if P(f = dk + r) =pk


(,k = 0, +1, +2,. . .) we have, for any integer n,

4- oo
2nn '
Vs = E Pk =i•

Conversely, if for a^Owe have cpftf) = eix with real a, we conclude

+ oo +00
j eKux-*)dF(^ = 1 = j dF(x),

hence
+ 00
| [1 - cos (t0x — a)] dF(x) = 0.

2kn a
Since 1 - cos (toX - a) is positive except for x = —-h — (A = U,
to 'o
±1,. . .) (for which values it is equal to 0), all jumps of F(x) must therefore
2k a
belong to the arithmetic progression dk + r with d = —— and r — —.
to ‘o

Theorem 12. If the distribution of £ is the mixture of the distributions of


the random variables with weights pk (k = 1,2,.. .), then

<p&) = 'LPkVtk(f)-
k

Proof. Let F(x) be the distribution function of £, Fk(x) that of £k. We


know that
F(x) ~ Ek Pk Fk (.*)■
From this Theorem 12 follows immediately.
Remark. The characteristic function may be considered as an operator
which assigns to the distribution function F(x) the function (p(t). Then
Theorem 12 expresses the fact that this operator is linear.
310 CHARACTERISTIC FUNCTIONS [VI, § 3

§ 3. Characteristic functions of some important distributions

We determine now explicitly the characteristic functions of some distii-


butions.

Example 1. The characteristic function of the normal distribution.


Let ^ be a normally distributed random variable with E{£) = 0,
D(0 = 1. Then
+ 00

dz.
— oo L

where L is the horizontal line z = x - it (— oo < x < + oo) of the


Z2

complex plane; e 2 is an entire function, its integral is thus zero along


any closed curve and in particular along the quadrangle Rx with the vertices
— x — it, x — it, x, —x. The relation

■x _ zy x* |r|
j j e 2 dz \ <e 2 j e 2 du (1)
x—it o
implies
x _ _zy
lim | f e 2 dz | = 0. (2)
|x|-»oo x-it
Hence
+ 00

1
(3)
v'/2tz
— OO

and consequently

<Pi(t) = e~‘2 (4)

If the random variable £ is N(m, a) the random variable f — ———


a
is N (0, 1) and £ = + m. From rp^t) = e~ 2 and from Theorem 3
in § 2 follows

(0 = e “ . (5)

Example 2. The characteristic function of the exponential distribution.


Let C be a random variable with an exponential distribution and of
VI, § 3] SOME IMPORTANT CHARACTERISTIC FUNCTIONS 311

expectation — . The density function is thus Xe~Xx for x > 0 and


A

1
9^(0 = X j e x(X u) dx = (6)
■-4

From this it follows immediately by Theorem 6 of § 2 that a random


k
variable c,k having a T-distribution of order k and expectation — has
A
characteristic function
1
n*(0 =

Example 3. The characteristic function of the uniform distribution.


Let £ be a random variable uniformly distributed on the interval
( — A, +^4); then

sin At
At

Example 4. The characteristic function of the Cauchy distribution.


Let £ be a random variable with a Cauchy distribution. Then

+ 00
0ixt

9>«(0 dx — e (7)
= —
71 1 + x"

(The integral can be evaluated by the method of residues.)


It should be noted that (pft) is not differentiable at the point t = 0.
According to Theorem 7 of § 2 this is linked with the fact that b, does not
have an expectation.

Example 5. The characteristic function of Pearson’s fl-distribution.


If xl is a random variable having a ^-distribution with n degrees of
freedom, we can write

xi=i
k
& =1

where £2,. . ., are normally distributed independent random


312 CHARACTERISTIC FUNCTIONS [VI, § 4

variables for which E(^k) — 0 and D(%k) — 1. It is easy to see that

X2
2
(1-2/0 . 1
<p&Q = e dx —
J\ - 2it
o

(One has to take that branch of the square root which is equal to 1 for
t = 0.) Hence Theorem 6 of § 2 leads to

<PX'n(0 = -1-IT- (8)


(1 -2 it)2

Example 6. The characteristic function of the binomial distribution.


Let £ be a random variable having a binomial distiibution of order n
with parameter p; according to § 15 of Chapter III

n(t) = Gfei') = [l+p(eit-l)r.

§ 4. Some fundamental theorems on characteristic functions

In this paragraph properties of characteristic functions will be discussed


which are essential for the proof of limit distribution theorems of proba¬
bility theory.

Theorem la. If cp(t) is the characteristic function of the distribution


function F(x) and if a and b are continuity points of F(x) (a < b), then
-f CO
-it a ~—itb Mb
F{b) - F(a) (p(t) - cp(-t) dt. (1)
-±s lit lit

Theorem lb. Every distribution function is uniquely determined by its


characteristic function.
Theorem lb follows immediately from Theorem la; in fact if y(t) is
known, (1) gives the increment of F(x) on every interval the endpoints
of which are points of continuity of F(x). The set of discontinuity points
of F(x) being denumerable, a may tend to — go through a sequence con¬
sisting only of continuity points of F(x); hence (1) gives the value of F(b)
at every point of continuity b. As F(x) is by definition leftcontinuous, the
values of F(x) at a point of discontinuity can be obtained by letting b tend
from the left to such a point.
VI, § 4] SOME FUNDAMENTAL THEOREMS 313

Since the unicity Theorem lb follows from the inversion Formula (1),
it suffices to prove the latter. Before beginning the proof we have to make
first some remarks. It was pointed out in § 2 that cp(-t) = cp(t). Thus if
Re{z} denotes the real part of the complex number z, (1) can be rewritten
in the form

1 f f e~ita - e~ilb 1
m - m = 2^-) Re mo —Tt— dt. (2)
— 00

e-‘‘a_ e~itb

The real parts of q>(t) and of-are even functions, while their
it
imaginary parts are odd functions. Therefore the same holds for

e-na _ e-‘‘b
m = m —-—
u
(3)

as well. Consequently, W( — t) = 'Pit). If Tm {z} denotes the imaginary


part of z, we have
+T

j Im {!F(0} dt = 0 for every T> 0, (4)


-T

hence by (2)
+T
e~ita - e~i,b
F(b) — F(a) = lim <?(0 dt. (5)
r-*oo 2n it
-T

In many textbooks the inversion formula is given in the form (5).


Nevertheless, while the improper integral (1) always exists, the same cannot
be stated regarding the integral
+ CO

1
7(0 dt. (6)
In it
— 00

But if this integral exists, its value is by Formula (5) equal to F(b) - F(a).
For the proof of Formula (2), we need two simple lemmas.

Lemma 1. Put
314 CHARACTERISTIC FUNCTIONS [VI, § 4

For every real a and for every positive T we have

I S(cc, T)\<2. (8)


Furthermore
+ 1 for a > 0,
sin cat
lim S(ce, T) = — dt = 0 for a = 0, (9)
T- + co
n
-1 for a < 0;

and the convergence is i 0, where S is an arbitrarily


small positive number.

Proof. If we put

sin u
S(x) = -du,
7t u

we have
S(a, T) = S(xT). (10)
Put

sin u
-du,
-tJ u

then we have

sin u
cn=(-iy n I mz + u
du (n — 0,1,2,...); (11)

the numbers cn have alternating signs, their absolute value decreases,


hence the series ]T cn is convergent. From

«-i
2 r sin u
S(X) — Yj ck~I- du for nn<x<(n+\)n (12)
k=0 71 u

it follows that for even values of n


n-l
Z
I - s(x) — Z ck f°r nn < x <(n + 1) n, (13)
/c = 0 * =0

and for odd values of n


VI, § 4] SOME FUNDAMENTAL THEOREMS 315

Hence in every case

0 < .S(x) < c0< 2 for x > 0. (15)

Since S(—x) = — S(x), we have for every real x

I S(x) | < 2. (16)

Thus (8) is proved. (9) follows from the well-known formula

„, . 2 (' sin u ,
5(oo) = — -du — l. (17)
n J u

The uniform convergence follows from (10).

Lemma 2. Put
+T
1 j' sin t(z — a) — sin t(z — b)
D(T, dt (18)
-T
and
+ OO

1 f sin t(z — a) — sin t{z — b)


D(z, a, b) - D{+ co, z, a, b) = —— f dt. (19)
In J

For every real z, a, b and for every positive T

| D(T, z, a,b) | < 2; (20)


further if a < b, then
1 for a < z < b,
1
lim D(T, z, a, b) — D(z, a,b) — — for z = a or z — b, (21)
T—co

0 for z < a or b < z.

The convergence is uniform for \ z — a \ > 3, \ z — b\> 5 (5 > 0 arbitrary).

Proof. Since

D(T, z, a, b) = — [5(z - a,T)~ S(z - b, T)],

Lemma 2 is an immediate consequence of Lemma 1.


CHARACTERISTIC FUNCTIONS [VI, § 4
316

Now we turn to the proof of Theorem la. We have


j- on
„-ita
1
f
— e„-itb I

Re cp(t) dt =
2n it

+ 00 +00

sin t(z — a) — sin t(z — b)


dF(z) dt. (22)
t
— 00 — CO

On the other hand, since a and b are points of continuity of F(x), we have
by Lemma 2
+ 00
F(b) - F(a) = f D(z, a, 6) JF(z). (23)
— 00

In order to prove (2) it suffices thus to prove that the order of integration
may be reversed in the right hand side of Formula (22). The difficulty is
that the integral (19) representing D(z, a, b) is not absolutely convergent.
But by Lemma 2 we know that D(T, z, a, b) — D(z, a, b) tends unifoimly
to zero on the whole real axi", except for the intervals a — 5 < z < a + 5
and b-5<z<b + 5, where 5 is a small positive number. Furthermore
on these intervals |D(T, z, a, b) \ < 2. Since a and b are continuity points
of F(x), we have
+ 00 +00

lim | D(T, z, a, b) dF(z) = j D(z, a, b) dF(z) = F(b) — F(a). (24)


T-^co — oo — oo

On the other hand


+ CO

I D(T, z, a, b) dF{z) =
— 00

+T +co
sin t{z — a) — sin t(z — b)
dF{z)\ dt. (25)
t
T -oo

Here the order of the integrations can evidently be interchanged, because


of the absolute integrability of the integrand in the domain — oo < z < + oo,
| t | < T. If we let T tend to + (X, then (25) leads to Theorem la because of
(22), (23) and (24). If a and b are points of discontinuity of F(x) we find,
by a slight modification of the proof,
F(b + 0) + F(b) F(a + 0) + F(a)
2 2
+ 00

(26)
— CO
VI, § 4] SOME FUNDAMENTAL THEOREMS 317

Of course the density function /(x) = F'(x) of an absolutely continuous


F(x) may also be expressed in terms of 99(f). We restrict ourselves to the
case where the integral
+ 00

J
— 00
I <7(0 I dt (27)

exists. Then
F(jc + h) - Fix - h)
fix) = lim
A-0 2/i
+ CO

sin th
lim I —~t— [7(0 + qp(—0 eitx (28)
A-0 47T J

Since (27) exists and because of

sin th
[99(f) e itx + 9f—t) eltx] ^ 2 | 99(f) |,
th

the limit and the integration can be interchanged according to the theorem
of Lebesgue, hence
+ CO

fix) = J <7(0 e~itx dt. (29)


— 00

It is easy to show that the integral figuring on the right hand side of (29)
is a uniformly continuous and bounded function of x. This leads to

Theorem 2. If 99(f) is the characteristic function of the random variable


£ and if the integral (27) exists, then £ has a uniformly continuous and bounded
density function given by
+ 00
/W=T j T(0 e-«*dt. (30)
— CO

We shall prove now

Theorem 3. The distribution functions Fn(x) (n = 1,2,...) tend to a


distribution function F(x) at every point of continuity of F(x), iff the charac¬
teristic functions cpn{t) of F„(x) tend for n ->■ 00 to a function (p(t) continuous
for t = 0. In this case 99(f) is the characteristic function of F(x) and the
functions 9on{t) converge uniformly to 9(f) on every finite interval.
318 CHARACTERISTIC FUNCTIONS [VI, § 4

Proof. We show first that the condition is necessary, i.e. we have to


show that if
lim Fn(x) = F(x) (31)

at every point of continuity of the distribution function F(x), then

lim cpn(t) = cp(t) (32)


n-*- oo
holds, where
+ 00
<p(t) = I eitxdF{x), (33)

and the convergence in (32) is uniform in every finite t-interval. Let


e > 0 be given. Choose a number A > 0 such that +A and —A are conti¬
nuity points of F(x) and

F(- A)< F(+A)> 1 - —,


8
in this case

e’xt dF(x) < (34)


\x\ > A

For n > n-L, where nx depends only on e, the inequalities

Fn(-A)<~, Fn(+A)> l-~

hold, hence

e,xt dFn(x) <— for n>nx. (35)


4
x\ >X

Consequently, for n > nx,

+A
f*

elxt d[Fn (x) - F(x)} + (36)

Integrating by parts we obtain for j 11 < T

+A
elxt d[Fn (x) - F(x)} | <
—x

< | Fn (A) - F(A) | + | Fn (- A) - F(- A) | + T f \ Fn(x) - F(x) | dx. (37)


VI, § 4] SOME FUNDAMENTAL THEOREMS 319

Now |Fn(x) — F(x) | < 2 and according to the theorem of Lebesgue limit
and integration can be interchanged, hence the right hand side of (37),
and by (36) cp„(t) — <p{t) too, tend for n -a oo uniformly to zero if [t| < T.
Thus we proved that the condition of Theorem 3 is necessary.
We show now that it is sufficient as well, i.e. that from (32), with cp(t)
continuous for t = 0, follows (31). According to a well-known theorem
of Helly every sequence {Fn(x)} possesses a subsequence (F^x)} that
converges to a monotone nondecreasing function F(x) at all continuity
points of the latter.
We show first that this function F(x) is necessarily a distribution function.
It suffices to show that F(+oo) = 1, F(— oo) = 0, and that F(x) is left-
continuous. This latter condition can always be realized by a suitable
modification of F(x) at its points of discontinuity. Since F(x) is a limit
of distribution functions, we have always 0 < F(x) < 1. Hence it suffices
to prove that F(+ oo) — F(— oo) = 1. First we prove the following formula:
+
1 f 1 — cos xt , „ ,
[F„O0 - F„ (-+)] dy - — - - <P„(0 dt if x > 0.
~2 (38)
.) n J t

In fact
+ 00

, N 1 f 1 — cos xt ,N ,
d„(x) = — - - <P«(0 dt = -2
n J t

+ 00 +00

1 — cos xt
cos yt dt dFn(y) (39)
— 00 —CO

(the order of integrations can be interchanged because of the integrability


| _ cos xt
of -?-and because of \yn(t) | < 1). It is known that

+ co
1 — cos xt
dt = | x |. (40)
r
— CO

From (40) it follows that for x > 0


0 for y < — x,
+ OP

1 C 1 — cos xt x + y for — x < y < 0,


cos vt dt =
n x — y for 0 < y ^ x, (41)
— CO

0 for x < y.
320 CHARACTERISTIC FUNCTIONS [VI, § 4

Hence by (39)

hi*) = I (x - | y |) dFn(y). (42)


—X

An integration by parts in (42) leads to (38).


Since Fn(y) — Fn(—y) is a nondecreasing function of y, we obtain from
(38)
+ CO

i r \_cos xt
Fn(x) - Fn (-x) > — -3- Ut) dt, (43)
TC J Xt
— CO

or
+ 00

i 1 — cos u
FJL*) ~ U-x) > ~ du. (44)
ft u

Suppose that x and —x are both continuity points of F(x) and that n runs
through the sequence {nk }. Then from the theorem of Lebesgue concerning
the interchangeability of the limit and the integral it follows that
+ 00
1 f 1 — cos u
F(x) - F(-x) > (45)
— 00

(p(t) is continuous for t = 0 and because of ^(0) = 1 we have ^(0) = 1


as well; hence we obtain, by applying Lebesgue’s theorem again and by
taking (40) into account

jF(+oo)- F(-co)> 1. (46)

Consequently, /’(Too) = 1 and +(—00) = 0; F(x) is therefore a distri¬


bution function.
It remains still to prove that 1) 9o(t) is the characteristic function of
F(x) and 2) that the whole sequence {^(a:)} converges to F(x); for the
latter, according to the theorem of Helly, it suffices to show that the
sequence {Tn(x)} possesses no subsequence converging to a function other
than F(x). Both of these statements follow immediately from Theorem lb
and from the already proved first part of the present theorem.
Hence from the sequence of distribution functions (F„(x)) (n = 1, 2,. . .)
there cannot be selected any subsequence which does not converge to
F(x), this means that lim Fn(x) = F(x). The uniformity of the convergence
n-*~ 00
in the relation
lim <pn(t) = <p(t) for 111 < T
VI, § 4] SOME FUNDAMENTAL THEOREMS 321

(T > 0 fixed arbitrarily) follows from the already proved necessity of the
condition of Theorem 3. Herewith our theorem is completely proved.
Let us add some remarks.

1. We have seen that if the sequence of distribution functions {F,,(x)}


in = 1,2,...) converges to a distribution function Fix), then the sequence
<pfl(t) of characteristic functions of the distribution functions Fn(x) converges
for every t, when n -*■ oo, to the characteristic function (pit) of F(x). If the
condition that F(x) be a distribution function is omitted, the sequence of
the characteristic functions cpn{t) does not necessarily converge; let e.g. be

F„(x) =
II for x > n,
0 for x < n.

For every finite x, lim Fn(x) = 0, nevertheless <pn(t) = emt does not tend
rt— oo

to a limit (except for t = 2kn (k = 0, +1, ±2,. . .)).

2. We have proved that if the characteristic functions <p„(t) of the functions


F„(x) converge to a function (pit) continuous at t = 0, then the functions
Fn{x) converge to a distribution function F(x) with characteristic function
(p(t). If we omit the condition that cp(t) is continuous at the origin, our
proposition is no longer valid. Thus for instance let F„(x) be the
distribution function of the uniform distribution on the interval ( — n, +n),
that is

f-(x) = ~T + 17 for lxl-n;

, , sm«t , , , .
then (pit) =-, and thus the limit
nt

lim (pn (t) = (fit)


n-*- co

exists for every real t and is given by

1 for t = 0,
(f(t) =
0 otherwise,

thus (pit) is not continuous for t = 0. The sequence Fn(x) converges when

77 oo for every x to —. F{x) is therefore identically equal to the

constant — , thus it is not a distribution function.


2
322 CHARACTERISTIC FUNCTIONS IVI, § 4

We show finally that the characteristic functions of two different distri¬


butions may coincide on a finite interval.
Consider the random variable £ which assumes the values + (2k + 1)
(k — 0, 1,. . .) with probabilities

- a + 1) - — ilk + D) - (* = o,i,...).

We know that
* 1 n2
(47)
to (2n+\f “ 8 ’

hence
+ 00
x P({ = 2« + 1) = 1,
n= — co
and we find

8 * cos2«+l )t 2 11
^(0 = -r I for 111 < n, (48)
71 Mto (2« +t 1)
, \2
7t

further 9o^t) is periodic with period In.


Let now f/ be a random variable assuming the values 0, ± (4k + 2)
(* = 0, 1, . . .) with the probabilities

P(V = 0)=~ and P(r, = ± (4 k + 2)) = ^ ^+ Jf (* = 0,1,...).

Clearly the condition

P(*1 = 0) + +f P(r, = 4« + 2) = 1
/7= — 00

is fulfilled because of (47) and we obtain

2 11 n
*,(0-i- + 4r£ c~<4* + 2>‘ - 1 - for \t\<~. (49)
7T A: = 0 (2k + 1) n

According to (48) and (49) we have thus

Vi (0 — (0 for u | . (50)

The function 9^(0 is periodic with period n. Let the real axis be partitioned
into subintervals

2k- 1 2k 4- 1
n < t < —• n (k = 0, + 1, + 2,...),
VI, § 5] ON THE NORMAL DISTRIBUTION 323

then we see that the functions cpft) and cpft) are identical on intervals
with an even index k and are of the opposite sign on intervals with an
odd index k.

§ 5. Characteristic properties of the normal distribution

Let £ and be independent random variables with the same normal


distribution; it is easy to see that £ + rj and £ — rj are independent. It is
quite remarkable that this property is characteristic for the normal distri¬
bution. In fact, Bernstein has proved the following

Theorem 1. If E, and >] are independent random variables with the same
distribution and finite variance, further if £ + rj and £ — rj are independent,
then £ and q are normally distributed.

Proof. We may assume without restricting generality that E(f) — E(rj) = 0


and £>(£) = D(rj) = 1. If cp{t) is the characteristic function of the common
distribution of £ and the characteristic function of £ + r] is cp\t) and
that of t; — r] is <p(t) • <p( — t). Since £ + r\ and — rj are independent,
the characteristic function of their sum is equal to the product of their
characteristic functions. The characteristic function of (.£ + r\) + — rj) —
= 2£ is, by Theorem 3 of § 2, equal to y(2t). Hence

9^(20 = <p3 (0 <p(— t). (1)

Now cpft) can never be zero. In fact, if for a value t0 we would have

<p(t0) = 0, then by (1) we would have cp 0 and thus also cp = 0

(n — 1,2,...). As cp(t) is continuous, we would have 99(0) = 0; this is


impossible as 99(C)) = 1 (cf. Theorem 1 of § 2). Put

<K0 = In cp{t) (2)


then, by (1)
ij/(2t) = 3 ij/(t) + (3)
Put
m = ho - H-o. (4)

If in (3) t is replaced by — / and the equality so obtained is subtracted


from (3) we find
<5(20 - 25 0
( - (5)
324 CHARACTERISTIC FUNCTIONS [VI, § 5

By assumption cp(t) is twice differentiable and <p'(0) = 0, <p"(0) = — 1


(cf. § 2, Theorem 7). Since (pit) A 0, ip(t) and <5(0 are twice differentiable
as well and we have <5(0) = 0, <5'(0) = 0.
We obtain from (5)

(n = 1,2,...). (6)

The right side of (6) tends for n -a oo to <5'(0), i.e. to zero. Hence

<5(/) = 0. (7)

It follows that \p(t) = ip( — t) and by (3) that

<K2o=mr (8)
This leads to
t
~2T ) (9)
1

t2 t 2

T"“J

The right hand side of (9) tends to —since i/i(0) = ip'(0) — 0, ip"(0) —

— — 1. Hence
o
\[f(t) =-—- and (f{t) - e 2

£ and rj are thus normally distributed random variables.


Similarly, one can prove

Theorem 2. Let £ and rj be independent random variables having the same


P _1_ yt
distribution with zero expectation and finite variance. If -- _ has the
J2
same distribution as £ and t], then £ and rj are normally distributed.

Proof. Assume D(f) = D(r\) = 1. If cp(t) denotes the characteristic


function of £ and rj, we have

<?( 0)=1, (p\0) — 0, cp\ 0) = -l,


VI, § 5] ON THE NORMAL DISTRIBUTION 325

£+ Y]
By assumption, the characteristic function of —-=— is also equal to
v 2
cp(t). By Theorems 3 and 6 of § 2, however, the characteristic function of
£ + n . 2
— i=— is <r , hence
x/2 In/2

<P = t(0- (10)


V

From this follows, as in the proof of Theorem 1, that y(t) ^ 0 for every t.
If we put again In (p(t) = )K0> then 'KO is twice differentiable,

^(0) = i/d(0) = 0 and i/d'(0) = - 1.


From (10)
t
<K0 = 2il> (ii)
V2
hence for every positive n

W-4]
m \ 22 I
(12)
t \2

hence i//(0 = - and cp(t) = exp 1 - — j which proves our theorem.

Theorem 2 can be rephrased, by using the notion of families of distri¬


butions, as follows:

Theorem 2'. Let F(x) be a distribution function such that

j” xdF(x) = 0, +f x2 dF(x) = 1.

x—m
If the family of distributions IF , a > 0, is closed with respect

to the operation of convolution, i.e. if for any real numbers mx, m2 and for
any positive numbers oq, cr2 there can be found constants m and o (m real,
a positive) such that

x — m1 x — m2 x — m
=F (13)
326 CHARACTERISTIC FUNCTIONS [VI, § 5

then
X

(14)

— 00

is thus the family of the normal distributions.

Proof. Obviously, m — mx + m2, o = ■f o\ + o\. If we put

+,00
<p(t) = | e'xt dF(x),
— oo

then

<f(? i 0 (p(<>2 t) = + cr! /). • (15)

For oy = o2 = —7= (15) reduces to (10); hence Theorem 2' follows from
V2
Theorem 2.
Theorem 2' explains to some extent the fact that errors of measurements
are usually normally distributed. In effect, the condition, that the sum of
two independent errors of measurement belongs to the same family of
distributions as the two errors themselves, cannot be fulfilled, in the case
of a finite variance, by other than normal distributions. The condition that
F(x) should have finite variance is necessary for the validity of Theorem 2'.
Thus, for instance, for the distribution function

1
arc tan x
71

of the Cauchy distribution we have the relation

lX~Wl\*F ' x — m2 F (x ~ (mt + m2) j


I CT2 [ <*1 + ^2 J' (16)

(16) follows easily by taking into account that the characteristic function
of F(x) is equal to

If the family of distrubtions |F —-— j is closed under the operation

of convolution, F(x) is said to be stable. According to Theorem 2', the


normal distribution is the only stable distribution having finite variance.
VI, § 5] ON THE NORMAL DISTRIBUTION 327

as pointed out above. There exist, however, other stable distributions, e.g.
the Cauchy distribution. Stable distributions will be dealt with in § 8.
We deal now with some further remarkable properties of normal distri¬
butions. If £ and r] are independent normally distributed random variables,
their sum £ + r] is, as we know already, normally distributed too. We shall
now prove that the converse of this statement is also true: this result is
due to H. Cramer.

Theorem 3. If c; and i] are independent random variables and if £ + r/


is normally distributed, then <; and i] are normally distributed themselves.

Proof. We may suppose E(£, + rj) — 0, and + i]) — 1; the charac¬

teristic function of £ + t] is then exp Let (pft) and <pn(t) be the

characteristic functions of £ and rj, respectively. We have thus

9>«(i)%{t) = e 2 . (17)

If F(x) and G(x) denote the distribution functions of t; and rj, respectively,
we have
-f OO +oo

n (0 = J elXt dF(x\ % (0 - J eixt dG(x). (18)


— 00 —00

We show now that the definition of cpft) and (pn(t) can be extended to
all complex values of t, so that (pft) and (p.ft) are entire functions of the
complex variable t. Let us first suppose t = iv (v real) and let A and B
be any two positive numbers. We have

Y e~vx dF(x) • J e~vy dG{y) < J° Y e-^x+y) dF(x) dG(y) =


—A —B —ao—oo

= <?«+!, (iv) = e2. (19)


Since e~ vx > 0, the following integrals exist:
+ 00
<ps(iv)= J e~vxdF(x) (20a)
— 00

and
+ 00
<pn (iv) = j e~vydG(y). (20b)
— 00

If now t — u + iv, we have


+ 00 + 00
(ft, (t)| = | f eitx dF(x) | < j e vx dF{x) = cpi (iv). (21)
328 CHARACTERISTIC FUNCTIONS [VI, § 5

The definition of cp^t) and cpn(t) can thus be extended to every complex t.
It is easy to see that cp^(t) and cp^t) are holomorphic on the whole complex
plane, hence they are entire functions of t.
Because of (17), (p^t) ^ 0, cpn(t) # 0 for every t. Hence lnq9^(/) and
In 99,(0 are entire functions too, where that branch of the logarithmic
function is to be taken, for which In 1 = 0. If a > 0 and b > 0 are such

that F(a) — F{ — a) > and G{b) — G{ — b) > , then

+ 00

j
-a\v\

<Pz (/ v) = e xv dF(x) > (22)

and
+ 00
M
9\ O' ») = e~xv dG(x) > (23)

Hence, for t = u + i v,

v*_
.2
r+%1 14? +b\t\
I Vt 0) I ^ Vi 0 v) = < 2e <2e (24)
V« O' V)
Similarly, we obtain
HI’
+ a|r|
Vn 0) I— (25)

If the real part of z is denoted by Re(z), we have by (24)

1
Re(ln <p{ (0) | = In < 1121 + max (a, b) \ 11 + In 2. (26)
I Vi (01
We have 9^(0) = 9\(0) = 1; furthermore we may suppose without
restricting generality, <^(0) - <^(0) = 0. Indeed if we would have <^(0) = a,
and consequently, <pn(0i) — — a, we could always consider instead of £
and rj the random variables £ — a and t] + a whose characteristic functions
satisfy the above conditions. From this we conclude that the functions
In g>{ (0 In <p (/)
and are everywhere holomorphic, furthermore

Re(ln <p4 (0) Re0n Vn (0)


2 and
VI, § 5] ON THE NORMAL DISTRIBUTION 329

are, because of (25) and (26), bounded on the whole t-plane. According
to a well-known theorem of H. A. Schwarz the relation
2ji

/(z) = to (/(0)) + A.
J Mrt"*))**^*
0
f Reie 4- z
(27)

holds for | z | < i? for every function /(z) holomorphic on | z | < R. It follows

from (27) that -—^ ^ and ^—^',a^ ^1 are bounded on the whole plane;

they are thus, according to Liouville’s theorem, constant. Hence 9^(t) =


= exp (ct2) and ytt(t) = exp (dt2). Because of

n(-o = nco» n(-o = n(o» in(01 ^in(0i^ 1.


there follows
~

n (0 = exp (2 8)
1
<N
1

and
r t2 ]
n (0 = exP (29)
2o\

and herewith Theorem 3 is proved.


If _/](/),.. . ,/r(0 are characteristic functions and ax,. ., ocr are positive
rational numbers, further if

h (/*(<))" = *■ ‘2 , (30)
k=1
then it follows from Cramer’s theorem that the functionsfk(t) (k = 1,2,,.., r)
are characteristic functions of normal distributions. In fact, if N denotes
the common denominator of the numbers ax,. .., a,, we have
r Nt‘

n(/*«=*-*',
k=l
(jo
where Nak (k = 1, 2,..., r) are integers. Hence
_ M2
/*(')«» (<) = «' ", (* = 1,•■■,'■), (32)
where we have put

fcw-wwr-'noswr1. j*k
gk(t) is also a characteristic function, hence by Cramer’s theorem fk(t)
330 CHARACTERISTIC FUNCTIONS [VI, § 5

is the characteristic function ®f a normal distribution. If not all ak are


rational, Cramei’s theorem does not guarantee the validity of the propo¬
sition; however, Yu. Y. Linnik and A. A. Singer have proved that it holds
in this more general case too; i.e. they proved the following

Theorem 4. If the functions fk(t) (k — 1,,r) are characteristic functions


and if we have in some interval | /1 < <5 (<5 > 0) identically

11 CAM)** = *"“"+ (33)


fc = 1
where m is a real number and a, al5 a2,. . . , a, are positive numbers, then
the functions fk(t) (k = 1,2,,r) are characteristic functions of normal
distributions.

Proof. The following proof is due to Yu. V. Linnik and A. A. Singer [1].
It consists of five steps.

Step 1. Put

(k = 1, 2,..., r).

Clearly gkit) is a characteristic function too; in fact if £ and g are independent


random variables possessing the same distribution with characteristic
£_ yj
function fk (t), then gk(t) is the characteristic function of-—. Thus the
identity o\/2

(9k(t))*k = e~ 2 (34)
k=1
holds if 111 < 5. Furthermore^ (t) is a real and even function. If we prove
from (34) th&t gk(t) is the characteristic function of a normal distribution,
then the theorem of Cramer implies the same conclusion for fk{t) . It
lollows from (34) that^t) ^ 0 for ] t| <5, hence we may take the log¬
arithm of the two sides of Equation (34):
r
1
Z afc In (35)
k=l 9k (0
Let Gk (.x) be the distribution function corresponding to the characteristic
function gft). It follows from the assumptions concerning gk{t) that Gk(x)
is symmetric with respect to the origin; hence we have for any a > 0
+ 00 + fl
9k (0 = J COS tx ■ dGk (x) < 1- j (1 - cos tx) dGk (x).
— OO —a
VI, § 5] ON THE NORMAL DISTRIBUTION 331

n
Since for | / j < —— the relation
2a

| (1 — cos tx) dGk(x) < 1


a

l
holds and since for 0 < x < 1 we have jc < In , it follows from
1 — X
Tt
(35) for a > —— that
25
+a
r C t2 n
E (1 - cos tx) dGk (x) < — for \t\< (36)
k=l J £ 2a
—a

If we divide both sides of (36) by t2 and let t tend to zero, we obtain


r -{-a

_ j
E
k=1
.f
-a
dGk (X) < 1. (37)

71
Since (37) holds for every a > , the integrals
25

f x2 dGk (x) (k r)
— 00

exist; by Theorem 7 of § 2 gk(t) is thus twice differentiable and, gk(t) being


an even function,
9k (0) = 0 (k — 1,..r). (38)

From (38) and (35) we conclude that

-i«kA
k=1
(0) = Z % T-2 daK W = I •
k=l —co
(39)

Step 2. We show now that gk(t) possesses derivatives of every order.


For this we need the classical formula of Faa di Bruno concerning the
successive derivatives of composite functions.1 This formula states: if

z = H(y), y = h(t)
and if we put
1 dv h(t)
h0(t) = h(t), hv(t) = (v= 1,2....),
v! df
then we have for every integer p
dpz ^ d’H(y)
(40)
dtp dyl h ^2 • • • ■ • 4 '

1 Cf. e.g. E. Lukacs [2].


332 CHARACTERISTIC FUNCTIONS [VI, § 5

where the summation is extended over all nonnegative integers 4> . . . , 4


and 4,. . . , 4 satisfying the conditions
S S

E 4 =z and
j=1
E 44
7=1
= p>

where / assumes the values l — 1, 2,,p.


In particular, if H(y) = In y and p — 2q, we obtain

d2qInh(t) ^(-ly-1 (2<y)! (/ — ^j-j h(li) (0


(41)
dt2q 4! 4! • • • 4'! 7 =1 4! /z(0
where the summation is to be extended over / = 1, 2,. . . , 2q and over
4 , lj such that

E 4 4 E 4 4 ~ ^<7*
7=1
~
7=1

Now we show by induction that the integrals

| x2q dGk(x) (q — 1,2,...; k=l,2,...,r)


— OO

exist. Suppose that this holds for a given integer q; from this it follows
that gk(t) (& = l,... ,r) is exactly 2q times differentiable and that

(0) = 0 (l=l,...,q; k — 1,..., r).


According to (35) we have

[ill
d2q
U J = (2?)! E E (-!)'(/-l)!Efl
h 2«
9k (04
•1
!J (42)
k=l 1=1 7=1

where the summation is as in (41).


Put t = 0 in (42) and subtract the relation thus obtained from (42).
Separate the terms with the indices / = 1, 4 = 1, 4 = 2q and consider
that the left side of (42) is either 1 or 0 for q = 1 or q > 2, respectively.
Thus we obtain

E
k=1 a tin

2q
= (2?)! E
k=1
E (-l)'-'(' - 1)! [SbW - S«(0)],
/=2
(43)
with
"

■5«(o=En—(0:,,!
7=1 4!
VI, § 5] ON THE NORMAL DISTRIBUTION 333

the summation being extended over the i), l, such that

7=1
= Z
7=1
hh = 2<l'

We show now that the right hand side of (43) has the order of magnitude
0{t2) when t -* 0. In fact if v < 2q is an odd number, then by the induction
hypothesis g^\t) — 0(\t\). Hence it suffices to consider terms for which
all the lj are even. If v is even, v < 2q — 2, then we have

- rt’Ho) = 0(t2),
9 kif)

from which our statement follows. But

- 1 - 0(t2)
9k (0

is also valid; hence it follows from (43) that

Z 0Lk[g^(t)-g<?»m = O(t2). (44)


k=1

In consequence, the expression


+ 00
1 - COS tX 3
X a* --j-xq dGk (x)
k=l tr

is bounded. If we let t tend to zero, we see that the integrals

+f x2q+2dGk(x) ik = l,...,r)

exist and this means that gk(t) (k = 1,,r) is at least (2q + 2) times
differentiable. As we know already that the integrals

f x2dGk(x) (&=1,...,/•)
— 00

exist, the proof is finished; gk(t) (k = 1,. . . , r) is thus infinitely often


differentiable.

Step 3. We shall show now that the gk(t) are holomorphic in a circle
11! < R (R > 0). In order to show this we have to evaluate the order of
334 CHARACTERISTIC FUNCTIONS [VI, § 5

magnitude of the derivatives g(kQ) (0). We can restrict ourselves to the


case when ak > 1 (k = 1,. . . , r); otherwise there exists an integer N0 such
that N0ock >1 (k — 1,,r). Since (34) leads to the equality
N„ak t2
n
*=i
9k
Jn,n 7

(34) is satisfied by the functions

&t (0 = 9k

for a* = N0ak. Without restriction of generality we may thus assume


> 1 (k = 1,,r). Now raise the two sides of (34) to the power 2q,
differentiate 2q times and put t = 0. By introducing the notation yk(t) =
= [gkm*kq we obtain thus

2q\ d2q
I 7?°(0)...yP(0) (45)
lx + ... + //• = 2q ~dt^

The quantities y{k\t) can be evaluated by means of the formula of Faa


pi Bruno:
/
y{k{t) = I 2q*k (2qcck - 1) ... (2qak - v + 1 )[gk (t)fqak~v x
V=1

/!
It
*i!
n
7=1 V
(46)

where in the inner sum the summation is to be taken over the i}-s and
Ifs such that Yj ij = v, Y hb = ^ Because of

^2,-x>(0) = 0, sgn#)(0) = (-iy


and
2q*k (2qcck - 1) ... (2q<xk - v + 1) > 0 for v < 2q,

it follows that all nonzero terms on the left hand side of (45) have the
sign of (-1)9. The right hand side is

- dtu |(_o = (2?)' (2?)! H2, (0), (47)

where H2q{x) is the Hermite polynomial of order 2q:

1
H2q 00 =
(48)
(2qf.
VI, § 5] ON THE NORMAL DISTRIBUTION 335

Thus
(-i)g
*M0) =
q\ 2q

Since on the left hand side of Equation (45) there occur the terms
2qcckg{^\0) too, the relation
(2qy(2q)l
^(0)<
Hi 2q
must hold, wherefrom

f (Q) | (49)
lim sup
q~*~ oo l (2q)\ I

thus gk(t) is holomorphic in the circle 11 \ <—= (k = 1,.. . , r).


s/e

Step 4. We show that the functions gk(t) are also entire functions. Put
hk{t) = gk i Jt. Since gk{t) is an even function, hk(t) is holomorphic in

the circle 11 \ < — . Suppose that not all gk(t) are entire functions; then
e
the same holds for the functions hk{t). Let hko(t) be the function hk(t) which
has the smallest radius of convergence, which radius we denote by R.
Take 0 < r < R; put k(t) — hk(r + t). Then

r+t

n
k=1
arm e 2 (50)

and
r

n ’. 3T4V)
^k(t)Yk_c
J
(51)

Since k{t) (k = 1,. . . , r) too can be represented by a power series with


positive coefficients, we obtain, by raising (51) to the n-th power and
differentiating o-timcs,

ns*, jrg(o) <


hence
attw )■;-<_< (52)
lim sup
n\

3Pk (t) is thus holomorphic in the circle 11 \ < —, i.e. hko(t) is holomorphic
336 CHARACTERISTIC FUNCTIONS [VI, } 5

in the circle 1t — r | < —. From this it follows for an r sufficiently near

to R that hko(t) is regular at the point t = R. This, however, contradicts


the known theorem according to which the sum of a power series with
positive coefficients having a radius of convergence equal to R, is singular
at the point + R.1

Step 5. The proof of Theorem 4 can now be finished like that of Cramer’s
theorem. If we choose the numbers ak > 0 (k = 1, 2, .... r) such that

Ok

J dGk(x) > ~ (k =■- 1,..r) ,


-ak

then
1112
^\9k{t)\<~-+ C\t\,
Z(Xk

where C is a positive constant. Because of (34) and step 4, the function


In gk(t) is an entire function, hence by Liouville’s theorem In gk{t) is a
polynomial of at most the second degree, which leads to the statement
of Theorem 4.
Theorem 4 ^enables us to generalize Theorem 1. In fact, as Darmois
and Skitovitch2 have shown:

Theorem 5. If ql, , . . . , f are independent random variables and


tfi, ... , ar, by, . . . ,br are real numbers different from 0, further if the
random variables
r r
= Z at£k and t]2 = £ bk£k
^=1 k=i

are independent, then the random variables f ,, . . . , f are normally


distributed.

Proof. Linnik has shown that the proof of this theorem can be reduced
to that of Theorem 4 as follows: Take ax = a2 = ... = a, = 1, which
does not restrict the generality, rjy and rj2 are by assumption 'independent
hence we have
E(ei(«>h + v,h_)) __ E(eiun^ E(em*y

1 Cf. e.g. E. C. Titchmarsh [1], p. 214.


2 G. Darmois [1], V. R. Skitovitch [1],
VI, g 5 ON THE NORMAL DISTRIBUTION 337

If (fk{t) is the characteristic function of we have by (53) because of the


independence of the £k.

11 + M = fl <pk(bkv). (54)
k=1 k=1
In a neighbourhood of the origin the cpk(t) are not all zero. Put therefore
M) = In (pk{t), then

Z 'I'kiu + bkv) = £ 1J/k(u) = £ ih(bkv). (55a)


fc=i fe=i k=1

Replace in this equality u by — u, v by — v and add the equality so obtained


to the former one; it follows, by putting i//*(t) = ipk(t) + \f/k(-t), that

Z W(M + bkv) = Z H(u) + Z Vk{bkv). (55 b)


k=1 fc = l k=1

We prove now that ij/*(t) = -ckt2. This means that (pk(t)(pk(-t) is the
characteristic function of a normal distribution and the proof of Theorem
5 is finished by Cramer’s theorem.
Multiply the two sides of (55b) by x — u and integrate from 0 to x, thus

X“
Z ( (x - u) <A*0 + bkv) du = ~2
Z 'l/t(bkv) | +
k=1 J k=l

+ O' - it) £ II4(u) du.


k=l

Then, by variable-transformation and integration by parts, we get


bkv x+bkv t

J O - u) >A*0 + bkv) du = — x j iA*(t) dx + f (J iJ/k (x)dx) dt.


o n bkv o

If we put

B{v) = Z <A* (bkv), (56)


k=1
we obtain
r x + bkv t bkv

Z J (j «A*0) dx) dt - x j \jj*k(x)dx =


k=1 bkv 0 0
(57)
x
= ~x~ B(v) + (X - u) Z Vk(M)du.
k=l
338 CHARACTERISTIC FUNCTIONS [VI, § 5

The left hand side of (57) is obviously a dilferentiable function of v. Hence

x+buv

\j/k(i)dx = —2?' (v) + (58)

where

A(v) = £ bk\jj*(bkv).
k=1

Replacing in (58) x by — x and adding the equation so obtained to (58),


we get
r x+bkV —x + bkV
.f = (59)
fe = l

Clearly, this equation can be differentiated with respect to v; doing this


and putting o = Owe find, because of \j/k ( — x) = i/i*(x) and i/i*(0) = 0

E b\Vk(x) = ~ B”(0). (60)


k=1

From this follows

n =e

where
<Pk(x) = = <Pk(*)

Since \ | ^ 1, the relation B"{0) < 0 must hold. Equality cannot hold
here, since then the £k would be constants with probability 1. Hence we
can put 5"(0) = — o2 < 0 and we have thus

n fa? (*))**=* (a2 > 0) (61)


fc=i

in a neighbourhood of the origin. By Theorem 4 the functions <p*(x), and


consequently the functions yk{x), are characteristic functions of normal
distributions. Theorem 5 is herewith proved.
Finally, we shall prove one further characteristic property of the normal
distribution: the following theorem of E. Lukacs:

Theorem 6. If £a,.. . , are independent random variables having


the same distribution of finite expectation and variance, then this distribution
VI, § 5] ON THE NORMAL DISTRIBUTION 339

is a normal one iff


n

«= E & and *1 = Yj 5*
k=1 k=1
are independent.

Remark. R. A. Fisher proved the independence of £ and r\ for normally


distributed £k. The converse theorem was proved by E. Lukacs in the
case where the t,k have a finite variance. It was already proved before by
R. C. Geary under the stronger condition that all moments of the £k exist.
Later on it was proved by J. Kawata and H. Sakamoto1 as well as by
A. A. Singer2 that even the existence of the variance is unnecessary; however,
we shall not deal with this more general case.

Proof of Theorem 6. The condition is necessary. The £k are normally


distributed; we may assume E(£f) = 0, D(fk) = 1. If (cy) is an orthogonal

matrix of n rows and n columns with cy — —— (j — 1, 2,we know


Vn
(cf. Ch. IV, § 17, Exercise 43.b) that the random variables

«,* = kZ= 1 (J =1,2,..., n)

are mutually independent, too.


We have thus £ — ^Jn and
n e2 n n

— =Z^2- tf2 = I
k=i n jt1 j=2
which shows the independence of £ and ?/.
2. The condition is sufficient. We may assume E(^k) — 0. By the assump¬
tion of the theorem
E{em+vn)) = E(eiui) E(eiv”). (62)
If we differentiate on both sides of (62) with respect to v (which is allowed
because of Theorem 7 of § 2), and substitute v = 0 afterwards, we obtain
E(neiui) = E{eiui) E{r,) = (cp(u))n E(r\), (63)
where
<p(u) = E(eiuik)
is the characteristic function of the random variables £k. From
n-1
1
rj = 1 - Z «!-— Z= 1 j<k
k=l
Z £A (64)

1 J. Kawata and H. Sakamoto [1].


2 A. A. Singer [1],
340 CHARACTERISTIC FUNCTIONS [VI. § 6

follows E{rj) — (n — 1) a2, by putting a2 = i$(£f). Since

E(t;keiu^)=-icp'(u) (65)
and
ml em‘) = - ■?», (66)
(63) can be written in the form

-O-i) <p"0) 0(«))"_1 + ~ (2 (<p'(«))2 0(w)) h —2

= 0~ l)ffa(9»(“))" (67)

If we divide by (« — 1) 00))", we find

?"0) f^O)'2 = — ff (68)


9?(m) <K«)

The left hand side of (68) is the second derivative of In cp(u). If we integrate
twice and consider that 0(0) = iE(gk) = 0, we find

G 2U2
\nrp(u) — — (69)

which proves the theorem of Lukacs.

§ 6. Characteristic functions of multidimensional distributions

Let £ = (0, ^2,, £„) be an n-dimensional random vector with


distribution function F(xlf x2,. . . , x„).
For any two vectors t = (tlt t2,. . . , t„) and x = (xlt x2,. . . , x„) we
put
n

(x,t) = YJ xktk. (1)


/c = l

For sake of brevity we write F(x) = F(xx, x2,. . . , x„). We define the
characteristic function of £ by
+ 00 +00

<Pi(0 = E(ei(i’n) = j ... j ei(jc,9 dF(x). (2)


— 00 —00

The characteristic function of an ^-dimensional distribution function is


thus a function of n real variables. As is readily seen, it has the following
properties:
VI, § 6] MULTIDIMENSIONAL CHARACTERISTIC FUNCTIONS 341

1. For 0 — (0, 0, . . . , 0) we have

<P«(0)=1.

2. For every t
1^(01 ^ 1-
3. qft) is a uniformly continuous function of t.

4. If

h 0?1> • • *J Vn)’ tfj Z CjkCk F bj (j 1, , ?2)


k= 1
and

u = (w1?..w„), uk =Z Cjkli {k=\,,.., n),


7=1
then
9,(0 = 9«(«)-

5. = iE(fk) when E(ff) exists.


dtk t=o

d2 <p&)
6. = - E(£j£k) when Egfa) exists.
dtjdtk t=o

7. If the density function f(x) of £ exists, then


+ 00 +oo

9«(0 = J • • • I eKt’x)f(x) dx,


-co — oo

where dx is an abbreviation for dxxdx2 ■ ■ ■ dx„.


Example 1. The characteristic function of the multinomial distribution.
If

Pj>0 (J= 1,2,£ Pj= 1


7=1
and
Nl wjki
P(Z 1 = /cv..., = k„) = Pi 9 • • •? Pn 5
kf.... kn\

where £ kj = N {kj > 0 being an integer), we obtain


7=1

<Pt(t) = (ipkei,k)N-
/c = l

Example 2. The characteristic function of the n-dimensional normal


distribution.
342 CHARACTERISTIC FUNCTIONS [VI, § 6

If the vector rj = is by definition normally distributed,


there exist normally distributed independent random variables £1,. . ., £„
with E(£,k) — 0, D2(£k) — ok and an orthogonal matrix (cjk) such that
ft

% = Z cJk£k + mi
k=1
(J = 1,2,..., n). (3)

Then by property 4 of the characteristic function

<Pn(0 = eKm,t) ?»«(«)» (4)


where
n

Uk =Z cjkt/, U = (Ml, .. M„), m - (mlt ... ,m„). (5)


7=1

It suffices therefore to determine the characteristic function of £. Since,


by assumption, the random variables are independent, we have

<pt(u) = = fl E(eiUkik) = exp - 4- Z akuk (6)


k= l Z A: = l

(4), (5) and (6) lead to


jj « n

<p„(t) = exp Km, t) - — Z Z bhjhh (7)


Z A=1y=l
where

bhj — Z °kchkcjk-
k=1
(8)

It is easy to see that the quadratic form

Z Z
h=li=l
bhjth{j

is positive definite and the matrix B = (bhj) is the inverse of the matrix
A = (ahj), the elements of which are the coefficients in the expression of
the density function of ;/. The matrix B is thus the dispersion matrix of 17.
In fact, a simple calculation shows that the matrix B can be written in the
form B — CSC-1, where S is the diagonal matrix with elements erf.
On the other hand, we have proved (cf. Ch. IV, § 17, Exercise 42) that the
density function g{y) of t] with y = (ylt . . . , y„) is given by

J n n

9{y) = — 'A} exp -y Z Z


L h=\7=1
ahj (y’h - mh) (y - mj) (9)
(2k)2'
VI, § 6] MULTIDIMENSIONAL CHARACTERISTIC FUNCTIONS 343

where the matrix A = (ahj) can be written as A = CS XC \ Hence we


have obtained

Theorem 1. If q is an n-dimensional normally distributed random variable


with density function (9) such that the quadratic form

n n

IZ avzkZj
h=1y=l

is positive definite and its determinant is denoted by \A\, then the characteristic
function of q is given by (7) where m = (mx , , m„) and where (bhj) — B —
— A~x is the dispersion matrix of q.

There exists an inversion formula for n-dimensional characteristic func¬


tions too. It is given by

Theorem 2. If (pft) is the characteristic function of the n-dimensional


vector we have
+T +T
i r r n p-itk.ak _ p-‘<kbk

1 n -—t-
k=1itu
<pwdt> (io>
-T -T

whenever the distribution functions of all £k are continuous at the points


ak and bk(k = 1; here I is the n-dimensional interval ak < xk < bk
(k = 1n).
Like in the one-dimensional case it follows from this theorem that an
n-dimensional distribution function is uniquely determined by its charac¬
teristic function. The proof is analogous to that of the uniqueness theorem
for n = 1; we leave it to the reader.
By means of the uniqueness theorem we prove

Theorem 3. The distribution of the vector c; = (f,. . . , C«) Is uniquely


determined by the distribution functions of the projections of £ upon all lines
passing through the origin.

Proof. Let da be an arbitrary line passing through the origin having


the direction-cosines oq,. . . , a„; put a = (a1? . . . , a„)- The projection of
t on dx is thus

Hence the characteristic function of the random variable is

= E(eKa^‘) = <pfut), (12)


344 CHARACTERISTIC FUNCTIONS [VI, § 6

where at is the vector with components akt (k = 1, . . . , n). Since every


system tx,. . . , tn of real numbers can be written in the form tk = tak
where t is real and

Z «f = 1,
A:=l

the theorem follows from the uniqueness theorem.

Theorem 4. If £ = (&,..., £n) and rj = (i/1;. .., rjn) are n-dimensional


independent random vectors and if ( = £ + rj, we have

<Pt(0 = <PS(0 <PnCO-

PROOF. Let be any line passing through the origin with direction
cosines ak (k = l,... ,n), and put a = (al5. . . , a„). If £a, rja, are the
projections of >/, £ upon da, we have C« = + ?7a- Hence because of
the independence of £ and rj:

<Pdt) = <Pd 0M0- (13)


It follows from (12) that

<pfta) = cpfta)(pn(ta), (14)

and the proof can be finished as that of Theorem 3.

Theorem 5. If . . . , c„ are independent, £ = (f,. . . , £w)} t =


— • • • > Oj (where tx,. . . ,tn are real numbers), then
n

(0 = n <Ptk (4). (15)

Conversely, (15) implies the independence of £x,... , fr

Proof. It follows from the assumption that the random variables eitkik
are also independent, hence

h<)=£(n/"*“)=n £(<■"*“)=ri vm-


k-l k=l k=\
(i6)

If on the other hand * - (Xl,. .., x„) and if F(x) is the distribution
function of Fk(x) the distribution function of £k(k = 1,... ,ri), further if

C(*) ^ FI Mxk)>
VI, § 6] MULTIDIMENSIONAL CHARACTERISTIC FUNCTIONS 345

then by (15) it follows that the characteristic functions of F(x) and G(x)
are identical. Hence, because of the uniqueness, F(x) = G(x) and the
random variables are independent.
As an application of the preceding theorem we prove now the following
theorem due to M. Kac (cf. Ch. Ill, § 11, Theorem 6).

Theorem 6. If £ and rj are two bounded random variables fulfilling for


any integers k and l the relation

E^krf) = E{fk)E(rll), (17)

then £ and tj are independent.

Proof. Our assumption implies the absolute convergence of the series

” E^Xit.f E(rt)(it2y
Wi) = L ' and <pn(t2) = £
k=0 k\ 1=0 /!

for every complex value of t1 and t.z. If we put £ = (<£, q) and t = (tx, t.,),
it follows from (17) that
00 00
E(erjl)(ih)k (it-))1
<p&) = E Z 2>-
k=01=0 kin
Hence the theorem is proved.
Theorem 3 of § 4 may also be extended to the case of higher dimensions:

Theorem 7. Let t;N (N — 1,2,...) be a sequence of n-dimensional random


vectors; let FN{x) (x = (xl5 . . . , x„)) denote the distribution function of
£n and cpN{t) (t = (tx, . . . , tfi) its characteristic function. If at every point
of continuity x of F(x) the relation

lim Fn(x) — F(x) (18)


N-» + cc

is valid and if F(x) is a distribution function as well, then

lim <pN(t) = (p(t), (19)


N-*ao

where (f {t) is the characteristic function of F(x). The convergence is uniform


on every n-dimensional interval.
Conversely, if (19) holds for every system of values (.q, ...,?„) — t and
if p{t) is continuous for t = 0, then (18) holds as well, the function F(x)
figuring in it is a distribution function and cp is its characteristic function;
346 CHARACTERISTIC FUNCTIONS [VI, § 6

furthermore, for any A > 0 the convergence in (19) is uniform for \ t\< A,
where \ t | = sjt\ + ■ ■ - + t„ .

The proof of this theorem is omitted here, as it is essentially the same


as that of Theorem 3 of § 4.
As an application of Theorem 7 we show that the multinomial distri¬
bution tends to the normal distribution. Suppose that the random vector
= (£ni> ■■■Am) (N = 1,2,...) has a multinomial distribution

N\
P(£ni — kl, •• •■> £Nn — k„) — ■ r>kl
Pi Dk" :
Jrn (20)
kf

where klt. . ., kn are integers and

Ykj = N, Pi>0, Y Pj=\


7=1 7=1

Let denote the random vector (gN1,. . . , rjNn) with

£nj ~ NPj
hNj — (21)
s/NPj

We obtain for the characteristic function of rjN


n N
Itj
9\N (0 = exP [- ijHI tjy/pj)] L Pj^V
j=l 7=1 jNp )J
hence

In (pnN(t) tjJPj) +
7=1

» I
( Ui
+ Ain 1 + I Pi - 1 (22)
7=1 lexp «
By substituting the expansions of ex and ln(l + x) we have

lim In <pnN(t) = tjJpj )2]. (23)


A'—oo 7'=1 7' = 1

The limit distribution has the characteristic function

99(t) = exp (24)


7=1

This is the characteristic function of a degenerate normal distribution;


VI, § 7] INFINITELY DIVISIBLE DISTRIBUTIONS 347

in effect the random variables r\Nk are connected by the linear relation
n _

X
k=i
>iNk \/Pk = Q;

the limit distribution too is thus concentrated on the hyperplane defined


by the equation
n _

X Xks/Pk =
k=1
o.

§ 7. Infinitely divisible distributions

In this section we shall deal with certain types of distributions which


can be described most conveniently by means of their characteristic
functions.
The probability distribution of a random variable £ is said to be infinitely
divisible if for every n = 2, 3,. . . there exists a probability distribution
the n-fold convolution of which is equal to the distribution of £. Or to
put it otherwise: the distribution of £ is infinitely divisible if, for every n,
^ can be represented in the form t, = + • • • + where
are mutually independent random variables with the same distribution.1
Let (f(t) be the characteristic function of £. Obviously, to say that the
distribution of ^ is infinitely divisible means the same as to say that for
i
every integer n the function |+(f)]" is again a characteristic function.
The infinitely divisible distributions can be characterized by the following
property:

Theorem 1. The function g(t) is the characteristic function of an infinitely


divisible distribution iff In<f (t) is of the form
+ 00
O_2 t12 iut 1 + w2
In cp{t) = iyt — + - 1 dG(u), (1)
1 + ul

where y and a > 0 are real constants and G(u) is a nondecreasing bounded
function. (Formula of Levy and Khinchin.)
If the distribution has a finite variance, (1) may be written in the form
+ 00
dK(u)
In 9ft) = imt + cr2 (e'“f - 1 - iut)- (2)

1 This second definition is not completely exact, since it is not certain that the
£,k-s can in fact be realized on the probability space in question. However, if the words:
“for a suitable choice of the probability space” are added, then the second formulation
becomes correct and equivalent to the first one.
348 CHARACTERISTIC FUNCTIONS [VI, § 7

where m is a real number, a > 0, and K(u) is a distribution function. (Formula


of Kolmogorov.)
In particular, if
0 for u < 0,
m= 1 for u > 0,
then

In <f (t) = imt —


2

hence in this case the distribution in question is a normal one. It can be


seen from (1) as well as from (2) that if y(t) is the characteristic function
i

of an infinitely divisible distribution, not only [cp(t)]n but also [y(t)]a


is a characteristic function for every or > 0; furthermore, (1) shows that
(pit) differs from zero for real t.
If we put in (2) m — Xh, a2 = Xh2 with X > 0, and

*(,o = |0
[ 1
u~h■

for u > h,
then
(p{t) = exp {X(eilh — 1)}.

The distribution is thus in this case a (generalized) Poisson distribution


in the following sense: A random variable so distributed takes on the
values kh (k = 0, 1, 2,. . .) with probabilities:

P(( = kh) = Iff (k = 0,1,...).

It follows immediately from the definition that the convolution of


infinitely divisible distributions is itself an infinitely divisible distribution.
Thus a distribution which is the convolution of a normal distribution and
of a finite number of generalized Poisson distributions (of the above type)
is infinitely divisible.
It follows from (1) that every infinitely divisible distribution can be
obtained as the limit of convolutions of a normal distribution and of a
finite number of generalized Poisson distributions. It can be shown that the
limit of a convergent sequence of infinitely divisible distributions is itself an
infinitely divisible distribution.
The proof of Theorem 1, however, will not be dealt with here.1

1 Cf. e.g. P. Levy [4] or B. V. Gnedenko and A. N. Kolmogorov [1],


VI, § 8] STABLE DISTRIBUTIONS 349

§ 8. Stable distributions

A distribution function i^x) is said to be stable if for any two given


real numbers ml5 m2 and for any two positive numbers a<r2, there exist
a real number m and a positive number a such that

\ x — mx' x — m2 x—m
* F —F (1)
1 °2

Here the sign * denotes the operation of convolution. As we have seen


the normal distribution is a stable distribution, in fact the function

satisfies (1) with


m — m1 + m2 and o (2)
Theorem 2' of § 5 implies that the only stable distribution of finite
variance is the normal distribution. There exist, however, stable distributions
of infinite variance; thus for instance the Cauchy distribution with the
distribution function
1 1
F(x) =-1-arc tan x
2 7T

fulfils (1) with m = m1 + m2 and a = ax + a2.


We state now without proof the following theorem:

Theorem 1. A distribution with the characteristic function (p(t) is stable


iff rp(t) can be written in the form

In (p{t) = iyt — c\t |a 1 + ijS — co(t, a) (3)

where the constants a, /?, c fulfil the inequalities

- 1 < p< 1, 0 < a < 2, c > 0, (4)

y is a real number and


not
tan for ct / 1,
a>(t, a) = (5)
In U | for a = 1.
350 CHARACTERISTIC FUNCTIONS [VI, § 8

The number a is called the characteristic exponent of the stable distribu¬


tion defined by (3). For the normal distribution a = 2, c > 0; in the case
of a = 1, p = 0 we obtain the Cauchy distribution.
It can be proved without any difficulty that for a stable distribution
with characteristic exponent a < 2, the moments of order 5 < a exist but
those of order 8 > a do not exist.
It follows from Theorem 1 that every stable distribution is infinitely
i
divisible. In fact, if cp(t) fulfils (3), the same holds for [(p(t)]n with the
y c
same a and/? and with — or — instead of y or c, respectively. This can

be seen directly as well, since if the distribution with the characteristic


function cp{t) is stable, we have

= (pfiJ) eiynt
i
with qn> 0; hence [^(t)]" (n = 1, 2, . . .) is again a characteristic
function.
For a detailed study of stable distributions we refer to the books cited
in the footnote of the preceding paragraph. Levy calls only those
nondegenerate distribution functions F(x) stable, for which to any two
positive numbers cx and c2 there exists a positive number c such that
F(cxx) * F(c2x) = F(cx). Distributions, which we called stable above,
are called quasi-stable by Levy. It can be shown that a distribution with
characteristic function cp(t) is stable in the sense of P. Levy, iff In <p(t)
may be written in the form (3) with y = 0 for a # 1 and /? == 0 for a = 1.
Thus the following result is valid:

Theorem 2. If a distribution with the characteristic function <p(t) is stable


in the sense of P. Levy, In cp(t) can be written in the form

In <p(t) = - jc0 + icx j^jj 11 |a with 0 < a < 2, (6a)

where c0 is positive.

It can be shown further that the inequality

[fij < tan


7ia

Co
(6b)

holds; however, this will not be proved here.

Proof of Theorem 2. If <p(t) is the characteristic function of a distri¬


bution which is stable in the sense of Levy, there exists for every pair
VI, § 8] STABLE DISTRIBUTIONS 351

cx > 0, Co > 0 a number c > 0 with

cp(c11) <p(c21) = y(ct). (7)

In particular, if cx = c2 = 1, we have

r (0 = 9>(?0- (8)
It may be supposed q # 1, since # = 1 would imply, because of <p(0) = 1
and the continuity of (p{t), the relation (p(t) = 1 and y(t) would then be
the characteristic function of the constant zero. Furthermore (8) implies
<p{t) 9^ 0 for every real t: since if q>(t0) = 0 could hold, (8) would lead to

h
(n = 1,2,...) (9a)
q
and
<p(qn t0) = 0 (n = 1,2,...) (9b)

which is impossible both for q > 1 (9a) and for q < 1 (9b) in view of
9?(0) = 1 and the fact that cp{t) is continuous at t = 0.
Thus i//(t) = In (f {t) is continuous too, and i/^(0) = 0. Let cl5 c2,. .., cn
be any n positive numbers; according to (7) there exists a c > 0 with

£ Wckt) = 'i'(ct)-
fc=i

Hence, for cx c2 = . • . = c„ = 1 there exists a c(n) with

n\J/(t) = <A(c(n) t). (10)


Thus
c(n)
ni^(t) = ip(c(n) t) = nuj/
c(m) J

dn)
If we put c -, we obtain
c(w)

— = 1A (11)
m

Consequently, there corresponds to every rational r a number r(r) such that

= (12)
We show now that c(r) is uniquely determined and the relation

c(rs) = c(r) £’(5) (13)

holds for any two positive rational numbers r and s.


352 CHARACTERISTIC FUNCTIONS [VI, § 8

From ip(at) = tj/(bt) and a < b follows ij/(t) = \jj


((a Mt\, and because
T
• \ /' T \ / T \ y ~y

since the distribution was supposed to be nondegenerate. Hence necessarily


a = b and c{r) is thus unique. Since further

*#(0 = ^{c(rs) t) = IJ/(c(r) c(s) t)

for every t, (13) holds for any two positive rational numbers r, s. Let now
9 be a rational number q > 1 and t a real number such that # 0.
Then
'l'([c(q)]nt)=qn\l>(t). (14)

We have necessarily c{q) > 1, since c(q) < 1 would imply

max 11l/(u) \>qn\ ip(t) |

for every n, which contradicts the continuity of i/i(t) on every finite interval.
c{r) f r )
SmCe~c(yr “ C tI ’ r > S imPlies CW > C(J)- The function c(r) is thus
increasing. For every irrational X > 0 we define c(A) by

c(A) = lim c(r),

where r tends to A through rational values. Because of the continuity of


^(0 it follows from (12) that the equalities

A«K0 = >A(c(A) t) (15)


and
c(Ap) = c(A) c(}l) (16)

are valid for every positive value of A and /t and that c(A), as a function
of the real variable A > 0 is increasing. Put

g(x) =.In c(ex) (- oo < x < + oo), (17)


then it follows that g(x) is increasing and

g(x +y) = g{x) + g(y) (18)


is valid. Hence (cf. Ch. Ill, § 13)

g(x) = ~ for a> 0 (19)


a
VI, § 9] CONDITIONAL PROBABILITY DISTRIBUTIONS 353

and
jp
c(x) — x “ , for x > 0. (20)
This leads to
XaiP(t) = iKXt) (21)
for every real t and positive A. Thus for t > 0 we have

xjj{Xt) = Xa iA(0 = taiKA), (22)


and therefore for A = 1:

= \t <A(1), for t > 0. (23)

Put \J/(\) = — (c0 + ic]l). Since <p(—t) = cp{t), and thus also i//(—t) = \p(t),
we obtain for every real t

H0 = Cl 1 f |“, : (24)

Because of \<p(t) \ < 1, c0 > 0 holds. It remains to show that 0 < a < 2.
But a > 2 would imply 9/(0) = 0 and hence D%E) = 0, and thus £ would
be a constant. Herewith (6a) is proved.

§ 9. Characteristic functions of conditional probability distributions

In the present paragraph we need a generalization of the concept of


'"function” due to L. Schwartz. In order to construct the theory of genera¬
lized functions (called distributions by L. Schwartz) we follow the way
suggested by J. Mikusinski and worked out by G. Temple and M. J. Light-
hill.1
Let C denote the set of infinitely often differentiable complex-valued
functions, for which

/(fc) (x) = O for X | —> + 00 (1)

for any nonnegative integers k and N. If f(x) £ C and if a, b are real, and
A is a complex number, then A/(x) £ C, f(ax + b) £ C, and f(k)(x) £ C
(k = 1,2,...); furthermore f(x) £ C, g(x) £ C imply f(x) + g(x) £ C. Let
K denote the set of all infinitely often differentiable functions g(x) such
that
g(k\x) = 0(\ x |^*) for | x | 00 and k = 1,2,...,

1 Cf. M. J. Lighthill [1].


354 CHARACTERISTIC FUNCTIONS [VI, § 9

where the numbers Nk are integers. It is easy to see that f(x) £ C and
g(pc) £ K imply f(x) • g{x) £ C.
A sequence of functions {/„(X)} (fn(pc) £C; n = 1, 2,. . .) is said to be
regular, if for every //(*) £ C the limit

lim | fn{x)h(x)dx (2)


/!-*• + 00 — 00

exists. Two regular sequences {/„(+)} and {gn{x)} are said to be equivalent
if for every h{x) £ C

+ 00 +00

lim J fn(x)Kx)dx= lim [ gn{x)h{x)dx. (3)


«-*- + oo —oo n -*■ + oo —oo '

This equivalence relation defines a partition of the set of regular sequences


of functions into classes. A generalized function is an equivalence class of
regular sequences of functions.
In the present paragraph generalized functions will be denoted by capitals
(F{x), G(x),.. .) and the ordinary functions by lower case letters (f(x),
g(x),. . .). If the regular sequence of functions {/,(*)} defines a generalized
function F(x) we express this fact by a sign F(x) ~ {fn{x)}. If F(pc) ~
~ {/„(*)} and if h(x) £ C, we define the “integral”

f
— 00
F(x)h(x)dx

by
+ 00 + 00
J F(x)h(x)dx = lim j f, (x) h(x) dx, (4)
— 0° «-<- + oo —‘oo

where on the right hand side the limit exists by assumption a d remains
the same when {f„(x)} is replaced by another sequence equi' dent to it.
If F(x) ~ {/„(*)} is regular, the sequence {Affax + 6)} s evidently
regular for any two real numbers a, b and for any complex number A
We put therefore AF(ax + b) ~ {Af„(ax + £)}. If F(x) ~ fn(x)} and
G(x) ~ {gfx)}, the sequence {fn(x) + gn{x)} is again regul vr; wo put
F{x) + G(x) ~ {/„(*) + #„(*)}• Finally if {/„(*)} is regular and < g{x) 6 K,
the sequence {/„(*) g(x)} is regular. For F(x) ~ {/„(x)} we put
F(x)g(x) ~ {/„(*) • g(x)}.
{/«(*)} is regular, {/„'(*)} is also regular because if h(x) £ C we have

+~° +00

J fn(x)h(x)dx^~ J fn(x)h'(x)dx (5)


VI, § 9] CONDITIONAL PROBABILITY DISTRIBUTIONS 355

from which the existence of the limit

+ °0 +00

lim j /„' (x) h(x) dx = — j F(x) h' (x) dx


n-+ oo — oo — oo

follows. The derivative of a generalized function +(x) ~ {/„(*)} is defined by

f' w ~ iax)}.
Thus we have
+ 00 + 00
j F' (x) h(x) dx — — [ F(x) h'{x) dx. (6)

A generalized function is thus infinitely often differentiable. It is easy


to prove the following rules of calculation:

A(F(x) + G(x)) = XF(x) + XG{x),

(cF(x))' = cF'(x),

(F(x) + G(x))'=F'(x) + G'(x),

(F(ax))' = aF'(ax).
Example. Let us put
n nXl
"2
fn (x) =
2n
If h(x) 6 C, it is clear that
+ 00
lim j fn (x) h(x) dx = h{0).
n-+ oo —oo

The generalized function {/„(x)}, which we denote by 5(x), is called


Dirac’s delta function. For every h(x) £ C we ,tave 1 hus
+ 00
[ <5(x) h(x) dx = h(( ).
— 00

We prove now some theorems.

Theorem 1. If /(x) £ C and if


+ GO

y(t) = j /(x) eitx dx (— co < / < + co), (7)

9ft) belongs then to C as well.


356 CHARACTERISTIC FUNCTIONS [VI, § 9

Proof. f(x) = 0(|jc| k) for every integer k, hence the integral (7) exists
for every real number t; further

+ oo

J 1/00 I -\x\kdx

exists too. The function <p(t) is thus infinitely often differentiable and

<p{k) (0 - r (k= 1,2,...). (8)


— 00

If we integrate (7) N times by parts we obtain

+ 06

i'N
<K0 = | fN)(x)ei,xdx (N= 1,2,...). (9)

The integral on the right hand side exists, since f{x) £ C, hence y(t) =
= 0(|t|_iV) for [t| -> + co and for every integer iV. By (8) the same holds for
(p{k\t) since (ix)kf(x) £ C, hence Theorem 1 is proved.
The function rp(t) is the Fourier transform of f(x).

Theorem 2. If f{x) £ C and if

<p{t) = J fix) eitx dx,


— 00

we have the following inversion formula:

+ 00

/w=Zr J 9?(t) c ltx dt. (10)

+ 00
Proof. The preceding theorem guarantees the existence of j’ | and
— 00

(10) follows from Theorem 2 of § 4.

Theorem 3 (Parseval’s theorem). If f(x) £ C, g(x) £ C, further if

<P(0= J f(x)eiudx,
— 00

00
7(0= j g(x)eitxdx,
VI, § 9] CONDITIONAL PROBABILITY DISTRIBUTIONS 357

then we have
\ + 00 + CX"

j /(*)d(x) dx = j (p(-t) 7(0 J/. (ID


— 00 —00

Proof. By Theorem 2
+ 00

#(*) = j 7(0 e~Ux dt.


— 00

Hence
+ 00 + 00 + 00

j /(*) g(x) dx = j" j f(x) y(x) e~itx dt dx =


— 00 —00 —00

+ 00
1
27z
J
— 00
7iO<P(~Odt

since the order of integration can be interchanged.

Theorem 4. If {/„(*)} F(x) is a regular sequence of functions and if


<pn(t) is the Fourier transform of fn(x), then [<p„(0} ~ 0{t) is again a regular
sequence of functions; the generalized function (l>(t') remains invariant when
{/„(*)} is replaced by an equivalent sequence. For this relation between the
two generalized functions F(x) and <P(t) we can write

+ 00
F(t) = \ F(x)ell3Cdx. (12)
—*00

If h{x) is a function of the class C and %(t) is its Fourier transform, we


have

J
+ 00 +00

j <P(t)x(t)dt = 2n F(x)h(-x)dx. (13)


— oo ~ co

We say that the generalized function <P(t) is the Fourier transform of


F(x).

Proof. Let y(t) £ C and


+ CO

h(x) = J yft) e~itx dt.


— OD
358 CHARACTERISTIC FUNCTIONS [VI, § 9

Then, according to Theorem 3

+ 00 +00

<Pn (0/(0 dt = 27: j fn (X) K~X) dx‘


-00 -00

Thus if F(x) ~ {f,(x)} we have

lim J <pn(t)x(t)dt = 2n ( F(x)h(-x)dx, (14)


n-*- oo—oo —oo

hence (f>„(t)} is indeed regular.


It can be seen from (14) that if {/„(*)} is replaced by an equivalent
sequence, {^(t)} will be replaced by an equivalent sequence as well.
(13) follows immediately from (14).
Let D denote the class of the ordinary measurable functions /(x) such
that |/(x)| < A( 1 + 1^1^), — co < x < +oo for any integer N and A > 0.
From /(x) £ D, h(x) £ C follows immediately the existence of the integral
+ 00

j~ f(x) h(x) dx.


— 00

Theorem 5. If f(x) £ D, there exists a generalized function F(x) such


that for every h(x) £ C

f f{x) h(x) dx = f F(x) h(x) dx. (15)


-00 -oo

Proof. Put
+ 0°
n f iiZi -n(x-yy
2^J Me" e 2 d>'-
— 00

A simple calculation shows that fn(x) £ C; furthermore from the fact


that a Lebesgue integrable function is almost everywhere +qual to the
derivative of its indefinite integral we find that lim f jx) = fix) for almost

every x. According to the convergence theorem of Lebesgue,

+S° +oo
hm j fn(x)h(x)dx= f fix) h(x) dx,
n-*oo — oo —oo

q.e.d.
VI, § 9] CONDITIONAL PROBABILITY DISTRIBUTIONS 359

Let f{x) CD and let Fix) ~ {/„(*)} be the generalized function corre¬
sponding to it (in a unique manner) according to Theorem 5. Write
fix) ~ Fix).
Let now £ be a random variable on a conditional probability space with
density function fix') and assume that fix) £ D. The characteristic function
<Pft) of £ is defined by

${(f)= jV(x) eitx dx


— 00
(16)

with Fix) ~ fix) in the sense of (12); +*(?) is thus a generalized function. If

j fix) dx = 1
— 00

i.e. if fix) is an ordinary density function, and if we put as usually


+ 00
<Pt(t)= J J\x)eixtdx,
— 00

then <pfit) ~ Qft)-, the definition of Fft) is thus consistent with the
definition of ordinary characteristic functions 9oft). In fact, in this case
(pft) is continuous and j yft) | < 1, hence cpft) £ D. It suffices thus to
prove that for every lit) 6 C the relation
+ C0 +00

f <Piit)x(t)dt= j' $t(t)x(t).dt (17)


— 00 -00

holds, where the integral on the left is an ordinary integral. If


+ 00

Kx) = | lit) e~itx dt,


— 00

we obtain, by proceeding as in the pro^f of Theorem 3,


+ 00 +00

j (p$it)xit)dt = 2% j hi—x)fix) dx. (18)


— 00 “0^

Furthermore, by Theorem 1,

f $i(t)x(t)dt = 2n j Fix)hi-x)dx. (19)


-00 - 00

By the definition of Fix), the right hand sides of (18) and (19) coincide;
we get thus (17).
360 CHARACTERISTIC FUNCTIONS [VI, § 9

If /(x) is the density function of a conditional distribution, /(x) and


consequently <Pft) are only determined up to a constant factor.
Example 1. If £ is uniformly distributed on (—oo. + oo) then /(x) = 1.
If h{pc) £ C, we can write
+ 00 +00 +00 X2

J f(x) h(x)dx = j h(x)dx = lim \ h(x)e 2" dx; (20)

hence for the generalized function F(x) corresponding to f(x), the relation
x2
F(x) ~ {exp is valid. Since
2n
+ 00 _ X2 _ nt2
| e 2" eltx dx = JInn e (21)
we obtain
+ CO

n nt2
<p{(0= F(x) e,xt dx ~ In 2tt5(0. (22)
In-

The characteristic function of £ is thus the Dirac delta function.


Example 2. Suppose /*(x) = <?,7+ Find the Fourier transform of fk(x)-
If h(x) fC we have
+ 00 +00 .£ _ .X2

j fk (x) Kx) dx = lim J /?(x) e * 2" </x.

hence /fc(x) . And since

+ 00 • ;kx--*F _ n(fc + Oa
j e ’ 2“ eix,dx = Jinn e 2
— CO

we find for the Fouiier transform $+0 of fk(i)

\ I n - )
e 2 | = 2n5(k + /).

introduce now the concept of convergence of generalized functions.


Let Fk[x) (£ = 1,2,...) be a sequence of generalized functions. We say
that Fk(x) tends to the generalized function F(x) (in signs Fk(x) -*■ F(x)),
if for every h{x) £ C

+r°
lim j Fk (x) h(x) dx = F(x)h(x)dx. (23)
k-t-oo -oo _oo v '
VI, § 9] CONDITIONAL PROBABILITY DISTRIBUTIONS 361

Theorem 6. Let [Fk{x)} (k = 1,2,...) be a sequence of generalized


functions and put

$k (0 = J ^ (a) e'7* dx (Jc — 1,2,...). (24)


— 00

Then Fk(x) -*• F(x) i ff <Pk(t) -» 4ft). Thus

+ 00
#(0= J F(x)eitxdx.
— 00

The proof follows immediately from Theorem 3.


Theorem 6 permits the use of characteristic functions for the establish¬
ment of the convergence of distribution functions also in case of conditional
distributions. In this case the characteristic functions are usually not
ordinary, but generalized functions. As an example we prove a limit distri¬
bution theorem.

Theorem 7. Let independent random variables with


the same distribution. Assumefurther that their distribution function is absolutely
continuous with finite variance and bounded density function fix). Suppose
E{£f) = 0. Put

Cn = £l + £2 + • • • + (25)

Then the distribution of £„ tends to the conditional distribution uniform on


the whole real axis, that is for any four real numbers c < a < b < d we have

lim P (a < < b \ c < £„ < d) = —— . (26)


n-*- m & C

Proof. Let fn(x) denote the density function of C« and a the standard
deviation of the fix) < M implies fn(x) £ D. We show that for every
Kx) 6 C
+ 00 + 00

lim er
«-*- 00
J fn(x)h(x)dx =
1
2n
J* h(x)dx. (27)
-00

Relation (27) proves the theorem. Indeed if it holds for every h(x) £ C,
let then be h{a, b, e, x) a function of the class C such that

0 < h{a, b, e,x)< 1 (28)


362 CHARACTERISTIC FUNCTIONS [VI, § 9

and
0 for x < a — s,

h(a, b, s, x) = 1 for a < x <b, (29)

0 for b + s < x.1


We have then
+ 00
\ fn(x)h(a + s,b — £,e,x)dx \ fn(x)dx
< <
+ 00 — d

-00J (X> Kc» d, z,x) dx


fn J /„ (x) dx c

+ 00

J (x) h(a, b, £, x)dx


fn
< (30)
+ CO

J fn (X) KC + e, d — £, £, X) dx

Since

f fn(x)dx
P(a<Cn<b\c<{n<d)= JL- (31)
J fn (X) dx

(27) and (30) enable us to write


+ oo

j* h(a + £,b — £,£, a) dx f h(a, b, £, a) dx


— 00

+ 00 <1<L< + 00 (32)
J h(c, d, £, x)dx I h(c + £, d — £, £, x)dx

1 These conditions are e.g. fulfilled by the function

I, I \ i I x ® “f- ^ i t (x b
h(a, b,e,x) = k\-| — k

where we put

for x < 0 ,
X
1
J exp dt
0 -1(1 -0 .
*(*) = for 0 < x < 1,
J exp dt
-o

for x > 1.
VI, § 9] CONDITIONAL PROBABILITY DISTRIBUTIONS 363

where
/ = lim inf P(a < < b | c < C„ < d)

and
L = lim sup P(a<Cn<b\c<Cn< d).

When e 0, the first and the last member of the threefold inequality (32)
, b — a
tend to ———, hence (27) implies (26).

Let now be F„(x) ~f,(x) and let <ph(t) be the Fourier transform of.
o Jlizn fn(x). By Theorem 6 it suffices to prove that $n(t) -> <5(0, where
<P„(t) is the generalized function corresponding to cpn(t) (n = 1,2,...)
and <5(t) is Dirac’s delta.
Put
+ °°
<P(t) = j f(x)el,xdx.

We see that cp„(t) = a Jinn (pn(t). We have to show that for every x(0 6 C
one has
+ OO

The proof can be carried out by means of the method of Laplace (cf. Ch.
Ill, § 18, Exercise 27).
By Theorem 11 of § 2 we have | cp(t) | < 1 for t ^ 0; furthermore by
Theorem 8 of § 2
lim cp(t) = 0.
|/|- + CO

Hence there can be assigned to every s>0ag = q(e) with 0 < q(e) < 1
such that | 99(0 | < q{s) for 111 > e. But then we have

a
J >B
95" (0 x(0 dt <oJn[q{E)]n
J
— OO
\x(t)\dt. (34)

On the other hand, for every 111 < e

Incp (0 = — (1 + i/(0) with lim i/(0 = 0.


2
364 CHARACTERISTIC FUNCTIONS [VI, § 9

By introducing a new variable u = to *Jn, we obtain

+ EG In
r
U
2 r u | u
1 f du.
l +1/ X
VS Jexp ~ T . Cyjn j,
—ea/n

Since %(t) is continuous for t — 0 and is bounded, it follows from Lebesgue’s


theorem that

Hm ° J
—£
^^ dt = <'35')
(35) and (34) lead to (33).
Let us remark that the assumptions of Theorem 7 can be considerably
weakened.
The product of two generalized functions is generally not defined. The
way which would seem quite natural to follow leads astray: the regularity
of {/„(*)} and {#„(*)} does not, in general, imply the regularity of
{f„{x)gn(x)}. Just take as an example

fn (*) = 9n O) =

Here for every h(x) £ C with h(0) ^0 we have


+ 00

+ 00.
— 00

Consequently, <52(x) has no sense at all.


So far we did not define the characteristic function of a random variable
defined on a conditional probability space unless the random variable
had a density function. Now we have to deal with the general case. Let
^(x) be the distribution function of £. Suppose that there exists a genera¬
lized function such that

+ 00
F(x) h(x) dx — j* h(x) ddF(x) (36)
S
— 00 ‘—00
VI, § 9] CONDITIONAL PROBABILITY DISTRIBUTIONS 365

for every h{x) £ C, where on the right hand side figures an ordinary Stieltjes
integral. Then the Fourier transform of F(x), will be considered as
the characteristic function of the random variable £.
Example. Suppose that £ is uniformly distributed on the set of the integers,
i.e. the distribution function of £ is given by [x] ([x] represents the integer
part of x, i.e. the largest integer smaller than or equal to x). In this case

+f h(x)dy(x) = +f h(k); (37)

there exists a generalized function fulfilling (36), namely


4- co

F(x) — ~ fy- (38)


fc = —00

it is easy to show that


+ O0
<P(t) = 2n Y ^(t - 2kn). (39)
fc= —00

If we apply (13) to any function h(x) 6 C and to an F(x) defined by


(38), we find
+ 00 +00

x
k = — oo
m=k= — x oo
(4°)
where x(0 is the Fourier transform of h(x). (40) is Poisson's well-known
summation formula. In particular, if h{x) = exp (—x"/!2), then

Jn
x(0 = exp
X

and it follows from (40) that

+ 00 /XT + =° _
£ e-k'» = \rY Y e . (41)
k = -cc ^ k = - co

This is a formula known from the theory of 0-functions. We shall need


it later on.
Now follows a theorem similar to Theorem 7.1

Theorem 8. Let &, f2,. . be independent integer valued random vari¬


ables having the same distribution. Suppose that their expectation is zero

1 It was proved by K. L. Chung and P. Erdos [1 ] under weaker conditions.


366 CHARACTERISTIC FUNCTIONS [VI, § 9

and their variance finite. Suppose further that the greatest common divisor
of the values assumed by - £2 with positive probabilities is equal to 1. Put
— £i + £2 + • • • + £n (n — 1, 2,. . .), then for any two integers k and l

P(Cn = k)
lim = + 1. (42)
«-*- 00 P(Cn = l)

Hence when n -> 00, the distribution of '(n tends to the uniform distribution
on the set of integers.

Proof. Let D(fk) — 0. If we show that

lim ajlnnP(Cn = k) = 1 (k =1,2,...), (43)


n-*- 00

the theorem is proved. Let

+ 00
<p{t)= £ P^x = k)eikt (44)
k= — co

be the characteristic function of the random variables We have

+n

P(Cn -k) = j cpn (t) e~ikt dt. (45)


—n

Since by assumption 99(0) = 1, 9/(0) = 0, /'(0) = -a2 and | <p (t) | < 1
for 0 < | t | < n, the method of Laplace (cf. Ch. Ill, § 18, Exercise 27)
leads immediately to the result.
This result can be rewritten in the following manner:

/-+ °° +00

lim <jyf 2nn j cpn (t) h(t) dt = 2n V h(2kn) (46)


k = — 00

for every h(x) £ C; hence o J2nn cp\t) tends for n 00 to the generalized
function (39). Thus if Fn(x) is a generalized function such that

Y __+00
_J Fn(x)h(x)dx = oj2nn £ P(£„ = k) h(k),
00 k = — 00

P„(x) tends for n -> 00 to the generalized function (38).


VI, § 10] EXERCISES 367

§ 10. Exercises

1. Prove the following


Theorem. If £, is a discrete-valued random variable taking on the values xk (x, ^ xk
for j ^ k) with probabilities pk{k — 1,2,...) and if <pt (t) is the characteristic function
of £, we have
T

Pn = Um j (Ps (/) e-^'dt (n = 1, 2,. . .).


-T
Hint. The series

<Pt (0 = £ PK e‘Xn<
n= 1
is absolutely and uniformly convergent, hence it can be integrated term by term.
Furthermore, since for every nonzero real number x
T

■ lim ( eixt dt — 0,
T —► 03 J
-T

the theorem follows immediately.


2. Let £ be an integer-valued random variable and let <p4 (t) be its characteristic
function. Prove that
71

P(£ — k) — -T- (t) e~m dt (k — 0, + 1, + 2,. ..).


— 71

3. Prove the theorem of Moivre and Laplace by means of the result of the preceding
exercise.
Hint. By Exercise 2

” j Pk qn~k = J (Peu + q)n e m dt.

For \k — np\ = 0(ji3 ) the method of Laplace leads after some calculations to

(k - np)
i.(< (pe" + qf e~iM dt =
Jlnnpq
exp
2 npq ]+0!v)-
4. Prove the following characteristic property of the normal distribution: Let
F(x) be an absolutely continuous distribution function, let F\x) = f(x) and
+ 00
J x2f(x) dx = 1.
If we put
+ °>

H(f(xj) = - J /(x) In f(x)dx,


— 00
368 CHARACTERISTIC FUNCTIONS [VI, § 1C

we have
< In ^2ne,
_ 1
where equality holds only for /(a) = (2ji) 2 exp . Hence //(/(a)) assumes

its largest value in the case of the normal distribution. (In information theory the
number H(f(x)) is called the entropy of the distribution with the density function/(a);
cf. Appendix.)

5. If £"(() exists, then we know that rp^t) is differentiable at t = 0 and tp'e(0) = iE(Q.
Show that the differentiability of cp?(t) does not necessarily imply the existence of £(£).

Hint. Put

Pit = «) = P(i=-n)= —i— (« > 2),


n2 In n
with
1 -l
«2 In n
We find
00
cos nt sin
Vi (0 = 2c £ Vi (0 = - 2c £
n=2 n2 In n n=2 n In h

The trigonometric series <p,f(0 is uniformly convergent1 and 9?|(0) = 0. Nevertheless.


£(0 does not exist.

6. Let £ be a random variable and AT. = £(|£|a), a > 0. Suppose that M is finite.
Show that if 0 < < a. ,

<(MJ\

bq
Hint. For positive a and b p > 1, q = —?— we have2 ab < — + —.
p — 1 Q
Apply this inequality with

a -
A ’ 6“ * - 7•

7. Study the limit distribution of the multinomial distribution

N\
Pklk*-kr = ~k[\ k2l .. . krl ^ • • • •P&r (*/ ^ 0, X k, = N

when A-a oo, with £ Pn, = 1 and lim NpNj = \ (J = 1, 2, . .., r - 1).
i— 1 TV —► co

Hint. The (r - l)-dimensional characteristic function of the multinomial distri-


bution is

(i + IX (^-i))",
/■=i

1 Cf. e.g. A. Zygmund [1], p. 108.


2 Cf. e.g. G. H. Hardy, J. E. Littlewood and G. Polya [1], p. Ill,
VI, § 10] EXERCISES 369

hence from lim Np^j = k; follows


N —*• oo

lim
N-+a>
(1 + X PnM'’ ~ '))'V = II exP
7=1 7=1
~ W

8. a) Let ^ be a random variable of zero expectation and put

<KS) =7 — 00
1*1 dF(x).

If <p(t) is the characteristic function of £, we have

= 1 j" 1 -Re(y(Q)
d(5) dt. (1)

(Here as usual Re(z) denotes the real part of z.)

Hint. From
co

1 — Re(<p(/)) — J (1 — cos xt)dF{x)


— 00

follows
CO CO

L f 1 - Re (vO) ± _ 11 C ( C 1 — cos xt 3
if -J (J— CO —CO
——dt)
dt dF(x).

Hence the relation (1) follows according to Formula (40) of § 4.


X'
Remark. It can be shown that the necessary and sufficient condition for the existence
of d(£) is the existence of the integral

1 - Re (cp(t))
J dt.

b) If we add to the assumption a) the other one that the variance of £ exists, we
have further

= _± ( Re (y'(Q)
Re(y7(o; d!
n J t

Hint. This can be obtained from a) using integration by parts.

9. Let £t, f2) be independent random variables having the same distribution
which is symmetric with respect to the origin and has variance 1. Consider the sums
t„ = £i + £2 +...+ £„• Show that

d(U
D(Cn)
370 CHARACTERISTIC FUNCTIONS [VI, § 10

Hint. If cp{t) is the characteristic function of the random variables £„, we have

95C„//«(0 = <P
V"
Since cp{t) is real for every real t, we obtain, by taking into account Exercise 8,

d(C„)
d(C„) Jn
Jn f
f n_ ,, , <P'
cp («)
iu
du.
DiZn) *1 .) U

From this we obtain the required result by the method of Laplace.

10. If <?(/) is the Fourier transform of the generalized function F(x), the Fourier
transform of F{ax + b) is
itb
1 - ™ _(t\
—e 0 — .
a I 1 a)
11. With the same notations, the Fourier transform of F'(x) is — it0(t).

12. a) With the preceding notations, the Fourier transform of x"F(x) is (—i)"0<n\t).
b) If the conditional density function of i is x2n (n = 1,2,...), the (generalized)
characteristic function of £ is 2ti(—1)” <5t2n>(r), where 8(t) denotes Dirac’s delta
(cf. § 9, p. 355).

13. a) Let £lf £», be independent random variables having the same normal
distribution, E(^k) = m, l = — +... + {„), and

^= ./ ~ 02 + (^ - 02 + • ■ • + (4 - 02
n - 1
Show that

T = a/” ^ ~

has Student's distribution with n — 1 degrees of freedom.


b) Let be independent random variables having the same normal
distribution. Put

& = -& + ■■■ + U, & = ~((H+1 + ... + «„+„), .


n+m

I'
fc= *
- ^,i>)2 + X («* - Ar=/i-fl
S =
n + /7i — 2
Show that

^ - <f<2> / ~nm
t=
s \ n + m

has Student s distribution with n + m — 2 degrees of freedom.


VI, § 10] EXERCISES 371

14. Prove that the following property characterizes the normal distribution: If
f(x) ( —00 < x < + °o) is a continuously differentiable positive density function
such that for any three real numbers x, y, z the function

f{x — 0/0 - 0/0 - 0

x y -f* z
has its maximum at t = -- , then

fix) = exp
yjlna 2a2

Hint. By assumption, fix) is positive. If we put gix) = ^^ and s = x + y + z


fix)
we have
gix — s) + giy — s) + #(z — s) = 0. (2)
x + y
For x = y = z, we obtain g(0) = 0. Take now any two x and y and put z = -s;

the relation (2) and #(0) = 0 lead to

9 + g = o.

Hence gix) is an odd function. If we put u — x — s, v = y — s and thus z — s —


= — iu + v), we can write Formula (2) in the form

9iu) + 9io) = giu + v). (3)

Since by assumption, giu) is continuous, we know that (3) implies

, , fix)
gix) - -ttt = Cx (C = constant).
fix)
Hence by integrating.
Cxi
fix) = A exp

As fix) is a density function, we have C ■ -- and A — — — , with a > 0.


yj 2n a

Remark. The result is valid under weaker conditions too.1

15. If and are independent random variables having the same nondegenerate
distribution with finite variance and if the random variable a£,x + b£2 (0 < a < b < 1;
a2 + b2 = 1) has again the same distribution, then this distribution is normal with
expectation zero.

16. The distribution having the density function

1 1
2x
fix) = X ix > 0)
V 2n

1 Cf. A. Csaszar [2].


372 CHARACTERISTIC FUNCTIONS [VI, § 10

is stable and corresponds to the parameters a — —, ^ — — 1. y — 0, c = l.1

17. If (p{t) is a characteristic function, then

il/it) = —
= 4.1| (p(u) du

is a characteristic function as well.

18. Show that the gamma distributions are infinitely divisible.

19. Let £(s) denote Riemann’s zeta-function


“ i
C(j) = — with s — a + it, a > 1.
n= 1 n

Show that the function


C(cr + it)
(<7>1)
is the characteristic function of an infinitely divisible distribution.

20. We know (Ch. IV, § 10) that the quotient of two independent N (0, 1) random
variables has a Cauchy distribution. Show that this property is not characteristic for
the normal distribution. If £ and rj are independent, have the same distribution of

zero expectation and if — has a Cauchy distribution, then it does not follow that
V
£ and rj are normally distributed.

Hint. Take for density function of £ and ??

1
fix) = v/2 i- '< x < +°°).
n 1 + a4

We obtain for the density function of


V

\y\dy
dix)
n2 ) (1 + y4)i 1 + *4y4) Ji(l + a2)
— 00

Remark. This example is due to Laha.2

1 fix) is the density function of £ 2; where E, is V(0, 1); this distribution is some¬
times called the “inverse normal distribution”.
2 Cf. R. G. Laha [1].
CHAPTER VII

LAWS OF LARGE NUMBERS

§ 1. Chebyshev’s and related inequalities

In the present paragraph we shall deal with an inequality due to Chebyshev


and with some similar inequalities, all needed in the proofs of the laws of
large numbers. First we prove the famous inequality of Chebyshev.1

Theorem 1. Let £ be any random variable with expectation M = E{fi) and


with a positive finite standard deviation D = D[fi). If X is a real number,
X > 1, we have

P(\£-M\>XD)<-^r. (1)

Remark. If 0 < X < 1, (1) remains valid, but becomes trivial.

Proof. If we apply Markov’s inequality (cf. Ch. IV, § 13, Theorem 1) to


the random variable rj — — M)2 with X2 instead of X, we obtain imme¬
diately (1).
If we apply Markov’s inequality to other positive functions of £, we obtain
other inequalities, related to Chebyshev’s inequality. Thus for instance if
we put r] = | £ — M |a (a > 0) and

Ma = E(\Z-M D (2)

(thus Ma is the a-th absolute central moment of £), then we get

P(\(-M\>W)<d[L-. (3)

Of course for a = 2 (3) reduces to (1).


In order to get an inequality as sharp as possible, we have to choose a

in such a manner that -— should be as small as possible.


(XD)a

1 Called also the Bienayme-Chebyshev inequality.


374 LAWS OF LARGE NUMBERS [VII, § 2

We can also apply Markov’s inequality to the random variable rj = eE^~m.


If we put
E(e* *-■»*)) = ^#(e), (4)
we obtain
P^-m > ^(g) ei) < e-tt (5)

where t is a positive number; the exponential function being monotone,


it follows for e > 0 that

t + In o# (e)
P k > M + < e~ (6)

In order to get the sharpest possible bound we have to choose e such that
t 4- In M(e) , ...
the expression-is minimal or at least nearly minimal.
£

In § 4 an improvement of Chebyshev’s inequality, which is due to Bern¬


stein, will be deduced from inequality (6).

§ 2. Stochastic convergence

We have mentioned already in Chapter III, § 17 the most elementary


case of the laws of large numbers, discovered already by Jacob Bernoulli.
In order to prove a more general theorem, we introduce first the concept
of stochastic convergence, due to Slutsky.
If £i> £25 •••»£«,••• is a sequence of random variables for which the re¬
lation
lim P( | £„ | > e) — 0 (1)
n-*~ 00

holds for every positive e however small, then the sequence gn (n = 1,2,.. .)
is said to converge stochastically (or in probability) to zero. If the random
variables £„ (n = 1.2,...) fulfil the relation

lim P{ | C„ - a | > £) = 0 (2)


n-+- 00

for any fixed e > 0 we shall say that the sequence Cn(n = 1,2,...) con¬
verges in probability (or stochastically) to the constant a and indicate this by

lim st („ = a (3)
or by
a. (4)
VII, § 2] STOCHASTIC CONVERGENCE 375

With this definition Bernoulli’s theorem may be formulated as follows:


In a series of independent experiments the relative frequency of the event A
tends stochastically to the probability P(A) of A when the number of experi¬
ments increases infinitely.
Bernoulli’s theorem is an immediate consequence of Chebyshev’s in¬
equality, established in the preceding paragraph.
k
In fact let £ = — be the ielative frequency of the event A in a series of
n
experiments. The random variable n£„ has a binomial distribution with
expectation np and standard deviation npq {q = 1 — p). Chebyshev’s
inequality leads to

P (5)

In particular, if we put k = s , (5) becomes

Pd
P(\tn-P\>s)< (6)
ns

if now n tends to infinity, the expression on the right of (6) tends to 0, which
proves the theorem.
The definition of stochastic convergence can also be given in the follow¬
ing form: the sequence £„ (n = 1, 2,. . .) converges stochastically to the
number p when to every pair s, 8 of positive numbers (however small) there
can be chosen a number N = N(s, 8) so that for every n > N

P( I C„ - PI > e) < <5- (7)

This condition can also be expressed in terms of the distribution function


F„(x) of £„. In fact (7) is equivalent to

0 for x < p,
lim F„ (x) = (8)
fl-*-oo
1 for x > p.

If Dp(x) denotes the (degenerate) distribution function of the constant p,


(8) is equivalent to

lim Fn (x) = Dp (x) for x # p. (9)

Conversely, it is easy to see that the stochastic convergence of £„ to the


constant p follows from (8) or (9).
376 LAWS OF LARGE NUMBERS [VII, § 2

The concept of stochastic convergence can be generalized still further.


We say that a sequence of random variables C„ (n = 1,2,.. .) tends in
probability (or stochastically) to the random variable £, if for every positive
£ one has
lim P(| C„ — Cl > s) = 0; (10)
n-*- oo

in this case we write


lim st (n = £ (11)
n-*- oo

or
(12)
It is easy to prove the following

Theorem 1. If -4 £, the distribution function Fn(x) of (n tends to the


distribution function F(x) of £ at every point of continuity of the latter.

Proof. If An is the event | £n - £ j < e, we have

P(C„ <x) = P(C„ <x\An) P(An) + P(C„ <x|4) P(i„), (13a)

hence
P(C„ < x) < P(C„ <x\An) + P(An). (13b)
But

f (C. < X M J < P(C < x -M H„) < —- p * *e) . (14)


*\An)

(13b) and (14) imply

lim P((„ < x) < P(£ < x + e) for every e > 0. (15)
n-*-co v 7

On the other hand, by (13a)

P(C„ < x) > P(zl„) P(C„ < x \An) > P(C < x - £) - P(i„). (13c)

From this follows

< x) > P(£ < x — e) for every e > 0. (16)


n-*- oo

Since e can be chosen arbitrarily small, (15) and (16) imply the statement
of Theorem 1.
VII, § 3] GENERALIZATION OF BERNOULLI’S LAW 377

§ 3. Generalization of Bernoulli’s law of large numbers

In the preceding paragraph Chebyshev’s inequality was applied to the


random variable n£„, where £„ denotes the relative frequency of the event A
(with P{A) — p) in a sequence of n independent experiments. Obviously

4+4+•••+4
5
n

where <jj* {k = 1, 2, . . .) is the indicator of the event A in the A>th experi¬


ment, that is
1 if the event A occurs at the k-th experiment,
4=
0 otherwise.

The E,k are by assumption identically distributed, independent random


variables, assuming the values 0 and 1 only. Their expectation is E(£k) = p.
Bernoulli’s theorem states that

lim st
4 + £2 + • • • + 4
n
= E(£k); (1)
n-*- co

i.e. that the empirical mean tends in probability to the common expectation
of the %k. It is easy to show that this property remains valid for arbitrary
independent identically distributed random variables with finite variance.

Theorem 1. Let (n = 1,2,...) be pairwise independent and identically


distributed random variables with finite expectation E(<?,,) = M and variance
Z>2(£„) = D2. Then
1 ” \
lim st — Ed =m.
n-+o0 n k=1 )
Proof. Put
1
c„ = — E 4.
» k=1
We have
D
E(U=M and D((n) =

By applying Chebyshev’s inequality we obtain

D2
jP(|C„-M|>e)< ,2 ’
ne

which proves the statement of Theorem 1.


378 LAWS OF LARGE NUMBERS [VII, § 3

The suppositions of Theorem 1 can be weakened. It is not necessary to


assume that £k are identically distributed, it suffices to assume the existence
of the limit

lim -
n—oo ft k=l

and the validity of

lim —- = 0,
n— oo ft

where

V k=i

Thus we obtain a form of the law of large numbers which is due to A. A.


Markov:

Theorem 2. Let c,k {k — 1, 2,. . .) be pairwise independent random vari¬


ables such that Mk — E(£f) and Dk = Diffi (k = 1, 2, . . .) are finite. Sup¬
pose further the validity of the following two conditions:
a) the limit
1 ”
lim — Yj Mk — M (2)
«-► oo ft k=I
exists and is finite;
b) for

Sn=jYDl
we have1

lim — = 0. (3)

Then for the random variable

1 "
k
n k=1
the relation
C„am
is valid.

1 This condition is certainly fulfilled e.g. if the random variables ^ (or at least
the numbers Dk) are uniformly bounded.
VII, § 3] GENERALIZATION OF BERNOULLI’S LAW 379

Proof. Similarly as in proving Theorem 1, we apply Chebyshev’s inequal¬


ity to

C*n=~ i (4 ~ Mk).
11 1

Taking into account that £(£*) = 0 and D(0 —-, we obtain the re¬
ft
lation
lim st C* -0.
n-*-co
Now
1 n
r*n = 'on
r - r - I Mk
n k=\
and by assumption
n

I Mk-M

if n is large enough. As | - M \ > e can hold only if

follows
lim st ((„ — M) = 0. (4)

The assumptions of the above theorem can still be weakened. Instead of the
pairwise independence of £k it suffices to assume that there does not exist
a strong positive correlation between most pairs. More precisely, the follow¬
ing theorem holds, due essentially to S. N. Bernstein:

Theorem 3. Let c,k (k = 1,2,...) be random variables with finite


expectations Mk = E(£k) and variances Dk = D‘2(fk) (k = 1, 2, . . .).
Let Rjj denote the correlation coe Jficient of f and 6. Assume further the va¬
lidity of the following three conditions:
a) There exists the limit

lim — £ Mk = M;
n—oo ft k= l

b) For S2n =Y,D\ we have S2 < Kn, where K is a constant independent


k=1
from n;
c) < R(\ i — j |), where R(k) is a nonnegative function of k such that
A'(0) = 1 and

lim — Yj R(k) — 0-
oc n k=1
380 LAWS OF LARGE NUMBERS [VII, § 3

1 "
Then £„ = — Y £k converges in probability to M:
n k=i

(5)

Proof. If we consider the proof of Theorem 2 we see that it suffices to


prove the relation
lim £>((„) = 0; (6)
n-*- oo

if this is done, the remaining part of the proof can be repeated word by word.
We prove therefore (6). We have

c2(c„) =-V£Z A0jJL=s-4


W i=l7=1 n
EE AA-sm-m.
i=\)=l

hence

D\C„) £-4.^+2 £ R(k) E AA+0- '7^


v' /
n k=1 i=l

By Cauchy’s inequality we have

DtDi+k<Sl-
j=i

if we put this into (7) we obtain

C2 9 c2 1
Z)2(C„) < -i- + —^ (8)
n n „
n k=1

Hence by condition b)

K
D\U <-+2K
n

Relation (6) follows and because of condition c) Theorem 3 is herewith


proved.
We return now to the case of pairwise independent random variables
with the same distribution. Let M denote the (finite) expectation of the
ik- It was shown by Khintchine that in this case the mean value of

n k=i

tends in probability to M when n increases, even if D(£k) does not exist.


VII, § 3] GENERALIZATION OF BERNOULLI’S LAW 381

Thus we have
Theorem 4. Let 4 be pairwise independent and identically distributed
random variables and suppose that the expectation

mk) = m (9)
exists. Then for
i "
c„ = — 14
n i

one has
(10)
Proof. Without restricting generality we may assume M = 0. Put

z* — 4 for Kfc I ^ k>


(11)
0 for | 4 | > k.

If F(x) is the distribution function of the random variables 4, we have

+k


n
i mt)=fi I *dF(x). n fc=i '
(12)
—k

Since by assumption
+ co

lim | xdF{x) = J xdF{x) = 0,


k-*-co —k

we can write
(13)
lim — Yj E(£t) = 0-
n-+~oo ^ /c=l

On the other hand

D\ek)<E{0= t x2dF(x),
—k
hence
+n +

JL £ T)2(^) < -i-jx2dF(x)< -j= j |x|</F(x) + J Ul.r/F(x) (14)

-n -jn \x\>Jn

and consequently

lim -4 £ ^(G) = 0-
(15)
„~co » fc-1
382 LAWS OF LARGE NUMBERS [VII, § 3

If we put

«* + £ «).
n *=i fc=,-+i
Theorem 2 implies
lim st £*r = 0.

On the other hand we have

If r is sufficiently large, we have for any d > 0 the inequality

#Q<
Hence for any £ > 0

C„I > <0=P(IC„ I > «,C*, # O + />(IC„I > «,c = £„) (17)
and thus
lim sup P( | Cn | > £) < <5. (18)

But since d > 0 is arbitrary, it follows

lim
n-*- oo
p( I C„ I > e) = 0,
which was to be proved.

Remark. When the random variables t;k are not only pairwise but com¬
pletely independent, Theorem 4 can be proved by the method of character¬
istic functions. We have seen in § 2 that the stochastic convergence of C„ to 0
is equivalent to the convergence of the distribution function Fn(x) of C to
the degenerate distribution function D0{x) of the constant 0 (i.e. to Tfor
-r > 0 and t0 0 for x < 0). Because of Theorem 3 (Ch. VI), § 4, it suffices
thus to show that the characteristic function cpn(t) of C„ tends, for every t,
to the characteristic function of the constant 0, i.e. to 1. If <p(t) is the charac¬
teristic function of the random variables then by assumption y'(0) = 0.

<p(t) - 1
£(t) = (19)
then
lim e(r) = 0. (20)
1-0
VII, § 3] GENERALIZATION OF BERNOULLI’S LAW 383

Since for the characteristic function <pn(t) of we have opn(t) = <P

(20) implies, for every real value of t.

lim cpn(t) = lim


n-*- oo w—► oo
1
/
+ -£
» n
t
|
)
= 1, (21)

which was to be proved.


It was shown by Kolmogorov that in Theorem 4 the assumption of the
existence of the expectation of can be replaced by the weaker postulate of
the existence of the limit

lim j xdF(x)
/!-*- oo —n

and of the relation

lim x[F(— *) + (!- T(x))] = 0


;c-*-oo

and that these conditions are not only sufficient but necessary as well for

1
t.
"

n k=1

to converge in probability to a constant as n —>co. As regards the proof


of this theorem of Kolmogorov, cf. § 14, Exercise 24.
We now give an example to which the law of large numbers does not
apply If the random variables t,k are completely independent and if all
1
have the same Cauchy distribution (with the density function ~ 7 7757 ),
7C(1 + X2)

then
n

Cn =
k=\

also has the same Cauchy distribution.


As a matter of fact the characteristic function of the variables is equal

to fT1'!, thus that of C„ is equal to (e n )" = e_|/|. Evidently, lim st C„ = P


does not hold in this case.
The fact that for independent random variables £* with a common Cauchy
distribution the random variable
384 LAWS OF LARGE NUMBERS [VII, § 4

has the same Cauchy distribution as £k, can be interpreted as follows: When
we take a sample from a population with a Cauchy distribution with den¬

sity function —T— --~r- , we do not obtain more information concern-


7i(l +(* - mi) )
ing the number m from the mean of a sample, however large, than from a
single observation.

§ 4. Bernstein’s improvement of Chebyshev’s inequality

In the preceding paragraph we applied Chebyshev’s inequality to the sum


of a large number of independent random variables in order to prove the
law of large numbers. It was shown by Bernstein that in this case Cheby¬
shev’s inequality can be considerably improved. Put

^f(e) = E{ec^~M)),

where a > 0, and M is the expectation of £. We proved in § 1 the inequality

t + In (e)
P £>M + < e -t it > 0). (1)

t + In (e)
It was already observed that the choice of a minimizing
a
makes the inequality (1) as sharp as possible.
We prove now the following

Lemma. Let glf


£2, be completely independent bounded random
variables with E(£k) = 0, D(£k) = Dk and suppose \ \k\ < K {k = 1, 2,..., n).
Put further

£ = £i + &> + ••• + £„ D(0 = D, and (e) = E(eEi),

where e is an arbitrary positive number. We have then

e2 D2 eK
In (b) < 1 + 0eK

Proof. Since
(2)
n

-#(£) = f] E(e^k),
k=l
VII, § 4] BERNSTEIN’S IMPROVEMENT OF CHEBYSHEV’S INEQUALITY 385

we evaluate first E{e^k). Since

co n zn
or-ik 8 Qk
= 1
n=0 n\

and the are bounded, the series is uniformly convergent and the expecta¬
tion of eE'k can be calculated term by term from the power series. Thus we
obtain
2 r)2 oo p()in
£(«*'*) = 1 + ip. + £ (3)
2 „T3 n\

As | t;k | < K implies the inequality

E(£l)<DlKn~\
we obtain
n-2

n =3 n\
As
1 1 1
<
n\ 6 (« — 3)!

for n > 3, substitution into (3) gives

1 00 (eK) n —2 1
eJT
E(eEik) < 1 + e2 Dl < 1 +s2Z>2
T + „?3 »! T + _6

which leads to
1 eKesK
£(«*) < [I l+e2D2
fc = l T+ 6
Because of 1 + * < ex, we obtain

r e2 D2 eKeeK
(e) = E(eEi) < exp 1 +

which is equivalent to (2).


Since by assumption M = 0, it follows from (1) that

/ £2 D2 eKe E K ~l
t+ 1+
< e -t (4)
386 LAWS OF LARGE NUMBERS [VII, § 4

Put

s/21
£ - (5)
D
Then (4) leads to

(6)

Substitution of X = J2t gives

XK _X?_
XK ^
£ > XD 1+ < e 2
* (7)
Jd
XK
Thus if X is large and - small, we obtain a much sharper inequality than

that of Chebyshev’s.
If we apply the obtained result to — we find that

XK
XD
Z\>XD 1+
~6D
< 2e (8)

In order to transform (8) into a more convenient form, we restrict our-


XK
selves to the case < 1. We have then

XK
e~b < e < 3.

From this, putting \i = X j1 + , we obtain from (8) because of X <

UK
< H < X 1 +
2D,

P( K | > fiD) < 2 exp (9)


HK 2
2 1+
2D

We may free ourselves from the condition E(£k) = 0 by applying (9) to


the random variable

«-*(€) = I [&-£(&)]•
fc=i
VII, § 4] BERNSTEIN’S IMPROVEMENT OF CHEBYSHEVS INEQUALITY 387

Thus we have proved the following theorem:

Theorem 1. If £1; £2, • • •, are completely independent random variables


such that E(gk) = Mk, D(ff) = Dk exist and \ £k — Mk \ < K (k = 1, 2,
. . n), then for £ = + £2 + ... + we have

■ P2
M \ > pD) < 2 exp 2 (10)
2 [i+f|
- 2D
-
In this formula
n "
M = Y, Mk and j YDt,
k=1 k=1

D
while p is a positive number such that p <
1c
Let us apply now this result to the case where the £k have a common
distribution. Let Mx be the expectation of the £k and D\ their variance. Then
n
the expectation of the sum £ = £ is equal to nMx and its variance to D'\n.
k= 1
£>, /-
It follows from (10) for p < yjn that
K

P{ | i — nMx | > pDx N/n) < 2 exp


pK (ID
2 1+
2 Dx yjn)
If the £k(k — 1, 2,...,«) are the indicators of an event A in a sequence of n
experiments, we get from (11)

Theorem 2. Let A be one of the possible results of an experiment, suppose


p — P(A) > 0 and put q = 1 — p. Let the random variable £„ denote the
relative frequency of A in an experiment consisting of n independent trials.
Then for 0 < e < pq we have
ne~
P{ I Ci, - PI ^ e) ^ 2 exp 2 (12)
2 pq 1 +
2pq) -
I n
Proof. (12) follows from (11) when we put p = e . Chebyshev's
Pd
inequality [cf. § 2, Formula (6)] leads only to

PW„-p\>i)<f^.
388 LAWS OF LARGE NUMBERS [VII, § 4

1 1
Thus e.g. for p — q £ Chebyshev’s inequality guarantees
20 ’

the validity of
1 1
P >- <- (13)
20 100

only for n > 10 000, while by using (12) we find that (13) holds already for
n > 1283.

If we take e = —-, we find, applying Chebyshev’s inequality, that

1
P > <
50 loo
is valid for n > 62 500; while applying (12) we see that it is valid already
for n >7164.
In these examples s > 0 and 5 > 0 were given and we wanted to estimate
the least number n0 = n0(s, 5) such that for n > n0

P(|£„-p|>e)<<5

holds. This question is answered in general by the following theorem which


follows from Theorem 2:

Theorem 3. We perform n independent experiments with possible outcomes


A and A. Suppose p = P(A) > 0. Let £„ be the relative frequency of the event
A in the sequence of experiments, and suppose <5 > 0, 0 < e < p(l — p) and

9 in ~
-gfi2-=«o (e, <5).
Then
P(\£n-P\>s)<S. (14)

Proof. It follows from Theorem 2 that (14) is fulfilled for

\ 2
2 pq 1 + In —
2 pq d
n>

| £
Since 2pq < —— and 1 -\- — ~7~ ^*or 0 < e < pq, 04) is always
2 2pq

fulfilled for n > —In ~


8e2 <5
VII, § 5] THE BOREL —CANTELU LEMMA 389

Theorem 2 ;an be rephrased in the following manner: If we put x =

= e / n , we have for 0 < x < yjnpq


V pq

pq < 2 exp (15)


Cn ~PI ^ X x
V n
1+
2 Jnpq

the sharpness of the inequality can be appreciated by comparing it to the


Moivre-Laplace theorem (Ch. Ill, § 16) concerning the convergence of
the binomial distribution to the normal distribution.
Finally, let us make some general remarks about the laws of large num¬
bers. In connection with many random mass phenomena, there are quanti¬
ties which, though depending on chance, are practically constant. As an
example, consider the pressure of a gas enclosed into a vessel. This pressure
is the result of the impacts of molecules on the wall of the vessel. The number
of the impacts as well as the velocity of the molecules depends on chance;
nevertheless the resulting pressure is practically constant, provided that the
number of the molecules is very large. Such phenomena can be explained
as instances of the law of large numbers.
According to the law of large numbers, by calculating the probability of
an event or the expectation of a random variable, one can obtain informa¬
tion about the relative frequency, or of the arithmetic mean, of the results
if a large number of independent experiments (observations) were per¬
formed. This is the reason why the law of large numbers is basic for so
many practical applications of probability theory.

§ 5. The Borel-Cantelli lemma

We now prove the following useful and simple lemma:

Lemma A. If {A„} (n = 1, 2,. . .) is any infinite sequence of events such


that

I P(T„)< +co, (1)


n=l

then with probability 1 at most finitely many events An occur simultaneously.


Or to put it otherwise, if we put
00 00

(2)
390 LAWS OF LARGE NUMBERS rvn, § 5

then
P(AJ = 0. (3)

Remark. The right side of (2) is denoted in set theory by

lim sup An.


n-*- + go
Proof. We have
00 oo

P(4)<?(^,)<^ P(Ak) (4)


k=n k-n

for every n. Because of (1), the right hand side of (4) tends to zero as n -* + oo,
hence we have (3).
If the An are completely independent, (1) is not only sufficient, but also
necessary in order that with probability 1 at most finitely many of the An
00

should occur. If £ P(A„) = + oo, thenP(Af) is not only positive but equal
n=1
to 1. Thus we have

Lemma B. If A,, A2,. . ., A„,... are completely independent events with


00

X P(A„) = + oo, then


n= 1

P(AJ = 1. (5)
Proof. Evidently
oo oo

Am ~Yj 11 Ak , (6)
n=1 k=n

it suffices thus to show that

P(f[Ak) = 0 for «=1,2,...; (7)


k=n

(7) and (6) imply P(AJ = 0, hence (5). But for N > n
00 _ N N N

F{RAn *k)=n (i - < exp [- j (8)


oo

The series £ P(Ak) being divergent, the right hand side of (8) tends to zero

as N 00 • Lemma B is thus proved. Lemmas A and B together are called


the Borel-Cantelli lemma.
The hypothesis of Lemma B concerning the complete independence of
the An-s can be replaced by the weaker condition that Al9 A2, A
be pairwise independent. Even somewhat more is true, namely
VII, § 5] THE BOREL-CANTELLI LEMMA 391

Lemma C. If Ai, A2,. .An, . . . are arbitrary events, fulfilling the condi¬
tions

f;P04„)=+oo (9)
n=1

and

ti p(Ak A,)
lim inf - =1, (10)
(£p(Ak))2
k=l

then (5) holds; thus there occur with probability 1 infinitely many of the
events A„.

Proof. The proof is based on Chebyshev’s inequality. Let a„(n = 1,


2, . . .) denote the indicator of the event An, i.e. put

I I if A„ occurs,
0 otherwise.

Then E(ocn) = P(A„) and by Chebyshev’s inequality we have

D~ ( afc)

p(\i«k-ir(.A)\>£ p( ~— (ID
‘-1 j=i ‘-1 t'tiw)'
i
Now E(otkcc,) = P(AkAi), hence

d! (i <a>=t i p(AkAi) - (i pwf■


(12)
k=1 k = ll=l k=l

Thus it follows from (10) that

lim inf P E ak ~ Z P(Ak)


k=l k=1
> t i
1 k=i
= 0. (13)

If we put
n
tl
1
dn = P Z P(Ak) (14)
I «fe<
,k = l 2 k=1
we have
lim inf d„ = 0. (15)
«-*■ GO
392 LAWS OF LARGE NUMBERS [VII, § 6

It follows from this that one can choose an infinite subsequence of posi¬
tive integers nx < n2 < . . . < ns < ..., such that
00

X dnj < + o° ; (16)


7=1

hence by Lemma A we have with probability 1

ni j n>
X ^ TX P(Ak),
k=1 z k=1
00

except for a finite number of values of j. Thus by (9) the series £ ak is di-
*=i
vergent with probability 1, which proves our statement.
The just proved lemmas will serve us well in proofs dealing with improve¬
ments of the law of large numbers.

§ 6. Kolmogorov’s inequality

In order to establish another group of laws of large numbers we need an


improvement of Chebyshev’s inequality due to A. N. Kolmogorov.

Theorem 1. Let the random variables rjl5 rj2, ...,rjn be completely inde¬
pendent, put further E(rjk) — Mk, D(r}f) = Dk. If e is an arbitrary positive
number, we have

k X Dl
P( max | X (nj - Mj) | > e) < __ . (1)
1 <,k<,n j=1 £“

Proof. Put r,* = rjk - Mk, ^ ^ r,f (k = 1,2,.. ., n). Let further Ak

denote the event that Ck is the first among the random variables A, Ca
Cn which is not less than e, i.e.

I Cl | < e,..., I C/c-11 < £ and | £k \ > s.


The events Ak (k = 1,2,.. n) exclude each other but obviously do not
torm a complete system of events. Let A0 denote the event | (k \ < e for
k=\, 2 . ., n, then the events A0, Alt . . ., An form already a complete
system of events. The probability figuring in (1) can be written in the form

/’(max | X /?*[>£) = f P(Ak). (2)


j=1 k=i
VII, § 6-i KOLMOGORoV’S INEQUALITY 393

On the other hand we have

t Dt =(O = t pWk) E(i\ \At). (3)


k =1 k=0

The right hand side of (3) becomes smaller if the term k — 0 is omitted from
the summation. Hence
rt n
Y.Ol>Y. P(Ak)E<ll\At). (4)
k=1 k=1

Let us now consider the conditional expectation E(£% | Ak). From £„ = (,k +
n

+ Z *1j it follows that


j=k+l

e=g+ z *?*2+2 z <;*>£ +2 s


y=/r+l j=k+l k<i<j<,n
* *
it nj- (5)

In the definition of the event AK only occur the random variables?/*,..t]k-


If <xk denotes the indicator of Ak, then it follows that ak does not depend
on rj* (J > k). Hence we have for k < j < n

E(Ck n* <*k) E(tfi)E(Ck*d


mri*\Ak) = (6)
P( Ak) P{Ak).
Similarly, we obtain that

Efnt nj I Ak) = o (7)

because of the independence of rjf, rj*, ak for / > i > k.


(5), (6) and (7) lead to

E(g\Ak) = E(&\Ak)+ f E{n**\Ak)>E(&\AK). (8)


J=k+1

Now, by the definition of the events Ak, we have | C& | > s whenever Ak
occurs; hence E{C,k \ Ak) > e2 and thus by (8)

E(ejAk)>s‘.

Because of (4) it follows that

t Dl > e2 £ P(Ak),
k=1 fc = l

which proves (1).


394 LAWS OF LARGE NUMBERS [VII, § 7

§ 7. The strong law of large numbers

A sequence qn of random variables is said to converge almost surely


(with probability 1) to 0 if

P( lim tjn — 0) = 1.
n-+ao

Almost sure convergence is a stronger condition than convergence in proba¬


bility. Indeed we have the following

Lemma 1. The condition

P( lim »/„ = 0) = 1 (1)


n-> co

is equivalent to the condition

lim st (sup | qm | ) = 0. (2)


7i-*co m>n

Consequently, (1) implies

lim st qn = 0. (3)
n-+ oo

Proof. We show first that (2) follows from (1). Let e > 0; let A„(e) denote
the event sup \ qm\> s and C the event lim rjn = 0; put further Bn(e) —
tn^>n n~*- co
00

= CAn(e). Then B„+1(e) c B„(s) and the set Bn(e) is obviously empty.
11 = 1

It follows from this (cf. Ch. II, § 7, Theorem 3) that

lim P(Bn (e)) = 0.


n-*- oo

Now because of P(C) = 1 we have

P(BM) = P(A„(e)).

Thus (2) holds and, because of (3),

11/„ | < sup | qm |


m^>n

holds as well. Conversely, assume that (2) holds. Let D(e) denote the event
lim sup | rjn \ > e (e > 0, arbitrary).
VII, § 7] THE STRONG LAW OF LARGE NUMBERS 395

Since obviously D(e) c A„(e) for n = 1,2,..., (2) implies P(£>(e)) = 0.

AsCc Yj D (t) 5 we get P{C) — 0, hence (1). Herewith the lemma is

proved.

Remark. Statement (1) concerning the almost sure convergence of r]k


to 0 can be rephrased: the probability of the set of the elementary events co
for which lim rjn(co) — 0 does not hold is equal to zero. In measure theory,
n-*- oo

this is expressed by saying that the variables converge almost every¬


where to 0. Convergence in probability is called in measure theory conver¬
gence in measure.
In order to emphasize the particular character of the strong law of large
numbers we restrict ourselves first to a simple case.

Theorem 1. Let i;1, £2,. . ., . .. be (completely) independent and iden¬


tically distributed random variables with finite expectation = M and
variance Z)'2(£„) = D~. Put

Then
/’(lim — M) = 1. (4)

Remark. According to Lemma 1, (4) is a stronger statement than the law


of large numbers: lim st £„ = M. Therefore Theorem 1 is called the strong
n—oo

law of large numbers.


In order to see better the meaning of (4), consider the case where the
are indicators of an event A in a sequence of experiments. C„ is then the
relative frequency of A. Put p = P(A). In this case the only thing the law
of large numbers says is lim st £„ = p, while the strong law of large numbers
n-+- oo
tells us that lim £„ = p with probability 1. That is, according to Lemma 1,
n-*- go
all relations
I Cn + k ~ P \ < e (k =1,2,...)

are simultaneously fulfilled with a probability > 1 — <5 for e > 0 and 5 > 0
however small, if the index n is larger than a number n0 depending on e
and <5.
Proof of Theorem 1. We consider

An — sup | — Mj and Au>b = max | C„ — M |.


n>N a<n<b
396 LAWS OF LARGE NUMBERS [VII, § 7

If the inequality AN > e is fulfilled for an N such that 2s < N < 2S+1, then
A2i 2i+ i > 6 is fulfilled for at least one l > s. Hence
CO

P(An > e) < Y P(A 2*,2/+i ^ e) for 2s <N< 2S+1. (5)

On the other hand we have

P(A2,2i+i > e) < P{ max k\ £k - M\> e • 2'). (6)


l^/c<2' + l
If we apply to the random variables £k (k = 1,2,.. ., 2/+1 — 1) Kolmogo
rov’s inequality (proved in the preceding section), we obtain

2/+1 D2 2D2
P( max k | Ck ~ M \ > e2‘) < 2- -- (7)
1^A:<2'+1 £ *•* £2-2'

If we substitute this into (6) and consider (5), we have

x 2D2 ” 1 4D2
^ - e) - ^2 Ys ~ ^272? • (8)
l =S

N
If N -> oo, it follows from (6) and from 2s > — that
2
lim P(d„ > e) = 0.
N-+CO

Hence by Lemma 1 our theorem is proved.


The following two generalizations of the strong law of large numbers are
due to A. N. Kolmogorov.

Theorem 2. Let £1} £2, be a sequence of (completely) mc/e-


pendent random variables for which E(£k) = Mk and D(£k) = Dk exist.
CO 2)2

Assume further that the series £ —converges. If we put


k=1 k

C„ = 4- i (& - AQ,
n k=l
then
P(lim C„ = 0)= 1. (9)
n-»co

Theorem 3. Let £i, <^2> • •b,n,. . . be (completely) independent identically


distributed random variables. The random variable

1 "
= —If*
n fe=I
VII, § 7] THE STRONG LAW OF LARGE NUMBERS 397

converges with probability 1 to a constant C iff the expectation M = E(£k)


exists; in this case C = M and, consequently,

P( lim C„ — M) = 1.
n-+ oo

The hypothesis of the existence of the variance in Theorem 1 is therefore


superfluous.

Proof of Theorem 2. Put, as in the proof of Theorem 1,

An= sup |C* | and ABtb= max |C*|*


k>N a<,k<b

It follows from AN > e, 2s < N < 2S+1 that d2;)2;+i ^ £ for at least one
l > s; hence
oo
P(AN>e)<YJP(M 2m>£). (10)
l=S

By application of Kolmogorov’s inequality we obtain

1
P(d2,j2;+i > e) <P( max k\£k\>e • 2l)<
22/e2
E
k=1
Dl
l<,k<2,+ i
Hence by (10),
1 oo i 21+1-1

o i=s £ k=1

By interchanging the order of summation we find


2s+l—l 16
1 Dl
P(An > e)<
3 • 4s'1 e2
e
E
^
k—1
+' E
3s2 kj%+i k?
(ID

Now it can be shown that the right hand side of inequality (11) tends to
• £ Dl
zero as n increases (hence as N increases too) provided that the series 2^
k = l /v
is convergent. To show this we need the following lemma due to L. Kro-
necker.
00
Lemma 2. If the series E ak is convergent and if qnis an increasing sequence
k=\
of positive numbers, tending to + co for n-> Co, then

1
lim -- E akHk = 0. (12)
«-► oo din k=1
398 LAWS OF LARGE NUMBERS [VII, § 7

Proof. Put rn = £ ak and choose a number «0 = n0 (e) (a is an arbitrary

small positive number) large enough in order that n > n0 should imply j rn j <

— . It is easy to see that (with q0 = 0)

1 " 1 "
— I ak <lk = — E rk (qk ~ qk-1) ” rn +i- (13)
1in k=l Qn k=1

If we put max [ rk | = A, we have for n > n0

1 ^ j Aq 2e
I akqk\<—^ + —
<ln k=l qn 3

Choose now nx > n0 such that —— < -—. Then for n >
qni 3

1 "
— E < 8,
cln k = 1

which proves Lemma 2.


00 Dl
This lemma and the convergence of the series £ —~ imply immediately
k=i

E^
lim -*=1. = 0;

hence the right hand side of (11) tends to 0 as iV -> oo. This and Lemma 1
lead to Theorem 2.

Proof of Theorem 3. We show first that the existence of M = E(qk)


suffices to imply P(lim C„ = M) = 1. For the sake of simplicity assume
n-*- oo
M = 0. Let the random variables £* (k = 1, 2,. . .) be defined by

e* = f I^ k,
[ 0 otherwise

and the random variables £** by

£**
Sfc
_ p _ Cat
— Qic
p* •
VII, § 7] THE STRONG LAW OF LARGE NUMBERS 399

Put
1 n 1 n 1 «
v 1 v-i e r** _ V k**
9n 2-i 2-i „ 2u •
n k=i n k=i n k■ = '1
Let F(x) denote the distribution function of the Since we have assumed
M = 0, we have

lim — X £(£*) = lira — X £(£**) = 0.


n— oo ^ /c = l n-*oo ^ fc = l

The expectation of the random variables £/c exists by assumption, hence


+ °0 I

the integral j' x | JF(x) is convergent. Since C« = C* + C**> it suffices


— OO

evidently to prove that

P( lim £* = 0) = 1, (14a)

P(limCr =0)= 1. (14b)

Theorem 2 applies to the random variables ££, since it is easy to show that
00 D\Zt)
Z>(£*) exists and that X < + * oo. Indeed
k=1 k2

Dxek)<E(.et*)=f
—k
*?dF(x).

Hence, because of

1 1 1
k 2 “ Jr* Kki -1) r
we have
-7 + 1

X < X f x2dF(x) + f x2dF(x) iX*


k=1 k 7;=i
=1 J J k=j
7-1 -7

+ °o
<2 f | x | dF(x).
— 00
Hence (14a) must hold. Now consider the random variables £**. We have

F(4*V0)= f dF(x), (15)


1*1>k
hence
+ 00
X P{$** 0)< j |x|</F(x). (16)
k=l
400 LAWS OF LARGE NUMBERS [VII, § 8

Lemma A of § 5 permits to state that with probability 1 £** = 0 holds,


for all but a finite number of values of k. This implies (14b) which proves
the first half of Theorem 3.
Now we show that the condition is necessary. Assume the validity of
P(lim C« = C) = 1, where C is a constant. Then with probability 1
n-*- oo

n — 1
lim lim Zn~ = C - C = 0.
n-*- oo n n-+oo n

Thus in
> 1 holds with probability 1 only for a finite number of values

of n. Since the are independent, it follows from Lemma B of § 5 that the

series £ P in in\ > 1


> 1 is convergent. Now since P - J dF(x)
n= 1
\x\ >77
the series
77+1

Z J dF(x) = Y, n j dF(x) + dF(x)


77 = 1
\x\>n

is convergent. But we have


+ 00 n +1
00 / r r i
x | dF(x) <1+^/7 JT(x) + r/T(x)!
J «=1 J J )
— 00 n -(« + l)

+ 00

hence j | x \ dF(x) exists and M = E(£k) exists, too. Hence by the first part

of the theorem C = M and thus our theorem is completely proved.

§ 8. The fundamental theorem of mathematical statistics

In the present paragraph we prove a theorem due to Glivenko, which is


of fundamental importance in mathematical statistics.

Theorem 1. Let the random variables &, £a,. . be the elements of a


sample drawn from a population, i.e. let f, £2,. . ., ^ be identically distrib¬
uted independent random variables with common distribution function F(x).
Let Ln(x) denote the empirical distribution function of the sample i e let
NFn(x) be the number of the indices k for which i;k < x. Put further

An= sup | FN(x) - F(x) |.


— 00 <x< + 00
VII, § 8] GLIVEHKO’S THEOREM 401

Then
P(lim An= 0) = 1. (1)
Af-oo

Remark. Glivenko’s theorem states that the empirical distribution func¬


tion Fn(x) of a sample of N elements converges, with probability 1, as TV -► oo
uniformly in x (- oo < x < -foo)to the distribution function F(x) of the
population from which the sample was drawn.

Proof. If xM k (M a positive integer; k = 1, 2,. . ., M) is the least number


k
x fulfilling F(x) < — < F(x + 0), we have

1
An < max (A$, A$) + (2)
M ’
where
= max ! Fn (xM>fc) - F(xM<k) \,
1 <.k^M

Atf — max | Fn (xM k + 0) — F(xMk + 0) |.


1 <,k^M

By the strong law of large numbers we have for every fixed x

P( lim Fn (x) = F(x)) = 1


N-+00

and
P( lim Fn(x + 0) = F(x + 0)) = 1.
N-*- co

Thus it follows from (2) that

lim sup An> —y = 0 (3)

for any natural number M, which proves the theorem.


This theorem of prime importance states that a large enough sample gives
an almost exact information concerning the distribution of the population.
Glivenko’s theorem can also be rephrased in the following manner: If e
and 5 are two given positive numbers however small, then there exists an
N0 such that
P( sup An < e) > 1 — S. (4)

This particular form shows clearly that the strong law of large numbers
and Glivenko’s theorem have a definite meaning even for the practical
402 LAWS OF LARGE NUMBERS [VII, § 9

case when only finitely many observations are made. In fact, always when
a large sample is studied, this theorem is implicitly used; hence it has the
right to be called the fundamental theorem of mathematical statistics.
On the other hand it must be noticed that Glivenko’s theorem does not
give any information how N0 figuring in (4) depends on £ and 5. This ques¬
tion will be answered by a theorem of Kolmogorov dealt with later on
(cf. Ch. VII, § 10).

§ 9. The law of the iterated logarithm

The strong law of large numbers can be still further improved. To get
acquainted with the methods needed for this, we give here first a new proof
of Theorem 1 of § 7 concerning bounded random variables. The proof rests
upon the Borel-Cantelli Lemma A.
Let £l5 £2,. . ., ... be independent and identically distributed bounded
random variables, and suppose \£n\ < K. Suppose further E(£n) = 0,
1 "
and put D = D(£n) and £„ = — Y £&• Then, according to Theorem 1 of
n k=1
§ 4, we have the inequality

P(\C„\>e)<2q" (1)
D2 a2
for 0 < e < ——■-, where q — exp 2 . From (1) follows
K f eK
2D2 1 +
. ID2 /
the convergence of the series

(2)
n— 1

and because of Lemma A of § 5, we find that the inequality | C„ | < e is


fulfilled with probability 1 for every sufficiently large n. This implies the
strong law of large numbers. Thus we obtained another proof of this law.
Notice, however, that a supplementary hypothesis was needed for this proof:
the random variables were supposed to be bounded. This hypothesis
allows to prove a far more precise theorem, called the law of the iterated
logarithm:

Theorem 1. Let Ci, • • •> • • • be uniformly bounded independent


random variables with common expectation M = and standard devia¬
tion D = £>(£„). Put
_ £1 + £2 + — nM
D (3)
VII, § 9] THE LAW OF THE ITERATED LOGARITHM 403

Then

(4)

In order to prove this we first have to prove a lemma similar to Kolmogo¬


rov’s inequality of § 6.

Lemma 1. Let fl5 be independent random variables with


common expectation E(£k) = M and variance D2(£k) = D2. Put

Ck = £i + £2 + • ■ • + £k — kM.
Then

(5)

Proof. Let Ak (k — 1,2,...,«) denote the event that the inequalities

Ci<x, £2<x,..., Ck-1<*>

are fulfilled. Let j?*. denote the event — Cl- > — 2 and yl the event
£n > x — 2 ,/nD. If both ^4^ and Bk occur, A occurs as well. The events
Ak (k = 1,2,...,/?) mutually exclude each other, thus the same holds for
the events AkBk; the events Ak and Bk are evidently independent since Bk
depends only on the random variables £fc+1,. . ., b,n and Ak depends only
on h,. . ., £k. Since A1B1 + . . . + AnBn £ A, the independence of Ak
and Bk implies

n n n
Y.P(Ak)P(Bk)=Y.p(AtBt) = P(YiAkBk)<,P(A). (6)
k—\ k=1 k=1

On the other hand:

\-P{Bk)<P(\t;n-Zk\>2D Jn), (7)

hence, by Chebyshev’s inequality,

(8)

Thus

(9)
404 LAWS OF LARGE NUMBERS [Vir, § 9

(6) and (9) lead to

X t P(A) = ~P( max U > x)<P(A) - P(C„ > x-lD^n),


* k=l * l<.k<,»

which was to be proved.


Now we proceed to the proof of the law of the iterated logarithm. We
show first that with probability 1 only finitely many of the events r]n >
> (1 + e) yf2n\n\n.n occur if e > 0 is aibitrarily small. Let y be yj\ + e
and let Nk be the least positive integer larger than yk. Let Ak(e) denote the
event
max r]n>{\ + E)s/2Nk\n\nNk. (10)
N‘k<.n<Nk+i

In order to show that with probability 1 at most finitely many of the events
Ak(e) occur, it suffices, in view of the Borel-Cantelli lemma, to show that
00

the series £ P(Ak (e)) is convergent for every e > 0. Now


k=l

P(M£)) ^ P( max r\n > (1 + e) JiNk lnln NX (11)

hence, because of Lemma 1,

P(Ak(s)) < ~P(rjNk+i > (1 + e)j2Nk lnlnNk - 2^N~k+,). (12)


If we put

k=kk = V +sf
V Nk +1
we conclude from (12) by applying Theorem 1 of § 4 that the following
relation holds if k is large enough:
9
kk
P(Ak 00) ^ y exp (13)
2 fl + ]
JJ
Since lim kk
k-+co 72 In InNk and lim = 0
there exists a number k0 depending only on e such that for k > k^we have

kl
^ (1 + e) In In Nk > (1 + £)ln (k In y).
HkK
2 1+
VII, § 9] THE LAW OF THE ITERATED LOGARITHM 405

Hence for k > k0

P(Ak(e)) < 4“ = n-S—,1+,-• (14)


3 3U-ln(l+e)] *>+■
00

The series ^ T(TA.(e)) is thus convergent for every positive £ and according
k=1
to the Borel-Cantelli lemma,

hn (15a)
lim sup_ < 1 = 1.
«-<» yj2n In In n

It can be shown in a similar manner that

lim inf— . > - 1 = 1. (15b)


«+oo yjln In In n

This proves Theorem 1.


It should be noticed that (4) cannot be improved; in fact, it can be shown
that

_ Jn_ (16a)
P lim sup
/!-*■ 00 yf 2n In In n

and

P (lim inf , ^ „= = -1=1. (16b)


' a> yj2n In Inn I

In particular, if the ik are indicators of an event A with probability P(A) =


= p in a sequence of independent experiments, the conditions of Theorem 1
are fulfilled. Thus we have

Theorem 2. If („ represents the relative frequency of an event A in a se¬


quence of independent experiments and if P(A) = p (0 < p < 1, q = 1 ~ P)>
then

( Cn~P I (17)
P lim sup < 1 = 1.
n-*- co 2pq In In n
n
406 LAWS OF LARGE NUMBERS [VII, § 10

As we have seen, there even holds the more precise relation

/ Ch~P
lim sup
2pq In In n
n
/ <s
\
P lim inf- . <sn ~ P = -1 (18)
j n-»oo / 2/R/ln In n
\ V /

For the proof of (18) cf. § 16, Exercise 23.

§ 10. Sequences of mixing sets

Let [Q, u€,P] be a Kolmogorov probability space and An (n =


= 1,2,. . .) a sequence of sets such that for every the"relation

lim P(An B) = dP(B)


(1)
holds, where dis a number, not depending on B, such that 0 < d < 1. Then
the sequence {A,,} is said to be mixing; d is called the density of the sequence
{ A„}.

Theorem 1. If A0 - Q, A„ 6 ^ (n = 1,2,...), 0 < P(A„) < 1 and

lim P(An | Ak) = d (k = 0,1,...), (2)


n-*- oo v '

then {A„} is mixing.

Remark. Evidently condition (2) is also necessary, as it is a particular


case of (1) for B = Ak (k = 0, 1,...).

Proof We use elements of the theory of Hilbert spaces. Let be the set
of all random variables £ for which E{?) exists. Put (& q) = Etfq) and 11 11 =

7 ^2 - ^ is then a Hilbert space. Let ct„ denote the indicator of the event

1 for co£An
an = V-n ifO) -
0 for co£A„ (» = 0, 1,...).

If/J is the indicator of B and if cc„ - d = y„ we can write (1) in the form

hm (fi, y„) = 0,
(3)
VII, § 10] SEQUENCES OF MIXING SETS 407

while (2) is equivalent to

lim (yk, yn) = 0 (& = 0,1,...). (4)


n-*- oo

We show that (4) implies (3) for every p £ 3tf (hence not merely for the p
which are indicators of sets).
Let denote the set of those elements of ^ which are linear combinations
of the yn or limits of such elements, in the sense of strong convergence,
that is 8n -* 8 means that lim || 8n - 8 || = 0. In other words, 3t*x is the least

subspace of 3tf containing the elements y„ (n = 0,1 . . .). Obviously, (4)


implies (3) when P is a finite linear combination of the y„, and also when
n

B P3tf,. In fact, in the latter case there exists for every a >0ay = £ ckyk
k=i
with I] ft — y || < a. Because of Schwarz’ inequality and of

II yH II = £((«„ - df) = P(An) (1 - df + (1 - P(Anj) d> < 1

we have

I (fi,yn) - (y»yn)\ = \(P~ y>y„)l<ll/?-ylHlyJI^-

By (4) lim (y, y„) = 0, thus lim sup | (P, y„) \ < a, since for a > 0 there
n-*-cc «-*-oo _
can be chosen any positive number however small. (3) is theretore proved
for every p 0 3f x.
Let now 3T2 be the set of elements 8 of 3? such that (8, y„) = 0 for
n = 0,1, ... . 3t2 is then the subspace of 3? orthogonal to 3f For
P 0 3ft? 2 (3) is trivial. Now according to a well-known theorem of the theory
of Hilbert spaces1 P £ 3f? can be written in the form P = Pi + Pi, where
j5x £ 3ft? x and p2 £ 3ft? 2. Furthermore,

(P, y„) = 0?i, y„) + (fi* yJ = (Pi>y n)

hence (3) holds for every P £3?. Theorem 1 is thus proved. As an applica¬
tion we prove now a theorem which shows new aspects of the laws ol large
numbers.

Theorem 2. Let fl5 f2,. . . • . be independent random variables whose

arithmetic mean £„ = • ^ + " ' +~ tends in Probability to a random


n
variable £. Then £ is equal with probability 1 to a constant.

1 Cf. e.g. B. Sz.-Nagy [1] or F. Riesz and B. Sz.-Nagy [1].


408 LAWS OF LARGE NUMBERS [VII, § 10

Proof. Choose two numbers a and b such that P(a < £ < b) > 0 and
a, b are two points of continuity of the distribution function of £. Then
P(a — C« < b) > 0 for n > n0. Let A0 = Q and let Ak denote the event
a — C«0 +k+i < b {k — 1,2,...). For k > 1 we have

P(An | Ak) =

(n0 + k+ 1) C„„+fc+i + Z„a + k+2 + ... + 40+„+i


= P \a< -< b\Akj <
no F n + 1

(n0 + k + 1 )b = «„ + £+ 2 + • • • + £n„ + n + l (Po + k + 1) a


<P a-— < <b~
Wq + /7 + 1 n0 + n+\ Hq F w+ 1

< P \a — e < ^n0 + k + 2 F • • • T ^n0 + n + i


<b+s
F ft + 1

for any e > 0 whenever n is large enough. Similarly, for sufficiently large n,

^n0 + k + 2 + • • • + £„t n 4-1


P(A„ | Ak) > P a + e < +
< b — s
Uq + n + 1

^n0-h Ar + 2 "F • • • “F ^fi


Now by assumption £„ Jk £, hence also P-* £. Now for
e > 0 there can be chosen any positive number however small thus
Theorem 1 of § 2 leads to

lim P(An | Ak) = P(ci < £ < b\


n-*- co

since a and b are points of continuity of the distribution function of £. Thus


by Theorem 1 the sequence of events {An} is a mixing sequence with the
density d = P(a < £ < b).
Now let B be the event a < £ < b. We have

lim P(A„ | B) = d = P(B).


n-*- oo

Variables ^ tend in Probability to £, also when taken on the


probability space [Q, , p(p
| 5)]. Thus lim P(A,, | B) = P(B | B) 1 =
by Theorem 1 of §2. Consequently, P(B) ="1 “since P(a < £ < b) > 0,
• f ~ But tbis means that £ is a constant with probability 1;
th * lf m dlSt”budon Unction of £ would increase at more than one point,
1 and hi
1 i°Unf a Pair ® < b
Wkh °<P(^C<b)< l such that
a and b are points of continuity of the distribution function of £ Theorem 2
eithertend It arithmetlC means independent random variables
either tend in probability to a constant or do not converge at all.
VII, § 11] STABLE SEQUENCES OF EVENTS 409

§ 11. Stable sequences of events

In the present section we deal with a generalization of the notion of mixing


sequences of events, introduced in the preceding section. Let [Q, P]
be a Kolmogorov probability space; a sequence {A,,} of events (A„ ;
n = 1,2,...) such that for any event B £ there exists the limit

lim P(AnB) = Q(B) (1)


JI-+- CO

will be called a stable sequence of events} We shall prove first that the set
function Q(B) on the right hand side of (1) is always a measure, i.e. we prove

Theorem 1. If {An} is a stable sequence of events, the set function Q(B)


defined by (1) is a measure which is moreover absolutely continuous with
respect to the probability measure P.

Proof. Obviously, Q(B) is a nonnegative and additive set function and


Q(B) < P(B), hence if P(B) = 0, then 0(B) — 0. From this the assertion
of our theorem follows directly, by Theorem 3 of Chapter II, § 7.
According to the Radon-Nikodym theorem the derivative

dQ
— tx(co) (2)
~dP

exists, furthermore, for every event B £ one has

0(B) — | adP■ (3)


B

It follows directly from the inequality Q(B) < P(B) that

0 < a(co) < 1 (4)


with probability 1. The random variable a = a(a>) is called the (local) den¬
sity of the stable sequence of events {A„}.
If a is constant almost everywhere, a = d(0 < d < 1), then clearly the
stable sequence of events {A„} is mixing and has density d. On the other
hand, if a is not constant, the stable sequence of events {An} cannot be a
mixing sequence. Hence the notion of the stable sequence of events is a
generalization of that of a mixing sequence of events.
Let us now consider an example of a stable but not mixing sequence of
events. Let [Q, P] be a Kolmogorov probability space, let £>x £
0 < P(Qfi < 1 and Q2 = Qx. Consider further the probability spaces
[£>, Pf[ and [Q, P2], where for every A PX(A) = P(A | Of
and P2(A) = P(A | £22).

1 Cf. A. Renyi [36],


410 LAWS OF LARGE NUMBERS [VII, § 11

Let A'n be a mixing sequence of events in the probability space [Q, Px]
with density dx and A"n a mixing sequence of events in the probability space
[f3, ^€,P2] with density d2(0 < dx < d2 < 1); put An = A'nQ1 + A"Q2-
Then clearly we have for every event B £ ^

P{An B) = P(fl0 Px (A; B) + P(Q2) P2(A; B),


hence
lim P(An B) — Q(B),
72-> 00

where
g(5) = d1P(BC21) + d2 P{BQ2).

Let the random variable a = a(a>) be defined in the following manner:

if to £ Qx,
if co £ i22,
then
2(5) = f ocdP.
i
Thus the sequence of events {A,.} is stable but not mixing, since its density
is not constant but assumes two distinct values with positive probabilities.
Clearly, there can be constructed in a similar manner stable sequences of
events with densities having an arbitrary prescribed discrete distribution.
Now we shall prove the generalization of Theorem 1 of § 10 concerning
stable sequences of events.

Theorem 2. If An £ (n — 1,2,.. .), Ax = Q and if the limits

lim P(An Ak) - Qk (k = 1,2,...)


72-> 00

exist, then the sequence of the events {An} is a stable sequence of events.

Proof. The proof of this theorem corresponds nearly step-by-step to


that of Theorem 1 in § 10, hence it will be only sketched. Let denote the
Hilbert space of all random variables with finite standard deviation on the
probability space [Q, 5]; scalar product and norm are (as usual) defined

by (£> fi) = E(fq) and || £ || = (£, £)2, respectively. Let a„ be the indicator
of the event An. Let 3?\ be the subspace of the Hilbert space ^spanned by the
elements oq, a2,.. ., oq, . . .; thus Af*x consists of the finite linear combinations
(with real coefficients) of the elements of the sequence {a*} and of the
(strong) limits of these elements. It is easy to see that if £ £ jrx, then the limit

Jim (£, «„) = L(0 (5)


VII, § II] STABLE SEQUENCES OF EVENTS 411

exists; in fact if £ = cqoq + . . . + ckctk, then

lim (£, a„) = c1Q1 + ... + ck Ok,


n~* oo

while if £ is the limit of linear combinations of the a.k, the limit (5) exists
again, since

| ({, «„) - ({', otj | = | ({ - a.) | S || ( - t' II-

Now in the same way as was done in Section 10 we decompose £ £


as £ — + £2, where t and (£2, ak) = 0 (k = 1,2,...); hence the
limit (5) exists for any £ £ and our theorem is proved.
Clearly the functionalL(f) has the following properties: If £ £ q £
furthermore a and b are real constants, then

L{a^ + br\) = aL(^) + bL{fi)


and
mOmiKII,
To put it otherwise, L{£) is a bounded linear functional and thus, according
to the well-known theorem of F. Riesz (cf. F. Riesz, and B. Sz.-Nagy, [1]),
there exists an a such that

L(0 = (L a),
i.e. the sequence a„ converges to a, in the sense of weak convergence in the
Hilbert space. (A sequence of elements a„ of a Hilbert space is said to con¬
verge weakly to a (a £ if for any element £ £

lim (0 a„) = (5, a).


/!—► CO

This fact is denoted by a„ -»■ a.)


The preceding discussion contains the proof of the following

Theorem 3. Let a„ denote the indicator of the event An and the Hilbert
space formed by the random variables with finite second moments defined on
the probability space [Q,ts£,P]. The sequence of events {A„} belonging to
the probability space [Q, P] is stable, iff a„ converges weakly in 3? to an
element a £ . If the sequence of events {An} is stable and if a„ -*> a, then a
is the density of the sequence of events {A„}.
A stable sequence of events {A,,} is mixing, iff there exists a number
d(0 < d < 1) such that for every event A

Q(A) = dP(A). (6)


412 LAWS OF LARGE NUMBERS [VII, § 12

It is readily seen from Theorem 1 of § 10 that for A — Q and A = Ak


(k = 1, 2,. . .) it suffices to assume the validity of (6); from this it follows that
{An} is a mixing sequence of events.
Finally, we prove another theorem showing the great generality of the
notion of stable sequences of events.

Theorem 4. From any sequence of events one can select a stable subse¬
quence.

Proof. Theorem 4 is a direct consequence of a well-known theorem of


the Hilbert space theory (cf. F. Riesz, and B. Sz.-Nagy, [1]) stating that
from any sequence of elements with bounded norm of the Hilbert space
a weakly convergent subsequence can be selected.
As an application of these results we shall discuss in the following section
sequences of exchangeable events.

§ 12. Sequences of exchangeable events

The notion of a sequence of exchangeable events was already encountered


(cf. Ch. II, § 12, Exercises 38-43). Here we shall deal only with infinite se¬
quences of exchangeable events. First, let us repeat the definition.
A sequence {An} (n = 1, 2, . . .) of events is said to be exchangeable if
the probability of the joint occurrence of k distinct events chosen arbi¬
trarily from this sequence depends only on k for every positive value of k,
but does not depend on which k events were chosen. Thus there can be given
a sequence of numbers pk such that

(1)
whenever < n2 < . . . < nk.
First we prove the following theorem:

Theorem 1. A sequence of exchangeable events is always stable and is


mixing, iff the events are independent and have the same probability.

hence
lim P(A„) = Pl,
VII, § 12] SEQUENCES OF EXCHANGEABLE EVENTS 413

further by assumption

P(A„ Ak) = p2 if n > k,


hence
lim P(An Ak)=p2 (k = 1,2,.. .)•
At-*- CO

Therefore by Theorem 2 of § 11 {An} is a stable sequence of events. Now


according to § 11 a stable sequence of events {A„} is a mixing sequence,
iff a is a constant, a = d. In this case px = d and p2 = d2, furthermore

Pz = ^2 ^i)> if n > 3,
hence
Pz = lim P(A„ A2 Afi = dP(A2 Ak) = d3,
n-*-oo

and, similarly, it can be seen that for every k > 1 we have

Pk = dk,
i.e.
P(Ani Ant... AJ = P(AJ P(AJ ... P(AJ,

whenever 1 < ^ < n2 < . . . < nk. But this means that the events An are
independent.
Now we prove a theorem due to B. de Finetti.

Theorem 2. If {A„} is a sequence of exchangeable events and the numbers


pk are defined by (1), there can be given on the interval [0, 1 ] a distribution
function F(x) (fulfilling F(0) = 0 and T(1 + 0) = 1) such that

Pk = ]xkdF{x) (k= 1,2,...). (2)


o

Proof. By Theorem 1 the sequence {An} is stable. Let a denote the den¬
sity of this sequence. Then

P\ = p{Ak) = lim P(An) = f adP,


n-+- oo Q

furthermore p2 = P{An Ak) if n > k and thus

p2 = lim P(An Ak) = JadP = j a cckdP (k — 1,2,...),


»-<- oo Ak a

hence by Theorem 3 of § 11

p2 = lim j aa/c dP = J a2 dP.


k-*oo C2 n
414 LAWS OF LARGE NUMBERS [VII, § 12

Similarly,
p3 = P(An Ak A,) if n> k > l,
thus
Ps = lim P(An Ak A,) = j cexk a, dP,
n-<- oo O

hence — by taking the limit first for k -> oo, then for / -» oo, — we obtain
that
P'i = \ot3dP
h

and, in general, for every positive integral value of k we get

Pk = f *k dP. (3)
n

Let now F(x) denote the distribution function of the random variable a.
Thus (cf. Theorem 6 of Ch. IV, § 11)

pk = jjxkdF(x) (lc = 1,2,...),


o

which was to be proved. The proof gives, however, somewhat more than
stated by Theorem 2; in fact we have proved the more general

Theorem 3. Let {A,,} be an arbitrary exchangeable sequence of events',


let the numbers pk be defined by (1). Let oc„ denote the indicator of the event
A„ and a = a(co) the density of the sequence {An}. Then, for any choice of
the positive integers l < kx < k2 <...< kr and for any integer s > 0 we
have
ak2..:<xkrasdP =pr+s. (4)
Si

Remark. Theorem 2 is contained as a particular case in Theorem 3.


In fact, if r = 0, relation (4) reduces to (3). On the other hand, if s = 0,
then (4) reduces to (1); i.e. to the definition of the sequence of numberspr.
Let {An} be an exchangeable sequence of events. Let us compute the
probability of the joint occurrence of k distinct events selected from this
sequence and of the simultaneous nonoccurrence of / events distinct from
the former and from each_ other; i.e. let us compute the probability
tAni.. . A„lc Ami Ami. . . Am(), where nx < n2 < ... < nk, mx < m2<
< .. . < mt and nt ^ mj(i = 1,2,. .., k; j = 1, 2,. . ., J). It can be seen
by an easy calculation that
' Z '
... A„kAmjAm2... Ami) — £ pk+j(—iy (5)
7= 0 J,
VII, § 12] SEQUENCES OF EXCHANGEABLE EVENTS 415

Equation (5) is valid for the case k = 0 too, if we understand by p0


the value 1.
The expression on the right hand side of (5) can be written in the form

i w-iy
' /'
= (-l )'A'pk+l, (6)
j=0 ,J ,

where A denotes the difference operator, defined by Axk = xk — xk_v


Since the probability on the left hand side of (5) is nonnegative, we have
obtained that for the sequence of numbers pk the inequalities

{-\)lAlpk+l>0 (7)

hold. Sequences of numbers having property (7) are called absolutely mono¬
tonic sequences. Hence an absolutely monotonic sequence is nonincreasing,
its first differences form a nondecreasing sequence (i.e. the sequence is
convex), its second differences form a nonincreasing sequence, etc. Note
that inequality (7) can be obtained from the representation of the sequence
of numbers pk in (2) or (3), since

(-1 )'Alpk+l= T xfe(l -x)ldF(x)= f ak (1 -a)'dP. (8)


o n

It was shown by F. Hausdorff that every absolutely monotone sequence


of numbers pk (k = 0, 1, . . .) for which p0 = 1 can be represented in the
form (2), where F(x) is a distribution function on the interval (0, 1). This
theorem can be deduced from the above Theorem 2. It suffices to show that
if Pk {k = 0, 1,. . .) is an arbitrary absolutely monotone sequence of num¬
bers for which p0 = 1, there can be constructed a sequence of exchangeable
events {A„} on a suitable probability space, fulfilling (1). This construction
is readily performed, e.g. by the fundamental theorem of Kolmogorov
(Ch. IV, § 22). In fact, if a„ denotes the indicator of the event An, then

P(A n Ank Ami.. . Ami) — P(ccltl — 1 j am, 0, mi = 0).


i *

Hence we can see from (5) that given the sequence pk, the joint distribution
function of a finite number of random variables chosen arbitrarily from
oq, a2,. . ., a„,. . . is given as well; the conditions of compatibility are,
obviously, fulfilled and thus the existence of th e(exchangeable) sequence of
events with the required properties is ensured by the fundamental theorem
of Kolmogorov.
416 LAWS OF LARGE NUMBERS [VII, § 12

Now we shall prove the following


Theorem 4. Let {An} be a sequence of exchangeable events with density
a and let oc„ denote the indicator of the event An. Then we have with probability 1

«i + «2 + • • • + a„
lim = a. (9)
n-*- oo n
Proof. Let

E (a/t~a)
a-, + a2 + + fc=i
tn =
72 77

We calculate the expectation of f* According to (4), if klf k2, k3, k± are


distinct positive integers, we have

£(K - «)(<**. - «)(«*, - °0K - a)) = 0. (1 0)

Similarly, it can be seen that if kx, k2, k3 are distinct, then

- «)2 («*, - «)K - «)) = 0, (11)


furthermore, if kx =£ k2, then

£(K- °03 K - «)) = o. (12)


On the other hand, if kx # k2, then

-^((aL-x a) (aArs af) — Pi — 2y>3 + /74 = A (13)


and
— d) ) = /7X — 4/72 + 6/73 — 3/74 = 5, (14)
hence
+ 372(77 - 1)A
mt) = =0 (15)

oo
Thus the series E -£((«) is convergent, hence (by the Beppo Levi theorem)
n—1
00

the series E C is convergent with probability 1, i.e. lim £„ = 0 with prob-


, ... ”=* _ n-»oo
ability 1, which implies the statement of Theorem 4. Our result may be re¬
phrased as follows: if {A„}isa sequence of exchangeable events and v
denotes the number among the events Alt A2,. . ., An which occur, then

the limit lun ~ exists with probability 1 and is equal to the density of the

sequence of events {An }.


Clearly, Theorem 4 is a generalization of the strong law of large numbers
concerning the relative frequency. Notice that the limit of the arithmetic
VII, § 12] SEQUENCES OF EXCHANGEABLE EVENTS 417

means of the random variables ock is, in general, not constant — contrary
to the case of independent random variables.
Let us now consider as an example Polya’s urn model (cf. Ch. Ill, § 3,
Point 9). Let there be in an urn M red and N — M white balls. Balls are
drawn at random, the drawn ball is replaced and simultaneously there are
added into the urn R > 1 balls having the same colour as the one drawn.
If An denotes the event that the ball drawn at the «-th drawing is a red one,
according to Formula (10) of § 3 in Chapter III the sequence of events
{An} is exchangeable and
k-1 M + 1R
Pk=
/=o N+1R
n (k= 1,2,...). (16)

It is readily seen that in this case


i
Pic = J xk dF(x), (17)

where F(x) is the distribution function of the beta-distribution with param-


M N-M
eters a = — and b =-——. That is (cf. Ch. IY, § 10, (10)) we have
R R

IN
r M N-M
R
F(x) = tR (1 ~t) R dt. (18)
M' ' N — M'
F r
R
Thus by Theorem 4 in case of Polya’s urn model the relative frequency of
the drawings yielding a red ball among the first n drawings converges with
probability 1 to a random variable having a beta-distribution of order
M N M prom tkjs jt follows that the distribution of this relative fre-
R ’ R ,
quency converges to the mentioned beta-distribution. In fact, if a sequence
t]n of the random variables converges with probability 1 to a random vari¬
able ri, then (cf. § 7) also r\n tends in probability to rj and thus (cf. Theorem 1
of § 2) the distribution of rjn tends to that of r\. Hence we have

Theorem 5. Let in Polya’s urn scheme vn denote the number of red balls
drawn in the course of the first n drawings, then
N
r R PL _i N~M _i

lim P < X tR (1-0 R dt,


M 1N-M
n-*~ oo
r r
R R
418 LAWS OF LARGE NUMBERS [VII, § 13

i.e. the limit distribution of the relative frequency of the red balls drawn is
M N-M
a beta-distribution of order
~R* R
In particular, if M — R = 1 and N = 2, the relative frequency of red
balls will be in the limit uniformly distributed on the interval (0, 1).
Furthermore, it is easy to see that Formula (10) of Chapter III, § 3 is a
special case of the present Formula (5).
As is seen from this example, the general theory of stable sequences of
events permits a deeper insight into some particular problems already dis¬
cussed.

§ 13. The zero-one law

In § 5 we proved the following statement: If the events An (n = 1,2,...)


are independent, the probability that there occur infinitely many of them is
OO

either 0 or 1 according to the series £ P(Ar) being convergent or diver-


n=1
gent. Thus the probability in question cannot be equal to any other value than
0 or 1.
This phenomenon is explained by the following general theorem, called-
the “zero-one law".

Theorem 1. Let Ax, A2,. . ., An,... be a sequence of independent events


and-f{n) the least o-algebra containing all sets An+1, A„+2,. . .; if C is an event
which belongs for every n to the o-algebra Mn), then either P(C) = 0 or
P{C) = 1.

Proof. We need a lemma from set theory. For its formulation the follow¬
ing definition has to be introduced: If o# is a system of subsets of Q having
two properties:

1) Bn £ and Bn+1 £ B„ (n - 1,2,.. .) imply lim Bn = ]~J Bn


»-«> «=i

2) Bn £ and Bn £ Bn+1 (n = 1,2,...) imply lim Bn = £ Bn ,


^ . . n-+ oo ,2 =l
then o-# is said to be a monotone class.

Lemma. If a monotone class of sets contains an algebra of sets it contains


also the least o-algebra o(*j€) containing .

Proof. It suffices to show that if (e^f) is the least monotone class con¬
taining it is identical with o(t^€). Let A be any subset of 12 and ^A the
VII, § 13] THE ZERO-ONE LAW 419

family of the sets B such that A — B, B — A, and A + B all belong to


Clearly, ^A is a monotone class, since for every (increasing or
decreasing) monotone sequence {Bn} with Bn A we have

lim Bn — A = lim (B„ — A),


n-*- oo n-*- oo

A — lim Bn — lim (A — Bn),


n-*- oo n-+ oo

A + lim B„ = lim (A + Bn).


n-*~ 00 /!-► CO

By assumption, erf is an algebra of sets, hence A d implies ^ £ o#A.


Since, by definition, (<^f) is the least monotone class containing we
have<Lx#(t^r) £ for ^4 £<^€\ Thus if C ) and ,4 £ then
C £ ^A. Hence A £ c and for every C £ G-^Q, ^ cl therefore
£o#c. Consequently is an algebra of sets, since for any
other D with D £ G-7^ we have D £ c, hence C + D £ («^)
and C - D £ But a monotone algebra of sets is obviously a er-al-
gebra, thus <r(oO £ o#V£). And since a(y€) is a monotone class con¬
taining and is the least class of this kind, we have necessarily
o(y€) = ^{y€), which proves the lemma.
Now we shall prove the theorem. Let «$„ denote the least algebra of sets
containing Ax, A2,.. A„. Let<^f„ be the collection of the sets which are
independent of all elements of A8n. n is a monotone class. If ^("} is the
least algebra containing An+1, An+2,. . ., then ^€(n) £ <-/#„. Hence, because
00

of the lemma, a(^(n)) = <$(n) £ Thus if C([[ ^.thenCisindepen-


n=1

dent of every A with A £<$„(« = 1, 2,. . .). If ^f(C) is the collection of


oo

all sets independent of C, we have (C), hence Iff £ (C), too.


n=l
00

By applying once more the lemma we find a ( ff ^«) =

It follows that C 6 ~*r(C), hence P(CC) = P(C) P(C) or P(C) - P2(C). But
this is impossible, unless either P(C) = 0 or F(C) = 1. Thus Theorem 1 is
proved. Finally we mention a generalization of the above theorem.

Theorem 2. Let £l5 f2,. . ... be an infinite sequence of independent


random variables and <$(n) the least a-algebrawith respect to which the random
00

variables £n+1, f+2,. . . are measurable. If C £ f] ^ either P{C) = 0


n=1

or P(C) - 1.
The proof will only be sketched, because it is similar to that of Theorem 1.
Let ^be the collection of sets independent of C. As in the previous proof,
420 LAWS OF LARGE NUMBERS [VII, § 14

we show for the least a-algebra L$n, relative to which the random variables
£i> £2> • • in are measurable, that (n = 1,2,...). Hence, accord¬
ing to the lemma, *$(1) £ therefore C £ ^ and, consequently, P2(C) =
= .P(C). Notice that Theorem 2 of § 10 can also be deduced from this.

§ 14. Kolmogorov’s three-series theorem

Theorem 1. Let j/1? rj2,. .tjn,... be independent random variables and


let 2 be an arbitrary positive number. Let the random variables r\* be defined by

* f rin for \nA< K


" | 2 otherwise.
00

The series )T converges with probability 1 iff the following three series con-
n= 1
verge:

I Pidin # r,*). (1)


n=1

I £(*£),
n=l
(2)

I ^2(^)- (3)
n=l

Remark. It is easy to see from the zero-one law that r\t, converges
• • n=\
either with probability one or with probability zero.

We show first that the conditions (1), (2), (3) are sufficient. From
Proof.
(1) and the Borel-Cantelli lemma it follows with probability 1 that rjn = tj*
co oo

for sufficiently large values of n; hence the series £ rjn and £ q* are, with

probability 1, simultaneously convergent or divergent. Thus it suffices to


00
show that X rj* converges with probability 1. Because of (2), it suffices to
n— 1 oo

prove this for the series £ <5„, where 5n = rj* - E(n*). We know that the
71=1

random variables bn are completely independent, further that E(d„) = 0

and £ Z)2(5«) < + co. Hence for an e > 0, however small, Kolmogorov’s
inequality gives
771 . N
P( max \pk\>t)& y, D\e,k).
n<,m<.N k=n £ k=n
(4)
VII, § 14] KOLMOGOROV’S THREE-SERIES THEOREM 421

and, if we let tend N to infinity for every fixed n


m i co

P( sup | X <5fc I > e) < -3- E D\dk). (5)


n<,m k—n ** k=n

Choose now from the sequence of all positive integers a subsequence rij
OO

(«! < n2 < • • •)> such that the series E dj converges, where
j=1
00

dj = Y.
k=nj

Then the series


00 m
E p( SUP I E 4 I > <0 (6)
j=1 nj<,m k=n;

converges as well; by applying the Borel-Cantelli lemma we obtain that the


relation
m

I I
k = nj

holds with probability 1 for sufficiently large j and m > nJr If n is an integer
between rij and m, we have with probability 1
m n-1 m

IE4I^I
k=n
E 41 + 1 kE
k=ni =m
41 — 2s. (7)

If we replace s successively by~^-^ = 1’ 2, . . .), we obtain (the union

of denumerably many sets of zero measure being a set of zero measure too)
with probability 1 the relation
m 1

I E 41 — -77- f°r m> n, (8)


tn M

whenever n is greater than a bound N(M) depending on M. But this means


00

that E 4 converges with probability 1. Herewith the first part of Theorem 1


k=1
is proved.
We show now that conditions (1), (2), (3) are necessary as well for the
00

almost sure convergence of the series E V«• If this series converges with
n=l
probability 1, r]n 0 as n -> a. with probability 1. Hence we must have
with probability 1 | rjn \ < 1, i.e. f/„ = rj*, except for finitely many values
422 LAWS OF LARGE NUMBERS [VII, § 14

of n. Thus according to the Borel-Cantelli lemma series (1) converges.


00
Hence it suffices to consider the series £ q*, the sum of which, a random
n=1
variable itself, will be denoted by q*. In what follows we shall need a lemma
due to J. L. Doob.

Lemma. If £ is a bounded random variable | £ | < M with variance

D~ = D\f), its characteristic function cpft) fulfils for | 11 < —the in¬

equality
DH2
3
<PfO | < e (9)

Proof. Put = £ - E(& We have E({*) = 0, 1£* | < 2M, and


<P$*(0 I — I 9^(0 I- We can write

I E(Cn) ! < (2M)n~*E(Z*2) ={2M)n-2D‘


and

T (t) 1 , Dh2 $ E(f*")(ity


1 + " n=3
A -n\-

For | t I < we obtain


4M

2.2
D~t < £ (2Mmra ^ p2t2
9>e* (0-1 +
n=3

i.e.
2# 2 2 ,2 r>2,:2
Z>T DU
I 9V(0 I ^ 9?^*(0 — 1 + + 1- < 1 - < e 3

which proves the lemma.


00

Now we return to the series q*. By assumption | q* | < a, hence if


n=1

is the characteristic function of q* and D2 its variance, we have,


according to the lemma

Di< In for t\< (10)


(01 4A
N
Since ^ tn converges to q* with probability 1 as N -> oo, the distribu-
«=i
VII § 14] KOLMOGOROV’S THREE-SERIES THEOREM 423

N
tion function of Z 17* converges to the distribution function of if at
n— 1
every point of continuity ot the latter. Hence
OO

n w)=<ko>
n=1

where ij/(j) denotes the characteristic function of rj*; furthermore there

exists an e > Owith \]/{t) + 0 for | t1 < e. Thus if | t \ < min (e, ~rr), then
N 4/1
- I ln I * ft) | tends to — ln | 1p(t) |. Because of (10) we have
n=1

Z Dl< ln 11j/(t) (ID


n=1

Y D„ is thus convergent and according to the first part of the theorem


n= 1
the series Y converges with probability 1. Since by hypothesis the same
n=l

holds for Y 'in’ the series


n= 1

fJE(n*) = Y 0i
n= 1 n= 1

converges with probability 1 too. Theorem 1 is thus proved.


It is easy to see from this proof that the following result is also valid:

Theorem 2. If rjd (n = 1,2,.. .) are independent random variables and if


E(jin), D2(t]n) exist, further if the series

Z E(dln) (12)
n=1

and

Z D% (^) (13)
n— \

converge, then the series Z hn converges with probability 1.


n—1

The assumptions of Theorem 2 are not necessary for the convergence of

Y r]„', even the existence of £(17,,) and is not necessary. Theorem 2


n=1
can be obtained as the limiting case for X —> + 00 of Theorem 1.
It should be noticed that the hypotheses of Theorems 1 and 2 do not
00
guarantee the absolute convergence of the series Z In- Thus for instance
n=1
424 LAWS OF LARGE NUMBERS [VII, § 15

1 1
if P Vn = ± - — ’ EQln) = 0, D (f]n) = —, , then the conditions of
n

Theorem 2 are fulfilled but the series E | r]n | is divergent. By Theorem 2,


n=1
OO

however, the series E rjn converges with probability 1 for any rearrange-
n=1
ments of its terms; the sum and the set of its points of divergence depend
of course on the rearrangement in question.
00

For the almost certain convergence of E | ?/„ | it is sufficient, according


77 =1
00

to a well-known theorem of Beppo Levi, that £ E(\ rjn |) < + oo. On the
11 = 1
00

other hand, it is sufficient that the series (1) and £ E(\ rj*|) converge, where
«=i
)]* is defined as in Theorem 1. This condition is necessary as well, since if
Theorem 1 is applied to the sequence \r]*n\, it can be seen that the con-
00 CO

vergence of E | rfn | implies that of £ EQ rj* j).


«=1 n=l
Theorem 1 is stronger than the law of large numbers. Thus for instance
Theorem 2 of § 7 can be deduced from the present Theorem 2 as follows:
Let • • •, £«, • • • be independent random variables with £'(£„) = 0,
00 Z)2 l
D(U = Dn, and assume J] —~ < + oo. If we put ^ = — , the hypo-
;;=i n n

theses of Theorem 2 are fulfilled; hence the series ]T ijn converges with prob-
77 = 1
ability 1. According to Kronecker’s lemma (Lemma 2 of § 7) with qn = n it
follows that with probability 1

E krlk
1 n
lim
n-*- cg
k=1-
yt
lim — Y 4 = o,
Ti—oo ^ Ar=l

which proves Theorem 2 of § 7.

§ 15. Laws of large numbers on conditional probability spaces

The relation of conditional probability to conditional relative frequency


is the same as the relation of ordinary probability to ordinary relative fre¬
quency. This is reflected in the following
VII, § 15] CONDITIONAL PROBABILITY SPACES 425

Theorem 1. Let P(A \ B)] be a conditional probability


space, C an event, C £ A8(n = \, 2, ...) a sequence of random variables
on -T which are independent with respect to C. Let further V be a Borel set
on the real axis and Bn the set of the elements co £ C such that £„(co) 6 V.
Suppose Bn (n — 1, 2,. . .). Then clearly Bn G C (n = 1,2,...). Let
further the conditional variances

D\UB.) = Dl (n = 1,2,...,) (1)

exist and assume that the conditional expectations

E(fn\ Bn) ~ M (n= 1,2,...) (2)

do not depend on n. Put

Pn = P(Bn\C) (3)

and assume that pn > 0 (n = 1,2,.. .). Suppose further that

X Pn = +00 (4)
n =1

and
00 p D2
e < +»■ (5)
“-1 ( Eft
k-1
)2

Define
S,(V)= £ it and JV„(0 = £ 1. (6)
1 <J<<,n 1 <.k<,n

S„(V) is thus the sum of the b,k (1 < k < n) whose values belong to V and
Nn{V) is their number. Then

Sn(V)
P lim =M
72-*-CO Nn{V)

Proof. Let us put

f 1 for £k(co) 6 V i.e. for w £ Bk,


£k £/c 60 I o otherwise
and
ek(Zk-M)
Sk = k

YPj
426 LAWS OF LARGE NUMBERS [VII, § 15

Consider the series Y dk. The 8k are independent under condition C;


k=1
further E{8k \\C) = 0 and

D\8k | C) =
(YpjY
7=1

thus by hypothesis (5) the series Z Z)2^ | C) is convergent.


/c=i
00

Kolmogorov’s three-series theorem shows that the series Z 5k con-


k=1
verges with probability 1 under condition C. If we apply Kronecker’s
n

lemma with gn= Y Pk> we obtain


k=X

Y Ek (f* - M)
lim _ c = 1. (7)

Z
k=1
Pk

Put now

£k ~ Pk
Vk= —k-
I/*
7=1
00

By repeating the preceding reasoning for the series Y *1k we find that

E(nk | C) = 0 and

D\>h IC) = Pk (1 ~ Pk) .


(Eft
j= 1
)2
00

The series Z D\rjk \ C) converges by the lemma of Abel and Dini;1 it


k=1

follows (as in the proof of (7)) that

f n
\
Y £k
p lim k=l
-1 c = 1. (8)
n-+ oo

V YPk
k-1 /

1Cf. K. Knopp [1],


VII, § 15] CONDITIONAL PROBABILITY SPACES 427

(7) and (8) lead to

[ " \
E ekU
p lim fc=1 — M c (9)
n-+ co
l E £k /
x fc=i

Since E ek^k — Sn(V) and E ek = iV„(F) Theorem 1 is herewith proved.


fc=i *=i

The quotient can be considered as the empirical conditional aver-


(^0
age of the random variables indeed it is the arithmetic mean of the
£k (k = 1,2,...) whose values belong to V.
The conditional strong law of large numbers can therefore be stated as
follows: If (4) and (5) are valid, the arithmetic mean of those values of £x,
£2,. . ., 1ln which belong to V converges with probability 1 to the common
conditional expectation of the £k under condition V.

Remarks.
1. If Dk is bounded, e.g. independent of k, the condition (5) is a conse¬
quence of (4) by the Abel-Dini theorem; hence in this case it suffices that
(4) is fulfilled.

2. If instead of (2) we suppose that Mn = E(gn | Bn) fulfils

00 E Pk I Mk - M |
y
^
*=i_< n
+00 (io)
n=1 v—\

E Pk
k=1

and if we replace (5) by condition

£ pn(Dj + M*(l - P„))


L n (50
’=1 (E Pkf

Theorem 1 remains valid; the proof is similar.

3. If the whole real axis is taken for V, then clearly Bn = C and pn= 1;
. . “ D2n < + 00.
hence (4) is trivially fulfilled and (5) reduces to the condition y —
n=1 72
428 LAWS OF LARGE NUMBERS [VII, § 16

We obtain thus, as a particular case of Theorem 1, Theorem 2 of § 7 for the


ordinary probability space [Q, ts£,P(A | C)]. Notice that it is possible
to state and to prove Theorem 1 without reference to conditional probability
spaces; however, in the form given above it shows how the strong law of
large numbers can be extended to conditional probability spaces.

4. Consider now the following special case: Let V be the set which con¬
sists only of the two elements 0 and 1, let further be P(£ = 1 | Bn) = p and
P{£ = 0 | Bn) = 1 — p = q. This situation can be described as follows: &
represents an infinite sequence of independent experiments, A and B are the
possible outcomes of the individual experiments. If at the k-th experiment
both A and B occur, we have t,k — 1; if A and B occur, we have £k = 0;
finally if B does not occur at the &-th experiment, takes on a value
distinct from 0 or 1. Then

M = E(UBn)=P(l;n=l\Bn)=p.

The number p can be considered as the conditional probability of A with


respect to B; we write therefore p = P(A \ B). Furthermore, Dn = ^fpq

and (4) implies (5). The quotient fn{A \ B) = -*?" ^ is thus the conditional

relative frequency of A with respect to B in the course of the n first experi¬


ments.
Theorem 1 states that/„(^ | B) tends to P(A \ B) with conditional proba¬
bility 1 for a given condition C. In this interpretation C is some condition
concerning the infinite sequence of experiments. pn = P(Bn | C) is the con¬
ditional probability of B at the n-th {n = 1,2,...) experiment.

§ 16. Exercises

1. Prove Theorem 2 of § 3 by the method of characteristic functions in the case


where the £* are not merely pairwise but completely independent.

2. Prove the generalization of Theorem 2 in § 3: Let the random variables


£i> £2 > • • • satisfy the following conditions:
a) the expectations Mn = £(£„) exist and

1 ”

lim - Yj Mk= M;
n-fco n k=l

b) the variances D2n = D2(£„) exist and

lim
n~*- co ^ X
*=1
D‘‘ =°
VII, § 16] EXERCISES 429

c) the correlation coefficients Rtl = R(£h £,) fulfil the inequalities


CO 00 00

Z Z rh x‘ x> ^ c i'Z= l x‘
i=l /=1
cb

for every system of real values x, such that Z xf converges; C is a positive constant.
(=i
Given these conditions.

lim st — Y £k = M.
o « *= i

3. Prove the following theorem: If f(x,y) is a uniformly continuous function of


wo variables, if lim st i and lim st rjn = rj, then
n —► co n —*■ co

lim st M„, rjn) = f($, rj).

If £ — a and rj — b are constants, continuity of/(x, y) at the point (a, b) is sufficient.

4. Let £„ (« = 1, 2, ...) be independent random variables with P(£, — + n) — —,

1 "
hence E(£n) — 0 and £>(?„) = .Therefore the condition lim — Z — 0

of Theorem 2 in § 3 is not fulfilled. Put C„ = ^ and show that

C„ does not tend in probability to zero.

Hint. Let cpn(t) be the characteristic function of £„, then

n f /~k _ t3
<pn{t) = J~[ cos —^— and lim (r) = e 4 ;
A: = 1 ^ n —► oo

does not tend to 1.

Remark. The distribution of £„ converges to a normal distribution.

5. Let $!, f2,. . ., be pairwise independent random variables with

Pttn = ±n6)= y-

Show that the law of large numbers holds for the sequence if 0 < 5 < — •

6. Let the events Au A2, . . ., An be the possible results of an experiment. Let there
be performed N such independent experiments. The probability that the event Ak
occurs exactly vk(N) times (k = 1,2and in a given order, is equal to

n Pltm>
k=l

where pk = P(Ak). Since nN depends on the sequence vk(N) (k = 1, 2.n) and


vk(N) are random variables, nN is a random variable as well. Obviously
430 LAWS OF LARGE NUMBERS [VII, § 16

The quantity H(d) = — ^ P^HzPk is called the entropy of the complete system of
k=l
events d — (Au A2,. . ., A„) (cf. Appendix). Prove the limit relation

lim st — log, = H(d).


N —*• CO A JlN

Hint. According to the law of large numbers

vk(N)
lim st ——— = pk for k = 1,2,..., n.
N-+ 03 A

7. Let an urn contain a0 white and b0 red balls. If we draw from the urn a white
ball, we put it back and besides we add to the urn ax white balls and bx red balls.
If we draw a red ball we put it back and add to the urn a2 white and b2 red balls where
a\ + bx = a2 + b2, a2 > 0. The same procedure is repeated after all subsequent
drawings. Let C„ denote the number of white balls drawn in the first n drawings.
Prove the relation

lim st — =
bi +

Hint. It is easy to show that lim E ; further lim D | —- ] =0; hence


n~* a. \ n I bl + a2
our statement follows from Chebyshev’s inequality,

8 a) Let r]n{n — 1, 2,...) be bounded random variables, |7?„| < C. The necessary
and sufficient condition that rj„ should converge in probability to zero is the fulfilment
of the relation
lim E( \Vn\ ) = 0. (1)
n —► oo

Hint. Applying Markov’s inequality to the random variable |^B|, we obtain

P( \r)„\ >£)<
e

hence condition (1) is sufficient for rjn 0.


Suppose now that lim st rjn = 0. Let A„(S) be the event \rjn\ > 8, with an arbitrary
n—► co
<5 > 0. We have then

E{\Vn\) = E{\rjn\ | A„(8)) P(A„(S)) + E(\Vn\ \aJ6)) P(IJS)) < CP(A„(8)) + 8 .

By assumption lim P(A„(8)) = 0. Hence lim sup E(\Vn\) < 8. Since 6 can be arbitrarily
11 1 n-*CO
small, the necessity of (1) is proved.
b) Suppose that lim st £n = c and that/(a) is a Borel-measurable bounded function
n —*■ co
which is continuous at the point c. Then lim £(/(£„)) = /(c).
/!-> 03

Hint. Evidently, lim st </(£„) -/(c)) = 0. Since f(x) is bounded, it follows because
n —co
of Exercise 8.a) that

lim E( |/(C„) -/(c) | ) = 0,


VII, § 16] EXERCISES 431

hence
lim £(/0 = /(c).

9. Let /(x) and ,g(x) be continuous functions on the closed interval [0, 1 ] which
fulfil the relation 0 < f(x) < Cg(x), where C is a positive constant. Then

dx
*. if...+ + dXidXt_I™.
J -J Axi) + Axt) + ■ • • + 9(xn) f-f g(x) dx
g(X)

Hint. Choose in the unit-cube of the n-dimensional space a point P at random


(with uniform probability distribution); let ?X) £2> • ■•»£« denote its coordinates;
lk(k= 1,2,...,«) are thus independent and uniformly distributed on [0, 1]. Put

Mi) + AO + • • • + AO
Vn =

and
.Ml) + fffe) + • • • + Mn)
n

We have thus
1 1
lim st r\n = J /(x) dx, lim st £„ = J g{x) dx,

1
and, since J g(x) dx > 0, we have by the result of Exercise 3,

J/(x) dx
Vn 6
lim st
n —cd (3/7
j gix) dx

Since further 0 < — < C, we get from the result of Exercise 8. b)


t.
j /(x) dx
Vn
lim E r ~ i ’
' J g(x) dx

q.e.d.

10. Prove that the limit relation


k) (nh)k
lim 1/ X + — c-** -Vf- = A* + *)
ft —► oo k=l
« k\
holds for every h > 0 and x > 0 if/(x) is a bounded continuous function on (0, + ~)
(Theorem of E. Hille).
Hint. Let O . . . be independent random variables having a common Poisson

distribution with E(0 = h. Put £n = ^ + ^ ~• Then lim st = h' Since


n n-+ co
432 LAWS OF LARGE NUMBERS [VII. § 16

fix) is by assumption continuous and bounded, it follows according to Exercise 8.b)


that
lim E(f(x + £„)) = f(x + h) ,
rt-+ oo

which was to be proved.

11. Let g(s) denote the Laplace transform of a function fix) which is bounded and
continuous in the interval [0, +°o):
oo
g(s) = J e~sxf(x) dx.
o
Prove the Post Widder inversion formula

(- l)"-i„y«-1> _
f(x) = lim for a > 0.
x"(n - 1)!

Hint. Let • ••» 4> • • • be independent random variables having the same
t
exponential distribution with expectation a, i.e. < /) = 1 — e~ x for t > 0 If we
1 \
put 4 — — ^ {*> then lim st £„ = a, hence (see Exercise 8.b))
n k=l «-* co

lim E(/(Q)=/(a).
n —► co

Now we have (cf. Formula (24) of Ch. IV, § 9)

nt
n"tn 1 exp nn (— l)"-'^-»|J.
x
mo) = m in - 1)! a"
dt
A"(n - 1)!

which leads to our statement.

12. Let r, be uniformly distributed on [0, 1], Let £„(/•) denote the number of occur¬
rences of the digit r ir= 0, 1, ..., 9) among the first n digits of the decimal expansion
of show that

Ur) 1
a) P lim = 1 Cr = 0, 1,..., 9),
\n —oo lo

( n
4(0
b) lim sijp f—
To
2/i In In n 3 /
V
Htnt. Let the random variable 4(r) be equal either to 1 or to 0 according to the
«-th digit in the decimal expansion of y being equal to r or distinct from it; then

4(0 in = 1,2,...) are independent and have the same distribution P(£ ir) = 11 = -i_
9 10’
p(Ur) = 0) = — (r= 0,1,..9). a) is obtained from the strong law of large numbers,

b) from the law of the iterated logarithm.


VII, § 16] EXERCISES 433

13. Let • • •, be independent random variables with

i*(£*=±D = y (k=

P ut rjk = Cl + + • • • + £k and = max rjk.


l^k<.n
a) If q > 0, then

1 1 2 n\ „2n-2k
In - 1
E{q^) = ^r 1 +
k
+ n - 1
k=0 \

b) Show by means of the result of a) that

q +
E(q^) < 11 + j
c) Using b) show that

c„
P |lim sup < 1 = 1.
yj In In In n

d) Show that
( k - i \

k - 1
n—1

E(U = I 2k — for n = 2, 3, 4, . ..
k= 1

and conclude from it that E(C„)

e) Show finally that


,__ X

/If e -42 du for x > 0,


V nJ
lim P(C„ < xjn ) =

0 otherwise.

Hint. Let p„%k = P(Cn = k) (k = —1, 0, 1,We have the following recursive
formulas:

Pn+i.k — yfeji-i + Pn.k +1) f°r k — 2, 3,..n + 1,

Pn + l.l =y (Pn ,0 + Pn.2 + P/f.-l)*

Pn + 1. —I — 2 1 Pn.o)

Pn +1.0 — 2 P"’l‘

a) follows by induction.
434 LAWS OF LARGE NUMBERS [VII, § 16

14. Let . be independent random variables with £"(£,,) = 0,


D(£„) = Dn and put f„ = ^ + £2 + ... + then for any e > 0 and n= 1,2,...

P sup > e < 1 y^., y (2)


_ k= 1 L
k=n-\-1 K
L2

This inequality is due to J. Hajek.1

Hint. Put 17= £ Clf-A--' . Then


^0 U2 (*+D2
I ■ , ^ Z)|
= II k= 1
E -£
/c=«+1 K
(3)

Let Ak(Je> n) denote the event that the inequalities

1C,
< e {m = n, n + 1,. . k — 1) and > £
m

hold and put A = Ak; then A, An, A„+1, ... is a complete system of events. Hence
k=n

E{n) = E(rj | A) P{A) + £ | P(/h).


k=n
Clearly

E(.V \Ak)=fj E(?m | Ak) (-L - ] >


l »i2 (w + l)2)

l l
> £ £(£ | Ak)
m=k /n2 (m + l)2
If m > k, it can be shown as in the proof of Kolmogorov’s inequality, that

E{Fm\Ak)>E{tt\Ak)>k*e\
Hence
E(V | AJ > e2,
and

%)>«2 Z E(Ak) = e^pfsup 4 >£ (4)


k=n \k^n

(3) and (4) imply (2).

15. Deduce Theorem 2 of § 7 from Exercise 14.

16. Prove the following generalization of Exercise 14. If f2, . . . are completely
independent and if £(£*) = 0, D\Zk) = D\ , further if 0 < Bk < B., < ... is a sequence
03 jy2
of positive numbers such that £ + oo, we have, for e > 0,
fc=i Ek
1 f 1 Dl
asS7§r £2 l E« + .E H
^ t=n+l Bk

1 For this proof see J. Hajek and A. Renyi [1],


VII, § 16] EXERCISES 435

Hint. The proof of Exercise 14 can be repeated almost word by word,

17. Deduce from the inequality in Exercise 16 the following theorem: Let rju r]2,. . .,
r)„,. . . be completely independent random variables with expectations E(rjk) = Mk > 0
and with finite variances D2(rjk) = D\. Suppose that

«) Y, Mk= +°°,
k=1
Dl
(5) the series is convergent. Then with probability 1
k= I
(I M<)‘
/= 1
E Vk
k= 1
lim = 1.
E
k=1
Hint. This is a generalization of Theorem 2 of § 7; in fact, if % = £* — Mk + 1,
oo £)2

then E(j]k) = 1, Z)(%) = Dk\ thus if £ -77 < +°o then with probability 1
k=i *

1
lim E % =1
and thus

lim - £ (4- M*) = 0.

18. Prove Theorem 1 of § 15 by means of Exercise 17.


Hint. Let r\k — £k(Zk — M + 1)- Then

E(Vk I Q = P(Bk \C) = Pk> D(r)k | 0 = Dk.

Thus if conditions (4) and (5) of Theorem 1 of § 15 are fulfilled, conditions a) and
/3) of Exercise 17 are fulfilled as well and it follows with conditional probability 1
with respect to the condition C that

Yek{Zk-M+ 1)
lim —-^-- = 1. (5)
CO
E Pk
k=1
If we apply the theorem in Exercise 17 to the sequence r\k — ek we have again, with
conditional probability 1 under condition C

E £*
k= 1
lim -V-- L (6)

k=1
±Pk
From (5) and (6) it follows that
/
E £^k
lim -^-4-= M = 1.
E £*
fe = l
436 LAWS OF LARGE NUMBERS [VII, § 16

19. If the random variables £ls f2> . are identically distributed and if
the fourth moment of q. exists, then for the validity of the strong law of large numbers
it is sufficient that the £* are four-by-four independent (instead of being completely
independent); thus, if any four of the random variables are independent and if

£(£*) = 0, Z)2(|„) = D-, E(Z£) = Mt, further if we put £„ = — £ £*, then we have

P( lim C„ = 0) = 1.
n—► co

Hint. If we apply Markov’s inequality to the random variables £)}, we obtain

PC I £„ | > e) < ^* + Mn - 1} Di .
e4 e4 «4

Hence the series £ P(| | > £) is convergent and we can use Lemma A of § 5. (The
n= I
idea of this proof is due to F. P. Cantelli.)

20. If £t, £2,.. ., are identically distributed random variables with finite variance,
it suffices for the validity of the strong law of large numbers the still weaker criterion
that the £k are pairwise uncorrelated (instead of completely independent).

Hint. Let £({*) = 0, Z>(£*) = D and

4 = ^1 4-
12 k= 1
According to Chebyshev s inequality, P(|C„a | > £) <—-—— ; hence the series
n2
to-
e2
CO

Z F(|Cn°- I ^ e) is convergent. By the Borel-Cantelli lemma


n= 1
I C„2 | < £ (7)

with probability 1 for a large enough n. On the other hand


( N (»+1)2-1 ( N >
P\ max E 4 > en2 < I P\ > en2
* = M2+1 Z 4
l«2<V<(n+l)2
/ V=«2+l V A:=/i2+l

Hence by Chebyshev’s inequality.

(n+i),a-l (AT_„2)£)2 4£)2


max
^/i2<N<0i+l)2
I
&=«2+ 1
4 >£«2 < £
V = »2+l
<
e22 n
yj 2

Applying again the Borel-Cantelli lemma, we find that with probability 1,

max
»2+1£V<0i + 1)2
I
fc = «2+l
4 < en2 (8)
for a sufficiently large n. (7) and (8) lead with probability 1 for n2 < N < (n + l)2
to the inequality |£„| < 2f for a large enough n, which proves our statement.

21. If fu • • • are pairwise uncorrelated, if £(£*) = 0, D2tfk) = D\ and if the


■ v ■
series ^ -7^7 is convergent, then the strong law of large numbers is valid
t,i k%
fc=
EXERCISES 437
VII, § 16]

Hint. Use the method of Exercise 20.

Remark. By a different method1 it can be proved that even the convergence of the
® ]n2£
series D\ ——- is sufficient.
k— 1 K

22. Let us develop the positive number x lying between 0 and 1 into Cantor’s series

<7l q-l ■ ■ ■ Qn

belonging to the sequence qn (q„ > 2, qn integer), where the “digits” e„ (x) may take on
the values 0,1,..., Q„ - 1 (n = 1,2,...). If V is a random variable uniformly
distributed on the interval (0, 1), let £„(£) denote the number of digits e,(rj) equal to k
(j = 1,2,...,«). Assume that the sequence q„ fulfils the conditions lim qn = + 00 and
00 1 n~*‘00
V — = + oo. Show that
n=\ (In
Cn (k) for k = 0, 1,....
P lim n
n —*■ co l

Hint. Let
1 for e„ (rj) = k,
0 otherwise.
1 1
and D2(^nlc) - — the convergence of
Hence, for q„ > k, E(Z„k) —
<In Qn

follows from the Abel-Dini theorem. Thus we can apply the result of Exercise 17-
The statement of the present exercise can also be obtained as a particular case o
Theorem 1 of § 15.
23 Let ?? be the frequency of the event A in a sequence of n independent experi¬
ments, while P(A) = p(0 < p < 1; q = 1 - p). Prove the complete form of the law
of the iterated logarithm, i.e.

Vn - nP (9)
jP I lim sup- = 1=1
n—<x> 2npq In In n
and
Plton inf —-======== = — 1 j = 1.
(10)
I n—a. sj2npq\n\nn J

Hint. The proof of the Moivre-Laplace theorem shows that


►-3
p \ Vn .—>X\ =1 -®(x)+0\~
Jnpq n

1 Cf. J. L. Doob [2], p. 158, Theorem 5.2.


438 LAWS OF LARGE NUMBERS [VII, § 1 6

for x = ln In «); hence

(ln In ri) 2
O
f h£** „a H=1 - -g) )+ Jn
Since we have (cf. Ch. Ill, § 18, Exercise 18):
K‘‘
1 - &{x) > - - 2

_/ 2nx
1 +
it follows that
Vn — nP
>l-e > for n > n.,
AJlnpq ln In n ln n

where c > 0 is a constant. Let nk — 2* and let Ak denote the event

Vn* ~ nkp
"/— > 1 — e.
y/2nkpq ln Inn*
00

Thus the series P(Ak) is divergent. It is easy to show that the sequence Ak fulfils the

condition of Lemma C of § 5. Hence there occur with probability 1 infinitely many


events Ak. Since e > 0 is arbitrarily small, (9) is obtained. (10) can be proved in a
similar way.

24. Let <Sj, c2, . . ., . . . be pairwise independent random variables with common

distribution function F(x). Put = ji_+ ^2 + • • • + _ Show ^ Qrder


72
lim st Cn — 0 should hold, the following two conditions are sufficient:

+n
a) lim J xdF(x) — 0,
w —► 00 —n

b) lim xF(x) - lim x(l - F{x)) = 0,


*->-+co

(Theorem of Kolmogorov).
Hint. Let

k for | {* | < n,
^nk
otherwise
and

1
„ Yj ^nk-
4=1
Then

pttn 5* £*) <, P( 14 I > n) — «[>(- ri) + 1 - F(/i)]

and, because of b),


lim P(£„ ,6 f*) = 0.
VII, § 16] EXERCISES 439

Thus it suffices to show that lim st £* = 0; as because of


rt 00

p( i c„ i > e) < pc i :* i > e) + p(c„ *:?>


it follows from this that lim st £„ = 0- On the other hand, by a)

lim P(C*) = lim J xdF(x) = 0.


n —► co — n

Furthermore
-Uc-1)
+/z k

D2 (f *) < — ( x2 JF(jc) < — V it2 f rfP(x) + — E /c' f


n J n ) n k=l j
-n k-i -«■

Putting ak = 1 - F(k), bk = P(- fc) we may write

D2 (C*) < — V k2 [(flfc_x - a*) + (6*-i - bk)] ^


n k= 1

<— (a* + 6*)(2*+ D-


n fc=0
Because of b)
lim (2fc + 1) (u* + ^/c)

hence
lim D2(£*) = 0 ,

and thus lim st £* = 0.


n —► oo

Remark. In case of completely independent £„ the proof can be somewhat simplified


by employing characteristic functions; in fact

+ co it )"xdF(x) + 8n'

E(e“'n) — |l + j* (exP~-lj^F(*)j = l1 +
—n
n

where
+n
(' t2
8n I < n[F(-n) + 1 - *(»)] + — j x2 dF(x)

Hence by a) and b) follows


lim E(el,Cn) = 1
n —► oo

for every real t.


CHAPTER VIII

THE LIMIT THEOREMS OF PROBABILITY THEORY

§ 1. The central limit theorems

On the basis of the theorems on characteristic functions established in


Chapter VI, we may now pass to the proofs of theorems concerning limit
distributions. Most important among these are the so- called “central limit
theorems”, which express the fact that the distribution of the sum of a
large number of independent random variables approaches, under very
general conditions, the normal distribution. These theorems disclose the
reason why in applications distributions close to normal distributions are
so often encountered.
A typical example is the case of errors of measurements; the total error
is usually composed of many small errors. The central limit theorems justify
the assumption that the errors of measurement are normally distributed;
that is, why the normal distribution is sometimes called the law of errors.
The simplest case of the central limit theorem, namely the Moivre-
Laplace theorem, was already dealt with in Chapter III. First we give here a
somewhat different formulation of this theorem.
Let there be performed n independent experiments having for their pos¬
sible outcomes either the occurrence of an event A or its non-occurrence A.
Put p — P(A), q = 1 — p = P(A). Let the value of the random variable
be either 1 or 0, according as the event A occurs or does not occur at the
/c-th (k = 1,2,...,«) experiment. Put further

— £ 1 + £2 + • • • + £„•

We know already that E{Q = np and £>(Q = The linear trans¬


formation which transforms the random variable £ into the random variable
£* _ ^ ~ -£(£) ,
D(f) having expectation zero and standard deviation one, is

called standardization. Let C„* = C"-_2 be the standardized variable cor-


,. . \/nP9
responding to Since £„ is evidently a binomial variable P(7. = JA =
I n k n_k v * '
~ \ k) P qn ' Theref°re the Moivre-Laphce theorem (Ch. Ill, § 16.
VIII, § 1] THE CENTRAL LIMIT THEOREMS 441

Theorem 3) can be formulated as follows:

lim P(C: <*) = <?>(*), (1)


n-+ + oo

where
X

$(x) = J2ii j * 2 du (2)


— 00

is the normal distribution function. In other words, the distribution of the


standardized relative frequency of the event A during n independent experi¬
ments tends to the normal distribution as n -> + oo.
The random variables £k are of a very restricted nature: they assume only
the values 0 and 1. The central limit theorem in its most simple form, which
is an immediate generalization of the Moivre-Laplace theorem, can be
stated as follows:

Theorem 1. Let d1? £2, be independent identically distributed


random variables for which M = E(£n) and D = D(fn) > 0 exist; put

- £(C„)
C„ - z Zk and Cn =
k=1 D(0

If F„(x) is the distribution function of (*, we have

lim Fn (x) = <P(x) uniformly for — oo < x < + co. (3)

Proof. Let cp{t) be the characteristic function of rjk = ^k — M and


ij/n(t)the characteristic function of (*. Since Elf,) = nM and D(C«) = F)J n
it follows from Theorems 3 and 6 of Chapter VI, § 2 that

<A«(0 = <P (4)


DyJ n

From Theorem 9 of Chapter VI, § 2 it follows, because of E{r\k) = 0, that

<P = 1 - + o (5)
Ps/n, 2n

By applying the well-known formula

lim 1 + — = ex if lim x„ = x, (6)


n-» + oo n rt— + 00
442 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 1

we conclude that for all values of t


_
lim iJ/n(t) = e 2 (— co <f< + oo) (7)
+ 00
which, because of Theorem 3 of Chapter VI, § 4, proves (1). It is easy to see
(in view of the continuity of <P(x) for all x) that the convergence is necessa¬
rily uniform in x. Evidently Theorem 1 contains the Moivre-Laplace theo¬
rem as a particular case.
The statement of Theorem 1 remains valid under much more general
conditions. This was shown first by Chebyshev and Markov by an entirely
different method, namely by the method of moments (see § 13, Exercise 27).
The method of characteristic functions was first employed by Liapunov.
He proved by essentially this method that the central limit theorem can
be applied under much more general conditions than those of Chebyshev
and Markov.1 The result of Liapunov can be stated as follows:

Theorem 2. Let £2,. . ... be independent random variables, the


first three moments

M(£k) = M, D2 (4) = T>! > 0, M(\Zk-Mk\s) = Hsk

of which exist {k = 1,2,...). Put

= \fD\ + + ... D2 , (8)

K„ = + (9)

r _ v i ) r* Cn — ^(Cp) (im
2-1 £n — N • (10)
k=1

Let Fn{x) denote the distribution function of (*. If Liapunov's condition

lim ^- = 0 (11)
Az-*- + co ^n
is fulfilled, then

lim Fn(x) = <P(x) (~co<x<+oo\ (12)


n-+ -f oo

Remark. The condition (11) is evidently fulfilled when all £,k have the same
distribution. In effect, in this case Dk = D, Hk = H, Sn = D^fn, Kn =

Later on it was proved by Markov that Liapunov’s theorem can also be proved
by the method of moments.
VIII, § 1] THE CENTRAL LIMIT THEOREMS 443

H\fn, hence

H
lim K" lim = = 0.
c
n-+- + oo ^n D n^ + oo %Jn

It is again fulfilled when the random variables — Mk are uniformly bound¬


ed and lim Sn — + oo, In fact, from | — Mk \ S C follows

Hi < CD\,
hence
K,
!L <

and since Sn -> -1- oo, condition (11) is satisfied.


Liapunov proved the central limit theorem starting from still more general
hypotheses. As a matter of fact, it suffices to assume the existence of the
moments of order /? (for some arbitrary /l > 2) instead of the third moment;
in this case instead of (11) one has to suppose

lim = 0. (13)
+ O0

where

Kn(P)= t EUZk-Mk\*)f. (14)


k=1

Lindeberg proved the central limit theorem under still more general con¬
ditions. His condition is, in a certain sense, necessary as well. It is formulated
in the following theorem due to Lindeberg:

Theorem 3. Let f1} £2, be independent random variables for


which the expectations Mk — E(£k) and the standard deviations Dk = D(ff)
exist (k — 1,2,...). Put

Sn=iD\F (15)
k=1

and let Fk(x) be the distribution function of lk - Mk. If for every positive
s the so-called Lindeberg condition

lim t | x2dFk(x) = 0 (16)


n-*- + co ^n k—1 J
\x\>e Sn
444 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 1

is fufillled, then for

Z «* - Mk)
r* _ *=i (17)
S n o

we have■
lim P(Cn <x) = 0(x). (18)
n-*- + oo

Remark. From Liapunov’s condition (11) one can deduce (16); indeed
we have
+ CO
AZ F*
1 ” 1 1
5isf t J ^VFfc(x)<—-g- LX
—1 ./
I* I3 <#*(*) =
£
(19)
n k~l J jaz At = 1 J
IJC! > — 00

Similarly, (16) can be deduced from (13), too. Hence it suffices to prove
Theorem 3 (Lindeberg’s theorem); then Liapunov’s theorem (Theorem 2)
will also be proved.

Proof of Theorem 3. If cpk(t) is the characteristic function of t]k = £k —


— Mk, then
+ 00
t
<Pk = I e Sn dFk(x). (20)

We need the following elementary lemma:

Lemma. For a real u and k — 1, 2, . . ., we have

k-1
o«y
e - £ V- (21)
7=0 J k\ '

Proof of the Lemma. In fact,

| ew — 1 |2 = 2(1 — cos u) = 2\ sin vdv < 2 | vdv = w2, (22)


o o

hence (21) holds for k — 1; if (21) holds for any k, it follows from

eiu _ £ c=
dv (23)
7=0 J 7=0 J'-
VIII, § 1] THE CENTRAL LIMIT THEOREMS 445

that (21) holds for k + 1 too; hence by induction (21) holds for every k.
Thus the lemma is proved.
We have therefore

~ itx x2;2
e Sn = 1 -— + 0X where | 91 \ < 1 (24)
2Sl
and
itx
itx x212 x3t3
eSn = 1 + + 02 where I 0, I < 1. (25)
Sn 2 52 6 Si

Now let e > 0 be given. The integral (20) can be separated into two parts:

UJ _ J1
+ sSrt
itx
t
<Pk

— ESn
e Sn dFk (x) +
I
|*|>eS7i
eSn dFk(x). (26)

Consider first the first integral on the right side of (26). Because of (25)
we have

J J J J
CaS/i ^ &S n

e^dFk(x) = dFk(x)+ xdFk(x)-^ xhlFk{x) + B$\ (27)


— eSn — sSn ~~sSn ~&Sn

with
eSn
11
tfP | < —o \x\3dFk(x)^-^-Dl (28)
* “ 653 ,
—eSn

Apply now Formula (24) to the second integral in (26); we obtain

i
ItX
Sn dFk (x) = J dFk (x) + J xdFk (x) + Ff\ (29)
|x| >e<Sji \x\>eSn 1*1 >eSn

with

I < x2 dFk (x). (30)


2 5?
|*| >eSn

If we add (27) and (29), we obtain by (28), (30) and by taking into account
that E(rjk) = 0,
t2D\
<Pk = l + R(P, (31)
S„n 25?
446 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 1

with
1113 pD2 t2 r
J x2dFk(x)• (32)
|jc| > eS'rt

We show now that (16) implies

max Dk
lim^^-= 0. (33)
«->- 00
In fact
n

Dl = J x2dFk(x)+ J x2dFk(x) < + £ J x2dFk(x),


|jc|^6S'« |jc| > e^/, ’ k = 1 |xi>£,!>n

where
max D\ n
!S5fr-^2 + ^r I I *VFt(x). (34)
*=1 J
\x\ > sSfl

It follows, because of (16), that

max Dk
1 <k<,n
lim sup < £. (35)
n-*- + oo

Since e > 0 may be chosen arbitrarily small, (33) is proved.


Choose now »0(e) such that for n > n0(s)
n

Z I x2 dFk (x) < a (36)


k=1 J
\x\ >eS„
and

max Dk
1 <,k^n
< a. (37)

This can be done because of (16) and (33). Let further be a < ,thus

<, 1
2 2 Si

for n > «0(e) and 1 < k < n. Because of the identity


VIII, § 1] THE CENTRAL LIMIT THEOREMS 447

(where an empty product is to be replaced by 1), it follows from (32) and


(36) that
n n
t2 Dt \
n
*:=i 2Sl J < £ \Rf|<£
fc=1
(39)

The inequality — 1 + x\ < x2 for a: and the identity (38)

imply, by considering (37),


n t*D%
t2Dl " t4Di t4e2
n
A: = l
1 - -
2Sl
ne
k=1
25" ^ I
k=i
<
4S4
(40)

Hence from (39) and (40) it follows for n > rt0(s) that

t t*
[\t
n
*=i
<Pk —e < £ +r + (41)

Since e > 0 can be chosen arbitrarily small, (41) implies


n _ t*

lim [] (pk = lim E(e“«) = e 2 . (42)


n-*- + oo k = l /2-»- -f 00

Because of Theorem 3 of Chapter VI, § 4, our theorem is thus proved. The


convergence in (42) is even uniform on every finite interval | /1 < T.
If the are identically distributed and possess a finite standard deviation
D, we have

Fk (x) = F(x) (k =1,2■), J x2dF(x) = D2, and Sn = D^n .


— 00

Thus

lim V £ I x2dFk(x) = -L lim f XtiF(x) = 0, (43)


n—+ oo ^n A: = l J U »- + «> J
|jc| >eSn \x\>eD!n

Theorem 3 therefore contains Theorem 1 as a particular case. Notice that


+ 00
Theorem 1 does not follow from Theorem 2, since it is possible that j x~dF{x)
+ 00 +i° _0°
exists but f \x\3dF(x) = + oo and even that J |*|* dF(x) = + oo for every
-'oo -®

p > 2.
Let us add that Lindeberg first proved his theorem by a different method,
viz. by a direct study of the convolution of the distributions (see § 12).
Lindeberg’s condition (16) is, as was shown by W. Feller, necessary as
well, in the following sense: If • • • are independent random
variables with finite expectation and finite standard deviation, if Fk(x) is the
448 THE LIMIT THEOREMS OF PROBABILITY THEORY rvni, § i

distribution function of 4 - E(£k) and if 4 = 4 + 4 + • • . + 4,

A„=E(US, = D(0 and ~ A" , then


4

lim P(£* < x) = <2>(x) (44)


n— + oo
and
lim P( max | 4 - P(4) | > sS„) = 0 (45)
n-» + co l<,k<,n

hold iff (16) is fulfilled for every s > 0.


If (45) is satisfied, the variables 4 - P(4) are said to be “negligible”
(or “infinitesimal”). Condition (45) follows from (16) by the inequality

P( max 14 - E{4) I > eSn) < £ P( | 4 - E(4) | > eS„) <


1 <.k<,n k=\ '

— £2 £2 J" •x;2 dEk (pc).


\x\ >eSn

Lindeberg’s condition implies thus that the variables (4 - E(£k))/Sn are,


in a certain sense, “uniformly small” with great probability. We do not
prove here that (16) is a necessary condition. Neither do we deal with
further generalizations of the central limit theorem.1
The results of this section can be generalized in the following way: in¬
stead of a sequence 4 (k = 1,2,...) consider a matrix (4*), (k — 1,2,. . .,kn;
n = 1,2,...) of random variables such that the variables 4 /, \
are independent for every n and put " ’ ’ ’’ " n
kn

c. = E k=1
U-

By the same method which served to prove Theorem 3 we can prove the
following, somewhat more general, theorem:

Theorem 4.Let 4k(k - 1,2,..., kn) be for every n (n = 1,2,...) inde¬


pendent random variables with finite variance. Put Mnk = E(4A), Z) =
- D(U) and let Fnk(x) denote the distribution function of £nk — Mnk^We

assume that £ D\k = 1. If the Lindeberg condition


k=1

lim V J x2dFnk(x) = 0
n-* + co k = l \x\>e

Vo! ? the books °f B- V- Gnedenko and A. N. Kolmogorov [1] and W Feller f71
ol. 2, containing the detailed discussion of many further results in this domain. ’
VIII, § 2] THE LOCAL FORM OF CENTRAL LIMIT THEOREM 449

is satisfied for any e > 0, then


n

lim P( X (f- Mnk) <x) = <Z>(x).

Theorem 4 evidently contains Theorem 3 as a particular case; it suffices to

put Znk = 4r- (fc = 1. 2,. . n).

The proof of Theorem 4 is very similar to that of Theorem 3; hence we


leave it to the reader.
The central limit theorem can be completed by the evaluation of the
remainder, viz. by giving an asymptotic expansion for the distribution func¬
tion of £*, the first term of which is given by <P(x) while further terms pro¬

gress according to powers of —=.


Jn
§ 2. The local form of the central limit theorem

In the preceding section we have seen that the distribution function F„(x)
of the standardized sum C* of n independent random variables £2,. . .,
. . . converges, under certain conditions, to the distribution function of
the normal distribution as n -* oo. It is therefore natural to ask under which
conditions the density function of (* (if it exists) tends to the density func¬
tion of the normal distribution. For this the conditions must certainly be
stronger, since it is known that F„(x) -> <P(x) does not- necessarily imply
F'n(x) -*<P'(x). We prove first in this respect a theorem due to B. V. Gnedenko:

Theorem 1. Let £x, £2, be independent, identically distributed


random variables which have a bounded density function f(x); assume further
+ 00 00

that = [ xf(x)dx = 0 and that the integral D2 — j x2f(x)dx exists;


— 00 — oo

then the density function fn(x) of

£l + £2 +••• + £/!
r* = (1)
Djn
tends to the density function of the normal distribution; hence we have
1
1 o
lim /„ (x) = (2)
Jin
The convergence is uniform in x.1
1 Fig. 25. The figure represents the case when the random variables are uniformly
distributed on (- + yj 3)-
450 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 2

Proof. The supposition D = 1 does not restrict the generality. Let cp(t)
be the characteristic function of and (pn{t) that of £*. We know that

<Pn (0 <P (3)

fi(x)

Fig. 25

We first prove the following

Lemma. If the density function g(x) is bounded, g{x) < K, and the charac
teristic function
+ 00
HO = j g(x) e,tx dx (4)
- 00

+ 00
is nonnegative, then the integral j \p(t)dt exists.
— 00

Proof. (4) implies for v > 0


+v + 00
is \ j ^ f , N sin vx ,
'KOdt = 2 I g(x) ——— dx, (5)
J
—V

hence, for T > 0


2T
2/ +o
+P +00

J jJ
0 -v
HO dtj dv = 2 J g(pc) -
— cos 2Tx
dx. (6)

Since tp(t) > 0, we have on the other hand


2T +o +2T +T
f(J H0dt)dv= j H0(2T-\t\)dt<T$ HOdt, (7)
0 -v -2 T -T
hence
+T + 00

-T
J iHO dt <
J 9(-x)1~
— cos 2Tx
dx. (8)
VIII, § 2] THE LOCAL FORM OF CENTRAL LIMIT THEOREM 451

Because of g(x) < K and


+ 00
(' 1 — cos 2 Tx
dx —2nT
J
— 00

(cf. Ch. VI, § 4, Formula (40)) we get


+T
\ \j/(t)dt -^4nK. (9)
-T

Since T in (9) can be chosen arbitrarily large and by assumption il/(t) is


nonnegative, the lemma is herewith proved.
Now if the density function of one of two independent random variables
is bounded, the density function of their sum is bounded as well (cf. Ch. IV,
§ 9, Formula (4)). Thus, the density function /(x) being bounded, that of
is bounded if !;* is independent of ^ and has the same distribution,
and the characteristic function of ^ is equal to |<p(i)|2- Thus by our
lemma |<p(t)|2 is integrable. From this it follows by applying Theorem 2 of
Chapter VI, § 4 that for n > 2
+ 00

(10)
— 00

On the other hand, we have


4-00

(11)
— 00

Furthermore, for every T > 0, because of the uniform convergence of


_ t2
(pn{t) to e 2 on every finite interval, we have
+T +T
_C
1
lim e 2 e~,xt dt (12)
W-*- + 00
2n
2t -T

uniformly in x.
We show now that the integral

In(T)= f (<Pn(t)~e~2)e-ixldt (13)


\t\>T

can be made arbitrarily small, uniformly in x, by choosing T and n suffi¬


ciently large. Because of (12), the theorem will then be proved. In order to
show that (13) can be made smaller than any positive number by an appro-
452 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 2

priate choice of n and T, notice first that


+ 00 + 00
n

\I,(T)\<2 j dt -f- 2 (14)


T T

The second integral does not depend on n and becomes arbitrarily small by
choosing T sufficiently large. It suffices thus to study the first integral. In
order to evaluate it, we separate it into two parts. For w^Owe have
i/2
<f(u) = 1 - — + o(u2).

For an e > 0 sufficiently small and \ u\ < e we have thus


2 ii*

l<K«)l < 1 -~<e 17


4
and it follows that
e/n oo

l t
V 1-7=
J l sjn
T T

which tends to zero as T —» + co, independently of n. It remains to show


that the integral
+ 00
n

dt (16)
J
n

tends to zero as n -*■ + oo. First we choose q = q(e) with 0 < q < 1, so
that | cp(t) | < q when | 11 > £ > 0.
In fact, according to Theorem 8 of Chapter VI, § 2,

lim | cp{t) | = 0.
t-*~ + CO

Since the £/t do not possess a lattice distribution, ! w(t) i ^ 1 for every
t # 0; therefore if we put

sup | 95(/) | = q
\‘\>e

we have 0 < q < 1. Then, however,

+ 00 +°° +oo
r t \ n — r _ +-
J 99 [“^J dt = J I 9>(«) \ndu < y/n qn~2 j | cp(u) |2 du. (17)
v . L
VIII, § 3] DOMAIN OF ATTRACTION OF NORMAL DISTRIBUTION 453

-r '-a- / --

Since we have already shown that j | cp(u) |2 du is finite and lim sJnqn~2'—
-oo n-+ + co

= 0, the integral (16) tends also to zero as n -> +oo. All these restric¬
tions are valid uniformly inx. (2) holds thus uniformly for — oo < x < + oo.
Theorem 1 is herewith proved.
When f(x) is not bounded but for any given k fk(x) is, (2) remains still
valid. This can be shown by a slight modification of the above proof. The
condition that fk(x) be bounded for a value of k (and, consequently, also
for every n > k) is evidently necessary for the uniform convergence of

/»(*) to

§ 3. The domain of attraction of the normal distribution

If (k = 1,2,...) are independent, identically distributed variables


and if the standard deviation D = D(ff) exists, then, according to
Theorem 1 of § 1, the random variables

c„ = E4
1 k=

satisfy the limit relation

lim P < X = <P(x) (—00 < X < + oo), (1)


«-<- + oo

where An = E(£„) and S„(Q = D yjn (n — 1,2,...). Now we have to con¬


sider, whether the existence of D(£k) is necessary for the validity of (1) with
suitably chosen sequences {A„} and {S„}. In the present section we show
that the existence of the standard deviation D(£k) can be replaced by the
weaker assumption (2).
We define the domain of attraction of the normal distribution as the set
of distribution functions F(x), possessing the following property:
If £l5 ... are independent random vaiiables with the common
distribution function F(x), then(l) is fulfilled for suitably chosen sequences
of numbers {A„} and {S„}.
In the present section we shall determine the domain of attraction of
the normal distribution and we shall prove a theorem due to P. Levy,
W. Feller and A. J. Khintchine:

Theorem 1. Let £a, £2, • • •> ••• be independent, identically distributed


random variables with a common distribution function F{x). If for F(.\) the
454 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 3

limit relation

lim + =0 (2)
•^+0° J xflF(x)
-y
holds, then (1) is valid for every suitably chosen sequence of numbers (A„}
and {£„}.

Notes
1. Condition (2) is not only sufficient but also necessary for the validity
of (1). But this will not be proved here.
2. If the standard deviation of the random variables £k exists, i.e. if
+ OO

j* x2dF(x) is finite, (2) is evidently true; this follows immediately from the
— 00

inequality
r[^(-T)+0 -^00)] < J x2dF(x).
\*\>y
Thus we can see that Theorem 1 of the present section comprises Theorem 1
of § 1.

3. If the standard deviation D and the expectation M of £k exist and if


M = 0, then in (1) An — 0 and Sn = D Jn. Conversely, if (1) holds with
A„ = 0 and S„ = Djn, the standard deviation of £k exists and is equal to D.
In fact, in this case Theorem 3 of Chapter VI, § 4 permits to state the follow¬
ing: If 9ft) is the characteristic function of the random variables tu, we
have
l t 1" -I*
lim qi -— = e 2
- + oo l D^Jn,
hence

In <P I-7= — In 95(0)


D^/'n ,
lim 12
/!-*- + 00

+ CO _(_ 00

and from this follows that D2 = J x2dF(x). Thus if j x2dF(x) does not

exist, the sequence of numbers {S„} for which (1) holds”cannot have the
order of magnitude ^fn. (Clearly, by the proof of Theorem 1, Sn tends to
infinity faster than jn.)
VIII, § 3] DOMAIN OF ATTRACTION OF NORMAL DISTRIBUTION 455

4. As an example of a distribution for which the standard deviation does


not exist though (2) holds, we mention the distribution with density function

-to- for | x | > 1,


fix) = c|
0 otherwise.
Now we give the

Proof of Theorem 1.

Lemma. If we have

< a < 1 for y > y0 > 0, (3)


| x2dF{x)
+ 00
then J | x | dF(x) exists.
— 00

If Y > y > y0 > 0, we may write:

y y
j xdF{x) = >’(1 — F(yf) — T(1 — F(Y)) + j (1 — F(x)) dx

and
-y
-y
f xdF(x) = YF(- Y) - yF{— y) - f F(x)dx,
J -Y
hence

j | x | dF{x) < >’( 1 — F(y) + T(— y)) + j (1 — F(x) + F( x))dx, (4)


y<.\x\<? y

and by (3)
Y +x

j
y<,(x\<,Y
| x | dF(x) <y{ 1 - F(y) + F(- y)) + a J j t2dF(t)
dx.

thus

j IXIdF{x) < y(l - F(y) + F(- y)) + a j \t\dF(t) +


y<\t\<,Y

+y

_|- t2 dF(t).
y
-y
456 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 3

If we subtract now from both sides a f


y<,V\^r
| 11 dF(t) and divide by 1 - a,
we get
4- y
.<**
J | x | dF(x) <
2a
(1 -oc)y
t2dF(t). (5)
y<.\x\<,Y -y

Since the right hand side of (5) does not depend on Y, we conclude that
+c
J | x |dF(x) exists; the lemma is herewith proved.
— CO

In what follows, we assume for sake of simplicity that F(x) is continuous


and symmetrical with respect to the origin, so that F(-x) = 1 - F(x).
In the general case the proof runs similarly, but the calculations are some¬
what more complicated. Furthermore, we may assume F(x) < 1 for every
x < + oo.
+ op
The existence of M = j xdF(x) follows from the lemma proved above.
— 00

Because of the symmetry of F(x), we have


«-»
J xdF(x) = 0. (6)
— 00

(In the general case we consider the random variable K = - MI We


put k

(7)
j' x2dF(x)
o
Then by assumption lim 5(y) = o. Put further

<5 (y)
A(y) =
(1 -F(y))- (8)
It follows from

A(y) = y >
l
(9)
0 - F(y)) | dF{x) (1 - F(y)) F(y) - ~
that
lim A(y) = +oo. (10)
y-* + oo

By assumption F(x) is continuous, hence A(y) is also continuous for y >


~ > ‘ Let C" be for n - /7o the least positive number > y0 such that

^(C„) = n2.
(11)
VIII, § 3] DOMAIN OF ATTRACTION OF NORMAL DISTRIBUTION 457

Evidently, C„ -> + oo and

n{ 1 - F(C„)) = . (12)
Put now
+ Cn
S~ = n | x2dF(x), (13)
-Cn

and let cp(t) be the characteristic function of tk. cpJt) the characteristic func¬
tion of (JSn. We have
itCn
<Pn(0 = E(e Sn ) = <P 1 r (14)
sj.
However, we have
+ Cn
itx

<P = 1+ (e Sn — 1) dF(x) + J (e Sn — 1) dF(x) (15)


-cn \x\ >Cn
and
itx
(eSn — 1) clF(x) < 4(1 - F{Cn)) = 4\//<l(C^ . (16)

\x\>Cn

By the lemma of § 1
+ Cn
itx
t*
(eSn - 1 )dF(x) = - — + Rn (17)
-c„
holds with
+ Cn
r l3l*l3 , c, 1113 y^c„) (18)
I < dF(x)
' 653 6n 2n
-Cn

Relations (14) through (18) lead to

2n+—n-)

with a 9n which remains in absolute value below a bound not depending


on n. Since lim ^/<S(C„) = 0, we get
n-<- + oo

lim (pn{t) — e % (19)


H-- + 00

which implies the statement of Theorem 1.


458 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 4

As regards the question whether other distributions than the normal also
have a domain of attraction, the following example shows that this is pos¬
sible. Let £1; £2,. . . . . be completely independent random variables
possessing a common stable distribution of order a(0 < a < 2) and charac-

teristic function then the characteristic function of Z €k is


n k=1
exactly
Thus any stable distribution has a domain of attraction which contains
at least the distribution itself. The domain of attraction of a stable distribu¬
tion with 0 < a < 2 is very narrow compared with that of the normal dis¬
tribution; it contains only distributions very similar to the stable distribution
considered. As regards the determination of the domain of attraction in the
case of a stable distribution with 0 < a < 2, we refer to the book of
Gnedenko and Kolmogorov [1].

§ 4. Convergence to the Poisson distribution

We have already proved in Chapter III, § 12 that the binomial distribution


of order n and parameter p tends, as n —*■ + oo, to the Poisson distribution
with the expectation 2, if p tends to zero in such a way that np —>■ X. This is
a particular case of the following, more general theorem:

Theorem 1. Let £nl, £„2,. . ., £nkn be independent random variables assuming


nonnegative integral values only. Put

P<£nk = r)=pnk(r) (r = 0,1,..k = 2,..kn ;n = 1,2,...) (1)

and
OO

Rnk = Z Pnk(r).
r=2
(2)

If the following conditions

kn
I'm Y, Pnk( 1) = A, (A)
n-*- + oo k=l

lim max (1 — pnk (0)) = 0, (B)


«-*- + oo 1 <Jk<kn

kn

lim Z Rnk = 0, (C)


n-* + oo k=1
VIII, § 4] CONVERGENCE TO THE POISSON DISTRIBUTION 459

are satisfied, then the distribution function of

In — £«1 + £n2 + • • • + £nk„ (3)


tends, as n —> + oo, to the distribution function of the Poisson distribution
with expectation X.

Proof. Let gnk{z) denote the generating function of the random variable
Lk-
CO
Gnkifi) = X Pnk(r) Zr (|Z|<1). (4)
r=0
Clearly

\9nk(z)-P„k(0)-Pnk(l)z\<Rnk fOT \z\<\. (5)


Since
^(0)-(l -Pnki\)) = -Rnk, (6)

we can write

| gnk(z) - 1 -pnk(l)(z- 1)| < 2Rnk. (7)

The identity (38) of § 1 implies, since | gnk{z) \ < 1 and | 1 + p„k(l)(z - 1) | <
< 1,
kn
ri 9nk{f) - n 0 + Pnfif) (Z 1)) < 2 X Rnk- (8)
k=1 fc=l

max pnk( 1) < max (1 - pnk(0)) < — ,


l<k<k„ l<k<kn

which is because of (B) fulfilled for n > n0, then identity (38) of § 1 leads to

<
ll 0 +Pnk<fi)(z - 1)) - ri eXP - !))
k=1 k-t

< max (1 - pnk(0)) | Z - 1 |2 ri Pnki\)- (9)


\<k<kn k=\

It follows now by our assumptions from (8) and (9) that

lim [] 9nk{z) = eA(z 1}. (10)


4- m /r = 1

Since fj g„k(z) is the generating function of rjn and e*z 1} that of the Poisson
k=1
460 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 5

distribution {Xke~x/k\}, our theorem is proved in view of Theorem 4 of


Chapter III, § 15.
When kn = n and the random variables £,nk take on only the values 0
X X
and 1 with the probabilities P(£nk = 0) = 1-and P(£nk = 1) = — .
n n
it

then conditions (A), (B), (C) are evidently fulfilled. In this case rj„ = Y, £nk
k=1
has a binomial distribution:

X
PiVn =j) =
n
"i-j-r
n

Thus we have as a particular case of Theorem 1: ■

n X V X \ n~j Xke~x
lim 1 -
n— + oo \j n k\

Theorem 1 is therefore a generalization of the convergence of the binomial


X
distribution of order n and parameter p = — to the Poisson distribution
n
of parameter X, already dealt with in Chapter III.

§ 5. The central limit theorem for samples from a finite population

The statement of the central limit theorem is valid for certain sequences
of weakly dependent random variables. In the present and in the following
two sections we prove some results in this direction. These results have
practical importance, too, since in the applications the independence is
often only approximately true. The following theorem1 refers to samples
taken from a finite population, a situation very often encountered in prac¬
tice.

Theorem: 1. Let aN1, aN2, . . ., aN>N be any real numbers (N — 1,2,...),

*=1
aN,k (1)
and
N
Mn\2
I
k=\
aN,k ~
AJ• (2)

• Hajek [2], where it is shown that


Theorem 1 is essentially best possible.
VIII, § 51 CENTRAL LIMIT THEOREM FOR FINITE POPULATION 461

Let further n = n(N) < N be a positive integer-valued function of N and put

n
Dfi/.n — D N Jv
1 -
N
(3)

From the numbers aN1, aN 2,. . aN N we randomly choose n numbers in


N\
such a way that all combinations have the same probability. Let the
n '
random variable Ca,« denote the sum of the so chosen aN k and put

Put further

dN,n (e)

If the condition
lim dN'„(e) = 0 (6)
N— +oo

is satisfied for any e > 0, then we have for — oo < a < + oo

lim P
Ca,« < X = <P(x). (7)
N-*- -f 00 D N,n

Proof. Condition(6) implies that n —>• + go asiV —► + co. Indeed it follows


from (6) that there exists for every e > 0 a number N0 such that for N > N0

the inequality dNfie) < y holds; but then we have

1 M,N D N,n ^ 2
< < Ne2 <m .
z aN,k n2
T n2
UN
aN.k-
Mn i , „
-N - <,eDN,n
N L)N

Hence, for N > N0 we have n = n(N) > and since e > 0 can be

chosen arbitrarily small, we get

lim n(N) = +oo. (8)


N— + oo

We may assume
Mn = 0. (9)
462 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 5

In fact, if (9) is not fulfilled, consider instead of the numbers aN k the numbers

aN,k = aN,k--; for these (9) is clearly fulfilled and if Theorem 1 holds

for a'Nk, it remains valid for aNk too.


Furthermore, we can assume

N
i; (io)

the random variables (Nn and ~CNN-n have indeed the same distribution
N
and if n > — , we may take instead of n the number N - n.

We compute now the characteristic function <pN>n(t) of

= j-ny I exp [it(aNJi + aNJt + ... + aNJn)\, (11)

where the summation is to be extended over all combinations of order n of


the numbers 1, 2,. . ., N. In order to prove Theorem 1, it suffices to establish
the relation

lim <Pn,u (12)


N-*- -f oo
We put

(13)
and

A"(l - X)N~n. (14)

By using the fact that

1 for k = 0,
(15)
0 for k = ± 1, + 2,...
— 71

we obtain easily the relation

+ 7E

1 r N
<PN,n(0= ~2nBNn(X) j O [(1 + (16)
VIII, § 5] CENTRAL LIMIT THEOREM FOR FINITE POPULATION 463

Indeed if we calculate the value of the expression behind the sign of integra¬
tion, by taking in the product (N - m) times the first and m times the sec¬
ond term, we obtain a term multiplied by the factor e'(rn~ri)<f; such a term
vanishes therefore when the integration is carried out provided that m # n.
If N —► + oo and JV — w -> + oo, which is certainly fulfilled in our case because
of (8) and (10), it follows from Stirling’s formula1 that

1
(17)

Hence if we introduce a new variable of integration «A - cp JnX{\ - A)


t
and if we replace t by ——, we obtain
DN,n

N
1
<PN,n
D N,n ^/in
— J
n QkW* 0
ft-1
(18)

where we put
•A taNk
QtM, t) = (l-X) exp — iX + +
Jnx{\-x) &N,n

>A taN,k
(19)
+ X exp /(I - X) +
JNX{\ - X) DN,n

l
According to the lemma of § 1 we have

v2 X(l — X)
(1 _ X) e~Uv + XeKl~X)v = 1 - + Rx (20a)

with
X(l-X)\v\
R,\< (20b)

and
(1 - X) e~iXv + XeK1-X)0 =l+R2 (21a)

with
X(\-X)v2
\R*\< - —» (21b)

1 If A is fixed, (17) follows directly from the Moivre-Laplace theorem. In cur case
X depends on N, hence the latter theorem cannot be applied. But Stirling’s fprmula
leads easily to (17).
464 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 5

On the other hand, an easy calculation shows that

- COS V) <

< 1 — 2(1 — A) (1 — cos v). (22


\
Now let e > 0 be given and suppose | ip | < 2e JNX( 1 -• 2). If k
A: is an index

such that -———- <e, then (20) implies


DN.n

2(1 - 2) f 1p 2
taN,k
e'ki'l', 0= 1 - + (1 + 9^) (23)
2 l ViV2(l - 2) DN,n

k and for |«AI <


< 2ev/ 7V2(1 - X),
2(1 - 2) <A ta[/v,fcj2
'ftfc(iM)- i I ^ + (24)
,JNX{ 1-2) AvJ
W,n) '

From this follows for | \j, | < 2e x/iV2(l - 2) the relation


N
>2 + t2
FI Qk (*P, 0 - exp (1 + 92 e) 0 + In), (25)
k= 1

where | 021 < Q and

I 9n I ^ C2 *N,n —|+c22(1-2)/N (26)


with

‘N ~ 1, (27)
\aN,k\> «^5-

Q and C2 being positive constants.


We have already
t2
X —-d
e2 2(1—2) N’n (28)

it follows from (26), (27) and (28), because of (16), that

lim rjN = 0.
N-^ + oo
(29)

For | ip | > 2e ^2(1 - 2) the following estimates can be used: For


i aN,k I IFI /Av> ^ £ the trivial inequality | Qk(ip, t) | < 1; for
vm, § 5] CENTRAL LIMIT THEOREM FOR FINITE POPULATION 465

aN k | | 11/DN>n < e the inequality

I 01 — 1 — A(1 — 2)(1 — cos e), (30)

which can be derived from (22).


Thus we obtain
N n-in
/—r I 2(1 — cos e)
k=1
n e*GM) difr < 2n-J NX 1-t— - (31)

2e< •, - <*
VNX(l-X)

Since
N-1n
( 2(1 — cos e) «(1 — COS £)
J'NX 1 < yjn exp + XlN\ ,

the right hand side of (31) tends to zero as N -» + oo because of (28) and
because
/— (( tt(l-cosa)
t/%( 1 r»r\ c «A

lim y/n exp — ----


W-* + oo l ^

From (18), (25), and (31) we obtain, since e can be taken arbitrarily small
T 00
ta
2
lim cpKn =e e 2 dif/ = e (32)
N-+ + 00
D N,n! V2

which concludes the proof of our theorem.


A case of particular importance occurs when M of the numbers aNk are
equal to 1 and N - M are equal to zero. Then

M' ‘N-M
m n —m
(33)
P(&N,n = m) =
N\
n

ie Cm has a hypergeometric distribution. Furthermore, MN — M, DN —


' ' ___ M(N - M) n(N - n)
= JM(N - M)/N. Condition (6) is satisfied for --►

—> + oo as A —» + oc; in effect, for every e > 0, dN n{£) = 0 as soon as N


(depending on e) is sufficiently large.

If m < — an(j n < — which can be assumed without restriction of gen-


2 2
466 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 6

erality, the above condition is equivalent to

nM
+ 00. (34)
If
.... M .
When — is constant or remains above a positive bound, this means that n

nM N
must tend to + oo with N. From-> + oo it follows, because of M < —
N 2
, N ,
and n < —, that N —► + oo, « -> + co, and M —► + oo. Theorem 1 con¬

tains thus as a particular case

Theorem 2. If N, M andn are positive integers, 1 < M < — , l < n < —

further if we put p and k=f~, then


N N

(M N-M
k n—k
lim X = *(x). (35)
NpX~ + oo k<np + xlf np(l-p)(l — X)

A particular case of this theorem, when p = — = constant, was derived

by S. N. Bernstein.
M
Note further that if p — is constant and n increases more slowly than

h2
N (it, for instance-> 0), (35) can be derived from the Moivre-Laplace

theorem by approximating the terms of the hypergeometric distribution by


those of the binomial distribution (see Chapter II, § 12, Exercise 18). How¬
ever, the general case cannot be treated in this way.
Theorem 2 can also be proved directly by merely considering the asymp¬
totic behaviour of the terms of the hypergeometric distribution, but this
procedure leads to tiresome calculations.

§ 6. Generalization of the central limit theorem through the application of


mixing theorems

A sequence ijlf rj2, . of random variables possessing a limit


distribution, i.e. such that
VIII, §6] APPLICATION OF MIXING THEOREMS 467

lim P(qn < x) = F(x)


n-* + oo

holds at every continuity point x of the distribution function F(x), is said


to be mixing, if for any event B with positive probability the relation

lim P(ri„ <x\B) = F(x) (1)


n->- + oo

holds at every continuity point x of F(x). We prove now

Theorem 1. Let £lt £a, be independent random variables and


put

c, = kt=l t*
Assume that there exist two sequences {C„} and {S',,} with Sn -*■ + oo and a
distribution function F(x) such that at every continuity point of F(x) the dis¬
tribution function of
in-Cn
hn =-c-

tends to F(x):
lim P(rj„ < x) = F(x).
«-*- + 00

Then the sequence of the random variables t)n is mixing.

Proof. We shall use the following lemma due to H. Cramdr1

Lemma 1. Let 6n and sn (n = 1,2,...) be two sequences of random va¬


riables; assume that the sequence 8n has a limit distribution with distribution
function F(x), that is, at every continuity point of F(x) we have

lim P(8,t < x) = F(x). (2)


n->- + co

Assume further that lim st e„ = 0. Then


«-<-+ 00

lim P(dn + e„ < x) = F(x) (3)


n-*- + oo

at every continuity point x of F{x).

Proof. We have for an arbitrary S > 0

P(dn + e„ < x) = P(\en | > S) P(dn + en<x\\en\>5) +

+ P(On + en < x\\en\ < 5). (4)

1 Cf. H. Cramer [3].


468 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 6

By assumption
lim />( | e„ | > 5) = 0.
n-*- + oo

From 9n + en < x and | sn [ < 8 follows 0n < x + 5; from 9n < x — S


and | e„ j < <5 follows 9n + en < x. Hence we can conclude from (4)

F(x - 8) -P(\en\>8) < P(9n + £„ < x| |£„ | < 8) < F(x + 8), (5)

F(x - 8) < Jim P(9n + £„ < x) < lim P(9n + en < x) < F(x + ,5). (6)
n~* + oo n~* + co

Since x is by assumption a continuity point of F(x) and 8 > 0 may be


taken arbitrarily small, (3) is proved.
Let now x be a continuity point of F(x) and suppose F(x) > 0. Then by
assumption we can find an n0 such that P{r\n < x) > 0 for n > n0. Put
Ao = Q and denote by Ak the event ?/„o+* < x(k = 1,2,...). ThenP(AJ >
> 0 and, by assumption,

lim P(An | A0) = lim P(A„) = T(x) > 0.


n-*~ + oo n-*- -p oo

By Theorem 1 of Chapter VII, § 10 it suffices to prove the relations

lim P(An | Ak) = F(x) (k = 1,2,...). (7)


n-*- + oo

Apply Lemma 1 with 9n = rjn and en = - =_u


. This can be done, since
s.
the hypotheses of the lemma are fulfilled. We find

£n0 + k
lim P
«-*- + 00
*ln ~ < X = F(X).
(8)
Since
£nu + k Cn C, n0 + k n
tin ~

does not depend on r/no+k, we have

lim pL -%*_<* M = F(x).


(9)

If we apply Lemma 1 again, to the random variables


9„n = yi _ £"» + fc „ feC,0 +
~ 'In A > £„ = —--
VIII, § 6] APPLICATION OF MIXING THEOREMS 469

on the probability space [Q, P(A \ A f)], we get already (7):

lim P(Y]n <x\Ak) = lim P(A„ | Ak) = F(x).


n-++ oo w-»- + oo

The conditions of Theorem 1 of Chapter VII, § 10 are thus satisfied and for
every B with P{B) > 0 the relation

lim P(r]n < x | B) = F(pc) (10)


n-*- + oo

holds. The theorem is therefore proved for every x such that F(x) > 0.
If x is a continuity point of F(x) such that F(x) = 0, we have

lim P{r\n < x) = 0


W—*- + 00

and, if P(B) > 0,

lim P{r]n <x\B)< lim — = 0.

Theorem 1 is herewith completely proved.


Theorem 1 can also be formulated as follows: If the random variables
lk are independent, if Sn -► + oo asn-^ + oo, further if the random vari¬
ables

k=l

nn=

possess a limit distribution, then rj„ is in the limit independent of ary ran¬
dom variable 9 in the following sense: For every y such that P{9 < > 0
the relation

lim P(r]n < x, 0 < y) = lim P(rjn < x)P{9 < y) (11)
n-*- + oo n-*~ + oo

holds at every continuity point of the limit distribution of rjn.


The following is an interesting corollary of Theorem 1:

Theorem 2. Suppose that the random variables £i, • • •, £n, • • • are inde-
n

pendent and put C„ = E U- V there exist tw0 sequences Cn and Sn


k=1
(n = 1,2,...) with lim Sn = + 00 fulfilling the relation
+ QO

lim P(r\n <x) = F(x),


«-*- + 00
470 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 6

where qn and F(x) is a nondegenerate distribution function, then

t]n cannot converge in probability to a random variable rim.

Proof. Assume that there exists such a variable with 0.


If we apply the lemma with 9n — and e„ = - rjn, we find

PQlao < x) = F(x). (12)

On the other hand. Theorem 1 allows to state the following: If * is a conti¬


nuity point of F(x) with 0 < F(x) < 1 (F(x) being nondegenerate; such a
point always exists) and if B denotes the event < x, we have

lim P(qn <x\B) = F{x) = P'(B). (13)


7!^+ 00

If we apply the lemma to the random variables 6„ — rj„ and sn = qm — q„


on the probability space [£?, isF, P(A | B)], we get

P(q00<x\B)=P(B\B) = P(B),

i.e. F(x) — P(B) = 1, which contradicts our assumption that 0 < F(x) < 1.
Hence Theorem 2 is proved.
Naturally, it follows from Theorem 2 that, under the conditions of the
theorem, the limit of rjn cannot exist almost everywhere. Still more is true:
the probability of the existence of the limit lim qn is equal to zero.
71—► + 00

The set C of the elements to £ Q for which lim qn{w) = qm(od) exists is
n-*- + oo
obviously measurable. Suppose we have P(C) > 0, then qn would con¬
verge on the probability space [12, P(A | C)], with probability 1
and therefore also in probability, which contradicts Theorem 2.
We now prove a lemma.

Lemma 2. Let [£>, ^F, P] be a probability space


and Q(A) (A £ -aF) a second
probability measure on the o-algebra cF, absolutely continuous with respect
to P. If the sequence of sets A„^^F is mixing on [Q, <aF, P] with the density d,
then
lim Q(An) = d. (14)
71 “► -f- 00

Proof. According to the Radon-Nikodym theorem there exists a mea¬


surable function x{od) such that for every d

Q(A) = J X(a>)dP.
A
VIII, § 7] SUMS OF A RANDOM NUMBER OF RANDOM VARIABLES 471

If %(a>) is a step function, (14) is clearly fulfilled. According to the definition


of the Lebesgue integral there can always be found a step function xfs0)
such that
J I Xi<o) - Xi (g>) I dP < £.
Si

Hence (14) is always fulfilled.1


Lemma 2 allows to rephrase Theorem 1 in the following stronger form:

Theorem 3. If <^l5 £2, ore independent random variables on the


probability space [Q, P] and if the hypotheses of Theorem 1 are satisfied,
then for every probability measure Q absolutely continuous with respect to P
the relation
lim 0(rjn < x) = F(x) (15)
71—► + CO

holds at every continuity point x of F(x).


Theorem 3 allows to extend limit theorems to sequences of weakly depen¬
dent random variables. Assume indeed that the random variables £k are
not independent with respect to the probability measure Q, but let there
exist a second probability measure P such that Q is absolutely continuous
with respect to P while £k are independent with respect to P. If one of the
theorems about the limit distributions can be applied to [A, P],
Theorem 3 guarantees its applicability to [£?, Q\ as well.2

§ 7. The central limit theorem for sums of a random number of random


variables

In the present section h, f2,. denote independent, identically


distributed random variables with zero expectation and unit variance.
Hence by Theorem 1 of § 1 for the random variables
n

C„ = ^=- (i)
Vn
the relation
lim P(Cn < x) = <P(x) (2)
n— + oo

1 The sequence of events {A„} is also mixing with respect to [Q, cA, Q] since
Q*(A) = Q(A | B) is also absolutely continuous with respect to P. Hence by Lemma 2
lim Q(A„ | B) = d.
rt-* + oo
2 Cf. P. Revesz [1].
472 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 7

is valid. Let now v„ (n — 1, 2,. . .) be a sequence of random variables which


assume only positive integer values and which are supposed to obey the
relation
v„ + oo (3)

i.e. to fulfill for any N > 0

lim P(yH > N) = 1.


n-*- + oo
We want to find conditions under which the random variables are in
the limit normally distributed. It is easy to prove

Theorem 1. If the random variables v„ (n = 1,2,...) are independent of


the random variables f, and if (1), (2) and (3) are fulfilled,
then
lim P(CV„ < a) = (P(x). (4)
rt-*- + o0

Proof. Put

Cnk=P(vn = k) (n, k — 1,2,. ..). (5)

The matrix (Cnk) possesses the following properties:

cnk> o (n,k= 1,2,...), (6a)


iiM8

(«= 1,2,...), (6b)


II
?T*

lim Cnk =0 (k = 1, 2,...). (6c)


«-*- + 00

(6c) is a consequence of (3); (6a) and (6b) express the fact that
cni, cni, • ■ ; Cnk,. . . is a probability distribution. The three conditions (6)
can be expressed bv saying that (C„*) is a permanent Toeplitz matrix. A
theorem1 known from the theory of series permits to conclude that, if

lim Sn — S
n-+ -f oo
then

)im ktcnksk
=
w—*- + oo
= s.
1
Now
00

p(Cv„ < X) - £ P(fk <x,vn = k). (7)


&=i v '

]Cf. K. Knopp [1].


VIII, § 7] SUMS OF A RANDOM NUMBER OF RANDOM VARIABLES 473
t.

Since vn does not depend on £k, we get

00

^(Cv„<*) = I CnkP(£k<x). (8)


/c = l

From (2) and the above-mentioned theorem from the theory of series we
obtain (4) and Theorem 1 is proved.
The situation is somewhat more complicated if we do not suppose that
v„ is independent of the variables £k. In this case a stronger condition than
(3) must be imposed upon v„. As an example we prove now a theorem
which is a particular case of Anscombe’s theorem.1 The reasoning is inspired
by W. Doeblin.

Theorem 2. If (2) is fulfilled and if

where c is a positive constant, then (4) is valid.

Proof. Put
n

nYi Zk and Xn = [nc]. (10)


k=1
Then, because of (9),

(H)
K
Furthermore
I j hv„ hxn
Cv„
(12)

Now we need a simple lemma.

Lemma. If the sequence of random variables On has a limit distribution


and if yn A 1, then the sequence Onyn has the same limit distribution as the
sequence 9n.

Proof. As 6ny„ = On + 9n(yn - 1), it follows that for every N > 0


474 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 7

Let <5 > 0 be arbitrary. Choose N and nx so that for n>nx the inequality
P(| 9n | > N) < 8 should hold. Choose n2 > n2 such that for n > n2 the

inequality P In < 5 should be valid. So P(| (yn — 1) | > s) <

< 2<5 for n > n2. Consequently, 6n (y„ - 1)L0 and the present lemma
follows from Lemma 1 of § 6.
According to these two lemmas it suffices for the proof of Theorem 2 to
show that
9Vfl P Q
(13)

Let e > 0 and 5 > 0 be arbitrary; choose nx such that

P( | v„ — Xn | > <5e2 A„) < d for n > nv (14)


Clearly

*/y„ ~_nxn nk -
>£ = Y.P > £, vn = k\ (15)
JK k=l

and because of (14) we obtain the inequality

< <5 + P nk - VXn


> E max > E (16)
A \k~~ A

Now Kolmogorov’s inequality (Chapter VII, §6, Theorem 1) implies

max
<h - ’h,
> E < 2(5. (17)
| k — A/j| < £^<5A/i
A
Inequalities (16) and (17) prove (13).
Finally, as an application of Theorem 1 of § 6 we prove a theorem in
which v„ fulfills a condition of other type than (9).

Theorem 3. Let a. be a positive discrete random variable; suppose that


(2) holds and assume further that

= [«a], (18)
where [x] denotes the integral part of x. Under these conditions (4) is valid.

Proof. Let ak (k — 1, 2,. . .) be the values taken on by a with positive


probability and Ak the event a — ak\ we have

-P((v„ < *) — nak] < X I ^k)^(^k)- (19)


k=1
VIII, § 8] LIMIT DISTRIBUTIONS FOR MARKOV CHAINS 475

When P(Ak) is positive, then because of ak > 0, (2) and Theorem 1 of § 6


we get the relation

lim P(C[nak] <x\Ak) = <P(x). (20)


n-+ -f oc

Hence for every fixed m


m m

lim Z f ({,„,) < * Mi)P(dk) = <Hx) £ P(Ak). (21)


n— + oo A:=l fc=l

As there can be found for every e > 0 an m such that

f P(Ak)<e, (22)
k=m +1

(4) follows from (21) and (22); thus Theorem 3 is proved.


We can deduce from Theorem 3 the following more general theorem:

Theorem 4. Let a. be a positive discrete random variable -, suppose (2) and

A a. (23)
n
Then (4) is valid.

The proof1 rests upon Theorem 3 and uses the same method as the proof
of Theorem 2.

§ 8, Limit distributions for Markov chains

In the present section we shall deal with an important class of sequences


of dependent random variables: Markov chains. A sequence of random va¬
riables Cn (n = 0, 1,. . •) is called a Markov chain if the following conditions
are fulfilled; for every n (n = 0, 1,. . .) and for every system of real numbers
;c0, ., xn+1 one has with probability 1

P(Jsn +1 ^ +1 I Co = *0, Cl = Ab • • • J Ch = X/i) P(.Cn +1 <- -*n + l I Cn ■*«)• (0

(The conditional probabilities figuring in (1) are defined according to § 2


of Chapter V.) In the present section we shall deal only with Markov chains
{(„} such that the values of („ belong to a denumerable set ; without

1 Cf. A. Renyi [31]; later J. Mogyorodi [1], further J. R. Blum, L. Hanson and
J. Rosenblatt [1], have proved that in Theorem 4 the restriction that a should have
a discrete distribution can be omitted.
476 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 8

any essential restriction of the generality it can be assumed that is the


set of the nonnegative integers. In this particular case (1) can be written in
the following form: If n, k,j0,j\, . . .,j„ are any nonnegative integers, then

P(Cn +1 = k I Co —joi Cl — ,/l> • • •> Cn —jn) = P(Cn + l ~ k | Cn = jn)-

Markov chains are usually interpreted as follows: Let S be a physical


system which can be in the states A0, Ax, . . ., Ak, . . .. Let the state of the
system change at random in time; consider the states of the system at the
time instants t = 0, 1,. . . and put £„ — k if at time n the system is in the
state k.
The hypothesis that the random changes of state of a system form a
Markov chain can then be expressed as follows: The past of the system can
influence its future only through its present state.
If we multiply both sides of (2a) by P(£0 = fl,. . ., £„ = j„) and add the
equations obtained for all values of j0,j\, . . ., yV-i (1 < r < n) and further
divide by P(£r = jr, . . — jn), then we obtain

P(U i = k\Cr =jr, ...,C„ =jn) = P(C„+1 = k\(n =jn). (2 b)

Similarly, we can show that for arbitrary integers 0 < n± < n2 < . . . <
< ns < n.

P(£n+1 = k\Cni =J\,..cns =js) = P(Cn+1 = k \tns =js). (2c)

The conditional probabilities P(Cn+m = k \ C„ = j) are called the m-step


transition probabilities, as they give the probability that the system passes from
the state Aj to the state Ak during the time interval (n,n + m) (m = 1,2,...),
i.e. in m steps. These quantities depend in general on the time n. If not, then
the Markov chain is said to be (time-) homogeneous. In this Section we
deal only with homogeneous Markov chains.
Thus by assumption the probability P(Cn+m = k \ (n = j) will be inde¬
pendent of n and we may put

PjP = P(Cn+m = k\C„ =j) (j, k = 0, 1, • • •)• (3)

It is reasonable to consider the numbers p(j, k = 0, 1,. . .) as the elements


of a matrix

nm = (/#>)• (4)
Instead of pty we write simply pjk and instead of n1 simply 77.
VIII, § 8] LIMIT DISTRIBUTIONS FOR MARKOV CHAINS 477

Clearly for every positive integer m and for j > 0 the relation

£ /$?> = i (5)
k=0

holds. In fact, the terms of the sum are the probabilities belonging to a com¬
plete system of events. Hence the matrix /7m, which has nonnegative terms
only, has the property that the sum of terms in each row is equal to 1. Such
matrices with nonnegative elements are called stochastic matrices. The matrix
J7m can be computed from /7 as follows. According to the theorem of com¬
plete probability (cf. Chapter III, § 2, Formula (2)) we have for 1 < r < m
00

P(jZn + m = k | C„ =j) = I P(f,, + * = * I Cn + r = l in = j)P(in + r = 1 \ in = j), (6)


/= 0

and it follows from (2c) that

=E
1=0
co
Thus we have

nm = nrnm_r (m = 2,3,...; l,2,...,m- 1). (8)

Consequently, IJm = 17 nm_1 = I72/7m_2 etc., hence

nm =nm (m = 2,3,...). (9)

The matrix of m-step transition probabilities is thus the ra-th power of the
matrix of one-step transition probabilities.
So far we have only considered transition probabilities, i.e. conditional
probabilities. In order to determine from these the probability distribution
of we must know the state of the system at the instant t = 0 or at least
the probabilities of the initial state of the system, i.e. the probability distri¬
bution P((o = k) {k = 0, 1,. ..). With the notation P(C„ = k) = Pn{k)
(n = 0, 1,. . .) one can thus write

Pn(J<)= EPoCfl/4? (10a>


y=o

or, more generally,

Prik) = f PrU)P%-r) (r = 0,1,1). (10b)


7 =0

If Co is constant, e.g. equal to j0, then P0(j0) = 1 and P0(j) = 0 for j ^ y0.
478 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 8

In this case

Pn(J<)=P(j:l (11)

As an example of a Markov chain consider a machine in a factory, which


is switched on and off as time proceeds. At any instant there are two possi¬
bilities: either the machine works (state Ax) or it stands idle (state A0). Let
pjk denote the probability that at the instant n + 1 the machine is in the
state Ak provided that at the instant n it was in the state Aj (J, k = 0, 1),
further put p01 = A,pw = p:(0 < A < 1 , 0 < ju < 1). In this case the matrix
of transition probabilities is

A-A A |
n=
n 1-Aij’

A simple calculation gives the n-step transition probabilities. Further we


derive from these

A
W) - (12a)
A + /i
and

P
p«(°)=TVv+ c1 -k - t*r p o(o) A + p.
(12b)

where P0(l) and P0(0) are the probabilities that at time 0 the machine works
and does not work, respectively. Since 0<A<l,0</i<l,we have always
| 1 - A - ^ | < 1; hence (12a) and (12b) lead to

A
lim P„(l) = lim Pn(0) = (13)
A + p A + fi ’

hence the distribution of (n tends to a limit distribution as n -> + oo.


Notice that the limit values (13) do not depend on the distribution of fn
/j,
If = TTu ’ and hence P»<°> = At • we haveP„(l) = A— and

= ^ f°r every n and not only in the limit.1

If in a Markov chain the distribution of the random variables £ tends to


a limit distribution (thus if lim Pn{k) = P(k)) which does not depend
ft-*- _l m

initiaf itJ = ll™ have ^0) = A and Pn( 0) = ^ without any assumption on the
initial state, m this case the C„-s are independent from each other!
VIII, § 8 J LIMIT DISTRIBUTIONS FOR MARKOV CHAINS 479

on the initial distribution P0(j), then the Markov chain is called ergodic.
An initial distribution such that £„ has the same distribution for every value
of n, is called a stationary distribution. If the Markov chain is ergodic and
there exists a stationary distribution, the latter is evidently the limit distri¬
bution of It is easy to show that there exists a stationary distribution,
iff the system of equations

*k = Z PjkX; (k = 0,1,...) (14)


7=o
00

admits a solution x0, xx,... with xk > 0 (k = 0, 1,., .) and £ xk = 1;


k=1
in this case the numbers xk constitute a stationary distribution. For the
example considered above Equations (14) can be written as

x0 = (1 - X) x0 + gxx,

xx = Xx0 + (1 — g) xx. (15)

The single solution such that x0 + xx = 1 is

X g
*i = -y~~— » *o = -y~— • (16)
A + fl A + fl

In this example there exists a stationary distribution and the Markov chain
is ergodic.
The following theorem, due essentially to A. A. Markov, shows that this
holds under rather general conditions.

Theorem 1. Let a system possess a finite number of possible states A0,


Ax, ,.., An. Assume that the changes of state of the system form a homo¬
geneous Markov chain and denote by p(ff the probability that the system passes
from state Ax to state Ak in m steps. Assume further that there exist integers
s > 0 and k0> 0 such that for j = 0, 1,. . N,

Pt> o, (17)
i.e. that the matrix IJS has at least one column in which all elements are posi¬
tive. In this case the chain is ergodic; the limits

lim p{$ = Pk (j, k = 0,1,..., N) (18)


n-*- + oo

exist and do not depend on j. The sequence of numbers P0,. . .,PN is the unique
480 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 8

nonnegative solution of the system of equations

P,= 1 PjPjk (k = 0,l,...,N), (19)


j=o

which satisfies
N

1Pj= I- (20)
j= 0

The limit distribution {Pf is thus a stationary distribution of the chain.

Proof. By assumption

d = min p% > 0. (21)


0<,j<.N
Put
m(kn) = min pjk\ = max pff (lc = 0,1,..., N). (22)

Clearly

Pjk+1) = I PjlPlk, (23)


1=0
hence, for 0<j^N,

m(£) = mP Y Pji ^ Z Pj,PW = P%+\ (24)


1=0 1=0
and
mffi < m{k+l). (25)
Similarly, for 0 < N,

1 p, & l pmpIP = p%*'\ (26)


1=0 1—0
hence
M(f > M[n+1\ (27)
Furthermore, by (22),

0 < rrfjfi < < 1, (28)

which implies the existence of the limits

mk= lim m(kn) and Mk = lim M<kn> (29)


n— + oo n-r + ao
VIII, § 8] LIMIT DISTRIBUTIONS FOR MARKOV CFIAINS 481

with
mk < Mk. (30)

If we can prove that mk = Mk for A: = 0, 1,. .N, then (18) will be proved.
Now for a suitable lx the equality

Ar?+'>=7>i,r> = £ (31)
j=O

a certain /0;

min+s) = piT) = £ ptOpt").


(32)
7=0
Hence

mp* - s)=f ip^-Pui)pf- (33)


7=0

Let H be the set of all j (0 < j < N) for which p\s) - p$ ^ 0 and let H
be the complementary set of H, i.e. the set of those j (0 < j < N), for which
Pu ~ P(i?J < d holds. Put
(34)
A = Y(Ph) - Pu) and b = Yl(Pij-Pu)-
j£H HH

Then A > 0 and

a+b=£ pft-i =1-1=0,


j=0 j=0

hence B = —A and it follows from (33) that

Min+S) - m(kn+s) < - m(k]) A. (35)

Two cases are now possible: either k0 £ H or k0 £ H. In the first case we


have

and, since A — —B,


A < 1 - d. (36)

In the second case we have

A < 1 -p\ 1< 1 -d,

hence (36) is valid in both cases. It follows thus from (35) and (36) that

M(f+S) - m[n+s) < (1 - d) (Mj?> - mf). (37)


482 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 8

Furthermore, because of (21)

- mf < 1 -d. (38)

By induction we conclude from (37) and (38) that

M{ks) ~ ™ks) < (1 - d)n (39)

and, because of d > 0, by passing to the limit n + oo and taking into


account (29) and (30), we get mk = Mk. This proves (18) with Pk = mk =
= Mk.
The passing to the limit in (7) and (5) shows that {Pk} fulfils Equations
(19) and (20). It remains merely to prove that the Pk (k = 0, 1, . ..,N)
are uniquely determined by (19) and (20). This can be shown as follows:
If Qo> ■ • >Qn were a distribution distinct from P0,. . ., PN and Satis-
JV
fying (19), thus if £ Qk = 1 and
k=0

Qk = Y QiPik (k = 0,l,...,N) (40)


/=0 v '

held, then after multiplication of (40) by pkt and summation over k we


would obtain

Qt = Z Qpf-
/=0

By repetition of these operations we find for every integer n

Qt = 1 Z0 QipP-
=

Because of lim pff = pt and Q, = 1 there follows Q, — P which


«^+0O /=0

was to be proven.
The numbers Pk fulfill the equations

= 1l0 P,P&
=
(41)

for eveiy n 1,2,...; Equation (19) is a particular case of (41)


If the distribution of C0 is known, P(C0 = /) = P0(/), then from (10a) one
can derive the relation

lim P„(/c) = Pk (* = 0,1, ..., N). (42)


VIII, § 8] LIMIT DISTRIBUTIONS FOR MARKOV CHAINS 483

Finally, let us mention the following particular case: assume that for the
matrix of the transition probabilities (pJk) the sum of all columns is equal
to 1:

E Pjk = 1 for k = 0, N.
j=o

The matrix 17 = (pjk) as well as its transpose II* — (pkj) are stochastic
matrices; such a matrix II is called a doubly stochastic matrix. In this case

(40) is fulfilled for Q — ——— (k = 0, 1,.. N); the solution of (19) being
N+ 1
1
unique, there follows Pk — . Thus for a doubly stochastic matrix II
N+ 1
fulfilling the conditions of Theorem 1 the relation

1
lim /#> holds for j, k = 0,1,..., N.
«-»-F oo N+ 1

It follows from (42) that the probabilities of the N + 1 states are in the limit
equal to each other, regardless of the initial distribution.
A particular class of the Markov chains is that of the so-called additive
Markov chains. If £0, g1}. . ... are independent random variables and
if we put („ = £0 + £1 + • • • + the random variables £„ (n = 0, 1,...)
form a Markov chain, since

J}(jsn +1 % | Co = • • •> C/j = %n) P(.^n +1 ^ ^-n)

= P(C„+1 < x\Cn = xn). (43)

If £k are identically distributed, the chain is homogeneous. In this case


the problem of finding the limit distribution of the chain {£„} can be reduced
to the study of sums of independent random variables, already dealt with.
If take on only integer values, if their expectation is zero and if the
greatest common divisor of the values assumed by ^ - £2 with positive
probabilities is equal to 1, then for every pair k, l of integers the relation
(cf. Chapter VI, § 9, Theorem 8)

P(Cn = k)
lim = 1 (44)
n-> + aO P{L = 0
holds, hence (n has in limit a uniform conditional distribution on the set of
all integers. (This may happen also for nonadditive chains.) Further, if the
expectation of £k is zero and their variance is equal to 1, then from
484 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 8

Theorem 1 of § 1 there follows the relation

Cn
lim P < X = *(X).
n-+ + oo \Tr‘

For homogeneous Markov chains with a finite number of states the suit¬
ably standardized sum

Vn — Co + Cl + • • • + £„

is under general conditions in the limit normally distributed. This will be


proved here only for the simplest case of a chain with two states.

Theorem 2. Let the random variables {(,} form a homogeneous Markov


chain with two states. Put £n — 0 or £„ = 1, according to whether the system
/1 _ X x \
is in state A0 or in state Ax at the instant n. Let with 0 < X <
P 1 -MJ
< 1, 0 < g < 1 be the matrix of transition probabilities. Put

I Cfc. (45)
k= 0
Then

\
Vn
2+ g
lim P < x = *(*). (46)
72-*-00 nXg{2 — X - g)
(A + gf

Remark. If X + g = 1, then the variables are independent and gn has


a binomial distribution of order n and parameter X, further in this case

Xg( 2 — X — /<)
= 2(1 - X).
a + /o3~
Hence (46) reduces to

2/7
lim P < x = 4>(x).
72->-{-00 .>>2(1-2)
Theorem 2 is thus a generalization of the Moivre-Laplace theorem.

Proof. If z„ denotes the instant when the system returns for the n-th
time to the state Ax, we have 0<t1<t2<...<t < • r __ j.
VIII, § 8] LIMIT DISTRIBUTIONS FOR MARKOV CHAINS 485

= 0 for k < or xn < k < t,,+1 (n = 1,2,.. .)• Put <5t = t,, bn — xn —
— t„_i. It is easy to see that <5„ are independent and (<5! excepted) have the
same distribution. For by the definition of Markov chains, xn — x„_x is
independent of the random variables <5l9 52, ■ .., <S„_! which depend only
on the states of the system at instants t < t„_x. The fact that the random
variables 8n (n - 2, 3, . . .) are identically distributed follows from the ho¬
mogeneity of the Markov chain. Clearly for every n> 2

P(5„=l)=l-n
and
P(5„ = k) = pX( 1 - X)k~2 for k > 2.

Hence, for n > 2


X /L
E{dn) =
X
and
ju(2 - X - n)
D\i5„)
X2

If £0= l, sx has the same distribution as the other 5„ (n > 2); if Co = we


have P{by — 1) = X, P(8i — k) — (1 X)k xX for k > 2 hence E(5
1 0 1 - X

By Theorem 1 of § 1 and the lemma of § 6 follows

( X+
x*~k^r
lim P < x *(x). (47)
k-* + oo Jkn{2 -X-fi)
X

Now obviously P(r]n < k) — P{xk > ri)\ in fact rj„ < k means that up
to the moment n the system was less than k times in the state Ay, thus its
k-th entrance into the state Ax occurs after the moment n, hence xk > n and
conversely. If we put

nXfi( 2 — X — ju)
—(X + Ixf~ ’

a simple calculation gives that

k{X + /i) x^/k^(2 — X - ix)


n= + 0(1),
X
486 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 9

hence

f X \ l X + \x
V n~ *k - ■—« —k
X + /i X 1
P < x =P >- x + O (48)
nXn(2 - X — /x) yjkpi2 - X - /.l) A
v v a+nf

Thus (47) leads to

(
In ~ n
X + fx
lim P — < x = 1 - 0(—x) = d>(x). (49)
+00 nXn{2 — X - n)
/
(* + A*)3
Theorem 2 is thus proved.

§ 9. Limit distributions for ‘ order statistics”

The theory of order statistics” (i.e. the theory of observations arranged


according to their magnitude) is becoming more and more important in
mathematical statistics.
In the present section we shall show how the theory of order statistics
can be reduced to the study of certain Markov chains.1 We start from the
following particular case: Let C1; f* ..f„ denote the outcomes of n inde¬
pendent observations of a quantity having an exponential distribution;
■ ' ':: are thus indePendent random variables with the same exponen¬
tial distribution:

F(x) -P(Ck < x)= \ - e Xx for x > 0(X > 0).

70^^ f°U°Wing pr°perty of the exponential distribution: for any x and

F(£ >-x + tIC > y) = P(C > x). (!)

Property (1) is characteristic of the exponential distribution. In fact it is


equivalent to ’ s
C/(x + y) = G(x) G(y)

if T(x) is the distribution function of £ and G(x) = 1 - F(x) • we know al


ready (cf. Chapter III, § 13) that the only nonincreasing Solutions of (%

1 For the method see A. Renyi [9], [10],


VIII, § 9] LIMIT DISTRIBUTIONS FOR “ORDER STATISTICS” 487

the trivial solutions G(x) = 0 and G(x) = 1 excepted, are the functions of
the form G(x) = exp ( — Ax) with A > 0.
The meaning of (1) becomes particularly clear if we interpret £ as the dura¬
tion of an event which takes a certain lapse of time to occur. In this case (1)
expresses the fact that the future duration of an event which is still in course
at a moment y does not depend on the time passed already since the begin¬
ning of this event.
Arrange the random variables Ci, £2, ...,£„ in increasing order and let

Ct = Rk{M
be the A-th of the ranked variables Cj- Then1

Ci < CJ < • • • < CJ-


It is easy to determine the distribution of the £*.
If the Cj are interpreted as durations of independent events beginning at
the same time, then £* is the duration of that event which is the k-th that
ends. We compute now the distribution of the differences £*+1 ~ £*• Clearly

P{Ct+i- C* > •* IC* = t) = p(C*+i > * + y\C* = y)- (3)


From the n — k events still in course at the moment y none must cease
before the moment x + y; by (1) the probability of this is

[P(£ > x)]n~k = e^n~k)Xx.

The conditional distribution function of £*+1 - Ct with respect to the


condition £* = y is thus

P(Cjf+1 - ct < XI Ct = y) = 1 - (4)


The function thus obtained does not depend on y hence it is equal to the
unconditional distribution function of (*+1 — (*. This difference has
1
thus itself an exponential distribution and its expectation is ^ ^

(k = 1, 2,...,«- 1). C* also has an exponential distribution, with expectation

. If we put Co — 0 and
nX

&k+i — (n — k) (C*+i ~ C*) {k — 0,1,..., n — 1), (5)

then <5/c+1 {k = 0, 1,.. ., n — 1) have all the same exponential distribution

1 F(x) being continuous, the probability that two of the £, are equal is zero; this
possibility can thus be omitted.
488 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 9

with expectation —. It is easy to see that the 5k are independent. In fact


A

the conditional probability

n«+i- c* < *ict=>■„..ct - ci-i=a) (6)


does not depend on y2,. . ., yk; since the conditions are equivalent to the
relations (* = yx + ... + yj (j = 1,2,..., k), they give thus for j — 1,2,
. . k the moment when the y-th of the events starting at t — 0 ends. This
means that at the moment t = yx + y2 + .. . + yk there are exactly n — k
events in course. The probability that between t and t + x at least one event
comes to an end is 1 — exp[ — (n — k)Xx\. This does not depend on the va¬
riables yx,.. yk; hence the random variables <5l5 S2,. . dn are indepen¬
dent. Thus C* may be written in the form

Ct=— + + ... + (7)


n n — 1 n k + 1

where <5,- (J = 1,2,..., k) are independent and possess the same distribu¬
tion. Equation (7) shows that the variables (* form an additive Markov
chain. By means of (7) the distribution of £* can be determined explicitly.
Let the preceding result now be applied to the theory of order statistics.
Let Ci> C2, •••>£« t>e independent random variables with the same contin¬
uous distribution function F(x). As above, put = Rk(£lt f2,. . .,£J,
hence £*<£*< •••<£* are the variables ^arranged in increasing order.
The theory of order statistics deals with the study of £*; f * is called the
/c-th order statistic. This study can be reduced to the case when £k are expo¬
nentially distributed and by (7) we then have to consider sums of indepen¬
dent random variables only. In order to show this, put

1
Ck= In (k = 1,2,...,n) (8)
and
Ct = 7?, (Ci,..C„) (9)
1
Since In ~f\xj 1S nonincreasing, we have

1
Ct = ^ (k = 1,2,...,«) (10)
and as are independent, the same is valid for (k.
Consider now the distribution of £k. Let y = F~\x) (0 < * < 1) be the
inverse function of x - F(y) (- oo < y < + oo). Then the relation

P(£k <x) = P(F(£k) > e-*) = p(£k > F-i (e-*))


VIII, § 9] LIMIT DISTRIBUTIONS FOR “ORDER STATISTICS’ 489

is valid, i.e.

P(Ck < x) = 1 - F^F-1 (e~x)) = \-e~x (11)

for 0 < x < +oo. Hence the random variables (k are exponentially distri¬
buted with expectation 1. Thus can be written in the form

Si <5«+l-k

cn
£* = F_1 (e-<«-n-*) = F~ 1 exp
n n— 1 k JJ . (12)

where <5t, d2,. . .,<5„ are independent random variables with expectation 1.
Our result implies the theorem of van Dantzig and Malmquist stating that
P(E* )
the ratios —v ----- (/c = 0, 1,. . ., ri) are independent of each other
F(£t)
(cf. Chapter IV, § 17, Exercise 17). Indeed we have according to (12)

exp (%l1 (k = 1,2,.. .,n), (13)


m) l * )

(We have to put here -F(£*+1) = 1.)


The random variables £f,. . .,£* form a Markov chain, since because of
(12) for xx < x2 < ... < xk < x the relation

p* = Xk)=P 'S„+l-k < k.,ln m


nt?k+1 < X = Xx •5 Ck
F(xk)
$n+l-j~

F(xJ+l) (14)
= j In l<j<k-l,£t=xk
f(xj)

is valid. Because of the independence of dj we get

P(&+ x<x\£* = xx,..., ££ = **) =

F(x)
= P\Sn+i-k<k In a=xk = P(Z*k+x<x \ek=xk). (15)
F(*k)

Let us notice that as

P(F(£k) < x)=P(Zk < F~x(x)) = F(F~'(x)) = x (16)

holds for 0 < x < 1, the random variables F(f*) are uniformly distributed
in the interval (0, 1). The random variables F{Ck) are thus the ordered ele¬
ments of a sample selected from a population uniformly distributed on
(0, 1).
490 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 9

Starting from this point of view, many results on order statistics can be
derived quite easily. As an example, consider the following problem: What
is the limit distribution of the random variables S* when both n and k tend
k
to infinity in such a way that— tends to a limit q(0 < q < 1)? In

particular we consider the case k = [nq] + 1; £*9]+1 is called the sample


quantile of order q.
We prove a theorem which implies in particular that the sample quantile
of order q is in the limit normally distributed, provided that the distribution
function F(x) fulfills certain conditions.

Theorem. Let £x, £>, •••,£„ be independent, identically distributed random


variables, with an absolutely continuous distribution function F(x). Suppose
that the density function f(x) = F'(x) is continuous and positive on the interval
[a, b). IfO < F(a) < q < F(b) < 1, and if k(n) is a sequence of integers such
that
k(n)
lim y/n
n-+ + oo n -9 = 0, (17)

further if denotes the k-th order statistic of the sample A A £


then CHn) is in the limit normally distributed, viz.

'Zm-Q x
lim P = *(x). (18)
n-*+oo D y/n.
where
Q = F~\q) (19)
and

D AQ)^ (20)
Proof. We consider first the limit distribution of

C»jj

n+l-k(n) = In (21)
F(£*
k{n ))
By (12)
n+l-A:(n)

n+l-k(n) = I
n+l-j
(22)
7= 1

where 5j are independent and exponentially distributed with density func¬


tion e x (x > 0). Hence E{dj) = D(Sj) = 1 and

E{\Sj- 11^) — J |x 1|ze~xdx< 3. (23)


o
VIII, § 9] LIMIT DISTRIBUTIONS FOR “ORDER STATISTICS” 491

It follows from (17) and from the known formula

N 1 I \
£ In N+C + O — ,
k=i k liVj

where C is Euler’s constant,1 that

M, = = In — + 0 (-1=1 . (24)
# LV*n)
Since

= J_1_
k=N1 k2 Nx N2 w)
we get
f 1
S2n = D2 (C*+1_fcW) = -5—^ + O (25)
nI ’
according to (23) and from

N• 1 1
1 +oM '
£NlJ? = JNl ~ In: TV?

it follows that

n+l-fc(n) J) _ 1 S'! 1
I ^ = O (26)
j= 1
n + 1 -y n,2 ’

IKn] 3 1
= 0 (27)
u, Jn)

Thus Liapunov’s form of the central limit theorem (Theorem 4 of § 1)


can be applied to the sums (22). Taking into account the lemmas of
Sections 6 and 7 we get

Cn+l—k(n)
X
lim P = *(x). (28)
n-* + oo

1 Cf. K. Knopp [1].


492 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 10

Now because of (21)

C« + l-k(n) — In q < x
1 -q Jn

l-q
= P tin) >F 1 k exp —X (29)
nq

The mean value theorem allows us to write

1 -q
f7-1 \q exp —x
nq

/I -q }
exp x . - 1
, \ nq
= F~\q) + (30)
JiQdn)
where lim 0n = 1; further
n-+ + oo

1
q exp —x x +0
nq I ) n n
Now (29), the continuity of/(x), and the lemmas of § 6 and § 7 imply (18),
hence the theorem is proved.
The theorem states that the empirical sample quantile of order q of a
sample of n elements is for sufficiently large n nearly normally distributed

with expectation Q = F~\q) and standard deviation —Z-^1 ~ ^ .


. AQ) V n
This fact can be used in practical applications (e.g. in quality control). In case
of a symmetric distribution, expectation and median coincide; thus in this
case the sample median can be used as an approximation of the expectation.

§ 10. Limit theorems for empirical distribution functions

In the preceding section we have seen how to determine the quantiles of


a distribution function F{x) by means of an ordered sample of independent
random variables with distribution function F{x). The empirical distribution
function of such a sample may help us to get information also about the
whole course of the distribution function F{x). Glivenko’s fundamental
VIII, § 10] EMPIRICAL DISTRIBUTION FUNCTIONS 493

theorem of mathematical statistics (Chapter VII, § 8, Theorem 1) states that


the difference between the empirical and theoretical distribution functions
tends uniformly to zero with probability 1 as the sample size tends to in¬
finity. Glivenko’s theorem, however, says nothing about the “rapidity” of
the convergence. But this information is supplied by the theorems of Smir¬
nov and Kolmogorov, which we shall state now without proofs.1
Let £l5 £2,. . ., be independent, identically distributed random variables
with the continuous distribution function F(x). As in the preceding section,
let £* denote the Ar-th order statistic. Put

0 for x< ,

k
F„(x) = < for F* < x <£*+1 (k = 1, 2,..., n — 1), (1)
n

1 for F*
*3n < x;

Fn(x) is the empirical distribution function of the sample gx,. .

Theorem 1 (Smirnov).
1 — e 2y2 for y > 0,
lim P{Jn sup (Fn (x) - F(x)) < y) -
n-* + oo — co <x< -f co
0 otherwise.

Theorem 2 (Kolmogorov).
K(y) for y > o,
lim P( Jn sup | F„ (x) - F(x) \ < y) =
/2-^ + QO — CO < A < -f- CO
0 otherwise.

where
K(y) — +f (-\fe~2k2yt. (2)
k= — oo

Notice that in these two theorems the limit distributions do not depend
on F(x). It suffices that F(x) is continuous, this guarantees the validity of
these and all further theorems in this section. The values of the function
K{y) figuring in Kolmogorov’s theorem are given in Table 8 at the end of
this book.
The theorems of Smirnov and Kolmogorov may serve to test the hypothe¬
sis that a sample of size n was drawn from a population with a given con¬
tinuous distribution function F(x).
The theorems of Kolmogorov and Smirnov refer to the maximal deviation
between Fn{x) and F(x). Often it is more convenient to consider the maximum

1 For the proof of Theorem 1 cf. § 13, Exercise 23.


494 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII. § J 0

F fix) — Fix)
for F(x) > a > 0 of the relative deviation —--. The follow-
F(x)
ing theorems are concerned with this relative deviation.1

Theorem 3. We have

Ffix) - F{x)
lim P
n-+ + oo H Xa^xK+vo
SUP ~^(x)-
x \A) <y

Vifr
x1
2
dx for y > 0,

0 otherwise,

where xa is defined by F(xa) = a, 0 < a < 1.

Theorem 4. We have

Fn (x) - F(x) 1 L for y > 0,


lim P n sup
«-+■ +00 Xa<3X< + oo Fix) <y)=\
0 otherwise,
where
n
4 oo exp - ilk + l)2
8 z2
L(z) = ~ Y (-1)* -- (3)
n o Ik + 1
and xa is defined by F{xa) = a, 0 < a < 1.

The values of the function L(z) defined by (3) are tabulated in Table 9.
We may be interested in the maximum of the relative deviation over an
interval (xa, xb), where and xb are defined by F{xa) = a and F{xb) = b
(0 < a < b < 1). This problem is solved by

Theorem 5. If 0 < a < b < 1, F(xa) = a, F{xb) = b, then the relation

r Fn(x) - F(x)
lim P fin sup -— <y =
II-*--f CO V xa<,x<Xb F(x)

0-»|/S
1 bf2

is valid.
n 1 -b
exp
2(1-6)/ J du dt

1 Cf. A. Renyi [9],


VIII, § 10] EMPIRICAL DISTRIBUTION FUNCTIONS 495

First of all, we have to note a surprising corollary of Theorem 5. It


follows from the theorem of Smirnov that

lim P{ sup (F„(x) — F(x)) < 0) = 0,


W-*- + 00 — 00 <*<-f-00

i.e. the probability, that the empirical distribution function remains every¬
where under the theoretical distribution function, tends to zero. According
to Theorem 3 the same holds if we restrict ourselves to values of x superior
to xa (a > 0). However, if we consider an interval [xa, xb] with 0 < a <
< b < 1, then by Theorem 5,

lim P( sup (F„(x) - F(x)) < 0) =


n-» + oo xa<.x<.Xb

0
-d/~
1' b-a

bt2
4vKj + 00
exp
2(1 ~b) J chi dt, (4)

i.e. the probability of the difference F„(x) - F(x) being in an interval


[xa, xb] (0 < a < b < 1) everywhere negative remains positive even in
the limit. Obviously, this result is of practical importance.
One can simplify the right hand side of (4) by a probabilistic consideration,
without calculations. In fact, the right hand side of (4) is equal to the
probability that a point (x, y) normally distributed on the plane lies in an
angular domain 0 <x< + oo, 0 < y < x, where x and y are independent
a( 1 — b)
and have the respective standard deviations and 1. Now this
b —a
probability is equal to

1 la(\-b) _ 1 a{\ — b)
arc tan arc sm (5)
271 b —a 2n b{ 1 - a)

As a matter of fact for a normal distribution symmetrical in x and y

the probability of the random point lying in an angle 90 is ; an affine

transformation leads from this to the general case. Thus we have

Theorem 6. If Q < a < b < \, F(xu) = a, F(xb) = b, then

1 I a(l — b)
lira P ( sup (F„(x) - m) < 0) = — arc sin ^ (6)
n—+ oo xa-£x<,Xb

If we take a = 0 or b = 1, we see that the right hand side of (6) is equal


to zero.
496 THE "LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 10

Another fundamental problem of mathematical statistics consists in the


decision whether two samples may originate from the same population
or not. This question occurs often in problems of medicine, biology,
economics and many other domains. Essentially, the question is whether
the deviation between the results of two experiments is or is not significant.
The following theorems can serve to decide, provided that the distribution
function of the basic population is continuous; the knowledge of the
population distribution is, however, not necessary.
Let <^l5 £2,. . . , t]u tj2,. . . ,r/n be independent random variables.
Let and rjj have continuous distribution functions F(x) and G(x) respec¬
tively, which are not necessarily known.
The problem consists of testing the hypothesis F(x) = G(x) by comparing
the empirical distributions Fn(x) and Gn(x).
The following two theorems were proved by Smirnov:

Theorem 7. If F(x) = G(x), then

1 — e 2yt for y > 0,


lim P sup (Ffx) - Gn(xj) < y
H—+ 00 -oo <x< +00 0 otherwise.

Theorem 8. If Fix) = G(x), then

\K(y) for >’> 0,


lim P sup | Fn(x) - Gn(x) | < y
f 00 — 00 < X < -f CO 10 otherwise
where K(y) is defined by (2).

The theorems of Smirnov can be derived by passage to the limit from the
following theorems due to Gnedenko and Koroljuk, which give the exact
distributions of the quantities sup (F„(x) - G„0)) and sup | F,£x) - Gn(x) |
for finite values of n.

Theorem 9. If {x} denotes the least integer ^ *, if c = {z Jin] and if


F(x) = G(x), then

n
sup (F„(x)-G„(x))<z =
— 00 <X< -f 00

0 for z< 0,

' 2/7

n — e
l- for 0 < z <
In
. n .
1 otherwise.
VIII, § 10] EMPIRICAL DISTRIBUTION FUNCTIONS 497

Theorem 10. Under the same conditions as in Theorem 9, one has

sup | Ffx) - Gn(x) | < z =


— 00 < X < + 00 )

0 for z < -L= ,


\j 2n
1 +[t] 2n
for < z <
I (-Dfe
(;)
lin\ n — kc

1 otherwise.

The values of

1 ' In \
(-\)k
2n n — kc)
. n .

are tabulated in Table 7, for n < 30; for n > 30 Theorem 8 can already
be applied.
First we prove Theorems 9 and 10; Theorems 7 and 8 can then be derived
by passing to the limit. Collect the random variables . . . , rjx,. . . , rjn
into one sequence and arrange these 2n numbers in increasing order; let
denote the A>th number in this ordered sequence. One can suppose
that C* < C* < . . . < C*2„. Put

1 if £* is one of the
9k = — 1 otherwise.

Thus in the sequence 0X, 0,,. . ., d2n, n numbers are equal to 1 and n
numbers are equal to —1. Put Sk = 0X + 02 + • • • + 9k. We prove first

Lemma. The relations


max Sk
sup (F„(x) - G„(x)) = ^k-*n ~
— 00<X<+<X>
and
max | S'* l
l<,k<2n
sup I Fn{x) - G„(x)
— 00 <X< +00
n

are valid.
498 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 10

The number n [F„(x) - (?„(*)] is the difference between the number of


the inferior to x (j = 1,...,ri) and the number of the t]/ inferior to
x(/ = 1,.. ., n). If x runs through the real numbers, the quantity
n [F„(x) — Gn(x)] changes only if x passes through one of the values
(k — 1,2,... ,2n); in this case it changes by 9k. Hence

sup n(Fn(pc) - Gn(x)) = max n(Fn(C* + °) ~ G„(C* + 0)) = max Sk


— oo<;t<+oo 1<^A:<C2« l<,k<i2n

and, similarly,
sup n I Fn(x) - Gn(x)| = max n \ F„(C* + 0) - Gn{Q + °) I = max \Sk\-
— co<x< + co l<tk<,2n l<Cik<Cl2n

The lemma is herewith proved.

Proof of Theorem 9. Clearly, the number of the possible sequences


01; 92,. .., 02n is equal to the number of the possible arrangements
of 2n elements among which n are equal to 1 and n to — 1; thus this number
12n
is . Since ,. . . , . . . , r\, are independent and identically
ln
distributed, all arrangements are equiprobable and each has probability
1
. In order to determine the probability for max Sk< z ^J2n we must
2n

find the number of the sequences 0ls. . . , 92n fulfilling this condition
2n
and then divide this number by . We arrived thus at a combinatorial
n
problem. Its solution will be facilitated by the following geometrical
representation: Assign to every sequence 9j_,... ,9„ a broken line in the
(x, y) plane starting from the point (0, 0) with the points (Sk, k) (k —
— \ ,2,..., 2n) as vertices. (Here (a, b) denotes the point with coordinates
x = a, y = b.) There corresponds thus to every sequence 9lt. . . , 92n a
“path” in the plane; all paths start from (0, 0) and end at (0, 2n); all are
composed of segments forming with the x-axis an angle either of +45°
or of —45°. We have to determine the number of those paths which do
not intersect the line x = z ^2n. Let this number be denoted by U+ (z).
If a path intersects the line x = z ■s/2n, it is clear that it reaches the line
x = {zj2n} = c, too.
Thus we have to count those paths which lie everywhere below the
line x = c. First we count the paths which intersect the line x — c.
If a path intersects the line x = c, we uniquely assign to it a path which
is identical with the original one up to the first intersection with the line
x = c and from this point on is the reflection of the original path with
VIII, § 10] EMPIRICAL DISTRIBUTION FUNCTIONS 499

respect to the line x = c. The new path ends at the point (2c, 2ri). By this
procedure, we assign to every path going from (0, 0) to (0, 2n) and inter¬
secting the line x = c in a one-to-one manner a path which goes from
(0, 0) to (2c, 2n) and is composed of segments which again form an angle
of +45° with the x-axis. The number of paths having one or more points
in common with the line x = c is thus equal to the total number of the
, . 2n
paths going from (0, 0) to (2c, 2ri). This number is ^ . Hence

' 2/z '


u; (z) = P"l_
UJ n - c
Because of the lemma, Theorem 9 is herewith proved.

Proof of Theorem 10. We use a similar argument. The number of paths


going from (0, 0) to (0, 2/z) and having no point in common either with
x = z J2n or with x = —z Jin is equal to the number of paths going
from (0, 0) to (0, 2/z) and having no point in common with the lines x = ±c.
Let this number be denoted by Un(z).
Let N+ and AC denote the number of paths intersecting x = c and
x — — c, respectively. Let AC- (and AC+) denote the number of the paths
which after intersecting x = c (and x ——c) intersect also x= - c (and x=c),
respectively, etc. Let N0 denote the number of the paths which do not in¬
tersect either x = c or x = — c. There can be shown as in Chapter II (§ 3,
Theorem 9) that
N0 = N - N+ - AC + N+ _ + AC + - N+ _ + — AC + _ +- (7)

We know that N+ ; by reasons of symmetry we have AC = N+.

We calculate now N+ _ (which is equal to AC +). Let us take the reflection to


the line x = c of the section of the path which follows the first common
point of the path with the line x = c, then let us take another reflection
to the line x = 3c of the section of the new path which follows the first
common point of the path with the line x = 2c; we obtain thus a path which
goes from (0, 0) to (4c, 2/z). Conversely, there corresponds to every such path
exactly one of the original paths intersecting first the line x c, then the
line x = -c. Hence N+_ (and AC+ too) is equal to the number of
sequences which consist of n + 2c elements equal to 1 and ot n — 2c ele-

ments equal to — 1, viz. to Similarly, we obtain


n + 2c

2n ' 2/z '


N,
n + kc n — kc
500 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § II

where el9. . . , ek is a sequence of alternating signs + and — beginning


with + or with —. By (7) and by the lemma, Theorem 10 is thus proved.
In order to derive Theorem 7 from Theorem 9 it suffices to remark that
for c = {z -Jin) Stirling’s formula gives

2n
n + c
lim
«-»• + 00
2n)
n)

Theorem 8 can be derived from Theorem 10 in a similar way.

§11. Limit distributions concerning random walk problems

In this section we shall study limit theorems of another type than those
encountered so far. As we do not strive at the greatest possible generality
but rather wish to present the different types of limit distributions, we shall
restrict ourselves mainly to the simplest case, i.e. to the case of the one-
dimensional random walk (classical ruin problem). We shall find in the
study of this simple problem a lot of surprising laws which contribute to
a better understanding of the nature of chance. Theorems 1 and 2 are
concerned with the problem of random walk in zz-space.
Let the random variables g2,. . . , £n,... be independent and let
1
each of them assume the values +1 and -1 with probability .The
2
random variable

k=l
(1)
may be considered as a gambler’s gain in a game of coin tossing after n
tosses, provided that the stake is 1 unit of money. The value of which
is always an integer, can also be interpreted as the abscissa at the time
t — n of a point moving in a random manner on the real axis. This point
performs on the real axis a “random walk”, in the sense that it moves
during the time intervals (0, 1), (1, 2),.. . either one unit step to the right

or one unit step to the left, both with probability — . We shall deal with

the laws of this random walk.


Consider first a generalization of the problem to several dimensions.
Let Gr denote the set of the points of the r-dimensional Euclidean space
which have integer coordinates, i.e. the set of points of the “/•-dimensional
VIII, § 11] RANDOM WALK PROBLEMS 501

lattice”. Imagine a point which moves “at random” over this lattice. We
understand by a “random walk” the following: If the moving point can
be found at a time t = n at a certain lattice point, then the probability
that at the time t = n + 1 it can be found at one of the adjacent points

of the lattice is equal to —— for all adjacent points which have r — 1 coor¬

dinates equal to those of the preceding point and one coordinate differing
by ±1. If the position of the point at the time t — n is given by the vector
C(„r), then the random vectors £(nr) (n = 0, 1,.. .) form a homogeneous
additive Markov chain, namely

C!,r) = tir) + t
k=i

where the random vector If# represents the displacement of the point
during the time interval (fc — 1 ,k); by assumption, the random vectors
& are independent and identically distributed. For r = 1 we obtain the
one-dimensional random walk problem discussed above; in this case we
write simply £„ and £k instead of and
We prove now first a famous theorem of G. Polya.1

Theorem 1. The probability that a point performing a random walk over


the lattice Gr returns infinitely often to its initial position is equal to one
for r = 1 and r = 2 and is equal to zero for r > 3.
*

Proof. Without the restriction of generality we can assume that at the


time t — 0 the moving point is found at the origin of the coordinate system.
Let denote the probability that at the time t = n the moving point
is again at the origin. The moving point returns to the origin after performing
in the direction of each of the axes exactly as many steps to the “right”
as to the “left”. Hence P$+\ = 0 and

(2 n)\
Pti = ‘In
(2 r) «! + ... + «, =n(nl\.,.nr\y

2 n] n\
(2 r)
in
n
z
nl+...+nr=n nf-.. .nr\
(2)

In particular

1 Cf. G. Polya [2].


502 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 11

In}2

p(2) _ l n -
2n /|2« >

In
n\
p(3) _
2« — '2 n E k\ H(n — k — l)\
k+l<ji

By applying Stirling’s formula we obtain

1
(3a)
V nn
and
1
p(2) ~ (3b)
-r2n ~
7T«

We give now an estimation of P$ for r > 3. We know that

«!
E = r\
n1+...+nr = n

On the other hand, it is easy to see that among the polynomial coefficients
the largest are those in which the numbers nl5 n2,. . ., nr differ at most
by +1 from each other (cf. Chapter III, § 18, Exercise 3). Hence

2n

pW _ n
r2n ~ E <
(2r) 2 n nl+... + nr = n

'2 n (3c)
n n\
< max = o r_
(4 r)n r ' n^. ... n<\
Unj—n n* >
1

On the other hand, it can be proved that P$ can be represented by the


following integral:
[2n\
n
pW _ __
2n /-.-V-l x2„
(27T/-1 (2 rf

X J ... j [r + 2 Y, cos (fii ~ 0/) + 2 cos 0,-]" d61... dOr_v


— 71 —7t ■ -
i=l
VIII, § 11] RANDOM WALK PROBLEMS 503

Hence we derive the asymptotic relation

(4)

From (3a) and (3b) it follows that

X = + oo for r = 1 and r — 2,
n—1

and from (3c) that

X ?2n < + oo for r > 3.


n=1

In the latter case the Borel-Cantelli lemma permits to state that for
r j> 3 the moving point returns with probability 1 at most finitely many
times to its initial position.
For r = 1 and r = 2 we shall show that with probability 1 the
moving point will sooner or later (and therefore infinitely often) return
to its initial position. In order to prove this, consider the time interval
which passes until the first return of the moving point. Let Q(nr) denote
the probability that the point walking at random on the r-dimensional
lattice reaches its initial position for the time after n steps. Obviously,
n-1
p&>=ea+ i pnosi-a- (5)
k=i
Put
00
Gr(x) = X p2hk (6)
k=i
and
00
Hr(x) - X (7)
k=l
then from (5)
Gr(x) = Hr(x) + Gr(x)Hr(x), (8)
hence
Gr(x) (9a)
1 +G,(x)
and
Hr(x)
(9b)
W-l-H#)-
504 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 11

Clearly,
Gr(x)
ew= E Qtt = W) = lim r-^M , (10)
k=l jt-1-0 1 “r

where Q(r) denotes the probability that the moving point returns at least
+ 00
once to the origin. For r = 1 and r = 2 the series Z P$ is divergent, hence
k=l
g(r) = 1, while for r > 3

Z p®
k=1
Q(r) =
1 + I P&
k=l

hence 0 < 0(r) < 1. (E.g., for r = 3, g(3) a; 0.35.) Thus we have proved
Theorem 1; at the same time we have obtained

Theorem 2. For r > 3 a point performing random walk over the lattice
Gr has a probability less than 1 to return to its original position.

It can be shown in a similar manner that for r = 1 and r — 2 the moving


point passes infinitely many times through every point of the lattice with
probability 1, while this is not true for r > 3.
In what follows we shall deal with the case r = 1 only. First we give
the explicit form of the probability Q$. The generator function (6) is here

i 2k\
f 1\
UJ
4^ -**=
00
I
2
(- *)* = i;
/c = l \ k / x/l-
this and (9a) lead to
( 1 \

Hx{x) = T^TY = \-J\-x = £ (-l)*-1**.


1 + Gfx) v k=i\k J
Hence
(2k-2
k— 1 1
G® = h. p2fc-T ~ F• (11)
2^/n k 2

A simple calculation shows that

0$ = n>_2-P$. (12)
VIII, § 11] RANDOM WALK PROBLEMS 505

Let vx be the number of the steps in which the moving point first returns
to its initial position; hence vx is a random variable and P(v1 — 2k) = Q$.
It follows from the asymptotic behaviour of the sequence that the
expectation of vx is infinite. Let (p{t) be the characteristic function of vx:

<P(0 = 1 - V1 - elu,
hence
f/|" ,-
lim (p ^2 = exp (-7-2 it). (13)
n-* + co

But we have
+ 00

1
exp ixt-
1 2xi
exp(- <J- lit) = —= _3 dx, (14)
v/27I ./
v 0 X2

hence exp( — x/ — lit) is the characteristic function of the distribution with


the density function

e 2x
—-3~ for x > 0,
fix) = y/2n x2

0 otherwise.

Because of the identity


X
r _l
e 2u
_ 3
du = 2 1 - <P (15)
J v/27r u2
0

(where <P(x) is the distribution function of the normal distribution), we


obtain

Theorem 3. If vl5 v2,.. •, v„ ,.. . denotes the moments when the moving
point performing a random walk on the line returns to its initial position,
i.e. when (,,k = 0, then for x > 0 the relation

lim P < X = 2 1 - <P (16)


k—1-00
W x)

is valid.
THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § II
506

Proof. The random variables vx, v2 — vl5. .., vk — vfc_x,.. . are


evidently independent and identically distributed. Further
vk — Vj + (v2 — vx) + ... + (v* —

Hence (16) follows from (13), (14) and (15).


Remark. The distribution figuring in Theorem 3, with the characteristic
function exp(- Jlit), is a stable distribution corresponding to the

parameter a = ; hence the distribution belongs to the domain

of attraction of this stable distribution.


Theorem 3 can also be formulated in a different way. Let 9n denote
the number of the zeros in the sequence £i then
P(vk<n)=P(9n>k). (17)
From this follows

Theorem 4.
|2<P(y) — 1 for y > 0,
lim P < y (18)
00
|0 otherwise;
9„
hence has in limit the same distribution as the absolute value of a random

variable with the standard normal distribution.


Now we shall investigate the number of positive and negative terms
in the sequence Ci, • • •, C«- If Cj = 0 but C/-i > 0, then Cj is counted as
positive; if Cj = 0 but Cj-1 < then it is counted as negative. Let n„
denote the number of the positive terms (in the above-mentioned sense)
of the sequence £i, C2> • • • > Cn ■ We prove

Theorem 5. For every positive integer n the relation1

I2k\ 2 n — 2k |
k , n—k )
P(n2n = 2k) = 2/1
(k = 0,1,...,«) (19)
22

holds.
Remark. Clearly, n2n cannot be an odd number; in fact, n2 is either 0
or 2, according as Ci = 1 or Ci = — 1. Similarly, n2n - 7r2„_2 is either Oor 2
since C2n-2 is an even number: if Can-2 ^ 0, we have necessarily Can-2 ^ 2,
hence n2n — 7r2„_2 = 2; if Can-2 < 0> we have necessarily Can-2 ^ —2,
hence n2n — n2n_2 = 0; if Can-2 = 0, then n2n - n2n_2 is equal to 0 or
to 2.
(2k\
1 = 1 for k = 0 .
VIII, § 11] RANDOM WALK PROBLEMS 507

We need the following

Lemma 1. For every integer n > 1 the relation

1 " (2k 2 n — 2k
= 1 (20)
22n l k n — k

holds.
Remark. Relation (20) is a corollary of (19);
bilities P(n2n - 2k) for k = 0, 1, . . . , n, we c
n2„ is always even. But since we wish to use (20) for the proof of (19), we
have to prove (20) directly.

Proof. As we have seen, for I x I < 1

'2k)

= 1 XT. (21)
JT k=0

Let us take the square of both sides of (21); since on the left side we get
1 00
—-= £ xk, (20) is obtained by comparing the coefficients of xn on
1 ~x k=0
both sides.
Now we prove (19) by induction. Clearly, (19) is true for n — 1; in effect

1
P(n2 = 0) = P(u2 = 2) =

Suppose that (19) is valid for n < N and let denote the least index j
for which £/ = 0; vx is necessarily an even number. Furthermore
N
P(n2N = 2k) = Yp(nw = 2k, vx = 21) + P(n2N = 2k, vj > 2N).
i=i

But
P(n2N = 2k, vx = 21) =

— p(n2N = 2k, vx 21, Ci = + 1) + P(^2N — 2k, Vi = 21, Ci — - 1)>

on the one hand

p(n2N — 2k, vx = 21, Ci = + 1) = ~2~p(n2(N-i) 2(k — /))P(vi = 21),


THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 11
508

and on the other hand

P(n2N — 2k, vx = 21, Ci = - 1) = ~2 P(n2(N-i) - 2k)P(v1 - 21).

According to (12)
'21-2
i-K
P(Vl = 21) = Q® =
22/"2

which leads to the recursion formula

P(n2N = 2k) = P(n2N = 2k, v1 > 2N) +

f (2/— 21 (21) \
1 N /— 1 J l
+ 22/-2 \P(n2(N-i) — 2k) +P(ji2(N-r>— 2(k — /))]. (22)
z 1=1 \ 22/ /

The probability P(n2n = 2k, vl > 2 N) is evidently zero for 0 < k < N.
If k = 0 or k = N,
[2 N\
N
P(n2N = 2N, > 2N) = P (n2N = 0,v1> 2N) = ±2N+1 (23)

Now if the relation (19) is true for n = 1, 2, . . . ,N — 1, then it follows


by some simple calculations from (20) and (22) that it is also true for
n = N. Thus Theorem 5 is proved.
This theorem implies the so-called arc sine law:

Theorem 6.

nN
lim P < x = — arc sin ^Jx for 0 < x < 1. (24)
N-*- + oo N

Proof. According to (3a)

2k
k 1
(25)
22" sfi~k '

For 0 < x < y < 1 we obtain from (19)

1 [«•>’] 1 1
D I <<
(26)
P\X-~2n<y 71 fc=[/!A-]+l k fi k n
n n
VIII, § II] RANDOM WALK PROBLEMS 509

hence
dt
^ 712” ^
lim P
n-*-+ oo
x-^<y

— — (arc sin Jy — arc sin J x ). (27)


n
7^2n +1 *
Now since n2n < n2n+1 < n2n + 1, the limit distribution of -^—ry com‘

TTo
cides with that of —— which proves Theorem 6.
2n
This theorem can be proved in a more elegant way, which, however,
requires more powerful tools. This rests upon the following generalization
of Lemma 1:

Lemma 2. We have

1 " (2k ' 2 n — 2k


_ eit(2k-n) = pn (C0S ^ (28)
22 n y
£—i
k n—k
k=0

where Pn(x) denotes the n-th Legendre polynomial:

1 dn , 2 nn

Proof. We see that the left side of (28) is the coefficient of xn in the
power series expansion of

_1_ 1
^(1 - e“ x)(l -e-“x) Jl~2x cos t + x2

On the other hand we know that1

1 = y Pn (cos t) xn (29)
yj 1 — 2x COS t + X2 n=0

where P„(z) is the «-th Legendre polynomial. By comparing coefficients


we obtain (28).
Theorem 6 can be derived from (28) as follows: We have
it
n2n 2 p
E exp it =e n
2n

1 Cf. G. Polya and G. Szego [1], Vol. II, p. 291.


510 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 11

By the Laplace formula1

P„ (x) — — f (a: + i cos cp yf\ — x2)" dcp


o
we obtain

it cos 99]
lim e* Pn cos dtp =
n-*- + °o l 2 J

i f1 e'V(1+u)/2 1 \ eitxdx

* J
-1 0

which imphes the statement of Theorem 6 in view of Theorem 3 in Chapter


VI, § 4.
Lemma 2 permits an easy calculation of the moments of tz2„ . For reasons
of symmetry E(n2n) = n. The standard deviation can be derived from (28):

D\n2n)^P'n{ 1) =

hence
'n(n + 1)
D(.K2n) = (30)

Remark. Theorem 6 expresses an interesting paradoxical fact. The


2
derivative of the distribution function F(x) = — arc sin Jx, i.e.
n v 5 * •

F\x) = - ‘
ti^Jx{\ - a:)

is namely symmetrical with respect to the point * = ~ and has a minimum

at this point. Consequently, this value is the least probable one for the

random variable ^ : the probability that the value of is in the neigh¬

bourhood of a point x (0 < x < 1) is the greater the farther the number

x is away from One would expect rather the contrary: indeedit would

1 Cf. G. Polya and G. Szego [1], Vol. II, p. 291.


VIII, § 11] RANDOM WALK PROBLEMS 511

seem quite natural that the moving point would pass approximately half
of its time on the positive and the other half on the negative semiaxis.
However, Theorem 6 shows that this is not the case. Or, to put it in the
terms of coin tossing: One would consider as the most probable that both

players are leading during nearly of the whole time. But this is not so;

on the contrary, — is the least probable value for the fraction of time

during which one of the players is leading. However, a little reflexion


shows this to be quite natural; indeed £„ varies quite slowly, because of
£n+1 — £„ = + 1; if reaches for a certain n a large positive value,
Cn+k will remain for a long time positive and a similar reasoning holds

for large negative values too.


Theorem 6 is due to P. Levy.1
The theorem can be generalized. It was proved by Erdos and Kac2 that
if £ls (^2, • • • are independent random variables with E(£k) = 0, D(£k) = 1

which satisfy the condition of Lindeberg, further if we put („ = Z


k=i
and if nN denotes the number of positive terms in the sequence (i, C2? • • • > Cn
then nN fulfils (24). Sparre-Andersen3 proved that if £2, • • • > £n are
independent random variables with the same symmetrical and continuous
distribution, then
2k) 2 n — 2k
n — k
P(Ttn = k) (31)
yin

In this case (24) is valid even if the variance does not exist.
We now determine the exact distribution of 9n (the number of the zeros
in the sequence (i, (2,..., C„) for even values of n. We prove first

Theorem 7. For every positive integer n

2k 2n — k
P(d2n = k) = (/< = 0, 1,..., 2n) (32)
n
holds.

Proof. If vx denotes the least positive integer for which = 0, then

P(Q0n = k) = £ P(d2n = k,v1 = 2r) + P(d2n = k,V!> 2n).


r=1

1 Cf. P. Levy [2].


2 Cf. P. Erdos and M. Kac [2],
3 Cf. E. Sparre-Andersen [1].
512 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 11

From this we derive (32) by the method used in the proof of Theorem 5.
We obtain from (32) by Stirling’s formula

2n
E(6„) (33)
n

Theorem 6 may also be formulated as follows: Put

1 for C*>0 or for Ck = 0 and Ck-i = 1>


= sgn U =
— 1 otherwise.

Then for —1 < x < + 1,

£1 + + • • • + £« 1 +X
lim P < x\ = — arc sin (34)
n-* + co I 71

i i • £i T Go T . . . -f- £„
Consequently, the ratio -- does not tend to zero as
n
n -> +oo, though this would seem to be quite “plausible”. However the
following theorem due to Erdos and Hunt1 is valid:

Theorem 8.

y e* \
h k
lim —tt——
N
= 0 - 1. (35)
N— + oo
I — /
A:=1 k

Proof. Clearly E(ek) = 0, E(sl) = 1. We determine E(sn em) (m > n):

E(sn am) = 2^ P(Zn = k) (2P (Cm > 0 \C„ = k) — 1) <

^4inZn = k)P(-k<tm_n<0).
k=i

The greatest term of the binomial distribution of order n and parameter


1 / 2
— is the central term, asymptotically equal to J— ; from this we con¬
me
clude that
Cxk
P(-k<Zm_n£ 0)<
Jm —n
1 Cf. P. Erdos and G. A. Hunt [1],
VIII, § 11] RANDOM WALK PROBLEMS 513

hence

\E(snem)\<C2 (36a)

here and in what follows C1; C2,. . . are positive constants. If m - n < n,
we use instead of (36a) the trivial inequality

|£(£„Em)|<l. (36b)
Thus we obtain

< C3 In N. (37)

If we put
N

E n
n=1
An — (38)
N
i
E
»=i n

we find
Q
E(AN) = 0 and E{Al) <
In TV
Hence, by applying Chebyshev’s inequality,

Cj
P(|^|>£)< (39)
e2 In N '

From this we obtain that the series ^ P(\A%k* \ > a) converges for every
k=i

e > 0. Thus by the Borel-Cantelli lemma the inequality | A2k*\ < c is


satisfied with probability 1 for a sufficiently large k. But for 2k3 <n <
c 2(,!c+1)! we have

\An\£A2*+-±.

Hence the inequality | An | < 2e is fulfilled with probability 1 for a sufficiently


large n. Since e > 0 can be chosen arbitrarily small, Theorem 8 is proved.
In conclusion we mention some theorems concerning the largest fluctu¬
ations of the one-dimensional random walk.

Theorem 9.
2<P(x) - 1 for x > 0,
lim P ( max (fe < x
(40)
0 otherwise.
n- + °o 1 <,k<,n
514 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 11

Theorem 10.

lim P( max | £* I < Xyfn) =


n-* + oo
(2k + l)2Jt2
A °°
(— \)k exp
for x > 0, (41)
71 fc = 0 2k + 1

0 otherwise.

Theorem 9 can be derived from the following formula (cf. Chapter III,
§18, Exercise 19):1

P( max (k < m) = 1 —
Jc_1_
(42)
1 ^k<Zn m + 2k 4k

It was shown by Erdos and Kac2 that Theorems 9 and 10 can be amply
generalized. They can be extended to the sums of independent, identically
distributed random variables.3
It is interesting to compare Theorems 9 and 10 with the results of the
preceding section. Those results can be put in the following form:

Theorem 11.

lim P( max ^ < x JYn | £2„ = 0) =


n->-+oo i<,k<,2n
f
[0
1 ~ e~** f°r
otherwise.
X~°' (43)

Theorem 12.

I (- Ife-™* for jc>0,


lim ( max \U\< xj2n |C2„ = 0) = k= — oo (44)
n—+ oo \<,k<,2n
0 otherwise.
Theorems 11 and 12 describe the properties of paths which after 2n
steps return to the origin. One expects that under this condition the path
does not deviate as far from its origin as in the general case. Indeed the
expectation of the distribution (43) is

^4x2e 2x‘dx=~^2n = 0.627,


o

while that of the distribution (40) is 0.798.

‘ For mother proof of Theorem 9 see Chapter VII, § 16, Exercise 13. e.
2 Cf. P. Erdos and M. Kac [1].
3 F°d -the ®^ension to random variables which are not identically distributed
see A. Renyi [9], ’
VIII § 1 2] PROOF OF LIMIT THEOREMS BY OPERATOR METHOD 515

§ 12. Proof of the limit theorems by the operator method

The most important limit theorems of probability calculus can also be


proved without the use of characteristic functions, by a direct method.
This method will be presented here. It is to be noted that this method, if
applicable, is more simple than the method of characteristic functions;
but it does not replace the latter. As a matter of fact, the method of charac¬
teristic functions has a far wider range of applications and there are many
limit theorems which can only be proved by means of characteristic
functions, or at least their proof by any other method becomes very
complicated indeed.
The method to be dealt with in the present section can be called the
operator method, since it uses certain functional operators. For the sake
of simplicity we shall introduce the method first by proving Liapunov’s
theorem; then we shall pass to the proof by this method of the more
general Lindeberg theorem. Finally, we shall prove a theorem about the
convergence to the Poisson distributions (§ 4, Theorem 1) and the theorem
of § 3.
We recall some definitions and notations. Let C3 be the set of all uniformly
continuous and bounded real-valued functions defined on the real number
axis which are three times differentiable while their first three derivatives
are also uniformly continuous and bounded on the whole number axis.
If/ = /(*) €C3, put
ll/ll = sup |/(x)|. (1)
X

The number ||/|| is called the norm of the function/ = f(pc). Clearly, if
/ £ C3 and g 6 C3 , then/ + g 6 C3 , further if/ £ C3 and a is a real number,
then af £ C3 . It is easy to see that if / 6 C3 and g 6 C3 , then 11/ + ^ 11 <
< ||/|| +||^|| and if/e C3 and a is a real number, then \\af\\ =
— Ial ||/|| • An operator A which assigns to every function / £ C3 an
element g = g{pc) = Af of C3 is called a linear operator if it possesses the
following properties:

1) If / € C3 and g 6 C3 , then A(f + g) = Af + Ag.


2) If / £ C3 and a is a real number, then A(af) = a • Af.
3) There exists a number K > 0 such that for any function / 6 C3 the
inequality \\ Af\\ < K • ||/|| holds.

If 3) is fulfilled for K = 1, the operator A is called a contraction operator.


If A and B are two operators, we define the operator A + B by (A + B)f =
= Af + Bf The product of two operators is defined by the consecutive
516 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 12

application of the operators, i.e. by (AB)f — A(Bf). The multiplication of


operators is associative, but it is usually not commutative. The addition
of operators is obviously both associative and commutative. For the
multiplication and addition of operators the distributive law is valid:

A(B + C) — AB + AC.

If A is an operator and a a real number, we understand by a A the


operator defined by (ocA)f = a • Af Clearly, if a and /? are real numbers
and A and B operators, then a(A + B) = aA + <xB, odfiA) = (pcffiA and
(a + ft) A = aA + flA, further A + ( — A) = O and O • A = O, where O
is the zero operator which assigns to every function /(C3 the function
identically equal to 0. To the reader acquainted with the elements of
functional analysis all this is of course familiar.

Lemma 1. Let F(x) be an arbitrary distribution function, then the linear


operator AF defined by

-f CO

AFf= J fix + y) dF(y) (2)


— 00

is a contraction operator.

Proof. It is easy to see that if / £ C3 then Atf £ C3 . Clearly AF fulfils


conditions (1) and (2) in the definition of linear operators, further

+ 00
II-A/ll S ll/ll j' dF(y) = ||/||,
— 00

hence AF is indeed a contraction operator.


The operator AF is called the operator associated with the distribution
function F.
If A and B are two operators such that for every /(C3 ABf — BAf
then the operators A and B are said to be commutative.

Lemma 2. Let F(x) and G(x) be any two distribution functions. The operators
AF and AG associated with them are commutative and AFAG = Ah, where
^ ~ H(x) is the convolution of the distribution functions F(x) and G{x), i.e.

+ 00
H(x) = j F(x - y)dG(y).
— 00
VIII, § 12] PROOF OF LIMIT THEOREMS BY OPERATOR METHOD 517

Proof. Clearly

Af AGf — j° (+f f{x + y + z) dG(z)) dF(y)= J f(x + u) dH(u).


— GO — CO ~~ 1=0

Lemma 3. Let A be a contraction operator and B an arbitrary operator,


then
\\ABf\\<\\Bf\\.

Proof. The statement follows from the definition of the contraction


operator.

Lemma 4. IfUx,U2,. . ., Un and Vlt V2,. . . , V„ are operators associated


with probability distributions and if f £ C3, then

\\U1U2...Unf-V1V2...Vnf\\< t \\Ukf~Vkf\\. (3)


k=1

Proof. Clearly, for arbitrary linear operators we have the identity

U1U2...Un-V1V2...Vn = fj U1U2... Uk_x(Uk - Vk) Vk+i... V„. (4)


k=l

By Lemmas 1, 2 and 3 we get immediately (3).


Now we can begin the proof of the central limit theorem under the
Liapunov conditions. Instead of Theorem 2 of § 1 we prove the following
somewhat more general theorem:

Theorem 1. Let ftl, £„2,t;nn be independent random variables with


finite variances D\k = D~{fnk) and third absolute central moments Hnk —
n y E(C )
= E(]U - Put C„ = Ii Lk <md « = "

S„ = D(UrJiDl, and /&.


V k—\ S k=l

Let Fk(x) denote the distribution function of £*. If Liapunov's condition

lim —— = 0 (5)
w—oo

is fulfilled, then
1 c -—
lim Fn (x) = $(x) = \e 2 du- ^6)
«->co 2.71 d
518 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 12

Proof. Without restricting generality we may assume E(£nk) =


— 0 (k = 1,2Let Unk denote the operator associated with the distri-
p
bution function F„k (x) of the random variable . Further let Vnk denote
^n
the associated operator of the normal distribution with expectation 0 and

standard deviation —^ . Then UnX Un2. . . Unn is nothing else than the

operator associated with the distribution function Fn{x) and V„i,Vn2...Vnn


is the operator A0 associated with the standard normal distribution function
<P(x) of expectation 0 and standard deviation 1. If we apply Lemma 4, we
obtain that for any / £ C3

AfJ-AJ || < £ II Unkf —Vnkf\\- (7)


k=1

Now
+ °o +00

unkf= f f(x + y)dFnk(y) and Vnkf= f(x + y)d<P


D nk ,
— CO — oo

Since / 6 C3, f(x + y) can be expanded into a finite Taylor series up to


three terms

f(x + y) =f(x) + yf (x) + -y/" (x) + ~-f" (x + 6y),


y (8)

where 0 < 6 < 1; of course 9 depends on x and y. Thus, taking into


account that

+ °° +00 + oo

J dFnk OO = 1, J ydFnk (y) = 0, and J y2 dFnk (y) =


D2
Unk
?2 ’

we obtain
+ oo

unkf-f(x) + - £) . -y + — J y3f"' (x + Oy) dFnk (y) (9)


and
+ 00

^/=/w+L^.^+-LJ y>r'(x+ey)d<t ys„ (10)


2 Si Dnk
VIII, § 12] PROOF OF LIMIT THEOREMS BY OPERATOR METHOD 519

Hence, if sup |/"'(x) | = M, we get

+ 00 +00

\\u„kf-v„kf\\<^[^ \y\’dF„k(y) + J l>'lSrf<i>(^))- <“>


Since

J \y\*dF„k(y)= j^)S, (12)

and
-t- + GO

ysn) [Dnki3 r Dmk\*


J' \y\zd<P
D nk
| y |3 d<P(y) < 2 (13)

there follows

II Unkf- Vnkf\\ <^{Hlk + 2D\k). (14)

Because of the Holder inequality one has for every random variable

l i
3 (15)
<£(m3) 9
hence
— Knk, (16)

and thus

tDlk<Kl (17)
k=1
(7) and (14) lead to
M
\\AfJ- A,f || (18)
l-sj ’
and thus by (5)
lim || AFnf- A0f || = 0. (19)

Thus we proved that if / 6 C3 , then for any value of x (and even uniformly
in x)
+ 00 + 00

lim J /(x + y) dFn (y) = J /(x + y) d<P(y). (20)


n-*- oo —oo
520 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 12

From this follows that (6) holds for every x. Indeed if e > 0 is arbitrary,
let /£(x) be a function belonging to C3 with the following properties:
/«(*) = 1 if X < 0,/£(x) = 0 if x > e and /£(x) is decreasing if x lies between
0 and e. Such a function can be given readily, e.g. the following function
has all the required properties:

1 for x < 0,
4
fe O) = for 0 < x < e, (21)

0 for e < x.
Then
+ 00

<P(x + e) > j fe(x + y)d0(y)> ^(x), (22)


and
+ CO

Fn (x + e) > J fE (x + y) dF„ (y) > Fn (x). (23)


— 00

Hence
lim sup Fn (x) < 4>(x + e), (24)
n-*-oo
and
lim inf Fn (x + e) > <P(x). (25)
n-*- oo

If we apply (25) to x — e instead of x, we obtain

lim inf Fn (x) > <P(x - e), (26)


«-*-00

i.e. we obtain from (24) and (26) that

4>(x - e) < lim inf Fn (x) < lim sup Fn (x) < <P(x + e). (27)
n-+- go n-*- oo

Since (27) is valid for every positive e, it follows that (6) is fulfilled for
every x. Theorem 1 is herewith proved.
Now we pass to the proof of the Lindeberg theorem by the operator
method. We prove the theorem in its most general form, i.e. we present
the proof of Theorem 4 of § 1.

Proof of Theorem 4, § 1 by the operator method. We may assume


without restriction of generality that Mnk = E{£nk) = 0 (k = 1 2 . .. n).

Put = and let Fn(pc) denote the distribution function of C


k= 1
VIII, § 12] PROOF OF LIMIT THEOREMS BY OPERATOR METHOD 521

We prove that for every / £ C3

lim AfJ=A0 f. (28)


n-*- oo

As we have seen in the proof of Theorem 1, it then follows that for every
real x
lim Fn {x) — (P(x).
00

Let U„k denote the operator associated with the distribution function
Fnk(x) of the random variable and Vnk the operator associated to the
normal distribution with expectation 0 and standard deviation Dnk. Then
according to our assumptions

AFn=UnlUn2...Unn, and A0 = VHl Vn2... Vnn. (29)

Further by Lemma 4 for every /(C3

\\AFJ-A„f\\< t \\U,kf-V,t)f\\. (30)


k=1

Now if we expand f{x + y) into a finite Taylor series up to the second


and third term respectively, we get

f(x + y) =/(*) + yf 0) + + °i y)’ (31)

and
2 3
f(x + y) =/(x) + yf (x) + i-/' (x) + r (x + »2 y\ (32)

where 0 < d1 < 1 and 0 < 02 < 1; of course and 02 depend on x


and y. Let e > 0. Clearly

Unkf= +\ f(x A y)dFnk(x) + j f(x + y)dFnk(x). (33)


-£ ! y\>£

Use in the first integral on the right hand side of (33) the equality (32)
and in the second integral (31). We obtain

F«t/=/(x) + d/'(x)Zi + 2- [ y3r(x + e,y)dF„k(y) +


\y\<,E

+ L J / (f"(x + 0, y) -f(x)) dF„k (/). (34)

bl>«
522 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIH, § 12

Put
sup | /"(x) | = Mx and sup \f"' (x) | = M2,
then
1
Unkf-Ax)- — Dlkf\x) <-^ sM2Dl„ + M1 I y‘dFnk(x\ (35)
I
|J/|>£
On the other hand
+ CO

, (36)
^/=/W + 4'Z>-/'W + 4‘ J //"'(*+ 02 3-) D nk

hence
1-00

1 1 M2D3nk
Vnkf-f{x)-—Dlkf"{x) < — M2DAnk | \y\3d<P(y)< • (37)

(35) and (37) lead to

i i M n
E II Unkf- Vnkf\\ < -- eM, + M, £ W+-ij; D^. (38)
ft=i 6 t=i J 3 k=1
b|>s

Since by our assumption the Lindeberg condition is fulfilled, we have

lim £ f y2dFnk(y) = 0, (39)


7!-»00 fc = l bT>£
furthermore

Dlk= J x2dFnk(x)+ J x2 dFnk (x) < e2 + J x2dFnk(x),


l*l^£ \x\>£ \x\>E
therefore

I F>\k ^ max A* < 82 + ^ J j2 (x),


^—1 1 <.k<,n V &=1 |jc| >e

and thus for every positive e

lim sup ^ Dlk < e,


n-*-oo k—1
that is

lim I Dlk = 0, (40)


n-+ oo A:=l

hence (30) and (38) lead to (28). Theorem 4 of § 1 is herewith proved.


As a further illustration of the operator method we prove now a theorem
concerning the convergence to the Poisson distribution.
VIII, § 12] PROOF OF LIMIT THEOREMS BY OPERATOR METHOD 523

Theorem 2. Let £,nk (k = 1,2,,n) be independent random variables


which assume only the values 0 and 1; put further

P(Znk=V=Pnk- (41)
Put

K = Z Pnk (42)
k=1
and suppose that
lim A„ = A (43)
/I-*-CO

and
lim max pnk = 0. (44)
n-*- oo 1 <,k<,n
Then the distribution of
Cn — £nl + £«2 + • • • + (45)

converges to the Poisson distribution with expectation A.

Remark. Theorem 2 is a particular case of Theorem 1 of § 4; the latter


can be proved in a similar manner. Merely for simplicity’s sake we restrict
ourselves to the proof of Theorem 2.

Proof. Let K denote the set of all real-valued bounded functions f(x)
(x = 0,1,2,.. .) defined on the nonnegative integers. Put ||/|| = sup|/(x)| .
Let there be associated with every probability distribution SA —
= {Po , Pi, ■ ■ • , Pn , • • •} an operator defined by

Agaf= fJf(x + r)pr (46)


r=0

for every / £ K. Clearly, Arp maps the set K into itself, A& is a linear con¬
traction operator, further if SA and Q are any two distributions defined on
the nonnegative integers, then AppA^ = A^ where = fA • Q, i.e. J?
is the convolution of the distributions Aft and Q; that is, if SA = {pn} and
Q = {<In}, then = (r„), where
n
= ^ PkQn-k'
fc = 0

Let Unk denote the operator associated with the distribution S/'\k of the
random variable £nk and Vnk the operator associated with the Poisson
distribution with parameter pnk. Then Unl Un2. . • Unn is nothing else than
the operator Appn associated with the distribution £An of the random variable
while Vnl Vn2. .. V„n is the operator QAn associated with the Poisson
distribution with parameter A„ (taking into account that if QA is the
Poisson distribution with parameter A, then QAQU = Ox+n)-
524 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 12

In order to prove Theorem 2 it suffices to show that for every element


/ of K the relation
lim || AgsJ — AQXnf\\ = 0 (47)
ft-*-00

holds. In fact, if (47) holds for every / £ K, choose for / the function for
which /(0) = 1 and /(x) = 0 for x > 1; then it follows from (47) that
for every r (and even uniformly in r)

2r„ e~Xn'
lim P(Zn = r) - = 0, (48)
ft-*-oo r\

and since by our assumption 2„ -> 2, it follows from (48) that

lim P(C„ = r) = (r = 0, 1,...). (49)


/7-v m
y *
* •

Now Lemma 4 is valid in this case too, and applying it we obtain

II A#J- AQXnf\\ < £ II Unkf- V„kf\\- (50)


k=l
On the other hand

Vnkf=f(x)(\ -pnk - e~™) +

00 Vr v 6
+/(* + I) te* - p„t c“'“) - E Ax + r) •£=*- (51)
,.2 rl
and thus

II U,kf~ v*f\\S ll/ll (e--‘ -(1 - P„t) + p„k < 1 - «-“) +

+ [l-e-»(l+/>*)]). (52)
Since
1 — x ^ e~x < 1 — x + x2 for x<l, (53)
there follows

II U„kf- V„kf\\< 3 ll/ll p%,. (54)

Thus (50) and (54) lead to

IM^,/-^22./llS3ii/ii- £ Pl„. (55)


k=1
Because of

Z Plk ^ 4 • max pnk, (56)


k=1
VIII, § 12] PROOF OF LIMIT THEOREMS BY OPERATOR METHOD 525

by (44) it follows that


lim \\A&nf- AqxJ\\ = 0. (57)

As we have seen, the assertion of Theorem 2 follows.


Finally, we give a proof by the operator method of the theorem proved
in § 3. Just like there, we assume that the distribution in question is contin¬
uous and symmetric with respect to the point 0, i.e. we prove the following

Theorem 3. Let £x, £2, . be independent identically distributed


random variables; let their common distribution function be denoted by Fix).
Suppose that F(x) is continuous and the distribution is symmetric with respect
to the point 0, i.e. F(-x) = 1 — F(x) (x > 0). Assume further that

lim A1 - m -o. (58)


y-*~ oo
J' X2 df(x)

Put = £x + £2 + Then there exists a sequence of numbers Sn


such that for every x
c„
lim P < X = *(*)• (59)
n-*-cn
Proof. Put

(60)
j x2 dF(x)

then by assumption
lim 8(y) - 0. (61)
y-*+00
Put further
<500 y
(62)
A(y) =
(l-FO))2
(1 - F(y)) § x2 dF(x)

then, as was shown in § 3,

lim A(y) = + 00 (63)

By our assumption A(y) is continuous for y > y0 . Let C„ denote the least
positive number for which
A(Cn) = n\ (64)
then Cn -»• 00, furthermore

»(1 - F(C.)) = JWi)- (65)


526 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 12

yCn
Put 2 Cl
x2dF(x) = (66)
— Cn
then
c, (67)
S. V2
Now let £/„fc be the operator associated with the distribution of the random

variable — and Vnk the operator associated with the normal distribution

with expectation 0 and standard deviation —= (k = 1,2,... ,n). Then



UnlUn2. . . Unn is the operator associated with the distribution function
F„{x) of the random variable while Vnl V„2.. . V„„ is the operator
associated with the standard normal distribution function #(x) (having
expectation 0 and standard deviation 1). Thus by Lemma 4 for every
/ ^ C3 we have

II AfJ- A0f\\ < t II Unkf- Vnkf\\ = n || UnJ- VnXf\\.


k=l

Hence it suffices to prove that


lim n || U„if— VnXf\\ = 0. (68)
n-+ oo

Now
Cn

U„if= j f{x + y)dF(Sny)+ j f(x + y)dF(Sny). (69)


,c» c„
Sn
\y\> Sn

If sup |/(x) | = A, then

f(x + y)dF(Sny) < A( 1 - f(C„)) =


V6(C») (70)
n
Cn
\y\> Sn

On the other hand, if in the integral on the right hand side of (69) f(x + y)
is expanded into a Taylor series up to the third term and if it is taken into
account that by our assumption the distribution with the distribution
function F(y) is symmetric with respect to the point 0, then we have

j Ax + y)dF(S,y)=f(x)(l-2(\-F(C,))) + i^y- + R,, (71)


Cn
Sn
VIII, § 12] PROOF OF LIMIT THEOREMS BY OPERATOR METHOD 527

where
+ Cn

Rn yzf" (x + 9y)dF(y), (72)

— Cn

and 0 < 6 < 1.


If sup | f"\x) | = B. then
+ Cn

\Rn\<JL J |y|3dF(y)<
BC„ = B
3 nS„ 3n
*/W7)
J2
(73)

-Cn

On the other hand


1
VHlf=f(x) + Q^ + 0 (74)
2n
' n2
thus (69)-(74) lead to
B
n || UnJ- VnJ\\ < 3A j5(Cn) + jj= *JWn) + 0 (75)

hence because of <5(C„) -> 0 the validity of (68) follows. Herewith Theorem 3
is proved.
Finally we make some remarks concerning the relation between operator
and characteristic function methods.
The convergence of a sequence Fn of distribution functions to a distri¬
bution function F is proved by the operator method by showing that for
every / ( C3 one has AF„f-+ AFf. This implies that the characteristic
function yn of the distribution function Fn tends to the characteristic
function y of the distribution function F; in fact, if f(x) = e,xt, then

/<EC3 and Ahnf = 7'x J eity dFn(y), hence Atnf= ei,x <pn(t) and,
— 00

similarly, Atf— e,,xy(t).


Hence, from the fact that for every / 6 C3
(76)
ArJ-* Arf
it follows that for every real t <pn(t) -*■ y(t)-
Therefore the operator method proves slightly more than the characteristic
function method. In effect, we prove for every / 6 C3 the validity of (76)
and even that (76) is fulfilled uniformly in x. This makes the proof of the
relation F„(x) -> F(x) simpler, because while the implication of the relation
f f(x) by the relation yu{t) -*• y(t) is a comparatively deep theorem
(the so-called continuity theorem of characteristic functions, cf. Theorem 3
of Chapter VI, § 4) it is quite easy to see that (76) implies Fn(x) -*• F(x) (for
528 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 13

every x which is a continuity point of F(x)). On the other hand, the method
by which we proved (76) in each of the above discussed cases, can be applied
for distributions of sums of independent random variables only, while the
method of characteristic functions can be applied in other cases too (cf. e.g.
§ 5 or Exercise 26 of § 13).

§ 13. Exercises

1. Prove Theorem 2' of Chapter VI, § 5 by means of the central limit theorem
(Chapter VIII, § 1, Theorem 1).

Hint. If F(x) is a distribution function with expectation 0 and variance 1 such that

* F = F
J2;
VA? +
then F(x) is equal to the n-fold convolution of F(x yjn). This converges to the normal
distribution as n-> +oo.

2. Let £t, ^2> • • • > £n> • • • be independent random variables and suppose

p(Z* = an) = P^n = ~ a„) = J (n= 1,2,...).

Under what conditions on the positive numbers a„ does Liapunov’s condition of the
central limit theorem hold for the random variables £„?

Hint. Put

— \/ ^ aki K„ £ al, and mn - max ak.


v fc= 1
|£=l I <,k<,n

It follows that

<^<
s. s„ - Sn

and Liapunov’s condition lim = 0 is fulfilled, iff lim ~ = 0


” — +“ n —+ co

3.a) Let be a random variable having a Poisson distribution with expectation X.


Show by the method of characteristic functions that the distribution function of the
£ 0
random variable tends to the normal distribution function as X—» +

(cf. also Ch. Ill, § 18, Exercise 28).


b) Let £„ be a random variable having a gamma distribution of order n with

£K„) = y • Show that the distribution function of ^J~n - 1 tends to the normal
distribution function with expectation 0 and standard deviation 1.
c) Let 4 be a random variable having a beta distribution of order {np, nq). Show
VIII. § 13] EXERCISES 529

that the distribution function of the random variable , /— (p + q)2 \L--—


v pq \ p + q
tends to the normal distribution function with expectation 0 and standard deviation
1 as n —* co.

4. Let e„(x) denote the n-th digit in the decimal expansion of x (0 < x < 1); the
n

values of en(x) are thus the numbers 0, 1, . . . , 9. Put Sn(x) = £ ek(x). If En(y) is the
k=l
25’„(.r) — 9 n
set of the numbers x for which < j,and if \E„(y)\ denotes the Lebesgue
33«
V
measure of En(y), show that

lim | En{y) | — —-L= [ e 2 du.


«->-+<» _ /2n J

Hint. We choose a point r\ at random in (0, 1); i.e. rj is a random variable uniformly
distributed in the interval (0, 1). The random variables £„ = £„(??) are then independent
and identically distributed; the central limit theorem can be applied. We have:

E(Q = 1, D(Q = •

5. Let qu q2,. .., q„,... be a sequence of integers > 2. It is easy to show that
every number x (0 < x < 1) (a denumerable set of numbers expected) can be repre¬
sented in one and only one way in the following form:

v £n(x)
* = Y --,
»=i di q-i ■ ■ ■ qn

where en(x) may take on the values 0, 1, . . . , qn — 1. As in Exercise 4 put Sn(x) =


n

— £ £k(x). Now if E„(y) denotes the set of numbers x (0 < x < 1) such that
*=i

Sn(x) - (dk - 1)

n
<y
X (ql - i)
k=l

and \En(y)\ is the Lebesgue measure of E„(y), then we have


y
lim | En (y) | = —( e 2 du.
n—+co \l^n '
— 00

provided that the condition


max qk
lim ——= 0
+“ a2

is fulfilled.
530 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 13

Hint. Choose at random a point t] € (0, 1) and put C„ — £n(rj). It is easy to see
that the random variables £„ are independent. Furthermore

E{Q = ^Z±,d* = and E{ \C„ - JE&) |3) < Q q\,

where Cl is a positive constant. Liapunov’s condition is thus satisfied.

6. Let .... £„ ,... be independent random variables, with the same normal
distribution. Put

I (f* - «„) /-
*=1
--- , and rn = — J n .
n *=1 n — 1

Show that
x
l c -u%
lim P(t„ < x) = —— e 2 du.
n —► -j* co Jln J
Hint. The distribution of t„ is Student’s distribution with n — 1 degrees of freedom
(cf. Ch. IV, § 10). Its density function is

n
1 2
S„-i(x) =
y/(n - 1) n

and we find that

lim S„-y (a) = —— e 2 .


«-► + =>

Another proof can be obtained by noticing that the distribution function of £„ n


tends to the normal distribution function as n —> + °o and lim st an — 1; the result
n—► -f*°°

follows then from the lemma of § 7.

7. Prove that any subsequence of a Markov chain is also a Markov chain.

8. (Ehrenfest’s model of heat conduction.) Consider two urns and N balls labelled
from 1 to N. The balls are arbitrarily distributed between the two urns; assume that
the first contains M and the second N — M balls. We put into a box N cards labelled
also from 1 to N. We draw a card from the box and put the ball bearing the same
number from the urn in which it is contained into the other urn. After this the card
i s replaced into the box and the operation is repeated. Let Cn denote the number
of the balls in the first urn after the «-th step (i.e. after drawing n cards) (n =
= 1,2,...; Co = M). The states of the system consisting of the two urns form a
Markov chain. The transition probabilities are

PkJc + l — 1-77 (k = 0,l,...,N- 1),


VIII, § 13] EXERCISES 531

P k.k-1 — (£= 1,2.N) (1)


N

PkJ = 0 for | k — /1 7^ 1.
Show that

(This example contains the statistical justification of Newton’s law of cooling.)

9. Let Gabon’s desk be modified in the following manner (cf. Fig. 26): From the
N-th row on the number of pegs is alternatingly equal to the number in the (N — l)-th
row and in the N-th row. On the whole desk there are N + n rows of pegs. Determine
the distribution of the balls in the containers when the number n of balls is large.

10. The random variables f0, ft.f« • • • form a homogeneous Markov chain;
all Cn take on values in (0, 1); let the conditional distribution of £„+, under the
condition = y be absolutely continuous for every value of y (0 < y < 1); let
p(x, j) be the corresponding conditional density function. We assume that for
0 < x < 1 and 0 < y < 1 the function p(x, ^) is always positive and that for every

x (0<x < 1) 5 p(x,y) dy = 1 holds, further that p(x, y) is continuous. Let p„(x,y)

be the conditional density function of C„ under the condition C0 = y- Show that


the relation
lim />„(*, y) = 1

is valid uniformly in x and y.


532 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 13

11. Let a moving point perform a random walk on a plane regular triangular
lattice. If the moving point is at the moment t = n at an arbitrary lattice-point, it
may pass at the moment t = n + 1 with the same probability to any of the 6 neigh¬
bouring lattice points. Show that the moving point will return with probability 1 to
its initial position, but that the expectation of the time passing until this return is
infinite.

The following Exercises 12 through 18 all deal with homogeneous Markov chains
with a finite number of states, fulfilling the conditions of Theorem 1, § 8. The notations
are the same. The states are denoted by A0, Au ..., AN. The random variable £„ is
equal to k if the system is in the state Ak at the time n (k = 0, 1,..., N). We put
Ptto = J) = P0(j), Pit = P(An + m =k\Zm = j), p% = plk and P„(k) = P(fn = k). We
assume that min plk = d > 0. According to Theorem 1 of § 8 the limits lim p)k = Pk
n —*■ CO
N
exist and are independent of j. Furthermore £ Pk = 1.
k=0
12. Let

if the system is in state Ak at the time t = n,


vP otherwise.

We put VP - rfjk\ Show that


/= i
Hk)
lim st —— = Pt (* = 0,1,..., AO
W—*- +00

i.e. the system passes approximately a fraction Pk of the whole time in the state Ak.
Hint. We have

E(vP) = P(V = k) = P(k),

D(vP)= ^Pn(k) (1 - Pn (*))


and

E(VP vPr) ~ E(VP)E(VPr) = Pn(*) (/»g - Pn + r(k)).


Furthermore, Formula (39) of § 8 implies |p$ — Pk | ^ (1 _ df. Hence

P(Vn\ Vn+r) < C(1 - df,

where ^O is the correlation coefficient of VP and Vp„ and C is positive.


Thus the result follows from Theorem 3 of Chapter VII, § 3.

13. Let + ...+ £„ . Show that

£ N
lim st — = Y kPk.
n-*-+co n *=1

Hint. C„ can be expressed as a function of the variables V„k\ namely £„ = £ kVP-


The result follows from that of the preceding exercise. *=l
VIII, § 13] EXERCISES 533

14. Assume that at t = 0 the system is in the state Ak . It returns to it for the first
time after a certain number of steps. Let this random number be denoted by .
Show that P(ym > «) < (1 — d)n.

Hint. We have P(v°° > 1) = 1 — pkk\ hence the inequality is true for n = 1.
Suppose for a proof by induction that the inequality is true for n . Then

P(ym > n + 1) = Y,
h^lc
Z
i±k
P(yik) > n>£n= J, £«+1 = h),

hence

P(vm > n + 1) = £
i 9tfc
P(y™ >n,'Qn= j) £
h+k
pm <

< (1 - d) P(vm >«)<(!- d)n+1.

15. Show that:


a) the expectation and the variance of the random variable v(k> defined in the
previous exercise exist.

b) E(v= Z- •
k

Hint, a) follows from Exercise 14. Let further Vk(z) denote the generating function
of v(k); we have

1
Vk(z) = 1
Uk{zV
where
CO

Uk(z) = +Z
The relations

Uk(z) = -pZ- + 1 + Z - Pk)A iPkl -Pk\<(i- df


1 — z «= 1

lead to

lim Uk(z) (1 — z) = lim Uk (z)(l — z)2 - Pk,


z-~ 1 *=i

which implies b).

16. Let the numbers p,® (r = 0, 1,. . .) denote the values of n for which rf® = 1
(pm < pW < . . .); nT is defined here as in Exercise 12. Show that the standardized
distribution of pt** tends to the normal distribution as r -> + 00 •

Hint. The random variables - p(klx are independent and identically


distributed. According to the preceding exercise, the expectation and the variance
of v(k) exist and are equal to those of vw. Hence the central limit theorem can be
applied.
534 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 13

17. Show that the distribution of the random variables C(„k) introduced in Exercise
12 tends, after standardization, to the normal distribution as n —► +oo. (Generaliza¬
tion of Theorem 2 in § 8.)

Hint. It is easy to see that P(fjjw < r) = P(/Ak) > n) ; for if the system passes less
than r times through the state Ak during the first n steps, then it will return to it for
the r-th time after the moment t = n, and conversely. Thus we are back to Exercise
16 and find

where we have put Dk = Z>(i(W).

18. Put

Pi** = P(.in = k | £„+i —j) (n = 1,2,k = 0, 1,...,N).

Show that the limits lim P(lk"> = P/* exist and form a stochastic matrix.
Hint. We have

Hence

lim Pfrn) = —kPkl — p*

and

But the Pk satisfy the system of equations


N

Z Pkpk, = P, U= o, 1,...,A0,
k=0

hence we find that £ Pfk = 1 ; the transition probabilities Pt define thus again a
k=0
Markov chain.

Remark. A Markov chain is said to be reversible, if P* = Pkj for /, k = 0, 1,.. ., N.


It is necessary and sufficient for this that the matrix of transition probabilities is
doubly stochastic.

In Exercises 19 through 23 are independent, identically distributed random


variables with the common continuous distribution function Fix). i*k denotes the k-th
order statistic (k = 1, 2,. .., n).

19. Suppose F(0) = 0 and F’{0) = X > 0. Show that


Xx

0
EXERCISES 535
VIII, § 13]

Hint. We have
n—r

and, consequently, Ax
co -Ax
{Xx)r 1
lim P(n
n-+ +oo
< x) = £

f-k
m L r l (* - 1 )'•
tk~l e 1 dt.

20. Suppose F(x) = x (0 < x < l). Show that »{** and n (1 - are
independent in the limit as n -» + 00 and have gamma distributions of order k and j,
respectively: s.
lim P(jt <C x, n(l ^n,n+i—D < y)
n-r+m

r u*-'e-udu r t/-1 e-'t/y


“ ) (* - 1)! J
X

21. Suppose F(x) = ^


x — m
, where d>(x) = -4/2ji=fJ e 2 du . Show that the
V — 00

density function of

In In n + In 4jt
Z*k -m + a <2\nn-a
2x/2 In n

yj 2 In n

tends to ■ , 1 ^ exp (-kx - e~*) as n-+ + 00.


(k — 1)!
22 Let fix') = F’(x) be continuous and positive on the interval [a, b]; suppose
further tha{ V<F- (J,) = C, < C, = C’M < «■ Show .ha. .he two-dunenstona
distribution of y»({.V. - Ct>. and V - Mtendsw a two^hnens^nal
normal distribution as n—r + °°, if IL,(n) n.n l and I ,(n) da
23. Let F.(x) be the empirical distribution function of the sample ({„ ti.O-
Show that
P„(e) = P(sup (F(x) - Fn(x)) < e) =
X

(n\ ( k\n~k ( , k'k~'

->-• s UH-») r-.


Hint. We may assume that the variables ?* are uniformly distributed in the interval
(0, 1) . If m = [n(l - £)], the inequality sup (F(x) - Fn(x)) < e is equivale
536 THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 13

It is easy to prove that

Pn (e) = n\ J .. . J dxt ... dx„,


Te

where TB is the domain defined by the inequalities

0< xx < x2 ^ ... <■ x„ < 1, x, < -J- e for j = 1, 2 ..m + 1.

The final result can be obtained by induction.

Remark. We can derive from this result the theorem of Smirnov (§ 10, Theorem 1),

24. (Wilcoxon’s test for the comparison of two samples.) Let and
Vu • ■ • > Vn be independent, identically distributed random variables with the common
continuous distribution function F(x). Let the numbers £k and rj, be united into a
single sequence, let them be arranged in increasing order and investigate the “places”
occupied by £1, • . ., €m. Let vu v2, . .., vm denote the ranks of the elements
<=x» • • •, Cm in this sequence. Put

m{m + 1)
— V1 + V2 + • • • + Vm
2

a) Show that W is equal to the number of pairs , ?],) such that £/> rj,.

b) Show that E(W) = .

c) Let G„m (z) be the generating function of W:

Gnm (z) = £ P(W = k) zk.


k

Show that

Cn + m (Z)
(Z) =
C„(z)Cm(z)
where we have put

Q(z) = n
1=1
(1 ~ zp
7(1-z) •
d) Show that

I n m(n + m + 1)
D(W) =
V n

e) Derive from c) that the distribution of W* =


- tends to the normal
m
distribution as n—» + 00, m—> + 00, if
tends to a constant. (Cf. Ch. II, § 12,
Exercise 46 and Ch. Ill, § 18, Exercise 45.)

conthnicuis ^distribution llmction^Fty) fofaluTnd^W Varinb‘eS- ,wMh ,he same


.he number of tripiets ftA*> *?,£
VIII, § 13] EXERCISES 537

number of those triplets (/,/, k) for which rj, > £k, rj, > £k and i < j. We put

2 2
Show that if F(x) = G{x), then E(L) = — and if F(x) # G(x), then E(L) > — .

26. Let there be performed N independent experiments. Let the possible outcomes
of every experiment be the events Au . . ., Ar. Let pk = P{Ak) (k = 1, 2, ;
r
let vk denote the number of occurrences of the event Ak, where ^ vk = N. If
k=i
- Npky
Npk

then the distribution of Xn tends as TV —> + oo to the ^-distribution with r — 1 degrees


of freedom:
X

r—3 _ t

lim P(x2n <x) = t 2 e 2 dt.


N + co

Hint. The r-dimensional distribution of the random variables vu . . . , vr is a multi¬


nomial distribution; hence the characteristic function of the joint distribution of the
. , , vk — Npk
variables -- tends to
JNPk

exp
L (i <t-cz
k=1 <•>.)’)
(Cf. Ch. VI, § 6, (23)).

27. If C1( C2, • ■ ■ , C«, • • • are random variables such that the &-th order moment
of C„ tends as n —> + oo to the k-th order moment of the standard normal distribution,
i.e. if
-f co
_ t‘ 1.3 . . .{k — 1) for k even,
lim E(Ckn) = —= tk e 2 dt
»-*■+<» yj 2n 0 for k odd.

then for every real x

lim T(C„ < A') = —L= e 2 dt.


n—>- co 2ti »

Hint. Apply inequality (21) of § 1 to u = t CN and k = 21. We find

& _ I1 < I *£n !2'


i=0 7- (2/)!
THE LIMIT THEOREMS OF PROBABILITY THEORY [VIII, § 13
538

From this we get


21—t (jt\l

~ Y E(^}
i=o J•

If we let N tend to + °° , we obtain


t
( '2
l 2 j
lim sup
ft=>- + co j- — 2'/! 5

hence, by letting / tend to +°°,

lim E{eu^) = e 2 ,
N-+ + co

which is equivalent to the desired result.

28. Let . .., be random variables which assume only the values 0 and 1
and let rjnk be the sum of all products of k distinct elements of the sequence
> ^nn •
rlnk = Z ^nii ^ni2 • • • ^nik-

xk
Show that if E(ri„k) tends to — (k = 1, 2, .. .) as n-> + oo, then the distribution

of the sum

Z Zm = Vnl
1=1

tends to the Poisson distribution with parameter A .

29. Let flf £2be independent random variables assuming the values

+ 1 and —1 with probability-^-. Put C„ = + £2 + • • • + and denote by nn

the number of the positive terms in the sequence C„ C2.• (If C* = 0, Ck is


to be considered as positive if Ck-i — +1.) Show that

P(nin — f2„ — 0) =
(2;)
^ 1J22,1 ^°r k — 0, 1, . . /I

and

2n 21 1
P(7l2n — 2k, C2n 2/) = 4r
- Z
I<,l<,n- n l (n-l+ 1)/

for k = 0, 1, . . . , n and j — 1, %

30. By using the results of Exercise 29 show the following: If y„ is any sequence
yn
of integers such that y„ and n are of the same parity and lim —— = y (y is here
n-f + a> yj n
VIII, § 13] EXERCISES 539

any real number), then

7tn
lim
M -*■ +00
P\-H-<X £n yn
4 f(t! y)dt.

with
+ 00

2e
r e
_ 1/2
2 du
At | y) =
V 2n 1 - J'2 4
y
IT

for 0</<l,j>0 and /(f | j) = /(I — I | — y) for y < 0 .

7,
Remark. For y = 0 the conditional limit distribution of — with respect to the
n
condition Cnln —■► 0 is thus uniform on (0, 1). If we notice that n is, in the limit,
normally distributed, it follows that

In \
lim P — < x | £„ > 0

from these results, from P(C„ > 0) = — and from

I ~r ^ 1 1 — t= i

V 1 - / + V t + /(1 - t)

the arc sin law can be easily derived.

I
CHAPTER IX

APPENDIX
INTRODUCTION TO INFORMATION THEORY

§ 1. Hartley’s formula

Information theory deals with mathematical problems arising in connection


with the storage, transformation, and transmission of information.
In our everyday life we receive continuously various types of information
(e.g. a telephone number); the informations received are stored (e.g. noted
into a note-book), transmitted (told to somebody), etc. In order to use
the informations it is often necessary to transform them in various fashions.
Thus for instance in telegraphy the letters of the text are replaced by special
signs; in television the continuous parts of the image are transformed into
successive signals transmitted by electromagnetic waves. In order to treat
such problems of communication mathematically, we need first of all
a quantitative measure of information.
It is not at all obvious that the amount of information contained in a
message can be defined and even measured. If we wish to introduce such
a measure, we must abstract from form and content of the message. We
have to work like the telegraph office, where only the number of words
is counted in order to calculate the price of the telegram.
It is reasonable to measure the amount of information contained in a
message by the number of signs necessary to express its content in the
most concise possible form. Any system of signs can be used; the infor¬
mations to be measured must be transformed into the system chosen.
Thus, for instance, letters can be replaced by digits, the binary number
system can be taken instead of the decimal, and so on. If we add to the
26 letters of the English alphabet the full stop, the comma, the semicolon,
the question-mark, the note of exclamation and the space between the
words, the 32 signs so obtained can be assigned to the numbers expressible
by means of 5 digits in the binary system. (The numbers expressible by
1, 2, 3, or 4 digits are to be completed by zeros to five digits; thus 0 = 00000,
1 = 00001.) In this manner, every telegram can be expressed as a sequence
of zeros and ones; the number of necessary signs is five times the number
of the letters of the text. Every message, every information may thus be
encoded into a sequence of zeros and ones.
It seems reasonable to measure the amount of information of a message
IX, § 1] HARTLEY’S FORMULA 541

by the number of signs necessary to express it by zeros and ones. On this


basis, messages of different forms and contents become comparable as to
the amount of information contained in them.
Since a digit can assume one of the values 0 and 1, the information
specifying which of these two possibilities occurred can be taken as the
unit of information. Thus the answer to a question which can only be
answered by “yes” or “no” contains just one unit of information, the
meaning of the particular question being irrelevant. The unit of information
is called “bit”, which is an abbreviation for “binary digit”.
When one receives some information it happens often that only a part
of it is really new. Thus for instance, if the telephone numbers in a certain
city have all 6 digits, we can be sure in advance that every inhabitant
will have a telephone number which is a number having 6 digits. Every
information may thus be considered as a distinctive sign of an element
of a set. If we know in advance that some object belongs to a certain set
E, to give full information on the thing means to specify which of the
elements of the set E is the one in question. The amount of information
received depends, evidently, on the number of the elements of E. If E
contains exactly N — 2n elements, these can be labelled by binary numbers
having n digits; any element will be uniquely characterized by a sequence
of length n consisting of zeros and ones, hence by n units of information.
From N = 2n follows n — log2 N; this gave Hartley the idea to define
by log, N the information necessary for the characterization of an element
of a set having N elements, even if N is not a power of 2.
At the first glance it would seem that if 2" < N < 2"+1, then log, N
units of information do not suffice for the characterization of the elements
of E, as somewhat more is necessary for this purpose, namely n + 1 units
of information. This, however, is not the case. If we consider a sequence
of symbols the terms of which are elements of E and if we replace each
term of this sequence by a sequence of zeros and ones, we need really
n + 1 binary digits. However, if we take from the elements of E a sequence
of k elements (some of which may be equal), there are Nk such sequences
and in order to characterize any one of these we need nk zero or one sign,
where

2«*-i <- j\[k < 2nk.

In order to transcribe a symbol of our “alphabet” (an element of E) we

need therefore on the average binary numbers, where


k

k log2 N <nk<k log2 N + 1.


542 INTRODUCTION TO INFORMATION THEORY [IX, § 1

It follows

Yl
lim —— = log2 N.
fc-~00 k

Thus for every e > 0 we can find a number k such that if we take the
elements of E by ordered groups of k, then the identification of one element
requires on the average less than log2 N + e binary digits.
The formula

I(En) = log2 N, (1)

in which I (EN) represents the information necessary to characterize the


elements of a set EN of N elements is called Hartley's formula.
Formula (1) is a mathematical definition of the amount of information
and thus needs no proof at all. Nevertheless, in order to show that this
definition is not arbitrary, we postulate some properties which the function
I(En) should reasonably possess and show that the postulates in question
are fulfilled only by the function log2 N. These postulates are:

A. I(Enm) = I(En) + I(Em) for N, M = 1,2,... ;

B. I(En) < I(En+1);

C. I(E2) = 1.

Postulate C is the definition of the unit; it is not more and not less
arbitrary than the choice of the unit of some physical quantity. The meaning
of Postulate B is evident: the larger a set, the more information is gained
by . the characterization of its elements. Postulate A may be justified as
follows.
A set Enm of NM element may be decomposed into N subsets each of
M elements; let these be denoted by Efy,. . . , E$\ In order to characterize
an element of ENM we can proceed in two steps. First we specify that subset
to which the element in question belongs. Let this subset be denoted by
£$. We need for this specification an information I(EN), since there are
N subsets. Next we identify the element in Ef}. The amount of information
needed for this purpose is equal to I(EM) since the subset contains M
elements. Now these two informations completely characterize an element
of Enm ; Postulate A expresses thus that the information is an additive
quantity.
IX, § 1] HARTLEY’S FORMULA 543

Theorem 1. The Postulates A, B, C are fulfilled only by the function


I(Eh) = log2 N.

Proof. Let P be an integer larger than 2. Define for every integer r the
integer s(r) by

2s(r) <Pr < 2s(r)+1. (2)

Taking the logarithms of base 2 on both sides, we get

s(r) s(r) + 1
< log2 P < (3)
r r

Hence

s(r)
hm —- = log2 P. (4)
r-+ao r

Put f(ri) — I(Efi. It follows from B that for n < m

/(») </(«)• (5)

(2) and (5) lead to

/(2i(r)) <f(Pr) </(2f(r)+1). (6)

According to A we can write

f(pk) — kf(a) (7)

and, by C,/(2) = 1; hence it follows from (6) that

<r) < rf(P) < s(r) + 1, (8)

thus

lim =/(/>). (9)


oo ^

From (4) and (9) we conclude that f(P) = log2P for P > 2. Since /(2) = 1,
/(1) = 0, the theorem is herewith proved.
Postulate B can be replaced by the following one:

B*. lim (I(En+1)-I(En)) = 0;

and A can be replaced by a weaker postulate, too:


544 INTRODUCTION TO INFORMATION THEORY [IX, § 1

A*. If N and M are relatively prime numbers, then

I(Enm) — I{En) +

P. Erdos1 proved the following

Theorem 2.1(EN) = log2 N is the only function which satisfies the postulates
A*, B*, and C.

Proof. Let P > 1 be any power of a prime number and f(n) = I(En)
a function satisfying A*, B*, C. Put

RE) log2 n
g(n) =f(n) - (10)
log2 P

Clearly, g(n) fulfills A*. Furthermore we have

g{n + 1) - g(n) =f(n + 1) -/(«) +


m log2
n
log 2P n+ 1

If we put

f, = 9(P + 1) - g(n), (ID


then B* implies

lim e„ - 0. (12)
/z—► CO

Hence g(n) fulfills B*. Now it is easy to see that

g(P) = o. (13)

Define for every integer n an integer n' by


i_i

i_i

i
a

R Q*,

for
■P)
)

(14)
i-1

i_i

n
^ 3

- 1 for
[ P ■p)
where (a, b) denotes the greatest common divisor of the integers a and b.

Cf. P. Erdos [2] and the article of D. K. Fadeev, The notion of entropy in a
finite probabilistic pattern (Arbeiten zur Informationstheorie, Vol. I). Fadeev found
this theorem independently from Erdos. The proof given here (cf. A. Renyi [29], [30],
[37]) is considerably simpler than that of the above two authors.
IX, § 11 HARTLEY’S FORMULA 545

Clearly

(15)

and
n = Pn' + /,

where (n',P) =1 and 0 < / < 2P. According to (13), g(Pri) = g{n'),
hence we can write

9(n) = 9(n') + g(n) - g(Pri) = g(ri) + £ ek, (16a)


k=P„■

where sk is defined by (11).


Repeat the decomposition (16a) with n instead of n, then with n" instead
of ri, etc. If we put

«(0) = n, «°'+1) = (««)' (j = 0,1,...),

we obtain at the A:-th step

g(n) = g(n<k>) + £ £ sh. (16b)


j=1 h=P„V>
But by (15)

log2 n
hence we obtain rfk) — 0 after at most + 1 steps, hence for every n
log iP
bn
9(n) = £ ehp (17)
i=i

where h1 < h2 < . . . < h bn and

, log,n
bn < IP + 1 •
log2P
Thus, according to (12),

lim ,*"> = 0, (18)


n-*-oo l°g2«
and by (10)

lim /<"> _ (19)


log2n log2P
546 INTRODUCTION TO INFORMATION THEORY [IX, § 2

Let c denote the limit of the left hand side of (19). We conclude that for
every P > 1 which is a power of a prime number

f(P) = c log2 P. (20)

If the integer n > 1 has a decomposition n = P1P2... Pr, where P,


are powers of primes, then we conclude from the additivity of f(ri) that

/(«) = E f(Pi) = cjl log2 Pi = c log2 n. (21)


i=i «=i

Because of Postulate C, the value of c must be equal to 1. Furthermore,


according to A*, /(1) = 0. Theorem 2 is herewith proved.
This theorem will be used in the following section.

§ 2. Shannon’s formula

Let Elt E2,. . ., En be pairwise disjoint finite sets and put

E — E1 + E2 + ...+E„.

Let Nk be the number of elements of the set Ek; E has therefore N = E Nk


k=1

elements. We put pk = {k = 1, 2,.. ., n). If we know about an

element of E that it belongs to a set Ek, for the complete determination of


this element we need some further information, the amount of which is
equal to log2 Nk. Thus in order to characterize an element which is known
to belong to a subset Ek of E we need on the average the amount of infor¬
mation

72= kE= l log2^= k=1


E Pk i°gzNpk. (i)
The information necessary for the complete characterization of an element
of E can therefore be decomposed into two parts. The first part, determines
the set Ek containing the element in question; the second part, /2, given
by Formula (1), identifies the element in Ek. If the information is additive
also in this sense, then the relation

n
loga A — /i + I2 — I\ + E
k=l
Pk log 2Npk (2)
IX, § 2] SHANNON’S FORMULA 547

must hold. Since £ pk = 1, it follows from (2) that in order to know


k=i
to which one of the subsets Ek an element of E belongs, we need an amount
of information equal to

h= £, Pk log2 — • (3)
k=1 Pit
Formula (3) was first established by Shannon and in what follows we
shall call it Shannon’s formula. Simultaneously with and independently of
Shannon the same formula was also found by N. Wiener.

In particular, if p, = p2 = . .. = pn = —, Shannon’s formula reduces


n
to Hartley’s formula (cf. Formula (1) of § 1). Analysing the above heuristic
considerations it is clear that we implicitly used three assumptions, namely

1. The selection of the considered element from the set E depends on


chance; actually, we are dealing with the observed value of a random
variable.

2. All elements of E are equiprobable; the probability that an element


Nk_
of E belongs to Ek is therefore pk =
N '
3. The amounts of information associated with the different possibilities
must be “weighted” by the corresponding probabilities; essentially, we
consider thus the expectation of the information.
Thus, instead of restricting ourselves to the particular case of the random
selection of an element from a set, we are led to the more general question:
how much information is yielded by the outcome of a random experiment?
We shall see that Shannon’s Formula (3) remains valid in this more general
case too (hence not only in the case of rational values of pk).
The general problem can be put in the following form: Let Ax, A2,. . A„
be the possible outcomes of a random experiment ; put pk = P(Ak)
{k = 1,2,..., ri). We wish to know, how much information is furnished
by a single performance of the experiment It seems reasonable to start
from the following postulates:

I. The information obtained depends only on the probability distribution


ffi = (pu p2,..., pn), consequently, it will be denoted by If ST’) or
/(pi, p2,. . .,pn). We suppose further that I(pu p2,. . .,/?„) is a symmetric
function of its variables px, p2,... , p„.
II. I(p, 1 — p) is a continuous function of p (0 < p < 1).
548 INTRODUCTION TO INFORMATION THEORY [IX, § 2

1 1
III. / - 1.
,22,
Furthermore, we require:
IV. The following relation holds:

I{Pl,Pl, • • ;Pn) =

Pi Pi
= I(Pl + Pi, P'1, • • •, Pn) + (Pi + Pi) I (4)
Pl+Pl Pl+Pl

Condition (4) can be worded as follows: Suppose that an outcome A


of an experiment with probability a = P(A) can occur in two ways, A'
and A" which mutually exclude each other. Suppose that the probability
of A' is a' and that of A" is a" (a + a" = a). Then if we are told in which
of the two forms A actually occurred, the amount of information thus
obtained is equal to the information associated with the distribution
a' a"' ot' a"
> taken with the weight a, i.e. to a / ? Postulate IV
a a a a
requires at the same time the additivity of information as well as the
“weighting” of the different informations by the corresponding proba¬
bilities.
We shall show that Postulates I-IV are fulfilled only by the function
defined by Shannon’s Formula (3). The above set of Postulates I-IV is
due to D. K. Fadeev; it is a simplified form of a system of postulates given
by A. I. Khinchin. In § 6 we shall characterize Shannon’s information
by different postulates which lead also to alternative measures of infor¬
mation.
In the present section we prove

Theorem l. If to every discrete, finite probability distribution there corre¬


sponds a number I{SA) = I(Pl, p2, . . . , pn) so that the above Postulates I-IV
are satisfied, then1

K^) = E ^log2— • (5)


fc=i Pk

Proof. The proof consists of six steps,

a) We show first that

/(l) = 0, (6)

1 In view of lim x log2 — = 0 we put 0 log, — =


x—0 X J 0
Jx, § 2] SHANNON’S FORMULA 549

i.e. that the occurrence of the sure event does not give us any information.
In fact, if n = 2, px = 1, p2 = 0, it follows from IV that

7(1, 0)- 7(1) + 7(1,0);

thus (6) holds. Similarly, we have

I(Pl,Pz. ■ ■ ;Pn> 0) = 7(/?i, p2, . . ,,pn).

b) If we put

we can deduce from (4) by induction on m the somewhat more general


relation

I(.Pi> • • •> Pnv> Pm + 1’ • • -’Pm+n) =

Ih
' Pm +1’ • • Pm + n) T Sm I (4')
$m

According to (4) the formula holds for m = 2. Suppose that it is already


proved up to m — 1. Then

I(Pl+P2,P3, ■ • • 3 Pm + n) =

P1 + P2.
— Pm + 1’ • • •> Pm + n) T $m ^ (7a)

furthermore, because of (4),

■ ■ • 5 Pm + n) = I(Pl+P*,P3, ■ • ;Pm+n) +

Pi P2
+ {Pl + P2) I (7b)
,7*1 + P2 P1+P2
and
P1+P2 Pi P2
smI + (Pi + P2) I
Jm Pi T P2 P1+P2)

Pi P,
= smI (7c)
tn m

(4') follows immediately from (7a), (7b) and (7c).


550 INTRODUCTION TO INFORMATION THEORY [IX, § 2

c) We prove now a still more general relation:

I(Pll> • • •■>P\m,i • • •■’PrAi • • •> Pnmn)

" ‘ Pj\ Pjm,


= I(Sl, • • •> Sn) + Y SJ I (4")
7 =1 Si
where we have put
m/
sj = Y P» 0 = 1,2,.. (8)
i=i
By assumption.
n n nij
Ysj =y=i/=l
j=i
YY Pn = L

Formula (4") may be considered as a theorem about the information


associated with a mixture of distributions. In effect, if denotes the
Pi i nmi
distribution 5 • * *? , the left hand side of (4") is the information
5/
associated with the mixture of the distributions <50 with weights Sj. According
to (4") this information is equal to the sum of the average of the informations
I{SPj) with weights Sj and the information associated with the mixing distri¬
bution S = ...sn):

(4"0
y=i y'=l

(4") can be obtained immediately by a repeated application of (4'), taking


into account the assumption that I(plf. . .,pn) is a symmetric function
of/?i,
1 1 1
d) Let^„ be the distribution and put f{n) = Iff f).
n n n
From (4") we deduce the functional equation

f{n, m) =/(») +/(m). (9)

In fact, if in (4") all nij are equal to m and all pJt are equal to ——, the
mn
left hand side is equal to f(nm) and the right hand side to f(n) + /(m),
hence we get (9).
e) If we apply (4') to the case when all probabilities are equal and if we
unite them all except the first one, we obtain

[1 1 1) 1
/(«) = / 5 1 ~ - + 1 -— /(» - !)• (10)
n n ) n
IX, § 2] SHANNON’S FORMULA 551

Now we show that

lim [/(«) -f(n - 1)] = 0. (ID


Put

/(») -/(« -\)-dn and I — ,1 - = S„ (n = 2, 3,...).


n n

It follows from our assumptions that

lim 5„ — 0. (12)

Indeed the assumed continuity of I(p, 1 — p) implies

lim <5„ = 7(0, 1) = 7(1),


n-*- oo

and according to (6) 7(1) = 0. On the other hand

f{n — 1) = d2 + d3 + ... + dn_1;

(10) is therefore equivalent to

d2 + d3 + ... + d„i
dn = d„ + (13)
n

Multiplying both sides by n and adding the equalities obtained for


n = 2, 3,. . .,7V, we get
N N

E (ndn + d2 + ... + d„_i) = E nd„, (14)


n=2 n=2

and by a simple transformation


N

^
N
J
E
n=2
r dk= N
(15)
N + 1A k=2
/r = 9.

E
«=2
n

Because of (12) the right hand side of (15) tends to zero for jV -> oo. Hence
we have
2 N
lim-— V dk = 0. (16)
n +oo y +1 k=2
552 INTRODUCTION TO INFORMATION THEORY [IX, § 2

From (12) and (16) it follows because of (13)

lim dN = 0, (17)
N->- + oo

hence we obtain (ll).1


We have seen that f(n) fulfills conditions A*, B*, and C of the preceding
section; hence by Theorem 2 of § 1

f(n) = I^n) = \og2n. (18)

f) We can now finish rapidly the proof of our theorem.

Consider the function I(p, 1 — p). Let first p be rational, p — — with


b
integers a and b (a < b). If we apply (4") with

n = 2, m1 = a, m2 = b - a, Pll = p12 = ... = p2rtli = ~ ,

we find
a a a , a
logo b — I —— , 1 - + — log2 a + log2 (b-a). (19)
T b

Since by assumption I(p, 1 — p) is continuous, we have for any p between


0 and 1

J(P> l~P) = P loga — + (1 - P) log2 —-— , (20)


P 1 -P

hence (5) is proved for n — 2. We show now by induction that (5) holds in
the general case too. Suppose that (5) is valid for a certain integer n\ let
SP = (p1?. . .,pn+1) be any distribution having n + 1 terms. We conclude
from (4) and (20) that
n-1 i
7Ol, • • ;Pn +1) = £ Pk log-2 - +
k=\ Pk
1 f ! Pn Pn+1 |]
+ (.Pn+Pn + l) log..
- [Pn+Pn + 1 [Pn+Pt + 1 ’Pn+Pn +1).

We use here a well-known theorem of the theory of divergent series (Mercer’s


theorem) which says: If sn is a sequence, fulfilling

lim + (1 - a) ] = j
ft —► oo l n J

(0 < a < 1), then we have also lim = s. We need only the particular case
n-+oo

a = — (cf. G. H. Hardy [1], Ch. V).


IX, § 2] SHANNON’S FORMULA 553

hence because of (20),

n+1 J

I(Pl, • • • , Pn + l) = Z Pk loS2-» (21)


k=1 Pk

and thus the theorem is proved for every integer n.

Remark. It is easy to see that Postulate IV implies the additivity of infor¬


mation. Suppose that the experiments A and B are independent of each
other; let Aj (j = 1,2,..., m) be the possible outcomes of A, and Bk (k =
= 1, 2,. . ., 7i) those of B. Let pj — P(Aj), qk — P(Bk) denote the corre¬
sponding probabilities and put SA = (px,. . , pm), Q — (qx,. . ., g„). To per¬
form simultaneously A and B means the same as to perform an experiment
AB having the events AjBk as possible outcomes with the corresponding
probabilities pjqk. The distribution {p}qk} = £A * Q is called the direct
product of the distributions SA and Q. If we apply (4") to pxqx, . . ., pxqn,
P2<h, ■ • ;PmPn, we find that

I{.9> * Q) = 1(SA) + 1(Q). (22)

However, (4) does not follow from (22). This is most easily demonstrated
by the quantity

h (Pi, •••>/>„) = - log2 (Pi + • • • + P% (23)

which fulfills Postulates I-III and Formula (22), without fulfilling (4). (If it
” 1
fulfilled (4), it would be equal, by the just-proved theorem, to Z Pk^°^— >
k=l Pk
which is not the case.) We shall see in § 6 that the quantity (23) too can be
considered as a measure of the information associated with the distribution
SA = (px,. . .,pn). In fact, we shall define a class of information measures
depending on a parameter a which contains both Shannon’s information
(for a — 1) and the quantity (23) (for a = 2).
We add some further remarks.

1. In connection with the notion of information we also have to mention


the concept of uncertainty. If we receive some information, the previously
existing uncertainty will be diminished. The meaning of information is
precisely this diminishing of uncertainty.
The uncertainty with respect to an outcome of an experiment may be
considered as numerically equal to the information furnished by the occur¬
rence of this outcome; thus uncertainty can also be measured. We could have
started equally well from the notion of uncertainty; to speak about infor-
554 INTRODUCTION TO INFORMATION THEORY [IX, § 3

mation or about uncertainty means essentially the same thing: in the first
case we consider an experiment which has been performed, in the second
case an experiment not yet performed. The two terminologies will be used
alternatively in order to obtain the simplest possible formulation of our re¬
sults.
2. The quantity (5) is frequently called the entropy of the distribution
£6 = (plf..pn). Indeed, there is a strong connection between the notion of
entropy in thermodynamics and the notion of information (or uncertainty).
L. Boltzmann was the first to emphasize the probabilistic meaning of the
thermodynamical entropy and thus he may be considered as a pioneer of
information theory. It would even be proper to call Formula (5) the Boltz-
mann-Shannon formula. Boltzmann proved that the entropy of a physical
system can be considered as a measure of the disorder in the system. In case
of a physical system having many degrees of freedom (e.g. a perfect gas)
the number measuring the disorder of the system measures also the uncer¬
tainty concerning the states of the individual particles.
3. In order to avoid possible misunderstandings it should be emphasized
that when we speak about information, what we have in mind is not the
subjective “information” possessed by a particular observer. The terminol¬
ogy is really somewhat misleading as it seems to support that the informa¬
tion depends somehow on the observer. In reality the information contained
in an observation is a quantity independent of the fact whether it does or
does not reach the perception of an observer (be it a man or some registering
device or a computer). The notion of uncertainty should also be interpreted
in an objective sense; what we have in mind is not the subjective “uncertain¬
ty” existing in the mind of the observer concerning the outcomes of an experi¬
ment; it is an uncertainty due to the fact that really several possibilities are
to be taken into account. The measure of uncertainty does not depend on
anything else than these possible events and in this sense it is entirely objec¬
tive. The above mentioned relation between information and thermodynam¬
ical entropy is noteworthy in this respect too.

§ 3. Conditional and relative information

We associated with every discrete finite probability distribution 66 =


— (px,.. .,pn) the information 1(66). If £ is a random variable assuming
the distinct values xx, x2,. . ., xn with probabilities px,p2,. ■ -,pn> we may
say that 1(66) is the information contained in the value of £ and we may write
1(0 instead of 1(66). It must, however, be remembered that 1(0 does not
depend on the values xl5 x2,..., xn of £. 1(0 remains invariant, when we
replace xx, x2,. . ., xn by any other system of mutually different numbers
IX, § 3] CONDITIONAL AND RELATIVE INFORMATION 555

x'i, x2,. . x'n. The observation of the random variable assuming the values
x{, x'2,. . ., x'n with probabilities pl} p2,. . .,pn contains the same amount
of information as the observation of Consequently, if h(x) is a function
such that h(x) / h(x') for x ^ x', we have I(h(£)) = /(£). However, with¬
out the condition h(x) A h(x) for x A x' we can state only that I{h{£)) <
< /(£). This follows from the evident inequality

(P + 4) l°g2—\—<p\og2— + gloga— (1)


p+q p q

for p > 0, q > 0, p + q < 1.


We shall often need Jensen’s inequality: If g(x) is a convex function on
an interval (a, b), if xl5 x2,. .xn are arbitrary real numbers a < xk < b
n

and if wl5 vv2,.. \vn are positive numbers with £ wk = 1, then we have
k=1

n n
g( E wk xk) g{xk). (2)
k=1 k=l

Inequality (2) can readily be proved by a geometrical reasoning. Consider


in the plane (x, y) the points (xk, g(xk)), k— 1,2,Suppose that
masses wk are situated in these points; the center of gravity of the so formed
system will evidently lie in the smallest convex polygon containing the men¬
tioned points. Since all points lie on the convex curve y = #(x), the center
of gravity lies above this curve. Let x and y denote its coordinates, then
n n
g{x) < y. As clearly x = E wk*k and y = E wkd(xk.), we get (2). It can
k=1 k=1
be seen immediately that if <?(x) is not linear on any subinterval, then
in (2) the equality sign can occur only if x1 = x2 = ,... = xB.
If #(x) is concave, we have instead of (2) the inequality

E wk xk) > E wk q(xk), (2r)


k=l k=1

since now —g{x) is convex.


From Jensen’s inequality (2) we obtain

I(pl,p2,...,p„)^log2n. (3)

It suffices for this to apply (2) to the convex function y = xlog2 x (x > 0);

with xk = pk, wk == — (k = 1, 2,. . n) we get (3). The equality sign holds


n
556 INTRODUCTION TO INFORMATION THEORY [IX, § 3

for px = p2 — . . . = pn = — only. That is, if there are n possibilities for the

outcome of an experiment, the uncertainty will be maximal when all possi¬


bilities are equjprobable.
Formula (3) can be generalized as follows: Let Sfi = (p1,p2, • • .,p„) be a
probability distribution and W = (wjk) a stochastic matrix with n rows and
n columns; the elements of W are thus nonnegative and the sum of the terms
of each row is equal to 1. Put

qk = Y,PjwJk (&= 1,2,...,«) •


j=i

Then

k=1
Z ?* = =1
Z j
pj Z
k=1
wjk = Z
j=1
pj=1’

hence Q = (qx, q2,..qn) is a probability distribution and we find

I{&) < 7(Q). (4)


In fact, by putting

g(x) = x log2 *, Xj = pp Wj = wjk (J =1,2,..., ri)

we can derive from (2) the inequality


n
(Ik log2 qk < Z WjkPj log2 Pj. (5)
7=1

If in (5) we sum over k, we obtain (4). Inequality (4) expresses that the un¬
certainty for a distribution is larger if the terms of the distribution are closer
to each other.
We introduce now the notion of conditional information. Let £ and q be
two random variables having finite discrete distributions. Let jcl5 x2,. . ., xm
be the distinct values taken on by £ with positive probabilities, yx, y2,. . ., yn
those by q. We write:

P^=Xj)=Pj U =1,2,..., rri); = (px,p2,..., pj. (6a)

p(n = yk) = Qk (k= 1,2, Q = (glt q2,..., q„). (6b)

j = 1,2,..rri _
P(£ = Xj, q=yk) = rjk u_i ^ ; *2 = (/-in • • •> rmny (7)

p(<» = i V = F/c) = Pifc ; = (Pi\k,. . .,pm\k) (k= 1,2,...,«). (8a)

= yk I £ = *,) = qk\j ; 0/ = (qiu,. ■ ; qnlj) (j =1,2,..., m). (8b)


IX, § 3] CONDITIONAL AND RELATIVE INFORMATION 557

According to the definition of conditional probability we have •

0& Pi<lk\) QkPj\k 5 (9)


further
n

X
k= 1
rik = Pi (j = 1, 2, ..., 777) (10a)

and
m
X rjk = dk (k = 1,2,..., n). (10b)
7=1

We define now the conditional information /(£ | 77) contained in £ with


respect to the condition that 77 assumes a given value; this will be the expec¬
tation of the information associated with the distribution 6^k:
n n m n

7(C I n) = z qk K&k) = X X rjk log2 —-. (11)


fc = l fc=l7=l fjk

On the other hand, if /((£, 77)) denotes the information associated with the
two-dimensional distribution of ^ and 77:
m n I
/(«. 1)) = - I I log, —, (12)
7=1 fc=l rjk
then we have

/(({, >1)) = /(>)) + /({ I >!)• (13)


Formula (13) follows from (9), (10b), (11) and (12):
m n
I n) = /((£, v)) + X X rJk 1og2 qk = 7((£> »7)) - /(>/).
7=1 k=1

It follows from the definition that /(£ | rj) = /(£) when £ and 77 are inde¬
pendent, hence (13) reduces in this case to the relation obtained in the pre¬
ceding section:
/(«, i,)) = 1(0 + /(,). (14)

We may consider (13) as a generalization of the theorem on the additivity


of the information: the information contained in the pair of values (£, 77)
is the sum of the information contained in the value of 77 and of the condi¬
tional information contained in the value of £ when we know that 77 takes
on a certain value.
Now we show that in general the relation

/«{, 0) s ko + m (15)
558 INTRODUCTION TO INFORMATION THEORY [IX, § 3

holds, where the sign of equality occurs only it £ and rj are independent.
According to (13), relation (15) is equivalent to

/(£ I n) £ m (16)
which means that the “conditional” uncertainty of £ for a known value of
rj cannot exceed the “unconditional” uncertainty of £. By taking (11) into
account we can write
m n

K£\n) = - Z Z
j=1k=l
QkPj\k i^g2 Pj\k' (17)

If we apply Jensen’s inequality to the function x log2 x with xk = pJ[k, wk =


= qk (k = 1,2,..., n), we obtain, in view of (9),
n

Pj l°g2 Pj ^ Z
k= 1
9kPl\k ,Qg2 Pj\k‘ (18)

From (17) and (18) follows immediately (16), and hence (15) too. The sign
of equality in (18) can only hold if all pJlk (k = 1, 2,...,«) are equal, i.e.
when £ and r\ are independent. We conclude from (13) that

m - /(* i ro = m + m - /((<?, n)y, (19)


the right hand side being symmetric in £ and rj, we have

m - nt i n) = m - «>i i o- (20)
The left hand side of (19) may be interpreted as the decrease of uncertainty
due to the knowledge of 17, or as the information about £ which can be
gained from the value of 17. We call this the relative information given
by rj about £ and denote it by /(£, rj); we have thus

/(£, n) = m - I r,). (21a)

(We must not confuse /(£, 17) with the information /((<!;, 17)) associated with
the two-dimensional distribution of £ and 17.) According to (20)

/(£, 17) = /(17, 0; (21b)

hence the value of 17 gives the same amount of information about ^ as the
value of £ gives about 17.
/(£, r\) can also be defined by the symmetric expression

m n
Ki. n) = 1(0 + Hi) - /(«, „))=££ r,t log, -2-. (22)
j=lk=l Pj<lk
IX, § 3] CONDITIONAL AND RELATIVE INFORMATION 559

According to (16) we have

KL 0 ^ 0, (23)

where the equality sign holds only if £ and i/ are independent. Hence if ^
and rj are not independent, the value of rj gives always information about
On the other hand, from (21a) and (21b) follows

/(£, rj) < min (1(0, 1(0). (24)

Here too, it is easy to find the cases in which the equality sign holds. In fact,
if I(£, 0 = 1(0, then /(£ | rj) = 0, which can occur only if the value of ^ is
uniquely determined by the value of rj, i.e. if f = f(rj). Similarly, 1(0 0 =
= I(rj) can occur only if ij = g(0- The quantity 1(0 0 can be considered
as a measure of the stochastic dependence between the random variables
^ and rj.
The relation 1(0 0 = I(l, 0, expressing that rj contains (on the average)
just as much information about £ as £ about rj, seems to be at the first glance
surprising, but a deeper consideration shows it to be quite natural.
The following example is enlightening. Let rj be a random variable sym¬
metrically distributed with respect to the origin, with P(rj = 0) = 0, and
put £= if. There corresponds to every value of rj one and only one value
of 0 while conversely £, determines rj only up to its sign. In spite of this,
£ gives just as much information on rj as rj gives on £ (viz. 1(0) ); the differ¬
ence is that this information suffices for the complete characterization of
£ but does not determine rj completely (only the value of \rj\). In fact, 1(0 =
= 1(0 + 1 (if we know already the absolute value of rj, rj can still take on

the values ±\tj\ with probability —, hence one unit of uncertainty must

be added).
We prove now the inequality

< /({, >1), (25)

which is equivalent to

l(z\f(0) > /(£ Iff). (26)

If instead of rj we observe a function f(0 of rj, then we obtain from the value
off(0 at most as much information on £ as from the value of g; the uncer¬
tainty of £ given the value of f(if) is thus not less than its uncertainty given
the value of rj.
560 INTRODUCTION TO INFORMATION THEORY [IX, § 4

Proof of (26). If f(yk) A /(>’/) for k ^ l, we have equality in (25); if for


instance f(yk) = ftyi) # /(Tm) for m ^ k, l then to the terms
qkI{^k) + q11(^1) (cf- (11)) figuring on the right hand side of (26) there
corresponds a single term on the left hand side, viz. (qk + qi)
where is the conditional distribution of £ under the condition that rj
takes on one of the values yk or yt. Clearly
QkPjlk + <llPj\l
P(Z = Xj \n = yk °r rj = y,) =
Ik + qi

If we apply Jensen’s inequality to the convex function x log2 x, we obtain

(qk + qi) l(&k,i) > qk I(&k) + q, /(^).

The case when several values off(yk) are equal to each other can be dealt
with similarly. Thus we proved (26), hence (25) too.

§ 4. The gain of information

The same example which served to derive Shannon’s formula can be used
to get a heuristic idea of the notion of gain of information. Let E be a set
containing N elements and let Ex,. . ,,En be a partition of this set. If Nk
n
is the number of elements of Ek, we have N — ]>] Nk and we put pk =
k=1

= Let the elements of E be labelled from 1 to N, E = {e1} e2, . . ., eN}


and let the elements of Ek (k = 1,. . ., n) be labelled from 1 to Nk. An ele¬

ment of E chosen at random (all elements having the same probability

of being chosen) may be characterized in two distinct manners: a) by giving


its serial number in E which we denote by ^; b) by giving the set Ek to which
it belongs and its serial number in Ek. The index k of the relevant set Ek
is a random variable which we denote by tj. The index of the element in
question in the set En will be denoted by £. Then we have

m = m + 7(c 1 q) (i)
where, clearly

/(0 = log2 A, l(r\) = Y P/tlog2 —


k=1 Pk
and
n

i(C I n)=Y
k=1
Pk iog2 Nk.
IX, § 4] THE GAIN OF INFORMATION 561

Now let E' be a nonempty subset of E and let E'k(k = 1,2,...,«) denote
the intersection of Ek and E'. Let Nk be the number of elements of Ek,
n
N' the number of elements of E' and put qk . Then we have X N'k —
n k=1
— N', hence X 9k — 1- Suppose that we know about an element chosen
k=l
at random that it belongs to E'; what amount of information will be fur¬
nished hereby about rjl The original (a priori) distribution of q was
^(Pi> • • • 5 Pn) after the information telling us that the chosen element
belongs to E', rj has the (a posteriori) distribution Q = (qu q2,.. ., qn). At the
first sight one could think that the information gained is /(^) - 7(0. This,
however, cannot be true, since 7(«i^) - 7(Q) may be negative, while the gain
of information must always be positive. The quantity /(^) - I(Q) is the
decrease of uncertainty of //; we are, however, looking for the gain of infor¬
mation with respect to y\ resulting from the knowledge that e{ belongs to E'.
Let the quantity looked for be denoted by 7(Q || &)1’, it can be determined
by the following reasoning: The statement e4 £E' contains the information

l°g2 This information consists of two parts; first the information given

by the proposition e^ £ E' about the value of rj, next the information given
by this proposition about the value of £ if q is already known. The second
part is easy to calculate; in fact if q — k, the information obtained is equal
, Nk
to log2 —p- and since this information presents itself with probability qk,

the information about the value of £ is

£ 9k log2 -T77
&=i Nk
Hence

l°g2 ~7T = KQ. II &) + £ <lk log2 (2)


iV k=1
Since
n
NNk 9k
£ 9k = i and
k=1 N'Nu Pk
we find that

7(Q||^) = £ qk log2 — . (3)


k—X Pk
The quantity I(Q || d?5) depends only on the distributions d?3 and Q; it

1 We use a double bar||in /(Qll-S0) in order to avoid confusion with the con¬
ditional information /(£ | rj).
562 INTRODUCTION TO INFORMATION THEORY [IX, § 4

follows thus from Jensen’s inequality that we have always

/(Q||^>0. (4)

The equality sign occurs in (4) only if the distributions S3 and Q are identical.
/(Q || S3) is defined only if every pk is positive and if there exists a one-to-one
correspondence between the individual terms of the two distributions. The
quantity I(Q || 3s), defined by (3), will be called the gain of information
resulting from the replacement of the (a priori) distribution S3 by the (a pos¬
teriori) distribution Q.
The gain of information is one of the most important notions in informa¬
tion theory; it may even be considered as the fundamental one, from which
all others can be derived. In § 6 we shall build up information theory in this
fashion; the gain of information, as a basic concept, will be defined by pos¬
tulates.
The relative information introduced in the preceding section can be ex¬
pressed as follows by means of the gain of information. Let £ and rj be ran¬
dom variables assuming the distinct values xl5 x2,. . ., xm and yu y2; . .., y„
with positive probabilities pj = P{i; = X/) and qk = P(q — yk) respectively;
put {Pl> • • Q. ' ((7ij (?2> • • •! ^n)?

P(£ = Xj, r] = yk) = rjk, P(f =Xj\rj = yk) = pJ{k,

^k ~ (Pl\k> P2\ki • • •} Pm\k)w


Then we have

ri) = Y VkK&k II ^)- (5)


k=l
Indeed by (3)
m

j=i Pj
hence, because of qkPj\k — rik

n m n

Z Qk K&k 11 = Z Z fk iog2 • (6)


k=i j=i k=i Pjqk

From this (5) can be derived by Formula (22) of § 3. Formula (5) means
that the amount of information on £ which is contained in the value of i/
is equal to the expectation of the gain of information obtained by replacing
the distribution tS/3 of £ by the conditional distribution Sfik.
If & = (Pl,...,Pn) is any distribution having n terms and if %n —
IX, § 4] THE GAIN OF INFORMATION 563

’ 1 1
we have
n ’ n
n

K& 11 &n) = E Pk log2 npk = log2 n - 1(,9) = l(&n) - I(&). (7)


k=1

The gain of information obtained by replacing the uniform distribution by


the distribution Sfi is thus equal in this case to the decrease of uncertainty.
But in general the quantities I(Q || and I(&>) - 1(Q) are not equal.
Though in general I(&k || JP) # — I(S$k), Formula (5) still expresses
that the averages of these two quantities are equal. For according to the
first definition of relative information,

/(£, n) = m - /({| n) = X qk (i(&) -1(&k)),


k=1

hence, according to (5)

£ ?* (m-W) = £ n&t I! &)■ (8)


k=l *=1

But only the sums on the two sides of (8) are equal; the single terms have not
necessarily the same value.
The following symmetric expression is also often considered in informa¬
tion theory:

J(&, Q) = J(Q||SP) +1(& || Q). (9)

This expression was first studied by Jeffreys. A simple calculation shows


that

J(&, Q) = £ (pk - qk) log2 —. (10)


&=1 Qk

Let us remark that while certain terms of the sum (3) defining I(Q || d?3)
may be negative and we know only that the sum itself is nonnegative, on
the contrary, on the right hand side of (10) all terms are nonnegative.
The relative information can be expressed by means of the gain of infor¬
mation in still another way. If <32 is the distribution {rJk}, Sft * Q the dis¬
tribution {pjqk}, then it follows from Formula (22) of § 3 that

7(£,»7) = /(J§>||^*Q). (11)

The information concerning £ contained in the value of is thus equal to


the gain of information obtained by replacing the direct product of the dis¬
tributions of £ and rj by their actual joint distribution.
564 INTRODUCTION TO INFORMATION THEORY [IX, § 5

§ 5. The statistical meaning of information

Let the possible outcomes of an experiment^ be denoted by Ax, A2,. . .,


Ar; let their probabilities be P(Ak) = pk (k = 1,2,.. ., r). Let Sft denote
the distribution (pl,p2, ■ • -,Pr)l consider n independent repetitions of the
experiment . The probability of an outcome of this sequence of experiments
(when we take into account the order of the experiments) is given by
7r„ = Pi,pl‘,. . .,pvrr, where vk means the number of experiments leading
to the outcome Ak. Since the vk are random variables, Ttn is a random vari¬
able too. The expectation of vk being equal to npk, we have

E (— log2 — | = £ Pk log2 — = I(&). (1)


l n 7i nJ ftti Pk

The information I(SA) may be interpreted as the expectation of — Iog2 — .


n 7in
According to the law of large numbers

t- Vk
lim st — = pk9
n-*.m ^

hence

lim P
n
log2 --I{£P)
7t„
< S = 1 (2)
n-*- oo

for every e > 0 (see Ch. VII, § 14, Exercise 6).

If instead of the expectation of log2 — we consider the analogous


n 7in

quantity — log2 ——then we obtain


n E(7i„)
1
— l0g2 = l0g2
n E{n„)
(Z pI)
k=1
This quantity was already mentioned in § 2; it can also be considered as a
measure of information. We shall return to this question in § 6.
There is still another point of view showing that the definition of informa¬
tion is suitable. The unit of information (“bit”) was defined as the amount
of information contained in a symbol which can assume only the two values
0 and 1. Such a symbol will be called a 0— 1 -symbol. We shall now consider
whether the outcome of an experiment can actually be characterized on
the average by I{SA) 0-1-symbols. We show that this is really possible,
if certain highly improbable events are neglected.
IX, § 5] STATISTICAL MEANING OF INFORMATION 565

Theorem 1. Let be an experiment having possible outcomes Au A2,. .


Ar occurring with probabilities pk = P(Ak) > 0 (k = 1,2,.. ,,r). Put SA =
= (Pi, p2, ■ • -,Pr)- Then for any given e > 0 and 5 > 0 there exists an n0
depending only on 9ft, e, and 5 such that if there are performed n(n > n0)
independent experiments 9ft, then the outcome of this sequence of experiments
can with a probability greater than 1—5 be expressed uniquely byn(I(9ft) + e)
0—1 -symbols. If q > 0 is arbitrarily small, it is impossible to character¬
ize the outcome of the sequence of experiments by less than n{l{9ft) — e)
0— l-symbols with a probability greater than or equal to o whenever n > n'0,
where if depends only on 9\ e, and o.
Remark. This means that, if the experiment is sufficiently often repeat¬
ed, for the description of an outcome of the experiment one does not need,
on the average, more than I{99) + e 0—1-symbols; hence the statement that
the outcome of ^ft contains the amount of information I{9ft) has a quite
definite meaning.

Proof. Choose /ix large enough so that n > nx should imply

log2 —-1{9s) < >1-5. (3)


n 7t„

This is, in view of (2), always possible. This means that the sequences of
outcomes obtained by repeating n times the experiment ^ can be parti¬
tioned into two classes: the first consists of the sequences for which

— log2 —1-1{9S) < (4)


n n„

the second of the remaining ones. According to (3) the probability that a
sequence belongs to the second class is less than 5. Let C„ denote the number
of sequences of the first class, let q±, q2,. . ., qCn be their probabilities. By
(4) we have
-n[l{Cp) +
)< {j — 2,..., C„) (5)
or, by adding these inequalities,

-»{l(Cfi)+ *)
c„
C„ • 2 < I cIj- (6)
/=!
The sum on the right hand side cannot exceed 1, since it represents precisely
the probability that a sequence belongs to the first class. Therefore we have

n {!(.&)+1)
C <2 (7)
566 INTRODUCTION TO INFORMATION THEORY [IX, § 5

Now let us number the events of the first class from 1 to Cn and write

these numbers in the binary system. For this n /(J?5) + + 1 binary digits

are needed. There can be found an n2 such that for n> n2 the inequality

!(&) + + 1 < + e) (8)

holds. Put 770 = max (nx, n2); it is clear that n0 depends only on s, 5, and &
and satisfies the requirements of the theorem. It is easy to show that with
large probability - s) 0- 1-symbols are not sufficient to describe
the outcome of the sequence of experiments. To see this, subdivide again
the set of the sequences into two classes: let the first class contain the se¬
quences for which

/(^5) _ JL iog2 — < (9)


n jin 2 v

and the second the remaining ones. Choose an n3 such that for n > n3 the
probability of (9) exceeds 1-5; this is possible because of (2). Let
Dn denote the number of sequences in the first class and let rlt r2,. . ., rDn
be the corresponding probabilities. We have then

2-"('<«-U>0 0= 1,2,.z>„). (10)


Furthermore, by assumption

Z n>\-8. (ii)
7=1

If we select some outcomes and assign to them sequences of zeros and


ones of length not exceeding — e), the number of these sequences
will be less than 2n(/(^-£>, hence the total probability of the selected out¬
comes will be at most

2«(/(^>)-£)2~"(/(’^)_D + ^ = 2~”~E + <5

The total probability of the outcomes not considered is thus at least


ne
1-5—2 2 > 1 — 25, provided that n > rc4.
Suppose that n>n'0 = max (n3, n4) and that, contrary to the statement
in the second half of the theorem, it is possible to characterize the outcome
of the sequence of experiments by less than n(I(6- e) 0- 1-symbols with
IX, § 5] STATISTICAL MEANING OF INFORMATION 567

a probability > q > 0. If we choose <5 such that 25 < g, then this contradicts
what was just proved.
Theorem 1 is therefore completely proved. It can be sharpened in the
following manner:

Theorem 2. For every 5 > 0 there can be given an n0 such that for n > n0
the outcome of n independent experiments can be uniquely expressed, with
probability > 1 — 5, by at most nl{ffi) + K^Jh 0—1 -symbols; K is here a
positive constant which depends only on 5.
However, there corresponds to every o between 0 and 1 a constant K' and an
integer n'0 such that a unique characterization of the outcome of a sequence
of experiments becomes impossible (with a probability > g, for n > n'0) by
less than — Kf/n 0—1 -symbols.

Proof. It is easy to show1 that the distribution of the random variable

( 1 1 ^ Vk ~ "Pk
Jn — log2 — - I(&) = I —7=— i°g2 —
1
n n„ k=1 fn Pk

tends to the normal distribution as n -*■ oo. There exists thus a constants
which depends only on <5 such that we have for sufficiently large n

K
— log2 —-/(^) >\-5. (12)
n n„ < Jn
The continuation of the proof runs exactly as that of Theorem 1.
Theorem 1 can also be considered as a justification of Shannon’s defini¬
tion of information. The statement of Theorem 1 can be translated into the
language of communication theory as follows: Let a message source at
the moment t (t = 1,2,...) emit a random signal let x1; x2,. . ., xr
be the possible signals; let pk = P(ft = xk) denote the probabilities of the
individual signals. These probabilities are supposed to be independent of t.
Assume that the signals are independent of each other. Assume further that
for transmission the signals must be transformed (encoded) since the
“channel” (transmission network) can only transmit two signs.2 (This is the
case e.g. if the channel works with electric current and at every instant only
two cases are possible: the current is either on or off.) Let 0 denote one of

1 The proof is the same as that in Exercise 26 of Ch. VIII, § 12.


2 In information theory the word “channel” has a very general sense: it means
any process capable to transmit information.
568 INTRODUCTION TO INFORMATION THEORY [IX, § 5

the signs and 1 the other. The question is then, how many 0 or 1 symbols
are necessary for the transmission of the information contained in n signs
£i> £2, ■■ - An furnished by the source. According to Theorem 1 with proba¬
bility arbitrarily near to 1 less than n(I(^) + s) symbols are required, pro¬
vided that n is sufficiently large. This shows the importance of the quantity
for communication engineering.

Let us mention an important particular case. If px = p2 = Pr =


r
then I(6P) = logo r; therefore in order to encode a signal of such a source
into 0-1-symbols, on the average log2 r symbols are necessary. (Of course
this can be shown directly.) If for instance a number written in the decimal
system is transcribed into a binary system, the number of digits increases on
the average by the factor log2 10 = 3.3219 . . . . This is of importance for
computers, which work in the binary system.
If the source emits signals xk with probabilities pk (k = 1, 2,. . ., r) and

if the channel can transmit ^ different signs, approximately nIty'^ sjgns


logo ^
are necessary in order to transmit a message of n signs if the most econom¬
ic coding is applied.
It is to be noticed that optimal or nearly optimal codings are very compli¬
cated and are feasible only for long sequences of signals. Hence in practice
usually such codes are employed which to some extent take into account
the statistical nature of the source, but are more easy to handle than the
nearly optimal codings. In particular, the signals are coded one by one or
by small groups (as for instance in the encoding of letters into Morse sig¬
nals).
The message sources encountered in practice are generally much more
complicated than those described above. The individual signals are, in
general, not independent of each other. E.g. in every natural language the
letters have not only different probabilities, but the probability of a letter
depends also on the letters preceding it in the text. This can also be taken
into account, but we do not deal with these questions here.
The channels actually used in communication theory are also much more
complicated than those discussed above. In practice, it is of the great impor¬
tance to know how to transmit the information through a channel which,
with a certain probability, distorts the transmitted signal. Then one cannot
be sure that the received signal is identical to the emitted one. (E.g. in broad¬
casting the distortions caused by the transmission through the atmosphere
are perceived as noise.) Such channels are called noisy channels. Information
theory takes this into account, but our brief introduction does not permit
to go into these questions.
IX, § 6] FURTHER MEASURES OF INFORMATION 569

§ 6. Further measures of information

In the present section we give another characterization of the information;


this approach will show what other quantities can be considered as measures
of information besides that of Shannon’s.
Shannon’s information was defined in § 2 by postulates and by means of
Shannon s information we introduced the notion of the gain of information.
We shall now follow the inverse procedure: we define first the gain of in¬
formation by a set of postulates; from this then we shall derive a measure
of information.
We start from a generalization of the notion of a random variable. Let
[£2, P] be a Kolmogorov probability space. We define an incomplete
random variable as a function £ = £(a>) measurable with respect to the meas¬
ure on and defined on a subset Q1 of Q, where Ql £ and P{Q2) > 0.
The only difference between an ordinary random variable and an incomplete
random variable is thus that the latter is not necessarily defined for every
co £ Q. In this sense, ordinary random variables can be considered as par¬
ticular cases of incomplete random variables.
If £ is an incomplete random variable assuming values xk with probabili-
n

ties pk (pk > 0; k — 1, 2we have Y pk < 1 and not necessarily


k=1

ipk=i.
k=1
The discrete incomplete random variables £ and t] are said to be inde¬
pendent, if for any two sets A and B the events £ £ A and rj £ B are inde¬
pendent.
The distribution of an incomplete random variable will be called an in¬
complete probability distribution; in this sense the ordinary distributions can
be considered as a particular case of the latter. Thus if pk > 0 (k = 1,...,«)
n

and Yj Pk “ 1 s then {pk} is a finite discrete incomplete distribution.


k=1

The direct product of two incomplete distributions JP5 — {pj} (j = 1,...,«)


and Q = {qk} (k = 1,2,.. ., n) is defined as the incomplete distribution
{Pjdk} (J — 1, . . m; k = 1, . . ., n) and will be denoted by 6ft * Q.
To every incomplete distribution 6ft = (pt,. . .,pn) there can be assigned
an ordinary distribution 6ft' — (p[,by putting

yk — n

HPj
j=1
Let £ be an incomplete random variable taking on values xk with probabili-
570 INTRODUCTION TO INFORMATION THEORY [IX, § 6

ties pk (k — 1, 2,. . ., n)\ put s = £ pk. If 0 < 5 < 1, £ can be inter-


k=1
preted as a quantity depending on the outcome of an experiment, but not
defined for all outcomes of the experiment. For example £ is only defined
if the outcome is observable, which happens with probability s, where
0 < s < 1. In this case the corresponding distribution may be inter¬
preted as the conditional distribution of £ with respect to the condition that
the outcome of the experiment is observable. Therefore ffi' is said to be the
complete conditional distribution of the incomplete random variable £.
We shall now define the mean gain of information obtained if the (incom¬
plete) distribution = (px,.. .,pn) {pk > 0 for k — 1,. . n) of the incom¬
plete random variable ^ is replaced by the incomplete distribution Q =
— (t7i> • • • , q„)• Before stating the postulates we .make two remarks:

1. The gain of information denoted by I(Q || fA) is defined only if SA and


Q have the same number of terms and these are in a one-to-one correspon¬
dence defined by their indices.
2. We supposed pk> 0 for all values of k; however, some qk (but not
all) can be equal to 0.
The quantity I(Q || Sft) has to satisfy the following postulates:

Postulate I. If fA — £A1 * fA2 and Q = If * Q2, then

(i)
Remark. This means that if we put SAi — (pa, . . ., pin), = (qn, . . ., qin)
(/ = 1 or 2), then there corresponds to the element pXj p2k of 6A the ele¬
ment qv q2k of Q. Postulate I is a general formulation of the additivity of
information.

Postulate II If pk < qk Oc = 1, 2,. .., n), then we have I(Q II SA) > 0;
for pk > qk (k = 1, 2,. . ., n) we have I(Q || S3) < 0.
Remark. It follows from this that I{fA1| fA) = 0. For complete distribu¬
tions fA and Q Postulate II asserts nothing more than this, as then the in¬
equalities pk < qk (k = 1,2,..., n) occur only if pk — qk (k = 1,2,..., n),
n n

since £ pk = £ qk = 1. In the case of incomplete distributions,


k=1 k=1
however, this postulate leads to important conclusions.
Let be the distribution consisting of a single term {p} (0 < p < 1).
We require:

Postulate in. I(fox || &l) = 1.


ix, § 6] FURTHER MEASURES OF INFORMATION 571

This postulate fixes the unit of gain of information.


Before proceeding further, we determine the function

g(q,p) = I{%q\\$p) (0 <p< 1,0 <q < 1).

It follows from (1) that

9(<h 92, Pi, P-2) = g(qi. Pi) + g(q2,P2)- (2)

If we put qx = q.2 = 1, we find

1, P1P2) = g( 1, Pi) + g{\, p2). (3)

If we put qx = p2 = 1, Pi — p, q2 = q, we obtain

g(q,p) = 9(1, p) + g(q, l). (4)

Hence, according to Postulate II,

g(\,p) + g{p, i) = o. . (5)


We conclude from (4) and (5) that

g(.q,p) = g(l,p) - g{l, g). (6)

Now g(l,p) being, by Postulate II, a decreasing function of p, it follows


from (3) by a well known theorem that

g(l,p) = c log2 —
P

with c > 0. According to Postulate III c = 1, thus

I(%i\\%p) = 9{\,P)= log2 — , (7)


P
and by (6)

9II&P)= 9{q,p) = log2 — - log2 — = log2 — . (8)


p q p

If we observe the occurrence of an event having probability p, we get the

amount of information log2 ; if p is replaced by q the gain of

9
information is log.
572 INTRODUCTION TO INFORMATION THEORY [IX, § 6

The quantity log, -— can also be considered as measuring the uncertainty


P

of the occurrence of an event with probability p\ the quantity log2 — =


P

= log<>-log, — is the decrease of the uncertainty resulting from


P “ Q
the replacement ofp by q (note that this “decrease” can be negative as well;
indeed if q < p, the uncertainty increases).
We introduce now a new notion. If we replace an incomplete distribution
9s — (plf. . .,p„)byan incomplete distribution Q = (qx,. . ., qn), we obtain

with probability qk the information log2 —(& = 1,2, n). Put


Pk

, Qk
Qk— n
X Qj
7=1

The conditional probability that we obtain the information log2 —


Pk
under the condition that at least one observation occurs, is equal to
Qk {k = \,2, . . ., n). Put

F(Q, &,x)= X Qk; (9)


Qk
1062 irk<x

F(Q_, Sfi, x) will be called the conditional distribution function of the gain of
information.
Now we can formulate our further requirements:

Postulate IV. I(Q \ \ ,9^') depends only on the function F(Q, 9, x).

Because of this postulate we can also write /[T(x)] instead of I(Q || 9s),
where F(x) = F(Q, 9, x).

Notes. 1. If Q = %q, 9s = <op, we have

0 for x < log, -—-,


F(Q,9*,x) P

1 otherwise.

Postulate IV is thus fulfilled and (8) expresses that for a degenerate distri-
IX, § 6] FURTHER MEASURES OF INFORMATION 573

bution function

0 for x < c,
F(x) = Dl (x) =
(10)
1 otherwise,

(where c can be any real number) we have the relation 7[7>c(x)] = c.

2. Every distribution function F(x) of a finite discrete distribution can be


written in the form F(x) = F(Q, <9, x) where ^and Q are suitably chosen
incomplete distributions. Indeed let ak be the discontinuity points of F(x),
n

with jumps wk (k = 1, 2,. .n\ £ wk = 1), then we have to determine


k=l
numbers pk and qk (k = 1, 2,...,«) such that the relations

7* qk
ak = log2 — , wk =-- {k= 1,2,..n) (11)
Pk qi + <?2 + • • • + q„

hold. This is the case if we take qk = twk and pk = twk 2~ak. If we choose
the number t such that

\
0 < t < min (12)
\ I
A: = l
Wk-2~ak
7

then we obtain a system of solution satisfying all our hypotheses.


I[F] is thus a functional defined on the set ^of all distribution functions
of finite discrete distributions. The following postulates concern the prop¬
erties of this functional.

Postulate V. If F G F^G and G(x) > F(x) (- oo < x <


< + oo), then

I[G(x)] < I[F(x)].

Remark. This postulate contains Postulate II. In fact if pk < qk (k = 1,

2,..., 77) we have log2 — > 0, hence F(Q, x) < From this follows
Pk
by Postulate V that I(Q || ,9s) > 7[Z)0(x)] — 0; and if pk > qk, the inequal¬
ity is reversed. However, Postulate II is not superfluous, since in order to
state Postulate V we used relation (8) resting on Postulate II.

Postulate VI. Let Ft £ ■9r (i = 1, 2, 3) and I[F2\ — 7[7’3]. Then for every
574 INTRODUCTION TO INFORMATION THEORY [XX, § 6

t (0 < t < 1) we have

7[f7q + (1 - t)F2\ = I[tFi + (1 - o^sl;

furthermore, I[tF1 + (1 - t)F2] is a continuous function of t.1


Remark. Postulate VI may be called the postulate of quasi-linearity.
Now we can state

Theorem 1. If 7(Q || FF) satisfies Postulates I to VI, then:


— either there exists a real number a ^ 1 such that I(Q || P) = Ia(Q || SF),
defined by

( i £ qt \ (13a)
/a(QH^) = ^Tlog: „ct-l
. k=1 Pk
£ (ik
\ *-i
or fill || -FF) — fi(Q || F5) with
n

/i(Qll^ = £ Vk log2 — • (13b)


Pk
£ qk
k=1

(13b) is the limit of (13a) for a 1:

lim /a(Q||^) = /1(Q||^). (14)

Remark. If d^and Q are complete distributions, Ix(Q\\Ffi) is identical to


Shannon’s gain of information defined by Formula (3) of § 4.
The quantity 7a(Q || FF) will be called the measure of order a of the gain
of information; Ifll || FF) will be called Shannon's gain of information or
measure of order 1 of the gain of information.
In order to avoid confusions, Shannon’s gain of information will from
now on always be denoted by IX(Q || FF) instead of 7(Q || FF).

Proof.2 Instead of IfiQ || FF) we use also the notation fi(F(Q, FF, x.))
Then (13a) and (13b) are written as
+ 00

Ia(F) = ^ ^ log2 j 2la~1)xdF(x) for a ^ 1 (15)


— oo

1 The assumption of continuity is not indispensable to the proof of Theorem 1;


its only purpose is to simplify our proof.
2 The following proof is a combination of the proofs of two theorems from the
theory of functional equations. Cf. G. H. Hardy, J. Littlewood and G. Polya [1],
pp. 215 and 84.
IX, § 6] FURTHER MEASURES OF INFORMATION 575

and
+ 00
h(F) = j xdF(x). (16)
— 00

From these formulae we see that IfQ ||^) satisfies for every a Postulates
I through VI. It remains still to show that no other functional can satisfy
all these Postulates. A simple calculation shows that

F(Qi * Q2, * ^2, x) - J F(Q1} &lt x-y) dF(Q2, &2, y), (17)
— oo

which permits to rewrite Postulates I and III in the following form:

Postulate I'. If F £ y, G £ y and if we put

+ oo

F*G= j F(x-y)dG(y%
— 00

then we have

I[F*G] = I[F] + I[G]. (18)

Postulate III'. I{Dfx)} = 1.


We show now that Postulates I', III', IV, V and VI are satisfied only by
the functionals (15) and (16).
Let yA be the class of finite discrete distribution functions with F(—A) —
— 0, F(A) — 1. We deduce from Postulate VI by induction that from the
relations

F^y, F^y, 7[F,] = l[F'i], wt> 0 (/ =1,2,..., r)


r

and X Wj = 1 the relation


j=i

ni«’,Fl] = ni (i9)
i=l i=l

follows. We know already by Postulates I', III', and V that

I[Dc(x)\ = e (20)

holds for every real c, where Dfx) is the degenerate distribution function
of the constant c (see Formula (10)).
Let
'l'a (0 = /[(!- t) D_a (x) + tDA (a)]. (21)
576 INTRODUCTION TO INFORMATION THEORY [IX, § 6

is a strictly increasing continuous function of t; further ipA(0) — —A


and \l/A(1) = +A. Put t = <pA{u) for u = iJ/A(t). As cpA{u) is the inverse
function of ipA(t), it is continuous and strictly increasing in the interval
(—A, +A). From this we derive

I[DU (*)] = u = ipA (0 = /[(l - (pA (m)) D_A{x) + cpA (u) DA (*)]. (22)

Let F ^_yA be a distribution function which jumps wl5 w2,. . wn at the


points ax, a2,. . ., a„, we have

F=F(x) = fwtDn(x), (23)


k=1

and, according to (19) and (22),

1\F\ =I[t w*((1 ~9a(a*))D_a(x) + <pA(at)DA(*))], (24)


k=X
hence

I[F] = /[(l - £ wk cpA (ak)) D_a(x) + {fJwk(pA (akj) DA (a)] (25)


k=1 k=1

or, according to (22) by writing (pA\t) instead of t/^(f)

= («*))• (26)
k=1

Formula (26) expresses that 7[F] is the Kolmogorov-Nagumo quasilinear


mean of the numbers ak with weights wk (k = 1,2,..., n). We shall need
the following lemma concerning this mean:

Lemma. Let gx{x) and <pfx) be two continuous and strictly increasing func¬
tions in the interval [/, K\. Suppose that for arbitrarily chosen numbers
*i, x2,. . ., x„ in [J, K] and for positive numbers wx, w2,. . ., w„ with
n

Z wk = 1 we always have
k=1

n n

1 (Z
k =1
wk(Pi (Xk)) = 9?2_1 (
k
Z
=1
wk CP2 (xk)). (27)

This means that the relation

<p2(x) = <x<pfx) + 0 (28)

holds, where a > 0 and P are two constants. (Conversely, (28) implies (27)).
IX, § 6] FURTHER MEASURES OF INFORMATION 577

Proof. It suffices to prove (28) by supposing (27) to hold for n = 2,


wx = t, w2 = l - t, 0 < t c l. Put <p2(J) = J', <p2(K) = K'. If x, and x2
describe the interval [/, K], then yx = cp2(xx) and y2 = cp2(x2) describe the
interval [/', K'}. Hence if J'<yx< K', J' < y2< K' and if we put
9?i(992 ’(*)) = 9?3(a') we find

V&y-L + (1 - 0^2) = t<Pz(y\) + (1 - t)(p3(y2) . (29)

T’sO) is thus a linear function, which proves the first part of the lemma.
The converse is trivial.
We show now that there can be found a function cp(x), independent of A,
such that (26) remains valid if <pA is replaced by cp. It suffices to prove that

= <Pb(x)~<Pb(-A)
for 0 < A < B. (30)
<Pb(A)-cpb(-A)

This follows from


n n

Va1 C£ wk cpA (ak)) = <Pb\Yx wk <Pb (ak)) (31)

or 0 < A < B, | ak \ < A (k = 1, 2,. .n); Formula (31) itself follows


from yA c for A < B. From (31) and from the lemma we conclude that

<Pa(*)= v<Pb(x) + /?, (32)

and since <pA(-A) = 0, yA(A) = 1, we obtain (30).


Thus we proved the existence of a monotone continuous function <p(x)
which for every F £ y having jumps wk at the points ak {k = 1,. . ., n)
fulfills the relation
n +00

I[F] = (p~1(YJ wk <p(ak)) = 95"1 ( I <p(x) dF(x)). (33)


k=1 — 00

Now we investigate how cp(x) can be chosen such that it fulfils also Postu¬
late Put in F

F(x) = lDa (x) + (1 - t) Db (x), G(x) = Dy (x).

Then we have

cp-1 (t<p(a + y) + (1 - t) <p(b + y)) = (p-1 (jcp(a) + (1 - t) (p{b)) + y. (34)

Fix y and put cp*{t) = <p(t + y). From (34) follows

cp*~x {tcp* (a) + (1 - 0 cp* (b)) = (p-1 (t<p{a) + (1 - t) cp(b)) (35)


578 INTRODUCTION TO INFORMATION THEORY [IX, § 6

for all values of a and b and for 0 < t < 1. It follows from the lemma that

<p*(x) = (fix + y) = A(y) cp(x) + B(y) (A(y) > 0).

If <p(0) = 0, which can be supposed without restriction of generality, then


B(y) — (p{y), hence

(p{x + y) = A{y) cp{x) + (p{y). (36)

This relation being fulfilled for every y, we may interchange x and y, hence

A(y) (p{x) + rp(y) = A(x) <p(y) + <p(x), (37)


thus
A(x) - 1 _ A(y) - 1
<p(x) <p(y)

From this we obtain A(y) = k<p(y) + 1, where k is a constant. From (36)


and (38) it follows that

<p(x + T) = k<p{pc) cp{y) + (p{x) + (p{y). (39)

We distinguish two cases: k = 0 and k ^ 0. If k = 0,

(p{x + y) = (fix) + (p(y) (40)

and, since cp(x) is monotone, <p{x) = Bx, where B is a constant. If k # 0,


put h(x) = k(p{pc) + 1. We conclude from (39)

h(x + T) = h(x) h(y). (41)

and h{x) being monotone,

h(x) = 2(a_1)x (42)

with a ^ 1, hence
2(«-l)* _ i
3

(43)
II

According to the lemma, (p{x) can be replaced by 2(a-1)*. And, by taking


into account (33), we have thus either (15) or (16). The limit relation (14)
can be proved e.g. by the rule of l’Hospital. Theorem 1 is herewith proved.
IX, § 6] FURTHER MEASURES OF INFORMATION 579

If = %n is the uniform distribution of n terms and if Q is any incomplete


distribution, then for a ^ 1 the relation
/ " \

'Zd
k=1
4(Qll^„) = log2 n - log2 (44)
1 —a
I <7*
'*=1

holds. Thus if we put for any incomplete distribution & = (p1}. . ., pn)

f n \
LPk
k=l
h (^) = -j-
l —a
10g2 n (45a)
\Y,Pkj
\fc=l '
we find that
lM\\$n) = h(%n)-lM- (46)

(46) shows that the quantity may be considered as a measure of the


amount of information corresponding to the distribution S5 (or else as a
measure of the uncertainty of a random variable with the distribution JP).
We call the information of order a. It is easy to see that

1
Z ^log2
A(Qll^„) = log2 n- (47)
Z dk
k=1

For any incomplete distribution S3 = (pu . . .,pn) we put

1
Z PO °g2
k = 1_Pk
h(&)- (45b)
ipk
k=1

and we call this quantity Shannon’s information or information of order I.


If f/3 is an ordinary distribution, IfSP) is the entropy or Shannon’s informa¬
tion of the distribution^. In what follows, in order to avoid confusions,
we shall write always Iff?5) instead of 1(£P) used in the preceding sections.
For a complete distribution f/3 the definition of Iff?3) gives

1
-log2 Z Pk fora ^ 1. (45c)
1-a k=i
580 INTRODUCTION TO INFORMATION THEORY [IX, § 6

Clearly, for every distribution function, complete or incomplete,

lim Ia (&) = f {&>)


a-*-l

holds. We study now Ifffi) as a function of a.


n

Theorem 2. LetSfi — (pu . . ., pn) be an incomplete distribution, pk =


k=1
= ^ < 1. Then ff9:]) is a positive, decreasing function of a. One has If^) —
71
— log, — ; in particular, if Sfi is a complete distribution, then = log2 n.
s
Thus for a complete distribution

0 < Ia (S3) < log2 n (a > 0). (48)

Proof. We can write

( n 1 1 —a
\ T^T
/YJ Pk
Pk)
4(^5) = log, k = 1

Z
k=1
Pk

We know1 that the average

(Z wk4y (Xk>0,wk>0,fjwk=l)
k =1 1

s a monotone increasing function of /l. Hence Theorem 2 is proved.


Remark. If px < p2 < . . . < pn , we have

lim Ia (J^5) = log2 — and lim Ia (<i^) = log2 — .


gc-*- — co Pi a->- + oo Pn

Concerning the gain of information we obtain the following inequality:

Theorem 3. If Sfi = (p1?. . .,/>„) and Q = (qt,. . ., qn) are any incomplete
n n

distributions (ff pk = s < 1; £ qk = t < 1), then IfQ j| ,ST) is an in-


k=l k=1

creasing function of a. Since I0(Q | | SP) = log, — , for the complete distribu-
s
tions and Q there follows the inequality

4(Qll^) ^ 0 for a > 0. (49)

1 Cf. G. H. Hardy, J. E. Littlewood and G. Polya [1], Theorem 16.


IX, § 6] FURTHER MEASURES OF INFORMATION 581

Proof. We have

4(Qll^) = log2 (50)

, Z Qk y

from which Theorem 3 follows by the same theorem on mean values (cf.
foot-note) as above.
If a is negative or zero, the properties of and IfQ || differ essen¬
tially from those of Shannon's information. As can be seen from Theorem 3,
IfQ || J/3) is for complete distributions only then positive, when a is posi¬
tive. The following property is particularly undesirable: Let a < 0; modify
the complete distribution J?3 = (pl5 . . .,/?„) by letting tend to zero, then
/fSP) tends to infinity. On the other hand, 70(<&>) is always equal to log2 n
whenever & contains n positive terms. is thus very inadequate to meas¬
ure the information and we consider only IfidfP) with positive a as true meas¬
ures of information.
Let us now consider some distinctive features of Shannon’s information
among the informations of the family or of IfQ || d/3) among the
informations of the family IfQ || SP). One of these properties is given by

Theorem 4. If c and r\ are two random variables with the discrete finite
distributions Sfi and Q and if denotes the two-dimensional distribution of
the pair (£, rf), then

7a(J§’)<7a(^) + 7a(Q) (51)

holds for every SP and Q with the mentioned properties if and only if a = 1.

Proof. We know already that inequality (51) is valid for a — 1 (cf. § 3,


Formula (15)).
In the case of a A 1, (51) is not necessarily fulfilled. In fact let 0 and 1 be
the possible values of £ and rj; and suppose

7>(£ = 0, rj = 0) = pq + £,

P(£ = 0, rj = 1) = p(l - q) - £,

P(Z = 1, n = 0) = (1 -p)q ~ £,
P(f = 1, rj = 1) = (1 -p)(l - q) + £
582 INTRODUCTION TO INFORMATION THEORY [IX, § 6

with
1 1
0 <p < 1, 0 < q < 1, p # — , —

and

| e | < min (pq, (1 - p)q, (1 - q)p, (1 - p){ 1 - q)).

If (51) were true, the function

g(e) = {pq + e)a + (p(l - q) - e)“ + ((1 - p) q - e)“ + ((1 - p) (1 - q) + s)a

would have an extremum for e = 0. But this is not the case, since g'(0) ^ 0.
The quantity IfiQ || S75) is distinguished among the IfiQ, || &) e.g. by
the following property:

Theorem 5. If 6P = (plt..., pr), = (p[,. . ., p'n) and Q = (qlt. .., qn)


are discrete, finite, incomplete distributions fulfilling the relations

dk = \/Pk Pk (k=l,2,...,n), (52a)

i.e. the relations

ilog2 — +i ilog2 — 0 (k= 1,2,...,»), (52b)


Pk Pk

then the relation

4(Qli^5) + /a(Q||^') = 0 (53)

holds for every distribution fulfilling (52a) only if a = 1.


Remark. The distributions , and Q can only be all three complete
if they are identical. In fact, according to Cauchy’s inequality

( ke= 1 ?*)2=(£
k=1
ly s c i Pk) (f a).
*=1 k=1

Proof. For a ^ 1 we have

n
(v v-i Qk
si"

|iH

h _'a —1
1

4(6 II S’) + 4 (Q11S^') =-- log. .Ac =1 Pk


n (54)
a— 1
X qk
U=i
STATISTICAL INTERPRETATION 583
IX, § 7]

It is easy to see that the right hand side of (54) is not identically zero; e.g.
it is different from 0 if we put n = 2, q1 = q2, px + p2.

§ 7. Statistical interpretation of the information of order a

Let A1} A2,. . Ar be the possible and mutually exclusive outcomes of an


r

experiment with probabilities P(Ak) = pk, Z P& = ^ Putd?3 = (p\,. ■ •,Pr)•


k=1
Suppose that px < p2 < . . . < pr and perform n independent repetitions
of the experiment. Let vk be the number of experiments leading to the out¬
come Ak (k = 1,2,.. r). Put tt„ - p\1 pi'. . • p7• As in § 5, nH is thus the
probability of a sequence of n observations. Consider the function

Z Pk l0g2 4“
u x k =1 Pk
/(a) =-— (1)
Z Pk
Since
/ JL ,1 ( ' .. 1 \*\
/ k=1
Z Pfe log5 — Z P* °§2
fctl_P/c
—-
r (a) = - In 2
Pfc
(2)
v Z Pk Z Pk
k=1
k =1

•it follows from Cauchy’s inequality that /(a) is a strictly decreasing function;
further we have
1 1
/(1) = (^), lim /(a) = log2 — , lim /(a) = log2 — •
a^-oo Pi a^+0O Pr

For a > 1 we have thus /(a) < A (S5). If we put

p(ol) = 2"'(a) (3)

we have

log2 = /(a) - ' (4)


p(a)
Z
k=\
Pk

and
< ^(a) < pr for a > 1
584 INTRODUCTION TO INFORMATION THEORY [IX, § 7

Now let Bn(a) be the event n„ > p(a)n. Consider the conditional information
contained in the outcome of the sequence of experiments, under the condi¬
tion Bn(ot). Put for this

n\
C»= v (5)
nx\ n2\ ... nr\
n p"k > p(«f
*-« ,
Z nk = n
k= 1

Obviously, C„(a) is the number of outcomes fulfilling the condition -5,/a).


The information in question is thus at most equal to log2 C„(a). Further

s C„(x)p(a)"' (6)
and on the other hand

£«-') = (L Pi)". (7)


k=l

Hence, because of Markov’s inequality,

r
Pk )T
I (8)
,k=1 X«)J J

or, according to (6),


a\ n
Pi-
C„(«)< E (9)
U=i PiP) ) •
Put

Pk
?*(<*) = r
-7 • (10)
YpI
7=1

If Q denotes the distribution (q^a), . . ., <?,(«)), we get from (4) by a simple


calculation that

Pk
log2
&?i lP(«)j
(11)

hence, according to (9),

C„ (a) < 2"/,(e“) . (12)


IX, § 7 STATISTICAL INTERPRETATION 585

Furthermore, we have

h (Qa) = « log2 -j- - (a - !) (^) + 1 — a (13)

Choose a sufficiently large h for which

Pr+ >l > (Pi Pi ■ • ■Pr-l)r 1‘> (14)

this is possible because of px < p2 < • • • < Pr


r—1

Put nfa) = [nqfoi)] - h (j = 1, 2,. . r - 1) and nfa) = » - E »/“)•


7=1
Then

n^(a)>x«)". (15)
A:=i

When v* = n*(a) (ic = 1, 2,. . r), the event 5„(a) occurs; hence

n\
C„(a)> r (16)

n n^y-
k=1

But according to Stirling’s formula

n\
(17)

n
/c=l
Relations (12), (16) and (17) lead to

In n
iM-o < — log2 C„ (a) < f (Qa). (18)
n n

Therewith we proved

Theorem 1. Let Ax, A2,. . ., Ar be the possible outcomes of an experiment,


r

P(Ak) = pk, 0 < < p2 < . • • < P,, I Pk = 1 and & = (Pi,p2, ■ • •,/>,)•
k=l
Let the experiment be repeated n times such that the repetitions are inde¬
pendent of each other. Put further
586 INTRODUCTION TO INFORMATION THEORY [IX, § 8

with a > 1 and

- £ <7k (a) Iog2 —


/>(«) = 2 *=1 Pk- (19)
r
Let vk be the number of experiments with outcome Ak, let nn = Pkk and
k=1
let Bfa) be the event nn > p(oc)n. Now if Bn{a) occurs, the outcome of the
sequence of experiments may be characterized completely by a sequence of
0 — 1 -symbols of length

noc
nh (QJ = nla (&) + --Ix (Qa 11 &). (20)
I — a

If, however, g > 0 and e > 0 are arbitrarily small positive numbers and n
is large enough, then «(/i(Qa) — e) 0—1 -symbols are not sufficient with
probability > g.

Remarks. 1. IfQa) = may also be considered as an information


measure of the distribution it has the following properties:

a) 0 < < log2 r,

b) if = NA * Q, we have

/(“) (,Jg>) = /<“> {&) + /<“) (Q).

2. It follows from Jensen’s inequality that

/a(J^)>log2 -j— . (21)


p(oc)

§ 8. The definition of information for general distributions

If the random variable £ takes on denumerably many values xk with


probabilities pk = P{f — xf) (k = 1, 2,. . .), then we define the information
of order a contained in the value of f by the formulas

Ja (0 = -7-log2 (£ pk) for a ^ 1 (1)


1 ~ a k=1
and

h(0 = £ Pk log2 — (2)


k=1 Pk
IX, § 8] DEFINITION OF INFORMATION FOR GENERAL DISTRIBUTIONS 587

if the series on the right hand sides of (1) and (2) converge. The series (2)
does not always converge. For instance for

1
Pk = (k = 1,2,...)
ck log2 (/: + 1)

it is divergent; c is here a “normalizing” factor:

_ y *
° hi n log2 (n + 1)

However the series (1) converges always for a > 1. In case of discrete in¬
finite distributions the measure of order a of the amount of information is
thus always defined if a > 1.
Let i/ be a second random variable which takes on the same values as ^
but has a different probability distribution P(rj = xk) = qk {k = 1,2,.. .).
Let the gain of information of order a, obtained if the distribution
Q = (qx, <72,. . .) is replaced by = Cpl5 p2, ■ ■ •)> be defined by

1 / °° na

4(QII^) = --f lo§2 X -3FT


for a ^ 1, (3)
« — 1 u=l Pk
and by
CO /y

(4)
k=1 Pk

if the series on the right hand side of (3) or (4) converges (which is not always
the case). The series (3) converges according to Holder’s inequality always
for 0 < a < 1.
Let now t, be a random variable having continuous distribution. We want
now to extend the definition of the measure of order a of the amount of
information, i.e. /a(£), to this case. If we do this in a straightforward way we
obtain that this quantity is, in general, infinite. If for instance £ is uniformly
distributed on (0, 1), we know (cf. Ch. VII, § 14, Exercise 12) that the digits
of the binary expansion of £ are completely independent random variables

which take on the values 0 and 1 with probability Hence the exact

knowledge of the values of ^ furnishes an information 1 + 1 + 1 + . . .


which is infinite. Or, to put it more precisely, the amount of information
furnished would be infinite if the value of £ could be known exactly. Practi¬
cally, however, a continuous quantity can only be determined up to a finite
number of decimal (or binary) digits.
588 INTRODUCTION TO INFORMATION THEORY [IX, § 8

We see thus that if we want to define I ft) we encounter problems of di¬


vergence. It seems reasonable to approach a continuous distribution by a
discrete one and to investigate, how the information associated with the
discrete distribution increases as the deviation between the two distributions
is diminished. Instead of £, we can for instance consider

, _ m]
(5)

where [x] denotes the largest integer not exceeding jc. Suppose a > 0 and
let 4(£i) be finite (this is only a restriction for a < 1). It follows from
Jensen’s inequality that Ia(£N) is finite for every N and the inequality

4 (£a) ^ 4 (£i) + log2 N (6)


is valid. If 0 < a < 1 and if we put

f, k k+ 1 1
PN,k ~ P =P <£ <
(N ~ ~N. N N j
(k = 0, ±1, ±2, ..N = 1
then we have the inequality
+ 00 +00

£ Pn^hn1- £
k=— oo £=-oo

from which (6) follows; for a > 1 (6) can be proved in a similar manner.
When the distribution is continuous, the information 4(£a) tends to
infinity as N -> co; however, in many cases the limit

da (0 = lim
4 (h)
(7)
N-oo log2 N
exists. The quantity da{£) will be called the dimension of order a oft If not
only d = dff) exists but also the limit

lim (4 (4v) - d log2 N) = Ia4 (0, (8)


N-* oo

the quantity dad(f) will be called the d-dimensional information of order a


contained in the value of the random variable t
In the important case when the distribution of £ is absolutely continuous,
we have the following

Theorem 1. Let £ be a random variable having an absolutely continuous


distribution with density function f(x), which is supposed to be bounded} If

This supposition is superfluous (cf. A. Renyi [27], [34]); we make it merely in


order to simplify the proof.
XX, § 8] DEFINITION OF INFORMATION FOR GENERAL DISTRIBUTIONS 589

[A^]
we put £n = (N = 1, 2, . . .) and if we suppose that 7a(<4) is finite
N
(a > 0) then
IAM
lim (9)
N -*■ 00 log 2N
+ 00
i.e. the dimension of order a of l; is equal to 1; if the integral j fixfdx (a / 1)
— 00
exists, then
+ 00
lim (4 (4v) - log2 N) = 4>a (0 = —log2 (
N-+ 00 1 CL
f fix)* dx);
J
(10)

if
+ 00

J f(x) log2^y dx.

exists, then
+ 00

lim (4 (£*) - log2 N) = 4,1 (0 = f fix) log2 -rj— dx. (11)


N-» oo J J\X)

Proof. Consider first the case a = 1. Put pNk = P and


(n=~w

fN(x) = NpNk for ~<x< k+ *- (k = 0, ±1,...).

We have then
+ 00

4 (£n) ~ l°g> W = E P-v/t log2 —-= fN (*) log2 -y-rx


fw(x)
dx' W
If

F(x) = J /(«) (13)

is the distribution function of we have

'& + 1 ' k
-F
AT N
AW L (14)
1
~N
590 INTRODUCTION TO INFORMATION THEORY [IX, § 8

According to the well-known theorem of Lebesgue it follows that

lim/v(x) =f(x)
N-*~CC

for almost every x. Now we shall use Jensen’s inequality in the following
form: If g(x) is a concave function and if p(x) and h(x) are measurable func-
b
tions with p(x) > 0 and \ p{x)dx = 1, then we have
a

b b
j g(h(x)) p(x) dx < g{§ h(x)p(x)dx). (15)
a a

This inequality can be proved in the same way as the usual form of Jensen’s

inequality. If we apply (15) with g(x) = log2x, h(x) = ——- and


J\X)

s fix) k k+ 1
P(x) = — f°r -rr^x<-—
pNk N N
then we get
k+1
N

(16)
Ik fix) log2 J(x) dx < pNk log2 —-—
NpNk
AT

and, by summing over k


-foo
1
f(x) log2 dx < Ix (£n) - log2 A, (17)
I
i.e.
+ 00

( fix) log2 -J— dx < lim inf (A (£N) - log2N). (18)


J J\X) oo
— 00

We still have to prove the inequality

+ 00

lim sup (/i(6v) - log2 N) < f f(x) log2 dx. (19)


N* oo J f(x)
IX, § 8] DEFINITION OF INFORMATION FOR GENERAL DISTRIBUTIONS 591

If/M < K, we have also f^ix) < K; thus the functions fu(x) are uniformly
bounded. Hence, by the convergence theorem of Lebesgue,

+A +A

lim j* fN(x) log2 --|-y dx = J f{x) log2 dx (20)


N °° -A -A

for every A > 0.


According to Jensen’s inequality, we have
(l+l)N-l j i

E PNk log2 -TT— ^ Pu log2 — • (21)


k = lN MPNk P1/

+ 0C
r 1
Since we have assumed that If^) and J f(x) log2 - dx are finite, we

can find for every e > 0 an A > 0 such that

1
J
\x\>A
m
log2 fix)
dx < £ (22a)

and
1
E Pv log — < e-
2
(22b)
\1\>A Pll
(20), (22a) and (22b) show immediately that the theorem is true for a = 1.
Consider now the case a > 1. We get from Fatou’s lemma1 that
+ 00 + CO
lim inf j fN(x)adx> f f(x)a dx. (23)
N-oo J
— 00
J
—CO

On the other hand, according to Jensen’s inequality,


+ 00 + CO

J /v (xf dx < J fix)* dx. (24)


— 00 —CO

It follows from (23) and (24) that

lim f fN (x)* dx = J fix)* dx, (25)


N-*oo —oo — co

hence (10) is proved for a > 1. We have still to examine the case 0 < a < 1.

1 Cf. F. Riesz and B. Sz.-Nagy [1], p. 30.


592 INTRODUCTION TO INFORMATION THEORY [IX, § 8

According to Jensen’s inequality, we have now

+ 00 +00

[ fN (x)a dx > f f{xf dx. (26)


— 00 — oo

On the other hand, according to the convergence theorem of Lebesgue, we


have for every A > 0
+A +A +oo

lim J fN (x)a dx — f f(x)a dx < \ f(xf dx. (27)


N 00 —A —A —00

Jensen’s inequality gives


(1+1)N-1
£ N-'p-Nk<p’u. (28)
k = IN

Since we supposed 7a(£j) to be finite, we can find for every e > 0 an A > 0
such that
I Pli < £• (29)
\1\>A

From (27), (28) and (29) we conclude that (25) remains valid for 0 < a < 1.
Theorem 1 is thus completely proved.
The quantities
+ 00

4,i (0 = J f(x) log2 dx (30)

4,i (0 = -j j; g log2 J f(x)a dx (a > 0, a ^ 1) (31)


— 00

are called (one-dimensional) information of order 1 or order a, associated


with the random variable f 71;1(£) is called also the entropy of the random
variable (or of the density function f{x)). The properties of these quantities
differ in some respect from the properties of the corresponding quantities
for discrete distributions. Thus for instance 7la(£) and 7a>1(£) can be nega¬
tive. Another difference is that these quantities are not invariant with respect
to any one-to-one transformation of the variable. E.g. for c > 0 we have

4,! (c0 — 7a>1 (£) + log2 c. (32)


IX, § 8] DEFINITION OF INFORMATION FOR GENERAL DISTRIBUTIONS 59 3

These facts are explained by realizing that /M(£) is the limit of a difference
between two informations.
All what we have said can be extended to the case of r-dimensional ran¬
dom vectors (r = 2, 3,. . .) with an absolutely continuous distribution. Let
f(xlt. . xr) be the density function of the random vector (<!;(1),,.^(r)).

Put £$ = ^ n ^ (k= 1,2,..., r). If . . ., gf)) is finite,1 we have

lim = r. (33)
N-» oo log,^
The dimension of the (absolutely continuous) distribution of a random vector
of r components is thus equal to r; the notion of dimension in information
theory is thus in accordance with the notion of geometrical dimension.
Furthermore, for a > 0, a ^ 1 we have

Hm [/.(($>,..., fl?)) - r log2 N] = /a>r ..., £«)) (34a)


N-+ oo

with
+ 00 +00

hr ((£(1)> • • •, £(r))) = log2 J . . . J fix ,..., xr)adx1 ...dxr


1 (34b)
-oo — 00

and
lim [/, (({«>,..£«)) - r log, at] = ((£■>.{<'>)) (35a)
N -*■ oo

with
+ oo +oo
1
A.,(«a>.{«)) = J . . . J /(X„ . . AT,) log2 dx\... dxr, (35b)
fiXl, • • Xr)
— 00 — 00

provided of course that the integrals exist.


The quantities /ar((^(1)» • . ., £(r))) and /l r((^(1), . . ., £(f))) defined by
(34) and (35) are called r-dimensional measure of order a, and of order 1,
of the amount of information (or entropy) associated with the distribution
of the random vector (£(1), . . ., ^(r)).
Consider now briefly the notion of gain of information in the case of gen¬
eral distributions. Let J75 and Q be any two probability measures on the
measurable space [£?, Suppose that Q is absolutely continuous with

1 IJdf'N , . • ■ , £^)) denotes the entropy of order a of the distribution of the random
vector ($’,. .., 1$).
594 INTRODUCTION TO INFORMATION THEORY [IX, § 8

respect to Aft. Then for every set A £

Q(A) = j h(co) dAft, (36)


A

dQ^
where h(co) > 0 is the Radon-Nikodym derivative and
dAft

J h(co) dAft = 1.
n

The gain of information of order a (or of order 1) obtained if Aft is replaced by


Q is defined1 by the formulas

h(^l\\6ft) = ^ j_y l°g2 j h(a>y dAft (37a)


n
or
Ix(Q\\Aft)= J h(co) log2 h(co) dAft. (38a)
Si

Formulas (37a) and (38a) remain valid in the discrete case too. The (ordinary)
discrete distributions Aft = (px,. . .,pn) and Q = (qx,. . ., qn) may indeed
be considered as measures defined on an algebra of events of n elements
<ox, co2,. . con with Aft(a>k) = pk and Q(cok) = qk (k = 1,2,..., n). The con¬
dition that Q is absolutely continuous with respect to Aft is here automatically
fulfilled whenever pk > 0 (k = 1,2,...,«) and we have

K°a) = —
(k = 1,2,...,«).
Pk

The formulas

n na
7. (£2 || Iog2 V and IfQ || Aft) = £ qk log, —
l—i 1
a— 1 k=1 Pk

appear thus as particular cases of (37a) and (38a).


If Q is the set of real numbers, the set of the Borel-measurable subsets
of Q and if Aft and Q are absolutely continuous with respect to Lebesgue meas¬
ure, there exist two functions p(x) and q(x) such that

Aft(A) = J p(x) dx, Q(A) = | q(x) dx for A £


A A

1 One could deduce Formulas (37a) and (38a) from a certain number of postulates
as was done in the discrete case (§ 6). This will not be dealt with here.
IX, § 8] DEFINITION OF INFORMATION FOR GENERAL DISTRIBUTIONS 595

The measure Q is absolutely continuous with respect to JP3 if for every x


such that p(x) = 0 we have q{x) = 0. Then

<l(x)
h(x) = (39)
p(x) '

In this case we obtain for the gain of information from (37) and (38)
+ 00
q(x)a
dx for a/ 1 (37b)
I. (Qll^>) = —b-r log
a— 1 (1 Pixy-1
and
+ 00

7i (Q 1 &) = j q(x) logs dx. (38b)

The gain of information for absolutely continuous distributions can be


obtained from the gain of information for discrete distributions by a limit
process:

Theorem 2. Let Sfi and Q be two distributions, absolutely continuous with


respect to Lebesgue measure, Q absolutely continuous with respect to Sfi. Let
p(x) and q(x) be the respective density functions of Sfi and Q. We suppose that
p(x) and q{x) are bounded.1 Further if
k+X k+1
N N
pNk= J p(x)dx, qNk= f q(x)dx (k = 0, ±1,...; N= 1,2,...),
k
n n

and Qn denote the distributions {pNk} and {qNk}> tf a w positive and


Ia (Q11| S3^ is finite (which means a restriction only for a > 1), then we have

lim 7a (Qn 11 SfiN) = 7a (Q11ffi), (40)


N—co

where Ia(Q || tP3) is defined by (37b) for a ^ 1 and by (38b) for a = 1, pro¬
vided that Iffl || SP) exists.

Proof. This is similar to that of Theorem 1. We define pN(x) and qN(x)

by pM = NpNk for -^r<x< k+— (k = 0, ±1,. . .)and qN{x) = NqNk

for — < x < „ - (k = 0, ±1,. ..). Let further hN(x) = .


N N Qn W

1 This supposition is superfluous and serves only to simplify the proof.


596 INTRODUCTION TO INFORMATION THEORY [IX, § 8

Consider first the case 0 < a < 1. It is clear that pN(x) -> p(x) and
qn(x) -*■ q(x) almost everywhere; further
+ 00

I«(Qn 11 &N) = log2 J qN(xf pN(xf~* dx. (41)


— 00

According to Lebesgue's theorem we have for every A > 0


+A +A

lim f qN(x)apN(x)l^x dx — j q(x)ap(x)1~a dx. (42)


N-*-cc —A —A

Since
J qN(,x)apN{xf~a dx
\x\>A

can be made arbitrarily small for a sufficiently large A, uniformly in N,


Theorem 2 is proved for 0 < a < 1.
Now suppose a > 1. We have, according to Jensen’s inequality,

+ 00 + 00
f ?(*)“ f qN(x)a j
J Wr'dx*) J^dx’ (43)

and on the other hand by Fatou’s lemma

+ 00 + 00
4n(xY q(xY
lim inf dx. (44)
N -+- 00 J Pn{x) Pixy1
— 00

which settles the case a > 1.


Finally, let a = 1. We have

+ 00
qN{x) .
A(QnII^V) = J qN(x) log2 Z 7“ dx. (45)
pN{pc)

Since the function x log2 x is convex, Jensen’s inequality gives

+ 00 + 00

J q{x) log2 -~5_


p(x)
dx ^ f 9n(x) log2 ■ dx. (46)
) p4x)
IX, § 9] INFORMATION THEORY OF LIMIT THEOREMS 597

From x log2 x > —— we deduce

, \ , , log2 e , . . „
qN(x) log2 —— +-pN(x) > 0.
Pn\x) e

Hence, according to Fatou’s lemma,


+ 00 + 00

lim
im inf f qN(x) log2 dx > q(x) log2^~dx. (47)
A- oo J p(x)

(46) and (47) lead to


+ 00 +00

lim f qN(x) log2 dx = ( q(x) log2 ^ } dx (48)


N-~ oo J Pn(.x) p(x)

and Theorem 2 is herewith proved.

§ 9. Information-theoretical proofs of limit theorems

We have seen that for complete discrete distributions the relation


7a(Q || S3) > 0 holds, where the equality sign occurs only if S3 and Q are
identical. We shall now prove the following property: If {Q„} is a sequence
of discrete distributions such that lim Ia(Qn \ \ = 0, then the distri-
n-> oo
butions Qn converge to the distribution Thus we have the following

Theorem 1. If & = (plt • • -,pr) and Q„ = (qnl,. .., qnr) are probability
distributions and if

lim IfQ.n 113s) = 0 («>0), (1)

then we have also


lim q„k=Pk • (2)
n~*" oo

Proof. If (2) does not hold, there exists a subsequence nx < n2 < ... <
< ns < . . . of the integers with

lim q„sh = p'k and £ (Pk-PkY^Q- (3)


s- oo k=1
598 INTRODUCTION TO INFORMATION THEORY [IX, § 9

Obviously, £ pk = 1; further if we put SA' = (X,.. p'r), it follows


k=1
from (3) that
lim /„(£>„, 11 ^) = 4(^' 11 ^). (4)
S-+- 00

According to (1), 11 «i^) = 0, but this is possible only if &>' — i.e.


if Pk — Pk f°r k — 1> 2, . . r, which contradicts (3). Thus Theorem 1 is
proved.
As an application of this theorem we shall now prove a theorem about
ergodicity of homogeneous Markov chains, which, essentially, is contained
in Theorem 1 of Chapter VIII, § 8. We give here a new proof of this result,
only to show how the methods of information theory may be used to prove
theorems on limit distributions.

Theorem 2. Let us consider a homogeneous Markov chain with a finite


number of states A0,. . ., AN; let the probability of transition from Aj to Ak in
n steps be denoted by p(f (n — 1,2,...). For pty we write simply pjk. If there
exists an integer s > 1 such that pty > 0 for j, k = 0, 1,. . ., N, then the
equations
N
I XjPjk = xk (k = 0,1,..., N) (5a)
7=0

have a system of solutions xk = pk (k = 0, 1, . . ., N) with

N
Pk > 0 (k = 0, 1,..., N) and V pk = 1, (6)
k=o
and we have

lim pf = pk (/, k = 0,1,..., N). (7)


n-> oo

Proof. The existence of a solution xk = pk > 0 {k = 0, 1,..., N) of the


system of equations (5a) can be proved directly, without probabilistic consid¬
erations, in the following manner:1 The determinant of the system (5a)

is zero, since X Pjk — 1 (j = 0, 1,. .N); thus the system has a non-
k=l
trivial solution (x0, Xl,..., xN). If (5a) is fulfilled, we have

I ** I ^ Z Pjk I Xj |. (5b)

1 We have here a particular case of the Perron-Frobenius theorem; cf F R Gant-


macher [1], Vol. 2, p. 46.
INFORMATION THEORY OF LIMIT THEOREMS 599
IX, § 9]

If we add the inequalities (5b) for k = 0, 1,. . N, we obtain


N N

E 1**1 s E l*il-
k =0 7=0

But this inequality is an equality; hence the same must hold for every in¬
equality (5b), i.e.

1**1 = Z Pik\xjl (5c)


7=0

Hence (5a) possesses a nontrivial nonnegative system of solutions, say


(p0,pi,. . .,pN). If we multiply (5a) by pkh add the equations obtained for
k = 0, 1,. . .,N and repeat the whole procedure n times, then we obtain

Z rf+1) = I PH®-
jZo k-0

We find then by induction that

E PjPjf = Pt for h =1,2. (5d)


7=0

Since (5d) is valid for h = s, it follows that no pk can be zero. Because of the
homogeneity of the equations, (5a) has thus a positive system of solutions
N

Po,Pi, ■■-,Pn with Z Pk = i-1 Put


fc = 0

iP5 = (Po, ■ ■ •,Pn) ar|d — (Pj\,\ ■ • •>Pjn)‘


We consider now the quantities Ia{&f || -9s) and prove the relation
1

lim /a(^f||^) = 0, (8)


11-+CO

then Theorem 2 follows because of Theorem 1. The value of a is immaterial


for the proof. We can e.g. assume a > 1. By assumption

Z PjPjk = Pk-
7=0

If we put TC;k = , we have


’ Pk

lnik=X. (10)
7=0

1 This solution is unique; this is a corollary of (7) and need not be proved here

separately.
600 INTRODUCTION TO INFORMATION THEORY [IX, § 9

Furthermore, by definition

Pit*1' = 1=0
I tfPib (11)
hence
■ N N n(n) \«
7a(^+1>||^) = -—log2 Eft w=o (12)
OC X k=o Pi )

Because of (10), Jensen’s inequality leads to


x N
£ Pf Pf a
Z —-
1=0 Pi
nik ^ Z
1=0
nik
Pi.)
(13)

Since £ plk =1, it follows that


k=0

N N

Z
=0
k
Pk^ik=PiY Pik=Pi-
k=0
(H)
If we multiply the inequality (13) by pk and then take the sum over k, we
obtain

h (^+1)l!^) ^ h II &) (« = 1,2,...). (15)

II 3s) {n = 1, 2,,..) is a monotone decreasing sequence of nonnega¬


tive numbers; it has thus a limit

lim Ia 11S5) = y. (16)


/l-*-00

It remains to show that y = 0. Choose a subsequence nx< ... < nt < ...
of the integers such that the limits

lim Pjf = qjk (k = 0,\,..,,N)


/-►CO

exist. Then £ qjt = 1 and by (11)


1=0

N
fni+s)
lim p)k‘ _
= Z
1=0
^jiPik = q'jk-
Obviously
IX, § 9] INFORMATION THEORY OF LIMIT THEOREMS 601

Let Qj and Q) denote the distributions (qJ0,. . ., qjN) and (q'J0,. . ., q']N),
respectively. If we put 71$ = PjP$/pk, Jensen’s inequality implies

N N

Ia(Qj\\&) = log2 Eft 1 *4' s/.(an^) (17)


a— 1 k=0 /=o Pi

by the same argument that led to (15). But, because of (16), the relations

IMN&) = Hm Ia(^rs)\\&) = y (18a)


f —► GO

and
/« (Qj \\'9S)= hm Ia 11 &) =y (18b)
t-*00

hold; hence there is equality in (17). Since (17) is derived from Jensen’s
inequality, it follows that equality can hold only if qJ{ = Xp, (/ = 0, 1,. . .,N).
N N

Since X! qji = £ Pi = 1, we must have A = 1; consequently, Qj =

But then IJQ || = Ia(& || &) = 0, hence by (18b) y = 0. Theorem 2


is herewith proved.
The idea to prove theorems on limit distributions by means of informa¬
tion theory is due to Yu. V. Linnik. He proved in this way the central limit
theorem under Lindeberg conditions by using Shannon’s entropy; he proved
the convergence of the distribution with the density function p„(x) to the
normal distribution by showing that

+ 00
Pn (*) (19)
lim
n-+co I P„(X) l0g2
<p(x)
dx = 0,

where <p(x) is the density function

<P(X) =
Jin

of the normal distribution. Linnik’s proof can be simplified if we use the


gain of information of order 2 instead of Shannon’s entropy; but even so
the proof’is too intricate to be reproduced here. However, we can briefly
indicate the principle of the method.
Let &, f2,. . ... be independent random variables having the same
absolutely continuous distribution with zero expectation and unit variance
602 INTRODUCTION TO INFORMATION THEORY [IX, § 9

_£2 • ~ £l + + •••+£/! rr-i,


and let pn(x) be the density function of -—-• 1 hen
Jn
+ 00

j x2pn(x)dx = 1.
— 00

Relation (19) can be written as


+ 00
lim | p„ (x) log2 —— dx = log2 J2ne. (20)
Pn \X)

It is easy to show that


+ 00

log2 Jlne = J (p(x) log2 dx, (21)


— 00

thus (19) is equivalent to


+ 00 + 00

lim pn {x) logo dx = cp(x) log2 dx. (22)


n-*- 00 J Pn (X) (p(x)

To say that the distribution with the density function pn(x) tends to the nor¬
mal distribution, means therefore that the entropy of this distribution tends
to the entropy of the normal distribution. But we can prove that for a den¬
sity function p(x) such that
+ CO
j x2p(x)dx= 1 (23)

the inequality
+ CO + 00

J p(x) log2 dx < q>{x) log2 dx (24)


p(x) <p(x)

holds, since because of (21) and (23), (24) is equivalent to the well-known
inequality
+ 00

p(x) log2 dx > 0. (25)


<P(x)

The statement of the central limit theorem may therefore be expressed as


follows: The entropy of the standardized sum of independent random vari¬
ables tends, as the number of the variables tends to infinity, to the maxi-
IX, § 10] EXTENSION OF INFORMATION THEORY 603

mum of the entropy of all random variables with unit variance. Thus the
central limit theorem of probability theory is closely connected with the
second law of thermodynamics.1

§ 10. Extension of information theory to conditional probability spaces

In the present section we consider particular conditional probability spaces


[&, ,£8,P(A |i?)] only: Let Q — {cox,a>2,. . . . .} be a denumerable
set and the class of all subsets of Q. Let further pk (k = 1, 2,. . .) be a
00

sequence of nonnegative numbers with £ pk — + oo. We put


k=1

P(A) = E
con£A
Pn for
Let 3S be the set of those subsets B of Q for which p(B) is finite and positive.
For A £ and B £ 3dl, P(A \ B) is defined by

K AB)
P(A | B) = (1)
m *

We shall indicate by some examples how the concepts of information


theory can be extended to conditional probability spaces.
If £ is a discrete random variable defined by £(co) = x„ for to £ An (n —
OO

= 1,2,...), where An £ £ An = Q, AnAm = O for n ^ m, then the


n=1
numbers P(An \ B) (n = 1,2,B £ £8) form the conditional distribution
of £ with respect to the condition B. Consider the entropy of this distribution,
i.e.

h (( IB) = f P(A, | B) log, —L=■ . (2)


n=1 *\An I B)

If Qn is the set {cox, co2,. . .,<x>N}, then QN £ P& for N > N0. We define the
entropy Ix(£) of £ (in other words the information contained in the value
of 0 by
h(0= Hm (3)

1 If the distributions considered concern the velocities of the molecules of a gas,


the condition
+“>
J x2pn(x)dx = 1
— 00

means that the total kinetic energy of the gas is constant.


604 INTRODUCTION TO INFORMATION THEORY [IX, § 10

if this limit exists and is finite. If it does not exist, the information in ques¬
tion will be characterized by the following two quantities:

lim inf I1(Z\Qn) = Ix(0


N-* oo
and
lim sup /j(£ | Qn) = hiO-
N-+- oo

Consider for instance the binary representation of a positive integer.


We show that each of the digits of this representation contains exactly one
bit of information, just like a digit of the binary expansion of a real number
lying in the interval (0, 1). Let £2 be the set of positive integers, con = n,
and pn= 1 (n = 1,2,...). Consider as above the conditional probability
space [£2, 33,P(A | B)]. Let sk(ri) denote the'k-th digit in the binary ex¬
pansion of n; we have

n = fek(n) 2‘, (4)


k=0

where ek(ri) is equal either to 0 or to 1. Obviously

1 if n is odd,
e0(n) =
0 if n is even.

If Qn = {1, 2,..., N}, we have

' N N
N—
, N N
7i(e0 (n) | Qh) = l°g2 r . . n + log2
N N N N
N—
T
and by (3)
A(«o («)) = 1- (5)

It follows in the same way that

A(e*(«))=l (k= 1,2,...). (6)

Take now an example for which the limit (3) does not exist. Consider again
the binary expansions of the positive integers; let [fl, «S@,P(A | B)] be
the same conditional probability space as in the previous example.
Let rj(n) be the largest exponent of 2 in the binary expansion of n; hence

i(«)
« = ££*(«)2* with £„(n)= 1.
i
IX, § 11] EXERCISES 605

If now 2r <N < 2r+1, that is if r — [log2 N], then

=j I ®n) = -^r for j < r - 1

and
N — 2r 4- 1
P(t](n) = r\QN) =-—-.

If N tends to infinity through values for which


2[i°g!v]
lim - = y
i
Af-00 N . z

then we have

lim h{t]{n) | &n) = 2y + y log2 — + (1 - y) log2 — = L(y). (7)


n-~o o y i -y

Thus the limit Z,(y) depends on y. L(y) is a concave function of y and we

have L — L(l) = 2. Furthermore, L(y) takes on its maximum for y = —

and we have

max L(y) = L log2 5.

Consequently, h{r](n)) = 2 and Ii(rj(n)) = log2 5. The information


I\(ji(rij) is not defined in this case; nevertheless, it can be stated that the
number of digits in the binary representation of an integer contains at least
2 bits and at most log2 5 bits of information.

§ 11. Exercises

l.1 a) How much information is contained in the licence number of a car, if this
consists of two letters and four decimal digits? (22.6)
b) How much information is needed to express the outcome of a game of “lotto”,
in which 5 numbers are drawn at random from the first 90 numbers? (25.4)
c) What amount of information is needed to describe the hand of a player in bridge
(each player having 13 cards from 52)? (39.2)
d) How much information is contained in a Hollerith punch-card which has 80
columns and in each column one perforation in one of 12 possible positions? (286.8)

1 The numbers in parentheses are the solutions.


606 INTRODUCTION TO INFORMATION THEORY [IX, § 11

e) How much information is contained in a table of values of a function consisting


of 50 pages with 40 lines per page and 25 decimal digits on each line (numbers for
identification of the lines not counted)? (166 095.5)
f) How much information is contained in a linear macromolecule consisting of
100 000 single molecules, if there can occur one of four different molecules at every
place? (200 000)
g) How much information is transmitted per second by a television broadcasting
station which emits 25 images per second each of which consists of 520 000 points,
each black or white? (13 000 000)

2. a) Let some integer n (1 < n < 2 000) be divided by 6, 10, 22, and 35 and let
the remainders be given, while we assume that the remainders are compatible. How
much information is thus given concerning the number nl

Hint. The information is equal to log., 2000 = 10.96 (i.e. we get full information
on n). In fact the remainders mentioned determine n modulo the least common
multiple of 6, 10, 22 and 35; which is equal to 2 310 > 2 000, hence n is uniquely
determined.
b) Let the number n be expressed in the system of base (—2), i.e. put

n= X bk{— 2)*,
*=o
where bk can take on the values 0 or 1 only. How much information on n is contained
in M
[^]
Hint. Put ?V = Y, 22j+1;then
/=o

r 2r+1—N— 1

na+jc(~2>*) = z *"•
k=0 n= — N

It follows that for fixed r the numbers — N, —N + 1, . . . , 2r+1 — N — 1 can all


r
uniquely be represented in the form Y bk( — 2)k with bk = 0 or 1; there are thus
fc=0
exactly 2r+1 numbers which can be expressed in this form and every “digit” bk contains
one bit of information with respect to n .
c) Let U(n) (and V(n)) denote the number of different (and of all) prime factors
of n. How much information with respect to n is contained in the value of the
difference V(n) — U(n) ?

Hint. As is known,1 for every positive integer k the asymptotic density of the
numbers n with V(n) — U{ri) — k exists. Let this density be denoted by dk, then

xv=n
k= 0
for | z | < 2,

where p runs through all primes. Let Nk(x) denote the number of integers n smaller
than x with V(n) — U(n) = k, then
Nk{x)
lim = 4
X-> +oo a:

1 Cf. A. Renyi [16],


IX, § 11] EXERCISES 607

(k = 0, 1,. . . .); d0 is the density of square-free numbers and is equal to -Thus


71“
«o j
the amount of information in question is equal to £ dk log2 - .
k—0 dk
3. a) Let the real number x (0 < x < 1) be represented in the form

V *«(*)
* = nL
=l H
-.

where q is a positive integer > 2, and e„(x) can take on the values 0, 1, . . ., q — 1
(n — 1,2,...). How much information with respect to x is contained in the value
of £„(x) ?
b) Expand x (0 < x < 1) into the Cantor series

x _y En(x)

n—\ dldi ■ ■ ■ dn

where qu q,,... , qn, . . . are positive integers > 2, and £„(x) can take on the values
0, 1, 1 . How much information with respect to x is contained in the value
of en(*)?
c) Expand x (0 < x < 1) into a regular continued fraction

x =
(x) +
a2 (x) +

where each a„(x) can be an arbitrary positive integer. How much information about
x is contained in the value of a„(x)?

Hint. Let m„(k) denote the measure of the set of those x for which o„(x) = k .
As is known1

lim mn{k) = log, — nk.


n —► co k(k + 2)
Hence
“ 1
lim Iy (a„ (x)) =YJnk log, -.
«oo k— 1 ^k

Let it be remarked that contrary to Exercises 3.a) and 3.b), the random variables an(x)
in this example are not independent; the total information contained in a sequence
of several digits an(x) is not equal to the sum of the informations contained in the
individual digits.

4. Let a differentiable function /(x) be defined in [0, A] and suppose /(0) = 0 and
|/'(x)| < B . Find an upper bound for the information necessary in order to determine
the value of /(x) at every point of [0, A] with an error not exceeding e > 0 .

ke { AB
Hint. Put xk k = 0,1, xr ab -j = A. Let the curve of
B L-f J +
/(x) be approximated by a polygonal line y — <p(x) which can have for its slope in

1 Cf. e.g. A. J. Khinchin [61.


608 INTRODUCTION TO INFORMATION THEORY [IX, § 11

each of the intervals (xk, xk + x) either +B or —B. If (p(x) is already defined for
0 < x < xk , then let the slope in (xfc+l) be so chosen that \f(xk+l) — g?Ot+1)| < e.
Obviously, this is always possible. Since f(x) — <p(x) is in every interval (xk, xA+l)
monotone, the inequality |/(x) — <p(x)| < e holds in the open intervals (xk, x*.+l)
(k = 0, 1,. . .) as well. Clearly, the number of possible functions q>(x) is equal

to 2H+1. in order to determine /(x) up to an error e there suffices therefore


AB
+ 1 bits of information.
e

5. We have n apparently identical coins. One of them is false and heavier than the
others. We possess a balance with two scales but without weights. How many
weighings are necessary to find the false coin?

Hint. The amount of information needed is equal to log2 n. Only weighings with
an equal number of coins in both scales are worth while to be performed. 3 cases
are possible: equilibrium, right scale heavier, and left scale heavier. One weighing
{logo n \ .
lo ~ 3 f weighings

({x} denotes the smallest integer greater than or equal to x). It is easy to see that
this number of weighings is sufficient. In fact, let k be defined by 3* 1 < n < 3*.

At the first weighing we put in each of the dishes j coins. We know then to which

of the three sets, each containing at most 3fc 1 coins, the false coin belongs. Proceeding
in this manner, the false coin will be found after at most k weighings.

6. The “Bar-Kochba” game is played as follows: Player A thinks of any object,


player B asks questions which can be answered by “yes” or “no”, and has to guess
the thing on which A thought from the answers. Naturally, A has to answer all
questions honestly,
a) The players agree that A thinks of some nonnegative integer < N. What is the
minimal number of questions permitting B to find out the considered integer? Give
an “optimal” sequence of questions.

Hint. Obviously, at least {log2 N} questions are needed, since each answer provides
at most one bit of information and we need log, N bits. An optimal system of questions
is to ask, whether in the binary representation of the number x the first, the second, ...,
digit is 0? The aim is arrived at by {log2 N} questions, since the binary representation
of an integer is unique.
b) Suppose N = 2s. How many optimal systems of questions do exist? That is:
how many systems of exactly j questions determine x whatever it may be?

Hint. The number of the possible sequences of answers to s questions is evidently 2s.
There corresponds thus to every integer x (x = 0, 1.2s — 1) in a one-to-one
manner a sequence of s yes-or-no answers. Every question can be put in the following
form: Does x belong to a subset A of the sequence 0, 1, . . ., 2s — 1? Thus to an
optimal sequence of s questions there correspond s subsets of the set M — {0, 1, ... ,
2s — 1}; let these be denoted by Au A2, . . . , As. According to what has been said,
Ay has to contain exactly 2i_1 elements. Let A always denote the set complementary
to A with respect to M. Then AyA2 and AXA2 have to contain both 2s-2 elements;
AyA2A3, AyA2A3, AyA2A3, and AyA2A$ have to contain 2'-3 elements, and so on.
IX, § II] EXERCISES 609

Conversely, if all sets AXA2. . . Ak~rAk contain exactly 2s~k elements (k = 1,2,..., s),
where A means either A or A, then the system of sets At, A2,. . ., As is optimal.
It follows from this that the number of optimal sequences of questions is

If we regard the systems of questions which differ only in the order of questions as
2s!
identical, then the number looked for is -.
si

Remark. In the Bar-Kochba game the questions are, in general, formulated while
taking into account the answers already obtained. (In the language of set theory:
if the first answers have shown that the object belongs to a subset A of the set M of
all possible objects, then the next question is whether it belongs to some subset B
of the set A.) It follows from what has been said that the questioner suffers no dis¬
advantage by being obliged to put his questions simultaneously.

7. Suppose that in the Bar-Kochba game type players agree that the objects
allowed to be thought of are the n elements of a given set M. Suppose that the
questions are asked at random, or in other words, all possible questions have the
same probability, independently of the answers already obtained.
a) What is the probability that the questioner finds out the object by k questions?
b) Find the limit of the probability obtained in a) as n and k both tend to + °° such
that
lim (k — log2 n) = c .
n—► co

Hints. We may suppose that the elements of the set M are the numbers 1, 2
Each possible question is equivalent to asking whether the number thought of does
belong to a certain subset of M. The number of possible questions is thus equal to
the number of subsets of M, i.e. to 2". (For sake of simplicity there are included the
two trivial questions corresponding to the whole set and the empty set.) Let
Ax, A2, . . . , Ak be the sets chosen at random by the questioner: i.e. he asks, whether
the number thought of does belong to these sets. By assumption, each of the sets

An is, with probability , equal to an arbitrary subset of M. Put

{1 if Ai contains the number /,


0 otherwise.

The random variables et(l) (J = 1, 2, . . ., k; l = 1, 2, ..., n) are independent of

each other and each takes on the values 0 and 1 with probability — . The questioner

finds the number x, when the sequence of numbers ex(x), f2(jc),.. ., e*(x) is different
from all sequences ex(y), f2(y), • • • . f*O0 with y ^ x. The sequences e,(/), e2(/),..., ek(l)

are, with probability and independently of each other, equal to any sequence

consisting of k digits 0 or 1; the problem is thus equivalent to the following urn-


problem:
n balls are thrown into 2k urns; each ball has the same probability to fall into any
of the urns. One of the balls is red, all the other balls are white. What is the probability
INTRODUCTION TO INFORMATION THEORY [IX, § II
610

f i r-1 _
that the red ball is alone in an urn? The answer is evidently 1 — — P„.k-

The answer to question b) is therefore

lim Pn,k = exp


k— 10g2«-*-C
n—► oo

Remarks

1. The number of questions needed (if n is sufficiently large) in


order to find the number with a probability > 0.99 by means of this random strategy
exceeds only by 7 the number of questions needed in the case the optimal strategy

is employed. In fact, exp > 0.99. This result is surprising, since one would

be inclined to guess that the random strategy is much less advantageous than the
optimal strategy.

2. When the questions are asked at random it may happen that the same question
occurs twice. But the corresponding probability is so small, if n is large, that it is
not worth while to exclude this possibility, though of course this would slightly increase
the chances of success.

8. Certain players play the Bar-Kochba game in the following manner: There are
r + 1 players, r players think of some object; the last player asks them questions.
The same question is addressed to every player, who answers by “yes or no ,
according to what is true concerning the object he had thought of.
a) Each of the players thinks of one of the numbers 1,2,. . . ,n(n> r), but each
of a different number. The questions are asked at random, as in the preceding
exercise. What is the probability that the questioner finds all numbers by k questions?
b) n = r and the players agree to think each of a different number of the sequence
1, 2hence it is a permutation of numbers which is to be found. What is the
probability that the questioner finds the permutation by k questions? Calculate ap¬
proximately this probability for k — 2 log2 n + c.

Hints, a) We are led to the following urn problem: we put n balls into 2k urns,

independently of each other, each ball having the same probability to get into

any one of the urns. Among the n balls there are r red balls, the others are white.
What is the probability that all the red balls get into different urns? This proba¬
bility is

For r — 1 we find as a particular case the result of the preceding exercise,


b) n = r, hence

thus

I'm P„,2loetn + c,n ~ exp


IX, § 11] EXERCISES 611

Remark. It is surprising that in this game to guess a permutation of the numbers


1,2.n approximately twice as many questions are necessary as to guess a single
one of these numbers. Of course, in the first case one gets to each question n answers.

9. Let/(«) (n = 1,2,.. .) be a completely additive number-theoretical function, i.e.

f(nm) = /(«) + f(m) (A)

for all pairs of integers n and m. Suppose further that the limits

lim f(n + d) - f(n) \ = 1(d) (B)

are finite for every integer d and


1(d)
lim = 0.
d—*- co log d

Then g(n) = c log n, where c is a constant.

Hint. Let P be an integer greater than 1; we put

f(P) log n
9(n) = /(«) -
log P

From g(P) = 0 and from (B) it follows that


max 1(d)
ff(n) | < 1 ■&d<P
lim
log n ~ log P

From this we conclude that

lim lim
f(n) An = 0.
P —► 00 fj —► 00 log n log P

This implies the existence of the limit

lim --= c.
/(«)
» log n

If we put now h(n) = /(«) — clogw, then /(«) is a completely additive function
for which

r h<ji) = 0.
lim --- o
(1)
n —* oo log«

But this implies h(n) = 0 since otherwise there would exist an integer r with h(r) ^ 0
and thus, because of the additivity, h(rk) = kh(r) for k — 1,2,... , which contra¬
dicts (1).

Remark. This problem is due to K. L. Chung but his proof differs from ours.
If instead of the complete additivity only simple additivity is required, i.e. that
f(nm) = /(«) + f(m), if («, m) — 1, then the condition (B) does not imply /(«) =
= c log n. (The last step of the proof cannot be carried out in this case.)

10. Let P = {pk} be any distribution with


00

kpk = A > 1.
612 INTRODUCTION TO INFORMATION THEORY [IX, § 11

Then the entropy

f Pk lo§2 — = h{&)
i Pk

takes on its maximal value if

11. Let and Q be two distributions, absolutely continuous with respect to Lebesgue
measure, with density functions p(x) and q(x) and further let be absolutely con¬
tinuous with respect to It follows from Theorem 2 of § 8 that the gain of information
is nonnegative in this case too, i.e. we have the inequalities
+ CD

( q(x) log2 dx >0


J P(x)
— 00

and
-foo

——- log2 ( _T dx > 0 for « > 0, # ^ 1.


« - 1 J />(*)“ 1
— CO

Prove these inequalities directly (without passing to the limit) by Jensen’s inequality
geneialized for functions, i.e. by inequality (15) of § 8.

12. a) Let £ be a positive random variable having an absolutely continuous distri¬


bution, with E(0 = X > 0. Show that the entropy (of order 1) of £ is maximal if
the distribution of £ is exponential.

Hint. Let f(x) be a density function in (0, +<») with

J xf(x) dx = A
o
and put
1 ( j
&(x) = y exp —

We have (cf. Exercise 11)

0 0 0

b) Let ( be a random variable distributed in the interval (a, b) with an absolutely


continuous distribution function. Show that the entropy (of order 1) of £ is maxima!
if £ is uniformly distributed in (o, b).

Hint. Let f{x) be a density function which vanishes outside (a, b) and put

-- for a < x < b,


b — a
d(x) =

0 otherwise.
IX, § 11] EXERCISES 613

We have then
b b j)

o < J m log, **) log, -L. dx - J/« log, -±. dx.


a a a

13. Let 4> £2> . . . , £„ be random variables and C = (cw) a nonsingular « by n


matrix. We put
n

Vk= Yj c<& (k = 1,2,..., «).


/=!

Show that we have for a > 0

4, (.Oh, Vt, • • .. Vn)) = 4,„ ((4.4, • •{„)) + log, 11| c III


where ||C|| denotes the determinant of C.

14. Let 4, £2,.... 4 be independent random variables with absolutely continuous


distribution. We have

4, (du ■ ■ ; €r)) = 4,1 (4) + 4,2 (4) + • • • + 4.r («,)•

///«/. This follows from Formulas (34b) and (35b) of § 8.

15. The relative information Id,rj) contained in the value of r] concerning £ (or
conversely) can be defined, when the pair (4 rj) has an absolutely continuous distri¬
bution, by
Id, v) = hxd) + 4,t0?) - 4,2 (&»?))•
Show that if 01 is the joint distribution of £ and ij and if * Q denotes the direct
product of the distributions and Q of £ and rj, then Formula (11) of § 4 remains
valid; i.e. Id,rj) is equal to the gain of information obtained by replacing * Q
by 01.

Hint. If h(x, y) is the density function of the pair (4 rj), f(x) and g(y) are the
density functions of £ and rj, then we have
-{-CO -{-00

/«,,) = ] jKx.y) lot. ^^dxdy.


— 00 — 00

It follows that
Id, V) = 4(« II & * Q),
because of Formula (38) of § 8.

16. In the following exercises we always use natural (Napier’s) logarithms In.
a) Calculate the entropy (of dimension 1 and of order 1) of the normal distribution;
i.e. show that

J
-f 00

w(x) In-dx — In a J 2ne ,


<f(x) N

where
(x — m)2 \
(p(x) = — - exp
In a ~’2cr2 J
614 INTRODUCTION TO INFORMATION THEORY [IX, § 11

b) Calculate the entropy (of dimension r and of order 1) of the r-dimensional normal
distribution.

Hint. Let the r-dimensional density function of the random variables £2, • • •> £r
be

where ||2?|| is the determinant of the positive definite quadratic form

Z Z
/=1 i=\
By a suitable orthogonal transformation
r

Vk = Z cm (£,• - mi)
/= i

we obtain for the density function of the random variables rju rj2, ... , ??,:

(2tt) 2 cqu,;... ar
1
with axa2 . . . ar = \\B\\ 2 . According to Exercise 13, the entropy is invariant under
such a transformation, since the absolute value of the determinant of an orthogonal
transformation is equal to 1. Hence, according to Exercise 14,
r r _1

h.r ((£i, ■ • •> Q) = h.r ((Vu ■ ■ •> Vr)) = Z (%) = In (2:r<?) 2 11 B 11 2.


k= 1
c) Let the joint distribution of the random variables £ and r\ be normal with density
function

f(x, y) = --——-exp | - y (Ax2 + 2Bxy + Cy2) I .

Calculate the (relative) information contained in the value rj concerning

Hint. We find the desired information by subtracting from the sum of the informa¬
tions contained in ? and in rj, the information contained in the distribution of the
pair (£, rj). Hence

2ne
/(£, n) — In + In In
J AC - B2

= In

If B — 0, i.e. if £ and rj are independent, we find of course that /(£, rj) = 0.


17. Let £ be a random variable with absolutely continuous distribution and density
function f(x). Let the standard deviation a of £, be finite and positive.
IX, § 11] EXERCISES 615

a) Show that the entropy (of dimension 1 and of order 1) of £, is maximal if £ is


normally distributed.
b) Show that the entropy lx,i(0 (of dimension 1 and of order a > 1) is maximal if

r{—^—
1 [a — 1 «+i x^ \ ^
2 a_1 1 - — 1_a for | x | <c,
f*(x) = 7 \r(( a —“ 1
otherwise,

3a— 1
where we have put c — a
a - 1

Hint. Put
-f 00

(x — m)2
m = M( C) , „■ = / (x — m)2f(x) dx, (f(x) exp -
2a2
V 2na
We have then
+ 00 + 00

0 < ( f(x) In —dx = ( <p(x) In —j— dx — f f(x) In —-- dx,


> <P(x) J <P(x) J f(x)

which implies a), b) can be proved in the same fashion. Let it be noticed that /a(x)
tends to

7 —exp
2jig (" +)
as a —> 1.

18. Let f(x) and f„(x) be density functions such that /„(*) = 0 (« = 1, 2,. .,) for
every value of x for which f(x) = 0; suppose further that all integrals

+ 00
r nw dx («= 1,2,...)
J /(
f(x)

exist and that

f2J,x)
*. f fix)
dx — 1.
n co J

Prove that under these conditions

lim sup | J f„(x) dx — J/(.*) dx [ = 0,


n —*■ co E E E

where E runs through all measurable subsets of the set of real numbers.
616 INTRODUCTION TO INFORMATION THEORY [IX, § 11

Hint. Applying Schwarz’ inequality, we get


+ 00

j j" fn (a) dx - | fix) dx < I'


J V/(A) V
<

1 E

+ CO

ifnix)-f{X)Y

* J Ax)
dx

and clearli
+ 00 + 00
(/„(*)-A*))2 dx =
f n (X)
dx — 1.
fix) Ax)
— 00 — oo
TABLES 617

TABLES

Table 1

Values of nl and of log n\ for n <; 50

n n\ log n! n n\ log n\

1 1 0.00000000 26 40329146-1019 26.60561903


2 2 0.30103000 27 10888869-1021 28.03698279
3 6 0.77815125 28 30488834-1022 29.48414082
4 24 1.38021124 29 88417620-1023 30.94653882
5 120 2.07918125 30 26525286-1025 32.42366007
6 720 2.85733250 31 82228387-1026 33.91502177
7 5040 3.70243054 32 26313084-1028 35.42017175
8 40320 4.60552052 33 86833176-1029 36.93868569
9 362880 5.55976303 34 29523280-1031 38.47016460
10 3628800 6.55976303 35 10333148-1033 40.01423265
11 39916800 7.60115572 36 37199333-1034 41.57053515
12 47900160 • 10 8.68033696 37 13763753-1036 43.13873687
13 62270208 • 102 9.79428032 38 52302262-1037 44.71852047
14 87178291 • 103 10.94040835 39 2039 7882-1039 46.30958508
15 13076744 • 105 12.11649961 40 81591528-1040 47.91164507
16 20922790 • 106 13.32061959 41 33452527-1042 49.52442892
17 35568743 • 107 14.55106852 42 14050061-1044 51.14767822
18 64023737 • 10* 15.80634102 43 60415263 -1045 52.78114667
19 12164510 • 10l° 17.08509462 44 26582716-1047 54.42459935
20 24329020 • 10u 18.38612462 45 11962222-1049 56.07781186
21 51090942 • 1012 19.70834391 46 55026222-1050 57.74056969
22 11240007 • 1014 21.05076659 47 25862324-1052 59.41266755
23 25852017 • 1015 22.41249443 48 12413916-1054 61.09390879
24 62044840 • 1016 23.79270567 49 60828186-1055 62.78410487
25 15511210 • 1018 25.19064568 50 30414093-1057 64.48307487
618 TABLES

Tabli: 2

Binomial coefficients for n <; 301

V
n \
0 l 2 3 4 5 6 7 8

2 1 2 1
3 1 3 3 1
4 1 4 6 4 1
5 1 5 10 10 5 1
6 1 6 15 20 15 6 1
7 1 7 21 35 35 21 7 1
8 1 8 28 56 70 56 28 8 1
9 1 9 36 84 126 126 84 36 9
10 1 10 45 120 210 252 210 120 45
11 1 11 55 165 330 462 462 330 165
12 1 12 66 220 495 792 924 792 495
13 1 13 78 286 715 1287 1716 1716 1287
14 1 14 91 364 1001 2002 3003 3432 3003
15 1 15 105 455 1365 3003 5005 6435 6435
16 1 16 120 560 1820 4368 8008 11440 12870
17 1 17 136 680 2380 6188 12376 19448 24310
18 1 18 153 816 3060 8568 18564 31824 43758
19 1 19 171 969 3876 11628 27132 50388 75582
20 1 20 190 1140 4845 15504 38760 77520 125970
21 1 21 210 1330 5985 20349 54264 116280 203490
22 1 22 231 1540 7315 26334 74613 170544 319770
23 1 23 253 1771 8855 33649 100947 245157 490314
24 1 24 276 2024 10626 42504 134596 346104 735471
25 1 25 300 2300 12650 53130 177100 480700 1081575
26 1 26 325 2600 14950 65780 230230 657800 1562275
27 1 27 351 2925 17550 80730 296010 888030 2220075
28 1 28 378 3276 20475 98280 376740 1184040 3108105
29 1 29 406 3654 23751 118755 475020 1560780 4292145
30 1 30 435 4060 27405 142506 593775 2035800 5852925
i

1 For n > 15 values are given for k < only; the further values can

be taken from the table by the relation


TABLES 619

Table 2
( continued)

9 10 11 12 13 14 15
V
2
3
4
5
6
7
8
1 9
10 1 10
55 11 1 11
220 66 12 1 12
715 286 78 13 1 13
2002 1001 364 91 14 1 14
5005 3003 1365 455 105 15 1 15
11440 8008 4368 1820 560 120 16 16
24310 19448 12376 6188 2380 680 136 17
48620 43758 31824 18564 8568 3060 816 18
92378 92378 75582 50388 27132 11628 3876 19
167960 184756 167960 125970 77520 38760 15504 20
293930 352716 352716 293930 203490 116280 54264 21
497420 646646 705432 646646 497420 319770 170544 22
817190 1144066 1352078 1352078 1144066 817190 490314 23
1307504 1961256 2496144 2704156 2496144 1961256 1307504 24
2042975 3268760 4457400 5200300 5200300 4457400 3268760 25
3124550 5311735 7726160 9657700 10400600 9657700 7726160 26
4686825 8436285 1 13037895 17383860 20058300 20058300 17383860 27
6906900 13123110 21474180 30421755 37442160 40116600 37442160 28
10015005 20030010 34597290 51895935 67863915 77558760 77558760 29
14307150 30045015 54627300 86493225 119759850 145422675 155117520 30
1
620 TABLES

The terms of the Poisson distribution


Table 3

X
0.1 0.2 0.3 0.4 0.5
k'''

0 0.90484 0.81873 0.74082 0.67032 0.60653


1 0.09048 0.16375 0.22225 0.26813 0.30327
2 0.00452 0.01637 0.03334 0.05362 0 07581
3 0.00015 0.00109 0.00333 0.00715 0.01263
4 0.00005 0.00025 0.00071 0.00158
5 0.00001 0.00005 0.00016
6 0.00001

X
0.6 0.7 0.8 0.9
k \

0 0.54881 0.49659 0.44933 0.40657


1 0.32929 0.34761 0.35946 0.36591
2 0.09878 0.12166 0.14379 0.16466
3 0.01976 0.02838 0.03834 0.04939
4 0.00296 0.00496 0.00766 0.01111
5 0.00035 0.00069 0.00123 0.00200
6 0.00003 0.00008 0.00016 0.00030
7 0.00001 0.00003

X
l 2 3 4 5

0 0.36788 0.13534 0.04978 0.01831 0.00673


1 0.36788 0.27067 0.14936 0.07326 0.03369
2 0.18394 0.27067 0.22404 0.14653 0.08422
3 0.06131 0.18045 0.22404 0.19537 0.14037
4 0.01532 0.09022 0.16803 0.19537 0.17547
5 0.00306 0.03609 0.10082 0.15629 0.17547
6 0.00051 0.01203 0.05040 0.10420 0.14622
7 0.00007 0.00343 0.02160 0.05954 0.10444
8 0.00085 0.00810 0.02977 0.06527
9 0.00019 0.00270 0.01323 0.03626
10 0.00003 0.00081 0.00529 0.01813
11 0.00022 0.00192 0.00824
12 0.00005 0.00064 0.00343
13 0.00001 0.00019 0.00132
14 0.00005 0.00047
15 0.00001 0.00015
16 0.00004
17 0.00001
TABLES 621

(continued) Table 3
/

6
/

7 8 9 10
/
•*

0 0.00247 0.00091 0.00033 0.00012 0.00004


1 0.01487 0.00638 0.00268 0.00111 0.00045
2 0.04461 0.02234 0.01073 0.00499 0.00227
3 0.08923 0.05212 0.02862 0.01499 0.00756
4 0.13385 0.09122 0.05725 0.03373 0.01891
5 0.16062 0.12772 0.09160 0.06072 0.03783
6 0.16062 0.14900 0.12214 0.09109 0.06305
7 0.13768 0.14900 0.13959 0.11712 0.09007
8 0.10326 0.13038 0.13959 0.13176 0.11260
9 0.06883 0.10140 0.12408 0.13176 0.12511
10 0.04130 0.07098 0.09926 0.11858 0.12511
11 0.02252 0.04517 0.07219 0.09702 0.11374
12 0.01126 0.02635 0.04812 0.07276 0.09478
13 0.00519 0.01418 0.02961 0.05037 0.07290
14 0.00222 0.00709 0.01692 0.03238 0.05207
15 0.00089 0.00331 0.00902 0.01943 0.03471
16 0.00033 0.00144 0.00451 0.01093 0.02169
17 0.00011 0.00059 0.00212 0.00578 0.01276
18 0.00003 0.00023 0.00094 0.00289 0.00709
19 0.00001 0.00008 0.00039 0.00137 0.00373
20 0.00003 0.00015 0.00061 0.00186
21 0.00006 0.00026 0.00088
22 0.00002 0.00010 0.00040
23 0.00004 0.00017
24 0.00001 0.00007
25 0.00002
26 0.00001
622 TABLES

Table 3 (continued)

v X
12 13 14 15
ll
k \
1
0 0.00001
0.00018 0.00007 0.00002 0.00001
1
0.00101 0.00044 0.00019 0.00008 0.00003
2
0.00370 0.00177 0 00082 0.00038 0.00017
3
4 0.01018 0.00530 0.00269 0.00133 0.00064
5 0 02241 0.01274 0.00699 0.00373 0.00193
6 0.04109 0.02548 0.01515 0.00869 0.00483
7 0.06457 0.04368 0.02814 0.01739 0.01037
8 0.08879 0.06552 0 04573 0.03043 0.01944
9 0.10853 0.08736 0.06605 0.04734 0.03240
10 0.11938 0.10484 0.08587 0.06628 0.04861
11 0.11938 0.11437 0.10148 0.08435 0.06628
12 0.10943 0.11437 0.10994 0.09841 0.08285
13 0.09259 0.10557 0.10994 0.10599 0.09560
14 0.07275 0.09048 0.10209 0.10599 0.10244
15 0.05335 0.07239 0.08847 0 09892 0.10244
16 0.03668 0.05429 0.07188 0 08655 0 09603
17 0.02373 0.03832 0 05497 0.07128 0.08473
18 0.01450 0 02555 0 03970 0 05544 0.07061
19 0 00839 0.01613 0.02716 0.04085 0.05574
20 0 00461 0 00968 001765 0 02859 0.04181
21 0.00241 0.00553 0 01093 0.01906 0.02986
22 0.00121 0 00301 0 00645 001213 0.02036
23 0.00057 0 00157 0 00365 0.00738 0.01328
24 0.00026 0 00078 0.00197 0 00430 0.00830
25 0 00011 0.00037 0 00102 0.00241 0.00498
26 0 00004 0.00017 0.00051 0.00129 0.00287
27 0.00002 0.00007 0.00024 0 00067 0 00159
28 0.00003 0.00011 0 00033 0.00085
29 0.00001 0.00005 0 00016 0.00044
30 0.00002 0.00007 0.00022
31 0,00003 0.00010
32 0.00001 0.00005
33 0.00002
34 0.00001
TABLES 623

Table 3

17 18 19 20
k

0
1
2
3 0 00003
4 0.00014 0.00006 0.00003 0.00001
5 0.00049 0.00024 0.00011 0.00005
6 0 00138 0.00071 0.00036 0.00018
7 0 00337 0.00185 0.00099 0.00052
8 0.00716 0.00416 0.00236 0.00130
9 0.01352 0.00832 0.00498 0.00290
10 0 02300 0.01498 0.00946 0.00581
11 0 03554 0 02452 0.01635 0.01057
12 0.05035 0 03678 0.02588 0.01762
13 0 06584 0.05092 0.03783 0.02711
14 0.07996 0.06548 0.05135 0.03874
15 0.09062 0.07857 0.06504 0.05165
16 0.09628 0.08839 0 07724 0.06456
17 0 09628 0.09359 0.08632 0.07595
18 0.09093 0.09359 0.09112 0.08439
19 0.08136 0.08867 0.09112 0 08883
20 0.06915 0.07980 0.08656 0 08883
21 0.05598 0.06840 0 07832 0.08460
22 0 04326 0 05596 0.06764 0.07691
23 0.03197 0.04380 0.05587 0.06688
24 0 02265 0.03285 0.04423 0.05573
25 0 01540 0.02365 003362 0.04458
26 0 01007 0 01637 0.02456 0 03429
27 0 00634 0 01091 0.01728 0.02540
28 0 00385 0 00701 0.01173 0.01814
29 0.00225 0.00435 0 00768 0.01251
30 0.00127 0 00261 0.00486 0.00834
31 0.00070 0.00151 0.00298 0.00538
32 0.00037 000085 0 00177 0.00336
33 0.00019 0.00046 0.00102 0.00203
34 0.00009 0 00024 0.00057 0.00119
35 0.00004 0.00012 0.00030 0.00068
36 0.00002 0.00006 0.00016 0.00938
37 0.00001 0.00003 0.00008 0.00020
38 0.00001 0.00004 0.00010
39 0.00002 0.00005
40 0.00002
41 0.00001
624 TABLES

Table 4

The incomplete gamma function

r*(A) =
1
(n - 1)!
"t“-'e-‘dt= £ Xke~x

k\
(«= 1,2,...)

n 2=0.001 2=0.002 2=0.003 2=0.004

i 0.000999 0.001988 0.002995 0.003992


2 000001 000002 000005 000008
1

n 2=0.005 2=0.006 2 = 0.007 2=0.008

i 0.004987 0.005982 0.006976 0.007968


2 000013 000018 000024 000032
O
O
n 2=0.009 2=0.02 2=0.03
II

i
i 0.008960 0.009950 0.019801 0.029555 .
2 000040 000050 000197 000441
3 000002 000004

n 2=0.04 2=0.05 2 = 0.06 2=0.07

i 0.039211 0.048771 0.058236 0.067606


2 000779 001209 001730 002339
3 000010 000020 000034 000054
4 000001
o
i-H
O

n 2=0.08 2=0.09
II

II

i 0.076884 0.086069 0.095163 0.104166


2 003034 003815 004679 005624
3 000080 000114 000155 000204
4 000002 000002 000003 000006

n 2=0.12 2=0.13 2=0.14 2=0.15

i 0.113080 0.121905 0.130642 0.139292


2 006649 007752 008932 010186
3 000263 000332 000412 000503
4 000008 000011 000014 000018
5 000001
TABLES 625

( continued) Table 4

n A=0 16 2=0.17 2 = 0.18 2=0.19

1 0.147856 0.156335 0.164730 0.173041


2 011513 012912 014381 015919
3 000606 000721 000850 000992
4 000024 000031 000038 000047
5 000001 000001 000001 000001

n 2=0.20 2=0.22 2=0.24 2=0.26

i 0.181269 0.197481 0.213372 0.228948


2 017523 020927 024581 028475
3 001149 001506 001927 002414
4 000057 000082 000113 000154
5 000002 000004 000006 000008
6 000001

n 2=0.28 2=0.30 2=0.40 2=0.50


1

1 0.244216 0.259182 0.329680 0.393469


2 032597 036936 061551 090204
3 002970 003600 007926 014388
4 000205 000366 000776 001752
5 000011 000015 000061 000173
6 000001 000001 000004 000014
7 000001 000001

n 2=1.0 2=1.5 2=2.0 2=2.5

1 0.63212 0.77687 0.86466 0.91792


2 26424 44218 59399 71270
3 08030 19115 32332 45619
4 01899 06564 14288 24242
5 00366 01858 05265 10882
6 00059 00446 01656 04202
7 00008 00093 00453 01419
8 00001 00017 00110 00425
9 00001 00002 00024 00114
10 00005 00028
11 00001 00006
12 00001
626 TABLES

Table 4 ( continued)

2=3.0 2= 3.5 2 = 4.0 2=4.5


n

0.96980 0.98168 0.98889


i 0.95021
80035 86411 90842 93890
2
57681 67915 76190 82642
3
46338 56653 65770
4 35277
18474 27456 37116 46789
5
08392 14239 21487 29708
6
03351 06529 11067 16895
7
01191 02674 05113 08659
8
00380 00987 02137 04026
9
00110 00331 00813 01709
10
00029 00102 00284 00667
11
00007 00029 00092 00240
12
00002 00008 00027 00081
13
00001 00008 00025
14
00002 00007
15
00001 00002
16
17 00001

2=6.0 2 = 6.5
o

6 = 5.5
II

1 0.99326 0.99591 0.99752 0.99850


2 95957 97345 98265 98872
3 87535 91162 93804 95696
4 73497 79830 84880 88816
5 55951 64248 71494 77633
6 38404 47108 55433 63096
7 23782 31396 39370 47347
8 13337 19051 25603 32724
9 06809 10564 15276 20843
10 03183 05378 08392 12262
11 ( 1369 02525 04262 06684
12 00545 01099 02009 03389
13 00202 00445 00883 01603
14 00070 00169 00363 00710
15 00023 00060 00140 00296
16 00007 00020 00051 00116
17 00002 00006 00017 00044
18 00001 00002 00006 00015
19 00001 00002 00005
20 00001 00001
TABLES 627

Table 4


o

od
n A—7.0 A=7.5

II

II
1 0.99909 0.99945 0.99966 0.99980
2 99271 99530 99698 99807
3 97036 97975 98625 99072
4 91823 94085 95762 96989
5 82701 86794 90037 92564
6 69929 75856 80876 85040
7 55029 62184 68663 74382
8 40129 47536 54704 61440
9 27091 33803 40745 47689
10 16950 22359 28338 34703
11 09852 13776 18412 23664
12 05335 07924 11192 15134
13 02700 04267 06380 09092
14 01281 02157 03418 05141
15 00572 01026 01726 02743
16 00241 00461 00823 01383
17 00096 00196 00372 00661
18 00036 00079 00160 00300
19 00013 00031 00065 00130
20 00005 00011 00025 00054
21 00004 00010 00020
22 00001 00003 00008
23 00001 00003
24 00001
628 TABLES

Table 4 (continued)

It A=9.0 A=9.5 A=10.0

1 0.999 88 0.99993 0.99996


2 99877 99921 99950
3 99377 99584 99724
4 97877 98514 98966
5 94504 95974 97075
6 88431 91147 93291
7 79322 83505 86986
8 67610 73134 77978
9 54435 60818 66719
10 41259 47817 54207
11 29401 35467 41696
12 19699 24801 30322
13 12423 16338 20844
14 07385 10186 13554
15 04147 05999 08346
16 02204 03347 04874
17 01111 01773 02704
18 00533 00893 01428
19 00243 00428 00719
20 00106 00196 00345
21 00044 00086 00159
22 00018 00036 00070
. 23 00006 00015 00030
24 00003 00006 00012
25 00002 00004
26 00001 00001

■*?
TABLES 629

Table 5

1
The function <j?(x) = --c 2

Jin
I
X <p(x) * V\x) X <f(x) * <Hx)

0.00 0.3989
0.01 0.3989 0.41 0.3668 0.81 0.2874 1.21 0.1919
0.02 0.3989 0.42 0.3653 0.82 0.2850 1.22 0.1895
0.03 0.3988 0.43 0.3637 0.83 0.2827 1.23 0.1872
0.04 0.3986 0.44 0.3621 0.84 0.2803 • 1.24 0.1849
0.05 I 0.3984 0.45 0.3605 0.85 0.2780 1.25 0.1826
0.06 0.3982 0.46 0.3589 0.86 0.2756 1.26 0.1804
0.07 0.3980 0.47 0.3572 0.87 0.2732 1.27 0.1781
0.08 0.3977 0.48 0.3555 0.88 0.2709 1.28 0.1758
0.09 0.3973 0.49 0.3538 0.89 0.2685 1.29 0.1736
0.10 0.3970 0.50 0.3521 0.90 0.2661 1.30 0.1714
0.11 0.3965 0.51 0.3503 0.91 0.2637 1.31 0.1691
0.12 0.3961 0.52 0.3485 0.92 0.2613 1.32 0.1669
0.13 0.3956 0.53 0.3467 0.93 0.2589 1.33 0.1647
0.14 0.3951 0.54 0.3448 0.94 0.2565 1.34 0.1626
0.15 0.3945 0.55 0.3429 0.95 0.2541 1.35 0.1604
0.16 0.3939 0.56 0.3410 0.96 0.2516 1.36 0.1582
0.17 0.3932 0.57 0.3391 0.97 0.2492 1.37 0.1561
0.18 0.3925 0.58 0.3372 0.98 0.2468 1.38 0.1539
0.19 0.3918 0.59 0.3352 0.99 0.2444 1.39 0.1518
0.20 0.3910 0.60 0.3332 1.00 0.2420 1.40 0.1497
0.21 0 3902 0.61 0.3312 1.01 0.2396 1.41 0.1476
0.22 0.3894 0.62 0.3292 1.02 0.2371 1.42 0.1456
0.23 0.3885 0.63 0.3271 1.03 0.2347 1.43 0.1435
0.24 0.3876 0.64 0.3251 1.04 0.2323 1.44 0.1415
0.25 0.3867 0.65 0.3230 1.05 0.2299 1.45 0.1394
0.26 0.3857 0.66 0.3209 1.06 0.2275 1.46 0.1374
0.27 0.3847 0.67 0.3187 1.07 0.2251 1.47 0.1354
0.28 0.3836 0.68 0.3166 1.08 0.2227 1.48 0.1334
0.29 0.3825 0.69 0.3144 1.09 0.2203 1.49 0.1315
0.30 0.3814 0.70 0.3123 1.10 0.2179 1.50 0.1295
0.31 0.3802 0.71 0.3101 1.11 0.2155 1.51 0.1276
0.32 0.3790 0.72 0.3079 1.12 0.2131 1.52 0.1257
0.33 0.3778 0.73 0.3056 1.13 0.2107 1.53 0.1238
0.34 0.3765 0.74 0.3034 1.14 0.2083 1.54 0.1219
0.35 0.3752 0.75 0.3011 1.15 0.2059 1.55 0.1200
0.36 0.3739 0.76 0.2989 1.16 0.2036 1.56 0.1182
037 0.3725 0.77 0.2966 1.17 0.2012 1.57 0.1163
0.38 0.3712 0.78 0.2943 1.18 0.1989 1.58 0.1145
0.39 0.3697 0.79 0.2920 1.19 0.1965 1.59 0.1127
0.40 0.3683 0.80 0.2897 1.20 0.1942 1.60 0.1109
1
630 TABLES

Table 5 (continued)

<p(x) * v(x)
X ?(*) X <P(x) X

0.0529 2.41 0.0219 2.81 0.0077


1.61 0.1092 2.01
2.42 0.0213 2.82 0.0075
1.62 0.1074 2.02 0.0519
0.0508 2.43 0.0208 2.83 0.0073
1.63 0.1057 2.03
0.0498 2.44 0.0203 2.84 0.0071
1.64 0.1040 2.04
0.0488 2.45 0.0198 2.85 0.0069
1.65 0.1023 2.05
0.0478 2.46 0.0194 2.86 0.0067
1.66 0.1006 2.06
0.0468 2.47 0.0189 2.87 0.0065
1.67 0.0989 2.07
0.0459 2.48 0.0184 2.88 0.0063
1.68 0.0973 2.08
0.0449 2.49 0.0180 2.89 0.0061
1.69 0.0957 2.09
0.0440 2.50 0.0175 2.90 0.0060
1.70 0.0940 2.10
0.0431 2.51 0.0171 2.91 0.0058
1.71 0.0925 2.11 •

0.0422 2.52 0.0167 2.92 0.0056


1.72 0.0909 2.12
0 0413 2.53 0.0163 2.93 0.0055
1.73 0.0893 2.13
2.14 0.0404 2.54 0.0158 2.94 0.0053
1.74 0.0878
0.0396 2.55 0.0154 2.95 0.0051
1.75 0.0863 2.15
2.16 0.0387 2.56 0.0151 2.96 0.0050
1.76 0.0848
2.17 0.0379 2.57 0.0147 2.97 0.0048
1.77 0.0833
2.18 0.0371 2.58 0.0143 2.98 0.0047
1.78 0.0818
2.19 0.0363 2.59 0.0139 2.99 0.0046
1.79 0.0804
2.20 0.0355 2.60 0.0136 3.00 0.0044
1.80 0.0790
2.21 0.0347 2.61 0.0132 3.10 0.0033
1.81 0.0775
2.22 0.0339 2.62 0.0129 3.20 0.0024
1.82 0.0761
0.0748 2.23 0.0332 2.t>3 0.0126 3.30 0.0017
1.83
0.0734 2.24 0.0325 2.64 0.0122 3.40 0.0012
1.84
2.25 0.0317 2.65 0.0119 3.50 0.0009
1.85 0.0721
0.0707 2.26 0.0310 2.66 0.0116 3.60 0.0006
1.86
0.0694 2.27 0.0303 2.67 0.0113 3.70 0.0004
1.87
0.0681 2.28 0.0297 2.68 0.0110 3.80 0.0003
1.88
0.0669 2.29 0.0290 2.69 0.0107 3.90 0.0002
1.89
1.90 0.0656 2.30 0.0283 2.70 0.0104 4.00 0.0001
0.0644 2.31 0.0277 2.71 0.0101 4.10 0.0001
1.91
1.92 0.0632 2.32 0.0270 2.72 0.0099 4.20 0.0001
1.93 0.0620 2.33 0.0264 2.73 0.0096
1.94 0.0608 2.34 0.0258 2.74 0.0093
1.95 0.0596 2.35 0.0252 2.75 0.0091
1.96 0.0584 2.36 0.0246 2.76 0.0088
1.97 0.0573 2.37 0.0241 2.77 0.0086
1.98 0.0562 2.38 0.0235 2.78 0.0084
1.99 0.0551 2.39 0.0229 2.79 0.0081
2.00 0.0540 2.40 0.0224 2.80 0.0079
TABLES 631

Table 6
x
1 C _ -“1
The normal distribution function <P(x) = —- e 2 du
■J2* J.
X d>(x) X 0(X) X 0(X) X 0(x)

0.00 0.5000
0.01 0.5O4C 0 41 0.6591 0.81 0.7910 1.21 0.8869
0.02 0.5080 0 42 0.6628 0.82 0.7939 1.22 0.8888
0.03 0.5120 0.43 0.6664 0.83 0.7967 1.23 0.8907
0.04 0.5160 0.44 0.6700 0.84 0.7995 1.24 0.8925
0.05 0.5199 0.45 0.6736 0.85 0.8023 1.25 0.8944
0.06 0.5239 0.46 0.6772 0.86 0.8051 1.26 0.8962
0.07 0.5279 0.47 0.6808 0.87 0.8078 1.27 0.8980
0.08 0.5319 0.48 0.6844 0.88 0.8106 1.28 0.8997
0.09 0.5359 0.49 0.6879 0.89 0.8133 1.29 0.9015
0.10 0.5398 0.50 0.6915 0.90 0.8159 1.30 0.9032
0.11 0.5438 0.51 0.6950 0.91 0.8186 1.31 0.9049
0.12 0.5478 0.52 0.6985 0.92 0.8212 1.32 0.9066
0.13 0.5517 0.53 0.7019 0.93 0.8238 1.33 0.9082
0.14 0.5557 0.54 0.7054 0.94 0.8264 1.34 0.9099
0.15 0.5596 0.55 0.7088 0.95 0.8289 1.35 0.9115
0.16 0.5636 0.56 0.7123 0.96 0.8315 1.36 0.9131
0.17 0.5675 0.57 0.7157 0.97 0.8340 1.37 0.9147
0.18 0.5714 0.58 0.7190 0.98 0.8365 1.38 0.9162
0.19 0.5753 0.59 0.7224 0.99 0.8389 1.39 0.9177
0.20 0.5793 0.60 0.7257 1.00 0.8413 1.40 0.9192
0.21 0.5832 0.61 0.7291 1.01 0.8438 1.41 0.9207
0.22 0.5871 0.62 0.7324 1.02 0.8461 1.42 0.9222
0.23 0.5910 0.63 0.7357 1.03 0.8485 1.53 0.9236
0.24 0.5948 0.64 0.7389 1.04 0.8508 1.44 0.9251
0.25 0.5987 0.65 0.7422 1.05 0.8531 1.45 0.9265
0.26 0.6026 0.66 0.7454 1.06 0.8554 1.46 0.9279
0.27 0.6064 0.67 0.7486 1.07 0.8577 1.47 0.9292
0.28 0.6103 0.68 0.7517 1.08 0.8599 1.48 0.9306
0.29 0.6141 0.69 0.7549 1.09 0.8621 1.49 0.9319
0 30 0.6179 0.70 0.7580 1.10 0.8643 1.50 0.9332
0.31 0.6217 0.71 0.7611 1.11 0.8665 1.51 0.9345
0.32 0.6255 0.72 . 0.7642 1.12 0.8686 1.52 0.9357
0.33 0.6293 0.73 0.7673 1.13 0.8708 1.53 0.9370
0.34 0.6331 0.74 0.7703 1.14 0.8729 1.54 0.9382
0.35 0.6368 0.75 0.7734 1.15 0.8749 1.55 0.9394
0.36 0.6406 0.76 0.7764 1.16 0.8770 1.56 0.9406
0.37 0.6443 0.77 0.7794 1.17 0.8790 1.57 0.9418
0.38 0.6480 0.78 0.7823 1.18 0.8810 1.58 0.9429
0.39 0.6517 0.79 0.7853 1.19 0.8830 1.59 0.9441
0.40 0.6554 0.80 0.7881 1.20 0.8849 1.60 0.9452
632 TABLES

Table 6 ( continued)

X X <P(x) X 0(X) X 0{x)

1.61 0.9463 1.86 0.9686 2.22 0.9868 2.72 0.9967


1.62 0.9474 1.87 0.9693 2.24 0.9875 2.74 0.9969
1.63 0.9484 1.88 0.9699 2.26 0.9881 2.76 0.9971
1.64 0.9495 1.89 0.9706 2.28 0.9887 2.78 0.9973
1.65 0.9505 1.90 0.9713 2.30 0.9893 2.80 0.9974
1.66 0.9515 1.91 0.9719 2.32 0.9898 2.82 0.9976
1.67 0.9525 1.92 0.9726 2.34 0.9904 2.84 0.9977
1.68 0.9535 1.93 0.9732 2.36 0.9909 2.86 0.9979
1.69 0.9545 1.94 0.9738 2.38 0.9913 2.88 0.9880
1.70 0.9554 1.95 0.9744 2.40 0.9918 2.90 0.9981
1.71 0.9564 1.96 0.9750 2.42 • 0.9922 2.92 0.9982
1.72 0.9572 1.97 0.9756 2.44 0.9927 2.94 0.9984
1.73 0.9582 1.98 0.9761 2.46 0.9931 2.96 0.9985
1.74 0.9591 1.99 0.9767 2.48 0.9934 2.98 0.9986
1.75 0.9599 2.00 0.9772 2.50 0.9938 3.00 0.9986
1.76 0.9608 2.02 0.9783 2.52 0.9941 3.20 0.9993
1.77 0.9616 2.04 0.9793 2.54 0.9945 3.40 0.9996
1.78 0.9625 2.06 0.9803 2.56 0.9948 3.60 0.9998
1.79 0.9633 2.08 0.9812 2.58 0.9951 3.80 0.9999
1.80 0.9641 2.10 0.9821 2.60 0.9953
1.81 0.9649 2.12 0.9830 2.62 0.9956
1.82 0.9656 2.14 0.9838 2.64 0.9959
1.83 0.9664 2.16 0.9846 2.66 0.9961
1.84 0.9671 2.18 0.9854 2.68 0.9963
1.85 0.9678 2.20 0.9861 2.70 0.9965
TABLES

00 o O
o o
d o'

O O o
o o O o o
o d o o o

+ vo O o o (
T-H CN
—< o o o o o o O o
o o o o o o o o

1 o o o ^H
N CN CO wo r-
o o o o o o o O O
o o
o o o o o o o o d o
II 1 o

o o T-H CN CO H" VO oo ro
o o o o O o o o O T-H
CN CN
o o o o o o o o o o
1 o o o

5i
<N + o o o T-H CN CO r- o H- <J\ NO
o o o o O o o T-H CN CO wo NO
d o o o o o o o o O O
o o o o o

n O o CN r- T-H r- CN CO NO
o o © o O o o oo ON vo
CN CO ■O’ vo r- ON o CO wo
o o o o o o o o d o o o o o
o - T-H

o o t-H CN n- CN oo 00 o WO CO «o O oo
- o o © O o o T-H T-H CN r- NO ON NO
VO ON CN 00 T-H wo ON
ii o d o d o o o o o o o o d T-H T-H
CN CN CN CO
The values of 100 Pn(c),

O •H CN wo o oo o NO r- CN CO ON o r^ ON NO
o q o O O ^■H CO tj- NO ON CN WO o oo wo O On
ON VO oo wo CO O
d o o o o o O o o o T-H T-H CN
CN CN co wo NO d

o <N NO WO ON ON <N NO
7

r- vo T-H CO CO 00 ON wo NO o ON O
as o O o t-H CN n- r- VO o NO CO T-H On 00 r- r- 00 ON
CN wo
Table

o o o d o d o T-H CN CN CO N- »ri NO d oo ON CN CO
T-H T-H

cn oo ON NO 00 CN o o o oo CO CO oo OO o NO CO
00 o q CN co CN oo NO vo •o NO 00 - N- ON On vo CN OO wo CN ON
d d o o o T—H CN CO vo NO 00 ON o CN co voi oo o <N CO
T—H T-H T-H T-H T-H CN CN CN
'w/
H
SO vo co co o rj- CO O «o CO Tt- CN »o CO ■'3* NO oo O o NO O ON
O CN VO CN os VO CO CN CN
t" CO NO oo o CN WO r- Os o CN CN
d o o *—< T—H co d wo r-' On T-H CO »o ON r-H NO OO d CN wo’ r- ON
T-H T-H T—H CN CN CN CN co CO CO co CO

H
'w CN CN vo wo VO ON »o O r- NO o r^* 00 O -wo o NO CN
VO CN t*- ON NO n- «o vo •o VO vo CO T-H On wo wo ON CN
o o T-H co WO wo ON CN VO oo T-H o CO NO ON CN o (N d d ON
t—( t-H T-H CN CN CN CO CO CO CO N" rt tj- wo wo wo wo wo

on o o o ON 00 wo 00 ON CO *o CO NO ON CO OO T—H O CN
t—H VO OS CN CN o
•o CO wo NO •o <N r- T-H <N CN o NO O <N CO CO oo
o <N WO 00 CN so t—< wo ON d oo CN NO o CO r- d CO NO OO ,-i’ CO wo
CN CN CN T—H CO CO N- »o VO
•o NO NO NO no r-
r-’ ON d
r- r- 00
"~l

■ Os »“H h* vo CN v—i CN VO NO o _ r-
CN t-H Os SO OO 00 VO NO O
o
OO CN CO
r- o Tf o o
OO CN wo NO wo
oo
r-
r-’ T—<
0o’ wo T-Hd co 00 CO r- ^H vo 00 o CO vo 00 ^ CN o CO wo wo
' 1 d CN co tj- VO
wo NO NO r- 00 OO oo 00 OO ON ON ON ON ON ON ON

^H O cn ^h ON NO O 00 NO CO CO r-H
•o o CN wo NO o W0 OO
fti m o o
VO CN On r- O OO CN CO T-H
oo CO CN o Tj- 1/0 NO r- OO 00 OO
vo r-* r-‘ SO co OO co NO ON CN CO »o NO r- r- 00 oo ON ON ON ON ON ON ON ON ON
II SO r- r- oo oo 00 ON ON ON ON ON ON ON ON ON ON ON ON ON ON ON ON ON
I

o r-H
wo wo wo CN NO 00 ON ON O O O O o o O O O o O o o
cC o CN q
ON CO
ON ON ON ON ON o o o
O o o o o o o o o o
n
<u
co SO oo OO On on ON ON ON ON ON
CTs
On o o o
O Q o d d o d d o d
o o o o o o o o o o
T-H o T-H o
T-H o
os OS OS ON OS ON ON ON On ON ON
Vh T—. t-H t-H T-H T-H
<D 1“l

£ o © o © O O © O O O O O o o o o o o o o o o o o o o
o o o o o o o O O O oO o o o o o o o o o o o o o o
o o o o o o o O o o O o O o o O o O o d d o o o o o
o o o o o o o o o O o o o o o
T-H o T-H o o T-H o
T-H T-H o o
T-H o o
O O p
t 1 T_H r-H t—■1 T-H
,—h ^h ,-H
1 1 T“l

o /

o T—H o
T-H ^H o
wo so 00 ON CN co *o NO 00 ON <N CO T}- »/0 NO oo ON
,-H r“’
T-H T-H T-H CN CN CN CN CN CN CN CN CN fN CO
/ *■
1 TABLES

Table 8

The function K(z) = £ (— l)fc ® "


k±~Lm

K(z) Z K(z)
2 K{z) Z

1
0.71 0.305471 1.14 0.851394
0.28 0.000001
0.72 0.322265 1.15 0.858038
0.29 0.000004
0.73 0.339113 1.16 0.864442
0.30 0.000009
0.74 0.355981 1.17 0.870612
0.31 0.000021
0.75 0 372833 1.18 0.876548
0.32 0.000046
0.76 0.389640 1.19 0.882258
0.33 0.000091
0.77 0.406372 1.20 0.887750
0.34 0.000171
0.78 0.423002 1.21 0.893030
0.35 0.000303
0.79 0.439505 • 1.22 0.898104
0.36 0.000511
0.80 0.455857 1.23 0.902972
0.37 0.000826
0.81 0.472041 1.24 0.907648
0.38 0.001285
0.82 0.488030 1.25 0.912132
0.39 0.001929
0.002808 0.83 0.503808 1.26 0.916432
0.40
0.003972 0.84 0.519366 1.27 0.920556
0.41
0.005476 0.85 0.534682 1.28 0.924505
0.42
0.007377 0.86 0.549744 1.29 0.928288
0.43
0.009730 0.87 0.564546 1.30 0.931908
0.44
0.012590 0.88 0.579070 1.31 0.935370
0.45
0.016005 0.89 0.593316 1.32 0.938682
0.46
0.020022 0.90 0.607270 1.33 0.941848
0.47
0.024683 0.91 0.620928 1.34 0.944872
0.48
0.030017 0.92 0.634286 1.35 0.947756
0.49
0.036055 0.93 0.647338 1.36 0.950512
0.50
0.042814 0.94 0.660082 1.37 0.953142
0.51
0.050306 0.95 0.672516 1.38 0.955650
0.52
0.058534 0.96 0.684636 1.39 0.958040
0.53
0.067497 0.97 0.696444 1.40 0.960318
0.54
0.077183 0.98 0.707940 1.41 0.962486
0.55
0.087577 1.99 0.719126 1.42 0.964552
0.56
0.57 0.098656 1.00 0.730000 1.43 0.966516
0.58 0.110395 1.01 0.740566 1.44 0.968382
0.59 0 122760 1.02 0.750826 1.45 0.970158
0.60 0 135718 1.03 0.760780 1.46 0.971846
0.61 0.149223 1.04 0.770434 1.47 0.973448
0.62 0.163225 1.05 0.779794 1.48 0.974970
0.63 0.177753 1.06 0.788860 1.49 0.976412
0.64 0.192677 1.07 0.797636 1.50 0.977782
0.65 0.207987 1.08 0.806128 1.51 0.979080
0.66 0.223637 1.09 0.814342 1.52 0.980310
0.67 0.239582 1.10 0.822282 1.53 0.981476
0.68 0.255780 1.11 0.829950 1.54 0.982578
0.69 0.272189 1.12 0.837356 1.55 0.983622
0.70 0.288765 1.13 0.844502 1.56 0.984610
TABLES 635

( continued) Table 8

z K{z) Z K(z) Z K(z)

1.57 0.985544 1.93 0.998837 2.29 0.999944


1.58 0.986426 1.94 0.998924 2.30 0.999949
1.59 0.987260 1.95 0.999004 2.31 0.999954
1.60 0.988048 1.96 0.999079 2.32 0.999958
1.61 0 988791 1.97 0.999149 2.33 0.999962
1.62 0.989492 1.98 0.999213 2.34 0.999965
1.63 0.990154 1.99 0.999273 2.35 0.999968
1.64 0.990777 2 00 0.999329 2.36 0.999970
1.65 0.991364 2.01 0.999380 2.37 0.999973
1.66 0.991917 2.02 0.999428 2.38 0.999976
1.67 0.992438 2.03 0.999474 2.39 0.999978
1.68 0.992928 2.04 0.999516 2.40 0.999980
1.69 0.993389 2.05 0.999552 2.41 0.999982
1.70 0.993828 2 06 0.999588 2.42 0.999984
1.71 0.994230 2 07 0.999620 2.43 0.999986
1.72 0.994612 2.08 0.999650 2 44 0.999987
1.73 0.994972 2 09 0.999680 2.45 0 999988
1.74 0.995309 2.10 0.999705 2.46 0.999989
1.75 0.995625 2.11 0.999723 2.47 0.999990
1.76 0.995922 2.12 0.999750 2.48 0.999991
1.77 0.996200 2.13 0.999770 2.49 0.999992
1.78 0.996460 2.14 0.999790 2 50 0.9999925
1.79 0.996704 2.15 0.999806 2.55 0.9999956
1.80 0.996932 2.16 0.999822 2 60 0.9999974
1.81 0.997146 2.17 0.999838 2.65 0.9999984
1.82 0.9973 16 2.18 0.999852 2.70 0 9999990
1.83 0.997533 2 19 0.999864 2.75 0.9999994
1.84 0.997707 2 20 0.999874 2 80 0.9999997
1.85 0.997870 2 21 0.999886 2 85 0.99999982.
1.86 0.998023 2 22 0.999896 2.90 0.99999990
1.87 0.998145 2 23 0.999904 2.95 0.99999994
1.88 0.998297 2 24 0.999912 3.00 0.99999997
1.89 0.998421 2.25 0.999920
1.90 0.998536 2 26 0.999926
1.91 0.998644 2.27 0.999934
1.92 0.998744 2 28 0.999940
636 TABLES

Table 9

(2k + l)2n2
8z3

The function L h> m = — I (-71 /c-0


!)* -2^—y
^

a
0.02 0.03 0.04 0.05 0.06 0.07
o.ot

0.1
0.5
0.0000 0.0000
1.0
0.0000 0.0000 0.0000 0.0002 0.0009
1.5
0.0000 0.0001 0.0008 0.0036 0.0101 0.0212
2.0
0.0001 0.0022 0.0112 0.0299 0.0578 0.0925
2.5
0.0000 0.0015 0.0151 0.0474 0.0941 0.1487 0.2061
3.0
0.0001 0.0092 0.0491 0.1136 0.1879 0.2628 0.3341
3.5
0.0006 0.0291 0.1052 0.2001 0.2942 0.3804 0.4571
4.0
4.5 0.0031 0.0643 0.1776 0.2951 0.4001 0.4901 0.5665
5.0 0.0096 0.1134 0.2582 0.3895 0.4985 0.5873 0.6598
5.5 0.0225 0.1726 0.3406 ; 0.4784 0.5863 0.6707 0.7374
6.0 0.0428 0.2375 0.4204 0.5591 0.6627 0.7409 0.8005
6.5 0.0707 0.3045 0.4952 0.6310 0.7282 0.7989 0.8509
7.0 0.1053 0.3708 0.5638 0.6940 0.7834 0.8461 0.8904
7.5 0.1452 0.4347 0.6258 0.7484 0.8294 0.8838 0.9207
8.0 0.1889 0.4959 0.6811 0.7951 0.8671 0.9135 0.9436
8.5 0 2348 0.5513 0.7301 0.8345 0.8977 0.9365 0.9606
9.0 0.2819 0.6031 0.7731 0.8676 0.9221 0.9540 0.9729
9.5 0.3290 0.6506 0.8104 0.8950 0.9414 0.9672 0.9817
10.0 0.3754 0.6938 0.8427 0.9175 0.9564 0.9770 0.9878
11.0 0.4640 0.7678 0.8939 0.9505 0.9768 0.9891 0.9949
12.0 0.5450 0.8270 0.9303 0.9714 0.9882 0.9951 0.9980
13.0 0.6174 0.8734 0.9555 0.9841 0.9943 0.9980 0.9993
14.0 0.6812 0.9090 0.9724 0.9915 0.9974 0.9992 0.9998
15.0 0.7367 0.9358 0.9833 0.9956 0.9988 0.9997 0.9999
16.0 0.7844 0.9555 0.9902 0.9978 0.9995 0.9999 1.0000
17.0 0.8249 0.9697 0.9944 0.9990 0.9998 1.0000
18.0 0.8591 0.9797 0.9969 0.9995 0.9999
19.0 0.8876 0.9867 0.9983 0.9998 1.0000
20.0 0.9112 0.9915 0.9991 0.9999
21.0 0.9304 0.9946 0.9996 1.0000
22.0 0.9459 0.9967 0.9998
23.0 0.9584 0.9980 0.9999
24.0 0.9683 0.9988 1.0000
25.0 0.9760 0.9993
30.0 0.9949 1.0000
35.0 0.9991
40.0 0.9999
43.0 1.0000
TABLES 637

(continued) Table ?
REMARKS AND BIBLIOGRAPHICAL NOTES

These notes wish to call attention to books and papers which may be useful to the
reader for further study of subjects dealt with in the present textbook, including
books and papers to which reference was made in the text. For topics which are
treated in detail in some current textbook, we mention only such books, where the
reader can find further references.
As regards topics not discussed in standard textbooks, the sources of the material
contained in this book are given in greater detail. These bibliographic notes contain
often some remarks on the historical development of the problems dealt with, but
to give a full account of the history of probability theory was of course impossible.
For the history of Probability Calculus up to Laplace see Todhunter [1].
Concerning less-known theorems or methods from other branches of mathematics,
we refer to some current textbook readily accessible to the reader.
The notes are restricted to the most important methodical problems. On several
occasions the method of exposition chosen in the present book is compared in the
notes to that in other textbooks.

Chapter I

Glivenko was the first to stress in his textbook (Glivenko [3]; cf. also Kolmogorov
[9]) the advantage of discussing the algebra of events as a Boolean algebra before
the introduction of the notion of probability. It seems to us that the understanding
of Kolmogorov’s axiomatic theory is hereby facilitated. On the general theory of
measure and integration over a Boolean algebra instead of over a field of sets see
Caratheodory [1], Recent results on probability as a measure on a Boolean algebra
are summarized in Kappos [1].
§§ 1-4. On Boolean algebras in general see Birkhoff [1], Glivenko [2]. We did
not give a system of independent axioms for Boolean algebras, since it seemed to
us of much more importance to present the rules of Boolean algebra in a way which
makes clear the duality of the two basic operations.
§ 5. See Stone [1]. We follow here Frink [1]; as to the Lemma see Hausdorff [1]
and Frink [2],
§ 6. The unsolved problem mentioned in Exercise 7 was first formulated by Dede¬
kind (cf. Birkhoff [1], p. 147). Concerning Exercise 11 see e.g. Gavrilov [1].
REMARKS AND BIBLIOGRAPHICAL NOTES 639

Chapter II

From Chapter II on, probability theory is developed on the basis of Kolmogorov’s


axiomatics, first published in Kolmogorov [5], The idea to consider probability as
an additive set function has — like every important mathematical idea — many
forerunners, cf. e.g. Borel [1], Lomnicki [1], Ldvy [1], Steinhaus [1], Jordan [1], [2],
The merit of Kolmogorov was to formulate for the first time this idea consequently
and in its whole generality and to show how from this idea probability theory can
be developed as a strictly axiomatic branch of modern mathematics. Herewith he
solved one of the famous problems of Hilbert. Nearly all modern textbooks of proba¬
bility theory and mathematical statistics (cf. e.g. Blanc-Lapierre and Fortet [1],
Cramer [2], Doob [1], Feller [7], Fisz [1], Frechet (1], Gnedenko [3], Kac [3],
Levy [3], [4], Loeve [1], Neyman [1], [2], Onicescu, Mihoc and Ionescu-Tulcea [1],
Parzen [1], Richter, [1], Schmetterer [1], van der Waerden [1]) and nearly the whole
recent literature of probability theory and mathematical statistics are based on Kol¬
mogorov’s axiomatics. Earlier theories and discussions of the concept of probability
may be found in Laplace [1], [2], Bernstein [2], von Mises [1], [2], Wald [1].
§11 and § 12 present a generalized system of axioms which contains that of Kol¬
mogorov as a particular case (cf. Renyi [15]).
§ 3. Concerning Theorem 10 cf. Ch. Jordan [3], A large number of general identities
and inequalities between probabilities of events can be found in Frechet [2], With
respect to Theorems 11 and 12 see Renyi [23], The method based on these theorems
is closely related to the method of indicator functions of Loeve (cf. Loeve [1]). About
the relation of the two methods to each other cf. Renyi [35],
§ 7. On Measure Theory see Halmos [1], Aumann [1]. The proof of the c-additivity
of Lebesgue-Stieltjes measure (by means of Kolmogorov’s Theorem 3) differs from
those given usually in textbooks and is due to Catherine Renyi and A. Renyi.
§ 10. On integral geometry cf. Blaschke [1].
§ 11. Some authors (cf. e.g. Reichenbach [1], Popper [1], [2]) have long ago
emphasized that conditional probability may and should be chosen as the basic
concept of probability theory. But the starting point of these authors was essentially
philosophical. They did not try to give a corresponding generalization of Kolmogorov’s
axiomatic theory. The idea, that unbounded measures may serve as legitimate
(conditional) probability distributions, does not appear in these early works. On the
other hand, unbounded measures were long ago used in statistics as Bayesian a priori
distributions (cf. e.g. Jeffreys [1], Dumas [1], [2],Baticle [1], [2]), without an exact
mathematical foundation. In the paper [15] of Renyi, these two points of view are,
in a certain sense, united and connected to Kolmogorov’s axiomatic theory. Concerning
the theory of conditional probability algebras cf. further Renyi [14], [18], [19],
Csaszar [1], Kappos [1]. Somewhat later, but independently, Luce [1] constructed
a similar system of axioms for finite conditional probability algebras, by using an
entirely different reasoning (starting from an investigation of the psychological laws
of choice and preference).
§ 12. Concerning exchangeable events (Exercises 38-41) cf. de Finetti [1], Khin-
chin [2], For Exercise 46, see Wilcoxon [1], Renyi [12]. On Exercise 47 and in general
on applications of probability theory to number theory see e.g. Kac [4], Renyi [25].

i
640 REMARKS AND BIBLIOGRAPHICAL NOTES

Chapter III

§ 2. Cf. Bayes [1].


§ 3. On the Polya-distribution see Eggenberger-Polya [1], for further generalization
see Onicescu-Mihoc [1].
§ 11. Concerning Theorem 6 see Kantorovich [1].
§ 13. Poisson [1], Bortkiewicz [1], For a general theory of simple and composed
Poisson processes see Aczel-Janossy-Renyi [1], Renyi [5], [6], Aczel [2], Prekopa [1],
Marczewski [1], Florek, Marczewski and Ryll-Nardzewski [1], Saxer [1].
§ 14. The idea that in number theory the application of Dirichlet’s series can be
replaced by a formal algebraic calculus is — according to Hardy — due to Harald
Bohr (Hardy and Wright [1], p. 259). Instead of Dirichlet’s series, this idea is applied
here to power series and hereby to the convolutions of probability distributions.
On related problems see Renyi [4],
§ 16. On Euler’s summation formula see Knopp [1].
§ 17. Concerning Exercise 8, see Bernstein [4]; for a generalization Erdos and
Renyi [4], For Exercise 32, see Bernstein [1] and also Arato-Renyi [1], For Exercise
35, see Feldheim [2] and Renyi [20].

Chapter IV

§ 7. On projections of probability distributions see Cramer-Wold [1], Renyi [8].


§ 8. On the lognormal distribution see Kolmogorov [8], Renyi [32].
§ 9. Concerning Example 1, see Lobatchewski [1]. On the ^'--distribution, see
Helmert [1], K. Pearson [1]; on the beta-function, Losch-Schoblik [1].
§ 10. Example 1: Student [1], Example 2: Fisher [1].
§ 17. Concerning Exercises 17-18, cf. Malmquist [1], van Dantzig [1], Hajos-
Renyi [1]. Exercise 26: Bateman [1], Bharucha-Reid [1]. Exercise 45: see, for
instance, Veksler-Grochev-lsaev [1].

Chapter V

§ 1. The content of this section appeared first in the present book (German edition,
1962).
§ 2. We follow here Kolmogorov [5], For the Radon-Nikodym theorem, see e.g.
Halrnos [1 ].
§ 3. Concerning the new deduction of the Maxwell distribution given here, see
Renyi [19].
§ 4. We follow here Kolmogorov [5].
§ 6 and § 7. See Gebelein [1], Renyi [26], [28], Csaki-Fischer [1], [2], On the
Lemma of Theorem 3 of § 7, see Boas [1 ]. On Theorem 3, see Renyi [26], On the
applications of the large sieve of Linnik in number theory, see Linnik [1], Renyi [2],
Bateman-Chowla-Erdos [1].
§ 7. Exercises 1-4 treat problems of integral geometry from the point of view of
probability theory (see Blaschke [1]). Exercise 6: cf. Hajos-Renyi [1]; Exercises
28-30: Renyi [28].
REMARKS AND BIBLIOGRAPHICAL NOTES 641

Chapter VI

The method of characteristic functions, as mentioned by L6vy [4], goes essentially


back to Cauchy. As to its application in probability theory, the merit is due to
Liapunov [1], Polya [1] and Levy [2], Detailed expositions of the theory may be
found in Dugue [1], Esseen [1], Ky Fan [1], Linnik [3], Luk&cs [4], On Fourier
series and integrals, see Zygmu nd [1], [2].
§ 4. On the theorem of Helly, cf. Luktcs [4],
§ 5. Theorem 1: Bernstein [4]; Theorem 3: Cramer [1]; Theorem 4: Linnik-
Singer [1]. Concerning the theorem on the singularities of a power series with positive
coefficients, which was applied in point 4, see Titchmarsh [1]. On the theorem of
H. A. Schwarz, cf. Hurwitz-Courant [1], For the formula of Faa di Bruno, cf. Jordan
[4], Lukacs [2], Theorem 5: cf. Darmois [1] and Skitovitch [1], Theorem 6: Lukacs
[1]; see also Geary [1], Kawata-Sakamoto [1], Singer [1] and Lukacs [3].
§§ 7-8. On the theory of infinitely divisible and stable distributions see Gnedenko-
Kolmogorov [1], Levy [4], Khinchin-Levy [1], Feldheim [1].
§ 7. Concerning the theory of distributions (theory of generalized functions) we
follow the book of Lighthill [1 ], which develops the theory established by J. Mikusinski.
The application of the theory of distributions to probability theory was published
for the first time in the German edition of the present book. For Theorem 7, see
Robbins [1]. On the method of Laplace, see de Bruijn [1], On the theory of theta
functions, cf. Hurwitz and Courant [1], Chung and Erdos proved Theorem 8 in a
stronger form.
§10. Exercise 4: see Shannon [1]; Exercise 6: Hardy-Littlewood-Polya [1];
Exercise 14 (for a more general theorem): Csaszar [2]; Exercise 20: Laha [1].

Chapter VII

§ 1. Chebyshev [1], Bienayme [1]. The first Chebyshev-type inequality is due to


Gauss [1 ].
§ 2. Bernoulli [1], Slutsky [1].
§ 3. Markov [1], Khinchin [4],
§ 4. Bernstein [4], Uspenski [1],
§ 5. Borel [1], Cantelli [l]. On Lemma C, see Erd3s-R6nyi [2],
§ 6. Kolmogorov [1], [5].
§ 7. Kolmogorov [2], On Lemma 2: Knopp [1].
§ 8. Glivenko [1].
§ 9. Khinchin [1], Kolmogorov [1], Erd5s [1], Feller [4], R6nyi [1].
§ 10. Renyi [24], Renyi-Revesz [1].
§ 11. Renyi [381.
§ 12. Renyi-Revesz [2], Renyi [38].
§ 13. Kolmogorov [5],
§ 14. Kolmogorov [5], Khinchin-Kolmogorov [1], Steinhaus [2] and also Stein-
haus, Kac and Ryll-Nardzewski [1]; for the lemma, Doob [2],
§ 15. Renyi [15], [17]. For the theorem of Abel-Dini, cf. Knopp [1], p. 173.
§ 16. Exercise 10, cf. Hille [1]; Exercise 11: Widder [1]; Exercise 13: Rinyi [1];
Exercise 14’ Hijek-Renyi [1]; Exercise 21: Doob [2] (the theorem mentioned in
the remark is due to Menshov); Exercise 24: Kolmogorov [5].
642 REMARKS AND BIBLIOGRAPHICAL NOTES

Chapter VIII

§ 1. Chebyshev [1], Markov [1], Liapunov [1], Lindeberg [1], Polya [1], Feller [1],
Khinchin [3], Gnedenko-Kolmogorov [1], Kolmogorov [5], [11], Prekopa-Renyi-
Urbanik [1 ].
§ 2. Gnedenko [2],
§ 3. Levy [4], Feller [1], Khinchin [5], and also Gnedenko-Kolmogorov [1],
§ 5. Erdos-Renyi [1]; for the particular case p — const, cf. Bernstein [4].
§ 6. For the lemma, cf. Cramer [3], Theorem 3 was first, under certain restrictions,
proved by a different method (cf. Renyi [3]). This result was generalized by Kolmo¬
gorov [10], For the more simple proof given here, cf. Renyi [24], Theorem 3 may be
applied to prove limit theorems for dependent random variables; cf. Revesz [1],
On the central limit theorem for dependent random variables, see the fundamental
paper of Bernstein [3].
§ 7. Anscombe [1], Doeblin [2], Renyi [22], [31],
§ 8. On the theory and applications of Markov chains and Markov processes
see Markov [1], Kolmogorov [3], [7]; Doeblin [1], [2], Feller [2], [3], [6], [7],
Doob [2], Chung [1], Bartlett [1], Bharucha-Reid [1], Wiener [2], Chandrasekhar
[1], Einstein [1], Flostinsky [1], Levy [3], Renyi [11],
§ 9. Renyi [9], [10], van Dantzig [1], Malmquist [1]; further references are to
be found in Wilks [1], Wang [1].
§ 10. Kolmogorov [6], N. V. Smirnov [1], [2], Gnedenko [1], Gnedenko-Koroljuk
[1], Doob [1], Feller [5], Donsker [1].
§ 11. For Theorem 1: Polya [2], See also Dvoretzky-Erdos [1], On the arc sine
law (Theorem 6) cf. Levy [2], Erdos-Kac [2]; Sparre-Andersen [1], [2]; Chung-
Feller [1], Renyi [33], On Lemma 2, Renyi [36]; for other generalizations, Spitzer
[1], For Theorem 8, see Erdos-Hunt [1]; for Theorem 9, Erdos-Kac [1]; for a gen¬
eralization of it, Renyi [9].
§ 12. Lindeberg [1], Krickeberg [1].
§ 13. Exercise 5: Renyi [17]; for a similar general system of independent functions,
see Steinhaus, Kac and Ryll-Nardzewski [1 ]— [10], Renyi [7]. Exercise 8: Kac [1];
Exercise 24: Wilcoxon [1] and Renyi [12]; Exercise 25: Lehmann [1] and Renyi
[12]; the equivalence of the problems considered in these two papers is proved in
E. Csaki [1], Exercise 28: Erdos-Renyi [3], The result of Exercise 30 is due to
Chung and Feller [1]; as regards the presentation given here, cf. Renyi [33],

Appendix

On the concepts of the entropy and information see Boltzmann [1], Hartley [1],
Shannon [1], Wiener [1], Shannon-Weaver [1], Woodward [1], Barnard [1], Jeffreys
[1], and the papers of Khinchin. Fadeev, Kolmogorov, Gelfand, Jaglom, etc., in
Arbeiten zur Informationstheorie I—III .On the role of the notion of information in sta¬
tistics, see the works of Fisher [1 ]—[3], and of Kullback [1 ]. The notion of the dimension
of a probability distribution and that of the entropy of the corresponding dimension
were introduced in a paper of Balatoni and Renyi (Arbeiten zur Informationstheorie I)
and were further developed in Renyi [27], [30]. Measures of information differing
from the Shannon-measure were already considered earlier, e.g. by Bhattacharyya
[1] and Schiitzenberger [1]; the theory of entropy and information of order a is
developed in Renyi [34], [37].
REMARKS AND BIBLIOGRAPHICAL NOTES 643

Part of the material appeared for the first time in the German edition of this book.
This appendix covers merely the basic notions of information theory; their appli¬
cation to the transmission of information through a noisy channel, coding theory,
etc. are not dealt with here. Besides the already mentioned works of Shannon and
Khinchin let there be indicated those of Feinstein [1], [2], McMillan [1] and Wol-
fowitz [1], [2], [3],
§ 1. Concerning the theorem of Erdos on additive number-theoretical functions,
which was rediscovered by Fadeev, see Erdos [2]; the simple proof given in the text
is due to Renyi [29].
§ 2. For the theorem of Mercer, see Knopp [1].
§ 6. On the mean value theorem, see de Finetti [2], Kolmogorov [4], Nagumo [1],
Aczel [1], Hardy-Littlewood-Polya [1] (where further references can be found;
this book contains also all other inequalities used in the Appendix, e.g. the inequalities
of Jensen and of Holder).
§ 9. The idea that quantities of information theory may be used for the proof of
limit theorems is due to Linnik [2].
On the theorem of Perron-Frobenius, see Gantmacher [1],
§ 11. For Exercise2c, see Renyi [16] and Kac [2]; Exercise 3c: Khinchin [6], for
the generalizations: Renyi [21]. Exercise 4: Kolmogorov-Tikhomirov (Arbeiten zur
Informationstheorie III). The content of Exercise 9 is due to Chung (unpublished
communication), the proof given here differs from that of Chung. Exercise 17b. cf.
Moriguti [1].

Tables

Further tables and graphic representations useful in probability calculus may be


found in Fisher-Yates [1], E. S. Pearson-H. O. Hartley [1], Molina [1], Koller [1].
REFERENCES

Aczel, J.
[1] On mean values, Bull. Amer. Math. Soc. 54, 393-400 (1948).
[2] On composed Poisson distributions, III, Acta Math. Acad. Sci. Hung 3, 719-
224 (1952).
Aczel, J., L. Janossy and A. Renyi
[1] On composed Poisson distributions, I, Acta Math. Acad. Sci. Hung. 1, 209-224
(1950).
Alexandrov, P. S. (AneKcaimpoB, C.) [1 ] BBenerme b o6myio Teopmo mhokcctb
h (JjyHKiMft (Introduction to the theory of sets and functions), OGIZ, Moscow-
Leningrad 1948.
Anscombe, F. J.
[1] Large sample theory of sequential estimation, Proc. Cambridge Phil. Soc 48,
600 (1952).
Arato, M. and A. Renyi
[1] Probabilistic proof of a theorem on the approximation of continuous functions
by means of generalized Bernstein polynomials, Acta Math. Acad. Sci. Hung 8,
91-98 (1957).
ARBEITEN ZUR INFORMATIONSTHEORIE I—III (Teil von A. J. Chintschin,
D. K. Faddejew, A. N. Kolmogoroff, A. Renyi und J. Balatoni; Teil II von
I. M. Gelfand, A. M. Jaglom, A. N. Kolmogoroff, Chiang Tse-Pei, I. P.
Zaregradski; Teil III von A. N. Kolmogoroff und W. M. Tichomirow),
VEB Deutscher Verlag der Wissenschaften, Berlin 1957 bzw. 1960.
Aumann, G.
[1] Reelle Funktionen, Springer-Verlag, Berlin-Gottingen-Heidelberg 1954.

Barban, M. B. (Eap6aH, M. E.)


[1] 06 onHoft TeopeMe P. Bateman, S. Chowla, P. Erdos, Publications of the
Mathematical Institute of the Hungarian Academy of Sciences, 9 A, 429-435
(1964).
Barnard, G. A.
[1] The theory of information, J. Royal Stat. , c. (B) 13, 46-69 (1951).
Bartlett, M. S.
[1] An introduction to stochastic processes with special reference to methods and
applications, Cambridge Univ. Press, Cambridge 1955.
Bateman, H.
[1] The solution of a system of differential equations occurring in the theory of
radioactive transformations, Proc. Cambridge Phil. Soc. 15, 423-427 (1910).
Bateman, P. T., S. Chowla and P. Erdos
[1] Remarks on the size of L( 1,%), Publ. Math. Debrecen 1, 165-182 (1950).
Baticle, E.
[1] Sur une loi de probability a priori parametres d’une loi laplacienne, C. R. Acad.
Sci. Paris 226, 55-57 (1948).
646 REFERENCES

Baticle, E.
[2] Sur une loi de probability a priori pour Interpretation des resultats de tirages
dans une urne, C. R. Acad. Sci. Paris 228, 902-904 (1949).
Bauer, H. .
[1] Wahrscheinlichkeitstheorie und Grundziige der Masstheorie, Sammlung uo-
schen 1216/1216a, de Gruyter, Berlin 1964.

[1] Essay towards solving a problem in the doctrine of chances, Ostwald s Klas-
siker der Exakten Wissenschaften”, Nr. 169, W. Engelmann, Leipzig 1908.
Bernoulli, J. , ,
[1] Ars Coniectandi (1713) I—II, III—IV. “Ostwald s Klassiker der Exakten
Wissenschaften”, Nr. 108, W. Engelmann, Leipzig 1899.
Bernstein, S. N. (BepmuTeiiH, C. H.)
[1 ] Demonstration du theoreme de Weierstrass fondee sur la calcul des probabilites,
Soobshchs. Charkovskovo Mat. Obshch. (2) 13, 1-2 (1912).
[2] OnbiT aKcnoMaTHHecKoro odocHOBaHMH Teopmi BepoHTHOcreH (On a tentative
axiomatisation probability theory), Charkovskovo Zap. Mat. ot-va 15,
209-274 (1917).
[3] Sur l’extension du theoreme limite du calcul des probabilites aux sommes de
quantites dependantes. Math. Ann. 97, 1-59 (1926).
[4] Teopun BepoflTHOCTen (Probability theory), 4. ed., Goztehizdat, Moscow 1946.
Bharucha-Reid, A. T.
[1 ] Elements of the theory of Markov processes and their applications, McGraw-
Hill, New York 1960.
Bhattacharyya, A.
[1] On some analogues of the amount of information and their use in statistical
estimation, Sankhya 8, 1-14 (1946).
Bienayme, M.
[1] Considerations a l’appui de la decouverte de Laplace sur la loi des probabilites
dans la methode des moindres carres, C. R. Acad. Sci. Paris 37, 309-324 (1853).
Birkhoff, G.
[1] Lattice theory, 3. ed., American Mathematical Society Colloquium Publications
25. AMS, Providence 1967.
Blanc-Lapierre, A., et R. Fortet
[1] Theorie des fonctions aleatoires, Masson et Cie., Paris 1953.
Blaschke, W.
[1] Vorlesungen iiber Integralgeometrie, 3. AufL, VEB Deutscher Verlag der
Wissenschaften, Berlin 1955.
Blum, J. R., D. L. Hanson and L. H. Koopmans
[1] On the strong law of large numbers for a class of stochastic processes, Zeit-
schrift fur Wahrscheinlichkeitstheorie 2, 1-11 (1963).
Boas, R. P. Jr.
[1] A general moment problem, Amer. J. Math. 63, 361-370 (1941).
Bochner, S. and S. Chandrasekharan
[1] Fourier transforms, Princeton Univ. Press, Princeton 1949.
Boltzmann, L.
[1] Vorlesungen iiber Gastheorie, Johann Ambrosius Barth, Leipzig 1896.
Borel, E.
[1] Sur les probabilites denombrables et leurs applications arithmetiques, Rend.
Circ. Mat. Palermo 26, 247-271 (1909).
[2] Elements de la theorie des probabilites, Hermann et Fils, Paris 1909.
REFERENCES 647

VON Bortkiewicz, L.
[1] Das Gesetz der kleinen Zahlen, B. G. Teubner, Leipzig 1898.
de Bruijn, N. G.
[1] Asymptotic methods in analysis, North Holland Publ. Comp. Inc., Amsterdam
1958.

Cantelli, F. P.
[1] La tendenza ad un limite nel senzo del calcolo delle probabilita, Rend. Circ.
Mat. Palermo 16, 191-201 (1916).
Caratheodory, C.
[1 ] Entwurf einer Algebraisierung des Integralbegriffes, Sitzungsber. Math.-Natur-
wiss. Klasse Bayer, Akad. Wiss., Munchen 1938, S. 24-28.
Chandrasekhar, S.
[1 ] Stochastic problems in physics and astronomy, Rev. Mod. Phys. 15, 1-89 (1943).
Chebyshev, P. L. (He6bimeB, n. Jl.)
[1] Teopun BepoHTHOCTeM (Theory of probability), Akad. izd., Moscow 1936.
Chung, K. L.
[1] Markov chains with stationary transition probabilities, Springer-Verlag, Berlin-
Gottingen-Heidelberg 1960.
Chung, K. L., and P. Erdos
[1] Probability limit theorems assuming only the first moment, I, Mem. Amer.
Math. Soc. 6, 1-19 (1950).
Chung, K. L. and W. Feller
[1] On fluctuations in coin-tossing, Proc. Acad. Sci. USA 35, 605-608 (1949).
Cramer, H.
,
[1] Uber eine Eigenschaft der normalen Verteilungsfunktion, Math. Z. 41 405-414
(1936).
[2] Random variables and probability distributions, Cambridge Univ. Press,
Cambridge 1937.
[3] Mathematical methods of statistics, Princeton Univ. Press, Princeton 1946.
Cramer, H. and H. Wold
[1] Some theorems on distribution functions, J. London Math. Soc. 11 , 290-294
(1936).
CsAKI, E.
[1] On two modifications of the Wilcoxon-test, Publ. Math. Inst. Hung. Acad.
,
Sci. 4 313-319 (1959).
Csaki, P. and J. Fischer
[1] On bivariate stochastic connection, Publ. Math. Inst. Hung. Acad. Sci. 5,
311-323 (1960).
[2] Contributions to the problem of maximal correlation, Publ. Math. Inst. Hung.
Acad. Sci. 5, 325-337 (1950).
CsaszAr, A.
[1] Sur la structure des espaces de probabilite conditionelle, Acta Math. Acad. Sci.
Hung. 6, 337-361 (1955).
[2] Sur une caracterisation de la repartition normale de probability, Acta Math.
Acad. Sci. Hung. 7, 359-382 (1956).

VAN DANTZIG, D.
[1] Mathematische Statistiek, “Kadercursus Statistiek, 1947-1948”, Mathema-
tisch Centrum, Amsterdam 1948.
648 REFERENCES

Darmois, G.
[1] Analyse generate des liaisons stochastiques, Revue Inst. Internat. Stat. 21, 2-8
(1953).
Doeblin, W.
[1] Sur les proprietes asymptotiques de mouvements regis par certains types de
chaines simples, Bull. Soc. Math. Roumaine Sci. 391, 57-115 (1937), 3911,
3-61 (1937).
[2] Elements d’une theorie generate des chaines simples constantes de Markov,
Ann. Sci. Ecole Norm. Sup. (3) 57, 61-111 (1940).
Donsker, M. D.
[1] Justification and extension of Doob's heuristic approach to the Kolmogorov-
Smirnov theorems, Ann. Math. Stat. 23, 277-281 (1952).
Doob, J. L.
[1] Heuristic approach to the Kolmogorov-Smirnov theorems, Ann. Math. Stat.
20, 393 (1949).
[2] Stochastic processes, Wiley-Chapman, New York-London 1953.
Dugue, D.
[1] Arithmetique de lois de probabilites, Mem. Sci. Math., No. 137, Gauthier-
Villars, Paris 1957.
Dumas, M.
[1] Sur les lois de probabilites divergentes et la formule de Fisher, Intermed. Rech.
Math. 9 (1947), Supplement 127-130.
[2] Interpretation de resultats de tirages exhaustifs, C. R. Acad. Sci. Paris 288,
904-906 (1949).
(See the note from E. Borel after Dumas’ article, too.)
Dvoretzky, A. and P. Erdos
[1] Some problems on random walk in space, Proc. 2nd Berkeley Symp. Math.
Stat. Prob. 1950, Univ. California Press, Berkeley-Los Angeles 1951, 353-367.

Eggenberger, F., und G. Polya


[1] Uber die Statistik verketteter Vorgange, Z. angew. Math. Mech. 3, 279-289
(1923).
Einstein, A.
[1] Zur Theorie der Brownschen Bewegung, Ann. Physik 19, 371-381 (1906).
Erdos, P.
[1] On the law of the iterated logarithm, Ann. Math. 43, 419-436 (1942).
[2] On the distribution function of additive functions Ann. Math. 47, 1-20(1946).
Erdos, P. and G. A. Hunt
[1 ] Changes of signs of sums of random variables, Pacific J. Math. 3,678-679 (1953).
Erdos, P. and M. Kac
[1] On certain limit theorems of the theory of probability, Bull. Amer. Math. Soc.
52, 292-302 (1946).
[2] On the number of positive sums of independent random variables, Bull. Amer.
Math. Soc. 53, 1011-1020 (1947).
Erdos, P. and A. Renyi
[1] On the central limit theorem for samples from a finite population, Publ. Math.
Inst. Hung. Acad. Sci. 4, 49-61 (1959).

[2] On Cantor’s series with convergent Y —, Ann. Univ. Sci. Budapest, Rolando
^ <fn
Eotvos nom., Sect. Math. 2, 93-109 (1959).
REFERENCES 649

Erdos, P. and A. Renyi


[3] On the evolution of random graphs, Publ. Math. Inst. Hung. Acad. Sci. 5,17-61
(1960).
[4] On a classical problem of probability theory, Publ. Math. Inst. Hung. Acad.
Sci. 6, 215-220 (1961).
Esseen, C. G.
[1] Fourier analysis of distribution functions, A mathematical study of the Laplace-
Gaussian law, Acta Math. 77, 1-125 (1945).

Feinstein, A.
[1] A new basic theorem of information theory, Trans. Inst. Radio Eng., 2-22
(1954).
[2] Foundations of information theory, McGraw-Hill, New York 1958.
Feldheim, E.
[1 ] £tude de la stabilite des lois de probability Dissertation, Univ. Paris, Paris 1937.
[2] Neuere Beweise und Verallgemeinerung der wahrscheinlichkeitstheoretischen
Satze von Simmons, Mat. Fiz. Lapok 45, 99-114 (1938).
Feller, W.
[1] Ober den zentralen Grenzwertsatz der Wahrscheinlichkeitsrechnung, Math. Z.
40, 521-559 (1935); 42, 301-312 (1947).
[2] Zur Theorie der stochastischen Prozesse, Existenz- und Eindeutigkeitssatze,
Math. Ann. 113, 113-160 (1936).
[3] On the integro-differenlial equations of purely discontinuous Markov processes,
Trans. Amer. Math. Soc. 48, 488-515 (1940). Errata: ibidem 58, 474 (1945).
[4] The law of the iterated logarithm for identically distributed random variables,
Ann. Math. 47, 631-638 (1946).
[5] On the Kolmogorov-Smirnov limit theorems for empirical distributions, Ann.
Math. Stat. 19, 177-189 (1948).
[6] On the theory of stochastic processes, with particular reference to applications,
Proc. Berkeley Symp. Math. Stat. Prob. 1945, 1946, Univ. California Press,
Berkeley-Los Angeles 1949, 403-432.
[7] An introduction to probability theory and its applications, Vols 1-2, Wiley,
New York 1950-1966.
DE FlNETTI, B.
[1] Funzione caratteristica di un fenomeno aleatorio, Mem. R. Accad. Lincei (6)
4, 85-133 (1930).
[2] Sul concetto di media, Giorn. 1st. Ital. Att. 2, 369-396 (1931).
Fisher, R. A.
[1] Statistical methods for research workers, 10th edition, Oliver-Boyd Ltd.,
Edinburgh-London 1948.
[2] The design of experiments, Oliver-Boyd Ltd., London-Edinburgh 1949.
[3] Contributions to mathematical statistics, Wiley-Chapman, New York-London
1950.
Fisher, R. A. and F. Yates
[1] Statistical tables for biological, agricultural and medical research, Oliver-Boyd
Ltd., London-Edinburgh 1949.
Fisz, M.
[1] Probability theory and mathematical statistics, 3. ed. Wiley, New York 1963.
Florek, K., E. Marczewski and C. Ryll-Nardzewski
[1] Remarks on the Poisson stochastic process, I, Studia Math. 13,122-129(1953).
650 REFERENCES

Frechet, M.
11] Recherches theoriques modernes, Fascicule 3 du Tome 1 du Traite du calcul
des probabilities par E. Borel et divers auteurs, Gauthier-Villars, Paris 1937.
[2] Les probabilites associees a un systeme d’evenements compatibles et dependants,
I-II, Hermann et Cie., Paris 1940 and 1943.
Frink, O.
[1] Representations of Boolean algebras. Bull. Amer. Math. Soc. 47, 755—756
(1941).
[2] A proof of the maximal chain theorem, Amer. J. Math. 74, 676-678 (1952).

Gantmacher, F. R. (raHTMaxep, O. P.)


[1] Matrize rrechnung, I-II, VEB Deutscher Verlag der Wissenschaften, Berlin
1958 bzw. 1959 (Ubersetzung aus dem Russischen).
Gauss, C. F.
[1 ] Theoria combinationis observationum erroribus -minimus obnoxiae, Gottingen
1821.
Gavrilov, M. A. (raBpnjioB, M. A.)
[1] TeopuH pejiefiHO-KOHTaKTHbix cxeM (Theory of relay-contact schemes),
Moscow-Leningrad 1950.
Geary, R. C.
[1] Distribution of Student’s ratio for nonnormal samples, J. Royal Stat. Soc.
Supplement 3, 178-184 (1936).
Gebelein, H.
[1] Das statistische Problem der Korrelation als Variations-und Eigenwertproblem
und sein Zusammenhang mit der Ausgleichungsrechnung, Z. angew. Math.
Mech. 21, 364-379 (1941).
GLIVENKO, V. I. (TJIHBeHKO, B. H.)
[1] Sulla determinazione empirica di una legge probabilita, Gior. 1st. Ital. Att. 4,
1-10 (1933).
[2] Theorie generale des structures, Act. Sci. Industr. Nr. 652, Hermann et Cie.,
Paris 1938.
[3] Kypc tophh b epoRTHCCTen (A course of probability theory), GONTI, Mos¬
cow-Leningrad 1939.
Gnedenko, B. V. (rHeneHKO, E. B.)
[1] Sur la distribution limite du terme maximum d’une serie aleatoire, Ann. Math.
44, 423-453 (1943).
[2] JloKajibHaa npenejibHaa TeopeMa ajih nnoTHocTen (A local limit theorem for
probability densities), Dokl. Akad. Nauk. SSSR 95, 5-7 (1954).
[3] The theory of probability (Transl. from the Russian), Chelsea, New York 1962.
Gnedenko, B. V. and A. N. Kolmogorov (rHenemco, E. B. h A. H. KojiMoropoB)
[1] Limit distributions for sums of independent random variables, Addison-Wesley.
Cambridge (Mass.) 1954.
Gnedenko, B. V. and V. S. Koroljuk (rHenemco, E. B. n B. C. KoponioK)
[1] O MaKCHMaJibHOM pacxoxcneHHH ncyx OMnupnHecKHX pacnpeneneHUH (On the
maximal divergence of two empirical distributions), Dokl. Akad. Nauk. SSSR
80, 525 (1951).

HAjek, I. and A. Renyi


[1 ] Generalization of an inequality of Kolmogorov, Acta Math. Acad. Sci. Hung. 6,
281-283 (1955).
REFERENCES 651

Hajos, G. and A. Renyi


[1] Elementary proofs of some basic facts concerning order statistics, Acta Math..
,
Acad. Sci. Hung. 5 1-6 (1954).
Halmos, P. R.
[1] Measure Theory, van Nostrand, New York 1950.
Hardy, G. H.
[1] Divergent series, Clarendon Press, Oxford 1949.
Hardy, G. H., J. E. Littlewood and G. Polya
[1] Inequalities, 2nd edition, Cambridge Univ. Press, Cambridge 1952.
Hardy, G. H. and W. W. Rogosinski
[1] Fourier series, 3rd edition, Cambridge Univ. Press, Cambridge 1956.
Hardy, G. H. and E. M. Wright
[1] An introduction to the theory of numbers, 4th edition, Clarendon Press, Oxford
1960.
Harris, T. E.
[1] The theory of branching processes, Springer Verlag, Berlin-Heidelberg-New
York 1963.
Hartley, R. V.
[1] Transmission of information. Bell Syst. Techn. J. 7, 535-563 (1928).
Hausdorff, F.
[1] Grundziige der Mengenlehre, B. G. Teubner, Leipzig 1914.
Helmert, R.
[1] Uber die Wahrscheinlichkeit der Potenzsummen der Beobachtungsfehler und
iiber einige damit im Zusammenhang stehende Fragen, Z. Math. Phys. 21 ,
192-219 (1876).
Hille, E.
[1] Functional analysis and semi-groups, Amer. Math. Soc. Coll. Publ., Vol. 31,
New York 1948.
Hostinsky, B.
[1] Methodes generates du calcul des probability, Mem. Sci. Math. Nr. 52,.
Gauthier-Villars, Paris 1931.
Hurwitz, A. und R. Courant
[1] Funktionentheorie, Springer, Berlin 1929.

Jeffreys,H.
[1] Theory of probability, 2nd edition, Clarendon Press, Oxford 1948.
Jordan, Ch.
[1] On probability, Proc. Phys. Math. Soc. Japan 7, 96-109 (1925).
[2] Statistique mathematique, Gauthier-Villars, Paris 1927.
[3] Le theoreme de probability de Poincare, generalise au cas de plusieurs variables-
independantes, Acta Sci. Math. Szeged 7, 103-111 (1934).
[4] Calculus of finite differences, 2nd edition, Chelsea Publ. Comp., New York 1950.
[5] Fejezetek a klasszikus valoszinusegszamitasbol (Chapters from the classical
calculus of probabilities), Akademiai Kiado, Budapest 1956.

Kac, M.
[1 ] Random walk and the theory of Brownian motion, Amer. Math. Monthly 54 ,
369-391 (1947).
652 REFERENCES

Kac, M.
[2] A remark on the proceeding paper by A. Renyi, Publ. Inst. Math. Beograd 8,
163-165 (1955).
[3] Probability and related topics in physical sciences. Lectures in applied mathe¬
matics, Vol. I, lntersci. Publ., London-New York 1959.
[4] Statistical independence in probability, analysis and number theory, Math.
Assoc. America 1959.
Kantorovitch, L. V. (KaHToposmi, JI. B.)
[1] Sur une probleme de M. Steinhaus, Fund. Math: 14, 266-270 (1929).
Kappos, D. A.
[1] Strukturtheorie der Wahrscheinlichkeitsfelder und -raume, Springer-Verlag,
Berlin-Gottingen-Heidelberg 1960.
Kawata, I. and H. Sakamoto
[1] On the characterization of the normal population by the independence of the
sample mean and the sample variance, J. Math. Soc. Japan 1, 111-115 (1949).
Khinchin, A. J. (Xhhrhh, A. R.)
,
[1] Uber dyadische Briiche, Math. Z. 18 109-118 (1923).
[2] Sur les classes d’evenements equivalents. Mat. Sbornik 39:3, 40-43 (1932).
[3] Asymptotische Gesetze der Wahrscheinlichkeitsrechnung, Springer, Berlin 1933.
,
[4] Korrelationstheorie der stationarer stochastischer Prozesse, Math. Ann. 109 604-
615 (1934).
[5] Sul dominio di attrazione della legge di Gauss, Giorn. 1st. Ital. Att. 6, 378-393
(1935).
[6] Kettenbriiche, B. G. Teubner, Leipzig 1956.
[7] O KJiaccax BKBHBajieHTHbix co6mthh (On classes of equivalent events), Dok-
,
ladi Akad. Nauk. SSSR 85 713-714 (1952).
Khinchin, A. J. und A. N. Kolmogorov (Xhhhhh, A. R. h A. H. Kojimotopob)
[1 ] Uber Konvergenz von Reihen, deren Glieder durch den Zufall bestimmt werden.
Mat. Sbornik 32, 668-677 (1925).
(Khinchin) Chintschin, A. J. et P. Levy
[1] Sur les lois stables, C. R. Acad. Sci. Paris 202, 374-376 (1936).
Knopp, K.
[1] Theorie und Anwendung der unendlichen Reihen, Springer, Berlin 1924.
Roller, S.
[1] Graphische Tafeln zur Beurteilung statistischer Zahlen, Steinkopff, Dresden-
I.eipzig 1943.
Kolmogorov, A. N. (Kojimotopob, A. H.)
,
[1] Uber das Gesetz des iterierten Logarithmus, Math. Ann. 101 126-136 (1929).
,
[2] Sur la loi forte des grandes nombres, C. R. Acad. Sci. Paris 191 910-912(1930).
[3] Uber die analytischen Methoden in der Wahrscheinlichkeitsrechnung, Math.
,
Ann. 104 415-458 (1930).
,
[4] Sur la notion de la moyenne, Atti R. Accad. Naz. Lincei 12 388-391 (1930).
[5] Foundations of the theory of probability, Chelsea, New York 1956.
[6] Sulla determinazioneempiricadi una legge di distribuzione, Giorn. 1st. Ital. Att.
4, 83-91 (1933).
[7] Ilenbi MapKOBa c chcthbim mhoxccctbom bo3mojkhbix coctohhhh (Markov
chains with denumerably infinite possible states), Bull. Mosk. Univ. 1, 1 (1937).
[8] O jiorapH(})MHMecKH HopMajibHOM 3aKOHe pacnpeflejieHHH pa3MepoB nacTHii npn
UPodjieHHH (On the lognormal distribution of the sizes of particles in chop¬
ping), Dokl. Akad. Nauk. SSSR 31, 99-101 (1941).
[9] Algebres de Boole metriques completes, VI. Zjad Matematykow Polskich,
Warsaw 20-23. IX. 1948, Inst. Math. Univ. Krakow. 1950. 22-30.
REFERENCES 653

Kolmogorov, A. N. (KonMoropoB, A. H.)


[10] Ein Satz iiber die Konvergenz der bedingten Erwartungswerte und deren Anwen-
dungen, I. Magyar Matematikai Kongresszus Kozlemenyei, Budapest 1950,
377-386.
[11] HeKOTopbie pa6o™ nocneaHbix JieT b o6jiacTH npe/jejibHHx TeopeM TeopHH
BepoHTHOCTeH (On some recent works concerning the limit theorems of prob¬
ability theory), Vestnik Univ. Moscow 8 (10), 29-38 (1953). See also “Ar-
beiten zur Informationstheorie III.”
Krickeberg, K.
[1] Wahrscheinlichkeitstheorie, Teubner, Stuttgart 1963.
Kullback, S.
[1] Information theory and statistics, Wiley, New York 1959.
Ky Fan

[1] Les fonctions definies-positives et les fonctions completement monotones, leurs


applications au calcul des probabilites et a la theorie des espaces distanciees,
,
Mem. Sci. Math. 114 Gauthier-Villars, Paris 1950.

Laha, R. G.
[1 ] An example of a non-normal distribution where the quotient follows the Cauchy
law, Proc. Nat. Acad. Sci. USA 44 222-223 (1958).,
Laplace, P. S.

[1] Theorie analytique des probabilites, 1795. Oeuvres Completes de Laplace, t. 7,


Gauthier-Villars, Paris 1886.
[2] Essai philosophique sur les probabilites, I-II, Gauthier-Villars, Paris 1921.
Lehmann, E. L.
[1 ] Consistency and unbiasedness of certain nonparametric tests, Ann. Math Stat
22, 165-180 (1951).
Levy, P.
[1] Calcul des Probabilites, Gauthier-Villars, Paris 1925.
[2] Sur certains processes stochastiques homogenes, Comp.Math. 7, 283-339. (1939).
[3] Processus stochastiques et mouvement brownien, Gauthier-Villars, Paris 1948.
[4] Theorie de l’addition des variables aleatoires, 2e ed. Gauthier-Villars, Paris 1954.
Lighthill, M. J.
[1] An introduction to Fourier analysis and generalised functions, Cambridge
Univ. Press, Cambridge 1959.
Lindeberg, J. W.
[1] Eine neue Herleitung des Exponentialgesetzes in der Wahrscheinlichkeitsrech-
nung, Math. Z. 15, 211-225 (1922).
Linnik, Yu. V. (JIhhhhk, K). B.)
[1] The large sieve, Dokl. Akad. Nauk. SSSR 30, 292-294 (1941).
[2] TeopeTMKO-HH(])opMaunoHHoe flOKaiareiibCTBO ueHTpajibHOH npe/rejibHOH Teo-
peMbi b ycjioBHBX JlHHuebepra (An information theoretic proof of the central
limit theorem on Lindeberg conditions), Teor. Verojatn. Prim. 4, 311-321
(1959).
[3] Pa3JioxceHHH BepoaTHocTHbix 33KOHOB (Decomposition of probability functions),
Izd. Univ. Leningrad 1960.
Linnik, Yu. V. and A. A. Singer (JIhhhhk, K). B. h A. A. 3nHrep)
[1] 06 ootom aHajiHTHHecKOM o6o6meHHH TeopeMbi KpaMepa (On an analytic
extension of Cramer’s theorem), Vestnik Leningr. Univ. 11, 51-56 (1955).
Ljapunov, A. M. (JlanyHOB, A. M.)
[1] M36pamibie Tpyflbi (Selected works), Akad. izd. Moscow 1948, pp. 179-250.
654 REFERENCES

Lobachevsky, N. I. (JIo6aHeBCKnfi, H. M.)


[1] Sur la probability des rdsultats moyens, tires des observations repetees, J. reine
angew. Math. 24, 164-170 (1842).
Loeve, M.
[1] Probability theory, van Nostrand, New York 1955.
Lomnicki, A.
[1] Nouveaux fondements du calcul des probability. Fund. Math. 4, 34-41 (1923).
Losch, F. und F. Schoblik
[1] Die Fakultat, B. G. Teubner, Leipzig 1951.
Luce, R. D.
[1] Individual choice behaviour. A theoretical analysis, Wiley, New York 1959.
LukAcs, E.
[1] A characterization of the normal distribution, Ann. Math. Stat. 13, 91-93
(1942).
[2] Application of Faa di Bruno’s formula in mathematical statistics, Amer. Math.
Monthly 62, 340-348 (1955).
[3] Characterisation of populations by properties of suitable statistics, Proc. 3rd
Berkeley Symp. Math. Stat. Prob. 1954-1955, Vol. II, Univ. California Press,
Berkeley-Los Angeles 1956, 215-229.
[4] Characteristic functions, Griffin, London 1960.
LukAcs, E. and R. G. Laha
[1] Applications of characteristic functions, Griffin, London 1964.

Malmquist, S.
[1] On a property of order statistics from a rectangular distribution, Skand.
Aktuarietidskrift 33, 214-222 (1950).
Marczewski, E.
[1] Remarks on the Poisson stochastic process, II, Studia Math. 13, 130-136 (1953).
Markov, A. A. (MapKOB, A. A.)
[1] Wahrscheinlichkeitsrechnung, B. G. Teubner, Leipzig 1912.
McMillan, B.
[1] The basic theorems of information theory, Ann. Math. Stat. 24, 196-219 (1953).
Medgyessy, P.
[1] Decomposition of superpositions of distribution functions, Akad. Kiado,
Budapest 1961.
von Mises, R.
[1] Wahrscheinlichkeitsrechnung und ihre Anwendung in der Statistik und theore-
tischen Physik, Deuticke, Leipzig-Wien 1931.
[2] Wahrscheinlichkeit, Statistik und Wahrheit, Springer-Verlag, Berlin 1952.
Mogyorodi, J.
[1] On a consequence of a mixing theorem of A. Renyi, MTA Mat. Kut. Int. Kozl.,
9, 263-267 (1964).
Molina, F. C.
[1] Poisson’s exponential binomial limit, van Nostrand, New York 1942.
Moriguti, S.
[1 ] A lower bound for a probability moment of an absolutely continuous distribution
with finite variance, Ann. Math. Stat. 23, 286-289 (1952).

Nagumo, M.
[1] IJber eine Klasse von Mittelwerten, Japan. J. Math. 7, 71-79 (1930).
REFERENCES 655

Neveu, J.
[1] Mathematical foundations of the calculus of probability, Holden-Day Inc.,
San Francisco 1965.
Neyman, J.
[1] L’estimation statistique traite comme un probleme classique de probabilite.
Act. Sci. Industr., Nr. 739, Gauthier-Villars, Paris 1938.
[2] First course in probability and statistics, H. Holt et Co., New York 1950.

Onicescu, O. et G. Mihoc
[1] La dependance statistique. Chaines et families de chaines discontinues, Act.
Sci. Industr., Nr. 503, Gauthier-Villars, Paris 1937.
Onicescu, O., G. Mihoc si C. T. Ionescu-Tulcea
[1] Calculul probabilitatilor si applicatii, Bucuresti 1956.

Parzen, E.
[1] Modern probability theory and its applications, Wiley, New York 1960.
Pearson, E. S. and H. O. Hartley
[1] Biometrical tables for statisticians, Cambridge Univ. Press, Cambridge 1954.
Pearson, K.
[1] Early statistical papers, Cambridge Univ. Press, Cambridge 1948.
Poincare, H.
[1] Calcul des probability, Carre-Naud, Paris 1912.
Poisson, S. D.
[1] Recherches sur la probabilite de judgements, Bachelier, Pans 1837.
Polya, G.
[1] Uber den zentralen Grenzwertsatz der Wahrscheinlichkeitsrechnung und das
Momentproblem, Math. Z. 8, 171-181 (1920).
[2] Uber eine Aufgabe der Wahrscheinlichkeitsrechnung betreffend die Irrfahrt
im StraBennetz, Math. Ann. 84, 149-160 (1921).
Polya, G. und G. Szego
[1] Aufgaben und Lehrsatze aus der Analysis, I—II, Springer, Berlin 1925.
Popper, K.
[ 1 ] Philosophy of science: A personal report, British Philosophy in the Mid-Century,
ed. by C. A. Mace, 1956, p. 191.
[2] The logic of scientific discovery, Hutchinson, London 1959.
Prekopa A.
[1] On composed Poisson distributions, IV, Acta Math. Acad. Sci. Hung. 3,
317-326 (1952).
[2] Valoszinusegelmelet muszaki alkalmazasokkal (Probability theory and its
applications in technology), Muszaki Konyvkiado, Budapest 1962.
Prekopa, A., A. Renyi and K. Urbanik
[1] O npenejibHOM pacpnenejieHHH nna cyMM neiaBHCHMbix cjiysaiinbix bcjihmhh
Ha 6nKOMnaKTHbix KOMMyTaTHBHbix TonojiorHHecKHx rpynnax (On limit dis¬
tribution of sums of independent random variables over bicompact commuta¬
tive topological groups), Acta Math. Acad. Sci. Hung. 7, 11-16 (1956).

Reichenbach, H.
[1] Wahrscheinlichkeitslehre, Sijthoff, Leiden 1935.
656 REFERENCES

Renyi, A.
[1] Simple proof of a theorem of Borel and of the law of the iterated logarithm.
Mat. Tidsskrift B, 41-48 (1948).
[2] O npeacTaBJieHHH neTHbix HHcea b Birae cyMMti npocToro h nonra npocToro
HHCJia (On the representation of even numbers as sums of a prime and an al¬
most prime number), Izvestia Akad. Nauk. SSSR, Ser. Mat. 12, 57-78 (1948).
[3] K TeopHH npeaejibHbix TeopeM rjib cyMM ue3aBHCHMbix cnywafiHb;x bchhhhh
(On limit theorems of sums of independent random variables), Acta Math.
Acad. Sci. Hung. 1, 99-108 (1950).
[4] On the algebra of distributions, Publ. Math. Debrecen 1, 135-149 (1950).
[5] On composed Poisson distributions, II, Acta Math. Acad. Sci. Hung. 2, 83-98
(1951).
[6] On some problems concerning Poisson processes, Publ. Math. Debrecen 2,
66-73 (1951).
[7] On a conjecture of H. Steinhaus, Ann. Soc. Polon. Math. 25, 279-287 (1952).
[8] On projections of probability distributions, Acta Math. Acad. Sci. Hung. 3,
131-142 (1952).
[9] On the theory of order statistics, Acta Math. Acad. Sci. Hung. 4, 191-232
(1953).
[10] Eine neue Methode in der Theorie der geordneten Stichproben, Bericht iiber
die Mathematiker-Tagung Berlin 1953, VEB Deutscher Verlag der Wissen-
schaften, Berlin 1953. 203-213.
[11] Kemiai reakciok targyalasa a sztochasztikus folyamatok elmelete segitsegevel
(On describing chemical reactions by means of stochastic processes), A Magyar
Tudomanyos Akademia Alkalmazott Matematikai Intezetenek Kozlemenyei 2,
596-600 (1953) (In Hungarian).
[12] Ujabb kriteriumok ket minta osszehasonlitasara (Some new criteria for com¬
parison of two samples), A Magyar Tudomanyos Akademia Alkalmazott
Matematikai Intezetenek Kozlemenyei 2, 243-265 (1953) (In Hungarian).
[13] Valoszinusegszamitas (Probability theory), Tankonyvkiado, Budapest 1954
(In Hungarian).
[14] Axiomatischer Aufbau der Wahrscheinlichkeitsrechnung, Bericht iiber die
Tagung Wahrscheinlichkeitsrechnung und Mathematische Statistik, VEB
Deutscher Verlag der Wissenschaften, Berlin 1954, 7-15.
[15] On a new axiomatic theory of probability, Acta Math. Acad. Sci. Hung. 6,
285-335 (1955).
[16] On the density of sequences of integers, Publ. Inst. Math. Beograd 8, 157-162
(1955).
[17] A szamjegyek eloszlasa valos szamok Cantor-fele eloallitasaiban (The distri¬
bution of the digits in Cantor’s representation of the real numbers), Mat. Lapok
7, 77-100 (1956) (In Hungarian).
[18] On conditional probability spaces generated by a dimensionally ordered set of
measures, Teor. Verojatn. prim. 1, 61-71 (1956).
[19] A new deduction of Maxwell’s law of velocity distribution, Isv. Mat. Inst. Sofia 2,
45-53 (1957).
[20] A remark on the theorem of Simmons, Acta Sci. Math. Szeged. 18, 21-22
(1957).
[21 ] Representations for real numbers and their ergodic properties, Acta Math.
Acad. Sci. Hung. 8, 477-493 (1957).
[22] On the asymptotic distribution of the sum of a random number of independent
random variables, Acta Math. Acad. Sci. Hung. 8, 193-199 (1957).
REFERENCES 657

Renyi, A.
[23] Quelques remarques sur les probabilities des evenements dependantes, J. Math,
pures appl. 37, 393-398 (1958).
[24] On mixing sequences of sets, Acta Math. Acad. Sci. Hung. 9, 215-228 (1958).
[25] Probabilistic methods in number theory. Proceedings of the International
Congress of Mathematicians, Edinburgh 1958, 529-539.
[26] New version of the probabilistic generalization of the large sieve, Acta Math.
,
Acad. Sci. Hung. 10 217-226 (1959).
[27] On the dimension and entropy of probability distributions, Acta Math. Acad.
,
Sci. Hung. 10 193-215 (1959).
,
[28] On measures of dependence, Acta Math. Acad. Sci. Hung. 10 441-451 (1959).
[29] On a theorem of P. Erdds and its applications in information theory, Mathe-
matica Cluj 1 (24), 341-344 (1959).
[30] Dimension, entropy and information. Transactions of the IE Prague Conference
on Information theory, statistical decision functions, random processes, Praha
1960, 545-556.
[31] On the central limit theorem for the sum of a random number of independent
,
random variables, Acta Math. Acad. Sci. Hung. 11 97-102 (1960).
[32] Az apritas matematikai elmeleterol (On the mathematical theory of chopping),
Epitoanyag 1-8 (1960) (In Hungarian).
[33] Bolyongasi problemakra vonatkozo hatareloszl&stetelek (Limit theorems in
random walk problems), A Magyar Tudomanyos Akademia III (Matematikai
,
es Fizikai) Osztalyanak Kozlemenyei 10 149-170 (1960) (In Hungarian).
[34] Az informacioelmelet nehany alapveto kerdese (Some fundamental problems
of the information theory), A Magyar Tudomanyos Akademia III (Matematikai
,
es Fizikai) Osztalyanak Kozlemenyei 10 251-282 (1960) (In Hungarian).
[35] Egy altalanos modszer valoszinusegszamitasi tetelek bizonyitasara (A general
method for proving theorems in probability theory), A Magyar Tudomanyos
Akademia III (Matematikai es Fizikai) Osztalyanak Kozlemenyei 11, 79-105
(1961) (In Hungarian).
[36] Legendre polynomials and probability theory, Ann. Univ. Sci. Budapest, R.
Eotvos nom.. Sect. Math. 3-4, 247-251 (1961).
[37] On measures of entropy and informations. Proc. Fourth Berkeley Symposium
on Math. Stat. Prob. 1960, Vol. I, Univ. California Press, Berkeley-Los Angeles
1961, 547-561.
[38] On stable sequences of events, Sankhya A 25, 293-302 (1963).
[39] On certain representations of real numbers and on equivalent events, Acta Sci.
Math. Szeged 26, 63-74 (1965).
[40] Uj modszerek es eredmenyek a kombinatbrikus analizisben (New methods
and results in combinatorial analysis), A Magyar Tudomanyos Akademia III
,
(Matematikai es Fizikai) Osztalyanak Kozlemenyei 16 75-105, 159-177
(1966) (In Hungarian).
[41] Sur les espaces simples des probability conditionnelles, Ann. Inst. H. Poincare
,
B 1 3-19 (1964).
[42] On the foundations of information theory, Review of the International Statis¬
tical Institute 33, 1-14 (1965).
Renyi, A. and P. Revesz
[1] On mixing sequences of random variables, Acta Math. Acad. Sci. Hung. 9,
389-393 (1958).
[2] A study of sequences of equivalent events as special stable sequences, Publi-
,
cationes Mathematicae Debrecen 10 319-325 (1963).
658 REFERENCES

Renyi, A. and R. Sulanke


[1] Uber die konvexe Hiille von n zufallig gewahlten Punkten, I—II, Zeitschrift
fur Wahrscheinlichkeitstheorie 2, 75-84 (1963); 3, 138-147 (1964).
Revesz, P.
[1] A limit distribution theorem for sums of dependent random variables, Acta
,
Math. Acad. Sci. Hung. 10 125-131 (1959).
[2] The laws of large numbers, Akad. Kiado, Budapest 1967.
Richter, H.
[1] Wahrscheinlichkeitstheorie, Springer-Verlag, Berlin 1956.
Riesz, F. and B. Sz.-Nagy
[1] Functional analysis, Blackie, London-Glasgow 1956.
Robbins, H.
[1] On the equidistribution of sums of independent random variables, Proc. Amer.
,
Math. Soc. 4 786-799 (1953).
Rota, G. C.
[1] The number of partitions of a set, Amer. Math. Monthly, 71, 498-504 (1964).

Saxer, W.
[1] Versicherungsmathematik, II, Springer-Verlag, Berlin-Gottingen-Heidelberg
1958.
Schmetterer, L.
[1] Einfuhrung in die mathematische Statistik, Springer-Verlag, Wien 1956.
Schutzenberger, M. P.
[1] Contributions aux applications statistiques de la theorie de l’information, Inst.
Stat. Univ. Paris (A) 2575, 1-115 (1953).
Shannon, C. E.
[1 ] A mathematical theory of communication, Bell Syst. Techn. J. 27, 379-423,
623-653 (1948).
Shannon, C. E. and W. Weaver
[1] The mathematical theory of communication, Univ. Illinois Press, Urbana 1949.
Singer, A. A. (3mirep, A. A.)
(1] O He3aBHCnM£>ix Bbi6opKax H3 HopMajibHon coBoxynHOCTH (On independent
samples from a population), Uspehi Mat. Nauk 6, 172-175 (1951).
Skitovich, V. R. (Ckhtobhh, B. P.)
[1] 06 oahom CBOHCTBe HopMajibHoro pacnpeaeneHHH (On a property of the nor¬
mal distribution), Dokl. Akad. Nauk. SSSR 89, 217-219 (1953).
Slutsky, E.
[1] Uber stochastische Asymptoten und Grenzwerte, Metron 5, 1-90 (1925).
Smirnov, N. V. (Cmhphob, H. B.)
[1] Uber die Verteilung allgemeiner Glieder inder Variationsreihe, Metron 12,
59-81 (1935).
[2] npn6jiH>KeHHe 33kohob pacnpeqejieHHH CjiyiaftHbix BeaHHHH no aMnupunecKHM
AaHHbiM (Approximation of the laws of distribution of random variables by
,
means of empirical data), Uspehi Mat. Nauk. 10 179-206 (1944).
Smirnov, V. I. (Cmhphob, B. H.)
[1] Lehrgang der hoheren Mathematik, Teil III, 3. Aufl., VEB Deutscher Verlag
der Wissenschaften, Berlin 1961.
von Smoluchowski, M.
[1] Drei Vortrage liber Diffusion, Brownsche Molekularbewegung und Koagulation
von Kolloidteilchen, Phys. Z. 17, 557-571, 585-599 (1916).
REFERENCES 659

Sparre-Andersen, E.
[1] On the number of positive sums of random variables, Skand. Aktuarietidskrift,
1949, 27-36.
[2] On the fluctuations of sums of random variables, I—II, Math. Scand. 1, 263-285
,
(1953); 2 193-223 (1954).
Spitzer, F.
[1] A combinatorial lemma and its application to probability theory, Trans. Airier.
Math. Soc. 82, 323-339 (1956).
Steinhaus. H.
[1] Les probability denombrables et leur rapport a la theorie de la mesure, Fund.
Math. 286-310 (1923).
[2] Sur la probability de la convergence des series, Studia Math. 2, 21-39 (1951).
Steinhaus, H., M. Kac et C. Ryll-Nardzewski
[i ]_ [10] Sur les fonctions independantes, I, Studia Mathematica 6, 46-58 (1936),
II ibidem 6, 59-66 (1936); III, ibidem 6, 89-97 (1936); IV, ibidem 7, 1-15
(1938); V, ibidem 7, 96-100 (1938); VI, ibidem 9, 121-132 (1940); VII, ibidem
, , ,
10 1-20 (1948); VIII, ibidem 11 133-144 (1949); IX, ibidem 12 102-107
,
(1951); X, ibidem 13 1-17 (1953).
Stone, M. H.
[1] The theory of representation for Boolean algebras, Trans. Arner. Math. Soc. 4,
31-111 (1936).
Student
[1] _’s Collected papers, Edited by E. S. Pearson and J. Wishart, London 1942.
SzAsz, G. ,,
[1] Introduction to lattice theory (transl. from the Hungarian), Akad. Kiado,
Budapest 1963.
Szokefalvi-Nagy, B.
[1] Spektraldarstellung linearer Transformationen des Hilbertschen Raurnes,
Springer, Berlin 1942.

Titchmarsh, E. C.
[1] Theory of functions, Clarendon Press, Oxford 1952.
Todhunter, L. . . _ ...
[1] History of the mathematical theory of probability, MacMillan, Cambridge
London 1865.

Uspenski, J. W. (ycneHCKHH,IO. B.)


[1] Introduction to mathematical probability, McGraw-Hill, New York-London
1937.
Veksler, V., L. Groshev and B. Isaev (BeKcnep, B., JI. TpouieB h B. HcaeB)
[1] MommuHOHHbie MeTO/tbi HCCJienoBaHHH H3JiyHeHHii (Ionisation methods in the
study of radiations), Gostehizdat, Moscow 1949.

Waerden, van der, B. L. .


[1] Mathematische Statistik, Springer-Verlag, Berlin-Gottingen-Heidelberg 1957.

[1] Die Widerspruchsfreiheit des Kollektivbegriffes der Wahrscheinlichkeitsrech-


nung, Erg. Math. Koll. 8, Wien 1935-1936.
Wang, Shou Yen . . . .
[1] On the limiting distribution of the ratio of two empirical distributions, Acta
Math. Sinica 5, 253 (1955).
660 REFERENCES

WlDDER, D. V.
[1] The Laplace-transform, Princeton Univ. Press, Princeton 1946.
Wiener, N.
[1] Cybernetics or control and communication in the animal and the machine,
Act. Sci. Indust., Nr. 1053, Hermann et Cie, Paris 1948.
[2] Extrapolation, interpolation and smoothing of stationary time series, Wiley,
New York 1949.
WlLCOXON, F.
[1] Individual comparisons by ranking methods, Biometrics Bull. 1, 80-83 (1945).
Wilks, S. S.
[1] Order statistics. Bull. Amer. Math. Soc. 54, 6-50 (1948).
Wolfowitz, J.
[1] The coding of messages subject to chance errors, Illinois J. Math. 1, 591-606
(1957).
[2] Information theory for mathematicians, Ann. Math. Stat. 29, 351-356 (1958).
[3] Coding theorems of information theory, Springer-Verlag, Berlin-Gottingen-
Heidelberg 1961.
Woodward, P. M.
[1] Probability and information theory with applications to radar, Pergamon Press,
London 1953.

Zygmund, A.
[1] Trigonometrical series, Warsaw 1935; Dover-New York 1955.
[2] Trigonometric series, I—II, Cambridge Univ. Press, Cambridge 1959.
AUTHOR AND SUBJECT INDEX

absolutely continuous distribution func¬ Blum, J. R., 475, 646


tion, 175 Boas, R. P., 640, 646
absolutely monotone sequence, 415 Bochner, S., 304, 306, 646
Aozel, J., 640, 643, 645 Bohr, H., 640
Alexandrov, P. S., 645 Boltzmann, L., 43, 554, 642, 646
algebra, of events, 9 Boltzmann energy distribution, 166
— of probability distributions, 131 Boltzmann-Shannon formula, 554
— of sets, 17 Boolean algebra, 9
almost sure convergence, 394 Borel, E., 639, 641, 646
Anscombe, F. J., 473, 642, 645 Borel, algebra, 46
a posteriori probabilities, 86 — cylinder set, 287
a priori probabilities, 86 — measurable function, 172
Arato M., 640 ,645 — ring of sets, 48
arc sine law, 508 — sets, 49
atom, 83 Borel-Cantelli lemma, 298, 390
Aumann, G., 22, 639, 645 Bortkiewicz, L., von, 639, 647
Bose-Einstein statistics, 43
Balatoni, F., 642, 645 Bruijn, N. G., de, 641, 647
Barban, M. B., 645 Buffon, G. L. L., 31
Barnard, G. A., 642, 645 Buffon’s needle problem, 67
Bartlett, M. S., 642, 645
Bateman, H., 237, 640, 645 canonical representation of an event, 19
Bateman, P. T., 640, 645 Cantelli, P. F„ 436, 641, 647
Baticle, E., 639, 645 Caratheodory, C., 638, 647
Bauer, H., 645 Cauchy, A., 641
Bayes, Th., 640, 646 Cauchy distribution, 204
Bayes’ theorem, 86, 274, 294 central limit theorem, 440
Bernoulli, J., 165, 374, 641, 646 central moment generating function, 138
Bernoulli’s law of large numbers, 157 central moments, 137
Bernoulli theorem, 375 chain, 23
Bernstein, S. N., 323, 374, 379, 466, 639, —, maximal 23
640, 641, 642, 646 Chandrasekhar, S., 642, 647
Bernstein polynomials, 165 Chandrachekharan, S., 306, 646
Bernstein’s improvement of Chebyshev channel, 567
inequality, 384 —, noisy, 568
Bertrand’s paradoxon, 64 characteristic exponent, 350
beta distribution, 205 characteristic function, 217, 302, 365
beta integral, 98 Chebyshev, P. L., 442, 641, 642, 647
Bharucha-Reid, A. T., 640, 642, 646 Chebyshev inequality, 373
Bhattacharyya, A., 642, 646 X2 distribution, 198
Bienayme, M., 641, 646 X distribution, 199
Bienayme-Chebyshev inequality, 373 Chowla, S. 640
binomial distribution, 87 Chung, K. L. 611, 641, 642, 643, 647
Birkhoff, G., 21, 646 coefficient of variation, 116
Blanc-Lapierre, A., 639, 646 compatibility, 287
Blaschke, W., 66, 639, 646 complementary event, 9
662 AUTHOR AND SUBJECT INDEX

complete algebra of sets, 17 Dirac’s delta function, 355, 360


complete conditional distribution, 570 direct product of distributions, 553
completely additive function, 47 discrete random variable, 95
completely independent events, 59 dispersion, ellipsoid, 226
complete measure, 50 — matrix, 225
complete system of events, 16, 84 distribution function, 97, 172, 247, 251
compound event, 18 Doeblin, W., 473, 642, 648
concentration, 227 domain of attraction of normal distribu¬
conditional density function, 181, 259 tion, 453
conditional distribution, 97, 265 Donsker, M. D., 642, 648
conditional distribution function, 181, Doob, J. L., 422, 437, 639, 641, 642, 648
258 doubly stochastic matrix, 483
conditional expectation, 108, 212, 270 dual formulas, 13
conditional information, 557 Dugue, D., 641, 648
conditional probability, 54, 255, 263 Dumas, M., 639, 648
conditional probability space, 70 Dvoretzky, A., 642, 648
conditional variance, 276
contingency, 279 Eggenberger, F., 640, 648
contraction operator, 515 Ehrenfest’s urn model 531
convergence, almost everywhere, 395 Einstein, A., 43, 642, 648
— almost surely, 394 ellipse of concentration, 226
— in measure, 395 entropy, 430, 554
— in probability, 374 — of a distribution, 368
—, of generalized functions, 360 — of a random variable, 592
convolution, 101, 195 equivalent regular sequences, 354
— power, 133 Erdos, P., 365, 460, 511, 512, 514, 544,
correlation, coefficient, 117, 229 640, 641, 648
— ratio, 276 Esseen, C. G., 641, 649
Courant, R., 641, 651 Euler, 199
covariance matrix, 225 Euler’s beta integral, 98
Cramer, H., 327, 329, 467, 639, 640, 641, — function, 80
642, 647 — gamma function, 124
crowd of events, 22 — summation formula, 149
Csaki, E., 642, 647 event, 9
Csaki, P., 640, 647 —, elementary, 18
CsaszAr, A., 371, 639, 641, 647 —, impossible, 10
cumulant, 139 —, sure, 11
— generating function, 139 events, exchangeable, 78, 412
—, mutually exclusive, 10
Dantzig, D., van, 489, 640, 642, 647 —, product of, 10
Darmois, G., 336, 641, 648 —, subtraction of, 13
(/-dimensional information, 588 —, sum of, 11
Dedekind, R., 638 exchangeable random variables, 235
degenerate distribution, 134 expectation, 103, 209
-function, 175 — vector, 217
density function, 175, 248, 251 experiment, 9
density, of sequence of mixing sets, 406
— of stable sequence of events, 409 Faa di Bruno, 331, 641
dimension of order alpha of a random Fadeev, D. K. 544, 548, 642, 643, 645
variable, 588 family of distributions, 187
Dirac, G., 43 Feinstein, A., 643, 649
AUTHOR AND SUBJECT INDEX 663

Feldheim, E., 167, 640, 641, 649 Hardy, G. H„ 307, 368, 552, 574, 580,
Feller, W., 447, 448, 453, 639, 641, 642 640, 641, 643, 651
Fermi-Dirac statistics, 43 Harris, T. E., 651
Finetti, B., de, 413, 639, 643, 649 Hartley, H. O., 643
Fischer, J., 640, 647 Hartley, R. V., 642, 643, 651
Fisher,R. A., 339, 642, 643, 649 Hartley’s formula, 542
Fisz, M., 639, 649 Hausdorff, F., 23, 415, 638, 651
Florek, K., 640, 649 Helly, E., 319, 641
Fortet, R., 639, 646 Helmert, R., 198, 640, 651
Fourier-Stieltjes transform, 302 Hille, E., 431, 641, 651
Fourier transform, 356, 357 Hirschfeld, A. O., 283
Frechet, M., 37, 639, 650 Hostinsky, B., 642, 651
frequency, 30 Hunt, G. A., 512, 642, 648
—, relative, 30 Hurwitz, A., 641, 651
Frink, O., 21, 638, 650 hypergeometric distribution, 88
Frobenius, 598, 643
fundamental theorem of mathematical incomplete probability distribution, 569
statistics, 400 incomplete random variable, 569
independent events, 57
gain, conditional distribution function of independent random variables, 99, 182
572 infinitely divisible distribution, 347
—, measure of, 574 infinitesimal random variable 448
—, of information, 562 information, 540, 554, 592
Gabon’s desk, 152 —, of order alpha, 579, 586
gamma distribution, 202 integral geometry, 69
Gantmacher, F. R., 598, 643, 650 Ionescu-Tulcea, C. T., 639, 658
Gauss, C. F., 641, 650 Isaev, B., 640, 659
Gauss curve, 152
Gaussian, density function, 191 Jaglom, A. M., 642, 645
Gaussian distribution function, 157, 187 Janossy, L., 640, 645
Gaussian random variable, 156 Jeffreys, H., 562, 639, 641, 642, 651
Gavrilov, M. A., 28, 638, 650 Jensen inequality, 555
Geary, R. C., 339, 641, 650 joint distribution function, 178
Gebelein, H., 283, 640, 650 Jordan, Ch., 37, 639, 651
Gelfand, A. N., 642, 646
Gelfand-distributions, 353 Kac, M., 345, 511, 514, 639, 641, 642,
generalized functions, 354 643, 648, 651, 659
generating function, 135 Kantorovich, L. V., 120, 640
geometric distribution, 90 Kappos, D. A., 638, 639
Glivenko, V. I., 9, 401,492, 638, 641, 650 Kawata, J., 339, 641
Gnedenko, B. V., 348, 448, 449, 458, Khinchin, A. , 347, 380, 453, 548, 607,
496, 639, 641, 642, 650 639, 641, 642, 643, 645
—, theorem of, 449 Knopp, K., 150, 426, 472 , 491, 640, 641,
Groshev, L., 640, 659 643
Gumbel, A. J., 37 Koller, S., 643
Kolmogorov, A. N 9, 33, 69, 276, 383,
HAjek, J., 434, 460, 641, 650 396, 402, 420, 438, 448, 458, 493,
Hajos, G., 640, 651 576, 638, 639, 640, 641, 642, 643
half line period, 127 645, 650, 652
Halmos, P. R., 48, 639, 651 Kolmogorov probability space, 97
FIanson, L., 475 Kolmogorov’s formula, 348
664 AUTHOR AND SUBJECT INDEX

Kolgomorov’s fundamental theorem, 286 Malmquist, S., 489, 640, 642, 654
— inequality, 392 Marczewski, E., 640, 649, 654
Koopmans, L. H., 646 marginal distribution, 190
Koroljuk, V. S., 496, 642, 650 Markov, A. A., 442, 642, 654
Krickeberg, K., 642, 653 —, theorem of, 479
Kronecker, L., 397 Markov chain, 475
Kullback, S., 642, 653 -, additive, 483
Ky Fan, 641, 653 -, ergodic, 479
-, homogeneous, 476
Laguerre polynomials, 169 — -—, reversible, 534
Laha, R. G., 372, 641, 653 Markov inequality, 218
Laplace curve, 152 maximal correlation, 283
—, method of, 164 Maxwell distribution, 200, 239
Laplace, P. S., 153, 639, 653 — —, of order n, 269
large sieve, 286 Maxwell-Boltzmann statistics, 43
lattice, 21 McMillan, B., 643, 654
— distribution, 308 measure, 49
law of errors 440 —, complete, 50
— of large numbers, due to, Bernstein, —, outer, 50
379 —, c-finite, 49
-Khinchin, 380 measurable set, 50
—- --Kolmogorov, 383 Medgyessy, P., 654
--Markov, 378 median, 217
— of the iterated logarithm 402 Mensov, D. E., 641
Lebesgue measure, 52 Mercer theorem, 552, 643
Lebesgue-Stieltjes measure, 52 Mihoc, G., 639, 640, 655
Legendre polynomials, 509 Mikusinski, J., 353, 641
Lehmann, E. L., 642, 653 Mises, R. von, 639, 654
level set, 172 mixing sequence of random variables, 467
Levy, P., 348, 350, 453, 511, 639, 641, mixture of distributions, 207, 131
642, 652, 653 modulus of dependence, 283
Levy-Khinchin formula, 347 Mogyorodi, J., 475, 654
Liapunov, A. M., 517, 641, 642 Moivre-Laplace theorem, 153
Liapunov’s condition, 442 Molina, F. C., 643, 654
Lighthill, M. J., 353, 641, 653 moment, 137, 217
Ltndeberg, J. W., 642, 653 — generating function, 138
Lindeberg’s, condition, 443, 447 monotone class, 418
— theorem 520 Monte Carlo method, 69
linear operator, 515 Moriguti, S., 654
Linnik, Yu. V., 286, 329, 336, 605, 640, mutually independent random variables,
641, 643, 653 252
Littlewood, J. E., 368, 574, 580, 643,651
Liapunov, A. M., 653, 653 Nagumo. M., 576, 643, 654
Lobachevski, N. I., 198, 640, 654 n-dimensional cylinder, 28 7
Loeve, M., 639, 654 negative binomial distribution, 92
logarithmically uniform distribution, 249 neglig’ble random variable, 448
lognormal distribution, 194 Neveu, J., 655
Lomnicki, A., 639, 654 Newton, I. 531
Losch, F., 199, 640, 654 Neyman, J. 639, 655
Luce, R. D., 639, 654 non-atomic probability space, 81
LukAcs, E., 331, 339, 641, 654 normal curve, 152
AUTHOR AND SUBJECT INDEX 665

normal distribution, 156, 186


643, 645, 648, 650, 651, 655 656,
normally distributed random vector, 191 657, 658
Revesz, P„ 471 641 642. 657, 658
Obreskov, N. G., 168 Richter, H. 639, 658
Onicescu, O., 639, 640, 655 Rtemann, B., 307
operator method, 515 Riesz. F„ 407, 411, 658
order statistics, 205, 235, 486 ring of sets, 48
Robbins, H 641 658
Rogosinski, W. W., 307, 651
Parsen, E„ 639. 643 655
Rosenblatt, M., 475
Parseval’s theorem, 356
Roia G. C., 658
partially ordered set. 23
Ryll-Nardzewski, C., 640, 642, 649, 659
Pearson E. S., 643 655
Pearson K.,32 198.276.279,640 642,655
Sakamoto, H., 339, 641, 652
Pearson distribution. 233
sample space, 21
Poisson. S D., 640, 655
Saxer, W„ 640, 658
Poisson distribution, 123. 202
Schmetterer, L., 639
Poisson’s summation formula, 365
Schoblik, F„ 199, 654
Polya, G., 169, 368, 509, 510, 574, 580,
Schutzenberger, M. P., 642
640, 641, 642, 643, 648, 651, 655
Schwartz, L., 353
Polya distribution, 94
Schwarz, H. A., 329, 641
polyhypergeometric distribution, 89
semiinvariant, 139
polynomial distribution, 87
sequences of mixing sets, 406
Popper, K., 639, 655
Shannon, C. E., 547, 567, 569, 642, 658
positive definit function, 304
Shannon’s formula, 547
Post-Widder inversion formula, 432
— gain of information, 574
Prekopa, A., 640, 642, 655
— information, 579
probability algebra, 34
similar distributions, 186
— distribution, 84
Simmons theorem, 167
— of an event, 32
Simson distribution, 197
projection of distribution, 189
Singer, A. A., 330, 339, 641, 653, 658
Skitovich, V. R., 336, 641, 658
^-quantile, 218, 490 Slutsky, E., 374, 641, 658
quartile, 28 Smirnov, N. V., 421, 642, 658
quasiorthogonal, 284 Smirnov, V. I., 48, 199, 658
quasi stable distribution, 350 Smoluchowski, M. von, 657, 658
Sparre-Andersen, E., 511, 514, 642, 658,
Radon-Nikodym theorem, 255 659
random event, 29 Spitzer, F., 642, 659
— variable, 95, 172, 245 stable distribution, 326, 349
— vector, 177 stable sequence of events, 409
— walk, 500 standard deviation, 110, 219
rectangular density function, 184 standardization, 440
regression curve, 278 standard normal distribution function
regular dependence, 281 157
regular sequence of functions, 354 stationary distribution, 479
Reichenbach, H., 639, 655 Steinhaus, H., 639, 641, 642, 659
relative information, 558 Stieltjes moment problem, 308
Renyi, A., 37, 70, 73, 268, 276, 279, 286, Stirling numbers, 137
409, 434, 460, 473, 475, 486, 494, Stirling’s formula, 149
514, 544, 588, 639, 640, 641, 642, stochastic convergence, 374
666 AUTHOR AND SUBJECT INDEX

stochastic matrix, 477 uniform distribution, 184


stochastic schemes, 29 Urbanik, K., 642, 655
Stone, M. H., 21, 639, 659 Uspenski, J. W., 658, 659
strong law of large numbers, 395
Student, 640, 659 variance, 110, 219, 225
Student distribution, 204 Veksler, V., 639, 659
Sulanke, R., 658
SzAsz, G., 658, 659 Waerden, B. L., van der, 639, 659
Szego, G., 169, 509, 510, 655 Wald, A., 640, 659
cr-additive set function, 47 Wallis formula, 149
<T-algebra, 46 Wang Shou Yen, 642, 659
Szokefalvi-Nagy, B., 407, 411,658,659 Weaver, W., 642, 658
Weierstrass, K., 165
Temple, G., 353 Widder, D. V., 641, 659
theorem, of Kolmogorov, 493 Wiener, N., 547, 642, 659
— of Polya, 501 Wilcoxon, F., 639, 642, 660
— of Smirnov, 496, 493 Wilcoxon’s test, 536
— of total expectation, 108 Wilks, S. S., 642, 660
— of total probability, 85 Wolfowitz, J., 643, 660
—, three-series, of Kolmogorov, 420 Woodward, P. M., 642, 660
Tikhomirov, W. M., 643, 645 Wright, E. M., 640, 651
Titchmarsh, E. C., 336, 641, 659
Todhunter, L., 638, 659 Yates, F., 643, 649
transition probabilities, 476
zero-one law, 418
ultrafilter, 22 Zygmund, A., 368, 641, 660
NORTH-HOLLAND SERIES IN

APPLIED MATHEMATICS AND MECHANICS


EDITORS: H. A. LAUWERIER AND W. T. KOITER

Volume 1: I. N. Vekua

New Methods for Solving Elliptic Equations

Volume 2: L. Berg

Introduction to the Operational Calculus

Volume 3: M. L. Rasulov
Methods of Contour Integration

Volume 4: N. Cristescu

Dynamic Plasticity

Volume 5: A. V. Bitsadze
Boundary Value Problems for Second
Order Elliptic Equations

Volume 6: G. Helmberg
Introduction to Spectral Theory
in Hilbert Space

Volume 7: Yu. N. Rabotnov


Creep Problems in Structural Members

Volume 8: J. W. Cohen
The Single Server Queue

Volume 9: S. Fenyo and Th. Frey


Modern Mathematical Methods in Engineering
The Moffett Library
Midwestern University
Wichita Falls, Texas

You might also like