0% found this document useful (0 votes)
79 views

Large Deviation Theory

This document provides an introduction to large deviation theory at three levels: theory, applications, and simulations. The first part introduces basic elements of large deviation theory, such as examples of large deviations, the large deviation principle, Gartner-Ellis theorem, Varadhan's theorem, and the contraction principle. The second part covers applications of large deviation theory to sums of independent random variables, Markov chains, continuous-time Markov processes, and physical systems. The third part discusses numerical methods for estimating large deviation probabilities, including importance sampling, exponential change of measure, and their applications to Markov processes and stochastic differential equations. The document aims to convey the key ideas of large deviation theory to advanced undergraduate and graduate students in a non-technical manner

Uploaded by

Arjun Singh
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views

Large Deviation Theory

This document provides an introduction to large deviation theory at three levels: theory, applications, and simulations. The first part introduces basic elements of large deviation theory, such as examples of large deviations, the large deviation principle, Gartner-Ellis theorem, Varadhan's theorem, and the contraction principle. The second part covers applications of large deviation theory to sums of independent random variables, Markov chains, continuous-time Markov processes, and physical systems. The third part discusses numerical methods for estimating large deviation probabilities, including importance sampling, exponential change of measure, and their applications to Markov processes and stochastic differential equations. The document aims to convey the key ideas of large deviation theory to advanced undergraduate and graduate students in a non-technical manner

Uploaded by

Arjun Singh
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

A basic introduction to large deviations:

Theory, applications, simulations

Hugo Touchette
School of Mathematical Sciences, Queen Mary, University of London
London E1 4NS, United Kingdom
March 1, 2012
Abstract
The theory of large deviations deals with the probabilities of rare
events (or uctuations) that are exponentially small as a function of
some parameter, e.g., the number of random components of a system,
the time over which a stochastic system is observed, the amplitude
of the noise perturbing a dynamical system or the temperature of
a chemical reaction. The theory has applications in many dierent
scientic elds, ranging from queuing theory to statistics and from
nance to engineering. It is also increasingly used in statistical physics
for studying both equilibrium and nonequilibrium systems. In this
context, deep analogies can be made between familiar concepts of
statistical physics, such as the entropy and the free energy, and concepts
of large deviation theory having more technical names, such as the rate
function and the scaled cumulant generating function.
The rst part of these notes introduces the basic elements of large
deviation theory at a level appropriate for advanced undergraduate and
graduate students in physics, engineering, chemistry, and mathematics.
The focus there is on the simple but powerful ideas behind large devia-
tion theory, stated in non-technical terms, and on the application of
these ideas in simple stochastic processes, such as sums of independent
and identically distributed random variables and Markov processes.
Some physical applications of these processes are covered in exercises
contained at the end of each section.
In the second part, the problem of numerically evaluating large
deviation probabilities is treated at a basic level. The fundamental idea
of importance sampling is introduced there together with its sister idea,
the exponential change of measure. Other numerical methods based on
sample means and generating functions, with applications to Markov
processes, are also covered.

Published in R. Leidl and A. K. Hartmann (eds), Modern Computational Science 11:


Lecture Notes from the 3rd International Oldenburg Summer School, BIS-Verlag der Carl
von Ossietzky Universit at Oldenburg, 2011.
1
a
r
X
i
v
:
1
1
0
6
.
4
1
4
6
v
3


[
c
o
n
d
-
m
a
t
.
s
t
a
t
-
m
e
c
h
]


2
9

F
e
b

2
0
1
2
Contents
1 Introduction 3
2 Basic elements of large deviation theory 4
2.1 Examples of large deviations . . . . . . . . . . . . . . . . . . 4
2.2 The large deviation principle . . . . . . . . . . . . . . . . . . 7
2.3 The Gartner-Ellis Theorem . . . . . . . . . . . . . . . . . . . 10
2.4 Varadhans Theorem . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 The contraction principle . . . . . . . . . . . . . . . . . . . . 13
2.6 From small to large deviations . . . . . . . . . . . . . . . . . 14
2.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Applications of large deviation theory 16
3.1 IID sample means . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Continuous-time Markov processes . . . . . . . . . . . . . . . 19
3.4 Paths large deviations . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Physical applications . . . . . . . . . . . . . . . . . . . . . . . 23
3.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4 Numerical estimation of large deviation probabilities 29
4.1 Direct sampling . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Exponential change of measure . . . . . . . . . . . . . . . . . 33
4.4 Applications to Markov chains . . . . . . . . . . . . . . . . . 36
4.5 Applications to continuous-time Markov processes . . . . . . 37
4.6 Applications to stochastic dierential equations . . . . . . . . 37
4.7 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5 Other numerical methods for large deviations 43
5.1 Sample mean method . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Empirical generating functions . . . . . . . . . . . . . . . . . 46
5.3 Other methods . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
A Metropolis-Hastings algorithm 50
2
1 Introduction
The goal of these lecture notes, as the title says, is to give a basic introduction
to the theory of large deviations at three levels: theory, applications and
simulations. The notes follow closely my recent review paper on large
deviations and their applications in statistical mechanics [48], but are, in a
way, both narrower and wider in scope than this review paper.
They are narrower, in the sense that the mathematical notations have
been cut down to a minimum in order for the theory to be understood by
advanced undergraduate and graduate students in science and engineering,
having a basic background in probability theory and stochastic processes
(see, e.g., [27]). The simplication of the mathematics amounts essentially
to two things: i) to focus on random variables taking values in R or R
D
,
and ii) to state all the results in terms of probability densities, and so
to assume that probability densities always exist, if only in a weak sense.
These simplications are justied for most applications and are convenient
for conveying the essential ideas of the theory in a clear way, without the
hindrance of technical notations.
These notes also go beyond the review paper [48], in that they cover
subjects not contained in that paper, in particular the subject of numerical
estimation or simulation of large deviation probabilities. This is an important
subject that I intend to cover in more depth in the future.
Sections 4 and 5 of these notes are a rst and somewhat brief attempt in
this direction. Far from covering all the methods that have been developed
to simulate large deviations and rare events, they concentrate on the central
idea of large deviation simulations, namely that of exponential change of
measure, and they elaborate from there on certain simulation techniques that
are easily applicable to sums of independent random variables, Markov chains,
stochastic dierential equations and continuous-time Markov processes in
general.
Many of these applications are covered in the exercises contained at the
end of each section. The level of diculty of these exercises is quite varied:
some are there to practice the material presented, while others go beyond
that material and may take hours, if not days or weeks, to solve completely.
For convenience, I have rated them according to Knuths logarithmic scale.
1
In closing this introduction, let me emphasize again that these notes are
not meant to be complete in any way. For one thing, they lack the rigorous
1
00 = Immediate; 10 = Simple; 20 = Medium; 30 = Moderately hard; 40 = Term
project; 50 = Research problem. See the superscript attached to each exercise.
3
notation needed for handling large deviations in a precise mathematical way,
and only give a hint of the vast subject that is large deviation and rare event
simulation. On the simulation side, they also skip over the subject of error
estimates, which would ll an entire section in itself. In spite of this, I hope
that the material covered will have the eect of enticing readers to learn more
about large deviations and of conveying a sense of the wide applicability,
depth and beauty of this subject, both at the theoretical and computational
levels.
2 Basic elements of large deviation theory
2.1 Examples of large deviations
We start our study of large deviation theory by considering a sum of real
random variables (RV for short) having the form
S
n
=
1
n
n

i=1
X
i
. (2.1)
Such a sum is often referred to in mathematics or statistics as a sample
mean. We are interested in computing the probability density function
p
Sn
(s) (pdf for short) of S
n
in the simple case where the n RVs are mutually
independent and identically distributed (IID for short).
2
This means
that the joint pdf of X
1
, . . . , X
n
factorizes as follows:
p(X
1
, . . . , X
n
) =
n

i=1
p(X
i
), (2.2)
with p(X
i
) a xed pdf for each of the X
i
s.
3
We compare next two cases for p(X
i
):
Gaussian pdf:
p(x) =
1

2
2
e
(x)
2
/(2
2
)
, x R, (2.3)
where = E[X] is the mean of X and
2
= E[(X)
2
] its variance.
4
2
The pdf of Sn will be denoted by pSn
(s) or more simply by p(Sn) when no confusion
arises.
3
To avoid confusion, we should write pX
i
(x), but in order not to overload the notation,
we will simply write p(Xi). The context should make clear what RV we are referring to.
4
The symbol E[X] denotes the expectation or expected value of X, which is often
denoted by X in physics.
4
Exponential pdf:
p(x) =
1

e
x/
, x [0, ), (2.4)
with mean E[X] = > 0.
What is the form of p
Sn
(s) for each pdf?
To nd out, we write the pdf associated with the event S
n
= s by summing
the pdf of all the values or realizations (x
1
, . . . , x
n
) R
n
of X
1
, . . . , X
n
such that S
n
= s.
5
In terms of Diracs delta function (x), this is written as
p
Sn
(s) =
_
R
dx
1

_
R
dx
n

_
n
i=1
x
i
ns
_
p(x
1
, . . . , x
n
). (2.5)
From this expression, we can then obtain an explicit expression for p
Sn
(s) by
using the method of generating functions (see Exercise 2.7.1) or by substi-
tuting the Fourier representation of (x) above and by explicitly evaluating
the n + 1 resulting integrals. The result obtained for both the Gaussian and
the exponential densities has the general form
p
Sn
(s) e
nI(s)
, (2.6)
where
I(s) =
(x )
2
2
2
, s R (2.7)
for the Gaussian pdf, whereas
I(s) =
s

1 ln
s

, s 0 (2.8)
for the exponential pdf.
We will come to understand the meaning of the approximation sign ()
more clearly in the next subsection. For now we just take it as meaning that
the dominant behaviour of p(S
n
) as a function of n is a decaying exponential
in n. Other terms in n that may appear in the exact expression of p(S
n
) are
sub-exponential in n.
The picture of the behaviour of p(S
n
) that we obtain from the result
of (2.6) is that p(S
n
) decays to 0 exponentially fast with n for all values
S
n
= s for which the function I(s), which controls the rate of decay of
5
Whenever possible, random variables will be denoted by uppercase letters, while their
values or realizations will be denoted by lowercase letters. This follows the convention used
in probability theory. Thus, we will write X = x to mean that the RV X takes the value x.
5
1 0 1 2 3
0
1
2
3
4
s
p
S
n

n5
n25
n50
n100
1 0 1 2 3
0.0
0.5
1.0
1.5
2.0
2.5
3.0
s
I
n

n1
n2
n3
n10
Figure 2.1: (Left) pdf p
Sn
(s) of the Gaussian sample mean S
n
for = = 1.
(Right) I
n
(s) =
1
n
ln p
Sn
(s) for dierent values of n demonstrating a rapid
convergence towards the rate function I(s) (dashed line).
p(S
n
), is positive. But notice that I(s) 0 for both the Gaussian and
exponential densities, and that I(s) = 0 in both cases only for s = = E[X
i
].
Therefore, since the pdf of S
n
is normalized, it must get more and more
concentrated around the value s = as n (see Fig. 2.1), which means
that p
Sn
(s) (s ) in this limit. We say in this case that S
n
converges
in probability or in density to its mean.
As a variation of the Gaussian and exponential samples means, let us
now consider a sum of discrete random variables having a probability
distribution P(X
i
) instead of continuous random variables with a pdf p(X
i
).
6
To be specic, consider the case of IID Bernoulli RVs X
1
, . . . , X
n
taking
values in the set {0, 1} with probability P(X
i
= 0) = 1 and P(X
i
= 1) = .
What is the form of the probability distribution P(S
n
) of S
n
in this case?
Notice that we are speaking of a probability distribution because S
n
is
now a discrete variable taking values in the set {0, 1/n, 2/n, . . . , (n1)/n, 1}.
In the previous Gaussian and exponential examples, S
n
was a continuous
variable characterized by its pdf p(S
n
).
With this in mind, we can obtain the exact expression of P(S
n
) using
methods similar to those used to obtain p(S
n
) (see Exercise 2.7.4). The
result is dierent from that found for the Gaussian and exponential densities,
but what is remarkable is that the distribution of the Bernoulli sample mean
6
Following the convention in probability theory, we use the lowercase p for continuous
probability densities and the uppercase P for discrete probability distributions and proba-
bility assignments in general. Moreover, following the notation used before, we will denote
the distribution of a discrete Sn by PSn
(s) or simply P(Sn).
6
also contains a dominant exponential term having the form
P
Sn
(s) e
nI(s)
, (2.9)
where I(s) is now given by
I(s) = s ln
s

+ (1 s) ln
1 s
1
, s [0, 1]. (2.10)
The behaviour of the exact expression of P(S
n
) as n grows is shown in
Fig. 2.2 together with the plot of I(s) as given by Eq. (2.10). Notice how
P(S
n
) concentrates around its mean = E[X
i
] = as a result of the fact
that I(s) 0 and that s = is the only value of S
n
for which I = 0. Notice
also how the support of P(S
n
) becomes denser as n , and compare
this property with the fact that I(s) is a continuous function despite S
n
being
discrete. Should I(s) not be dened for discrete values if S
n
is a discrete
RV? We address this question next.
2.2 The large deviation principle
The general exponential form e
nI(s)
that we found for our three previous
sample means (Gaussian, exponential and Bernoulli) is the founding result or
property of large deviation theory, referred to as the large deviation principle.
The reason why a whole theory can be built on such a seemingly simple
result is that it arises in the context of many stochastic processes, not just
IID sample means, as we will come to see in the next sections, and as can be
seen from other contributions to this volume; see, e.g., Engels.
7
The rigorous, mathematical denition of the large deviation principle
involves many concepts of topology and measure theory that are too technical
to be presented here (see Sec. 3.1 and Appendix B of [48]). For simplicity,
we will say here that a random variable S
n
or its pdf p(S
n
) satises a large
deviation principle (LDP) if the following limit exists:
lim
n

1
n
ln p
Sn
(s) = I(s) (2.11)
and gives rise to a function I(s), called the rate function, which is not
everywhere zero.
The relation between this denition and the approximation notation used
earlier should be clear: the fact that the behaviour of p
Sn
(s) is dominated
7
The volume referred to is the volume of lecture notes produced for the summer school;
see http://www.mcs.uni-oldenburg.de/
7
0.0 0.2 0.4 0.6 0.8 1.0
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
s
P
S
n

0.0 0.2 0.4 0.6 0.8 1.0


0.0
0.2
0.4
0.6
0.8
s
I
n

n5
n25
n50
n100
0.0 0.2 0.4 0.6 0.8 1.0
0
2
4
6
8
s
p
S
n

0.0 0.2 0.4 0.6 0.8 1.0


0.0
0.2
0.4
0.6
0.8
s
I
n

n5
n25
n50
n100
Figure 2.2: (Top left) Discrete probability distribution P
Sn
(s) of the Bernoulli
sample mean for = 0.4 and dierent values of n. (Top right) Finite-n rate
function I
n
(s) =
1
n
ln P
Sn
(s). The rate function I(s) is the dashed line.
(Bottom left). Coarse-grained pdf p
Sn
(s) for the Bernoulli sample mean.
(Bottom right) I
n
(s) =
1
n
ln p
Sn
(s) as obtained from the coarse-grained pdf.
for large n by a decaying exponential means that the exact pdf of S
n
can be
written as
p
Sn
(s) = e
nI(s)+o(n)
(2.12)
where o(n) stands for any correction term that is sub-linear in n. Taking the
large deviation limit of (2.11) then yields
lim
n

1
n
ln p
Sn
(s) = I(s) lim
n
o(n)
n
= I(s), (2.13)
since o(n)/n 0. We see therefore that the large deviation limit of Eq. (2.11)
is the limit needed to retain the dominant exponential term in p(S
n
) while
discarding any other sub-exponential terms. For this reason, large deviation
theory is often said to be concerned with estimates of probabilities on the
logarithmic scale.
8
This point is illustrated in Fig. 2.2. There we see that the function
I
n
(s) =
1
n
ln p
Sn
(s) (2.14)
is not quite equal to the limiting rate function I(s), because of terms of order
o(n)/n = o(1); however, it does converge to I(s) as n gets larger. Plotting
p(S
n
) on this scale therefore reveals the rate function in the limit of large
n. This convergence will be encountered repeatedly in the sections on the
numerical evaluation of large deviation probabilities.
It should be emphasized again that the denition of the LDP given above
is a simplication of the rigorous denition used in mathematics, due to
the mathematician Srinivas Varadhan.
8
The real, mathematical denition
is expressed in terms of probability measures of certain sets rather than in
terms of probability densities, and involves upper and lower bounds on these
probabilities rather than a simple limit (see Sec. 3.1 and Appendix B of
[48]). The mathematical denition also applies a priori to any RVs, not just
continuous RVs with a pdf.
In these notes, we will simplify the mathematics by assuming that the
random variables or stochastic processes that we study have a pdf. In fact,
we will often assume that pdfs exist even for RVs that are not continuous
but look continuous at some scale.
To illustrate this point, consider again the Bernoulli sample mean. We
noticed that S
n
in this case is not a continuous RV, and so does not have
a pdf. However, we also noticed that the values that S
n
can take become
dense in the interval [0, 1] as n , which means that the support of the
discrete probability distribution P(S
n
) becomes dense in this limit, as shown
in Fig. 2.2. From a practical point of view, it therefore makes sense to treat
S
n
as a continuous RV for large n by interpolating P(S
n
) to a continuous
function p(S
n
) representing the probability density of S
n
.
This pdf is obtained in general simply by considering the probability
P(S
n
[s, s + s]) that S
n
takes a value in a tiny interval surrounding
the value s, and by then dividing this probability by the size s of that
interval:
p
Sn
(s) =
P(S
n
[s, s + s])
s
. (2.15)
The pdf obtained in this way is referred to as a smoothed, coarse-grained
8
Recipient of the 2007 Abel Prize for his fundamental contributions to probability
theory and in particular for creating a unied theory of large deviations.
9
or weak density.
9
In the case of the Bernoulli sample mean, it is simply
given by
p
Sn
(s) = nP
Sn
(s) (2.16)
if the spacing s between the values of S
n
is chosen to be 1/n.
This process of replacing a discrete variable by a continuous one in some
limit or at some scale is known in physics as the continuous or continuum
limit or as the macroscopic limit. In mathematics, this limit is expressed
via the notion of weak convergence (see [16]).
2.3 The Gartner-Ellis Theorem
Our goal in the next sections will be to study random variables and stochastic
processes that satisfy an LDP and to nd analytical as well as numerical ways
to obtain the corresponding rate function. There are many ways whereby a
random variable, say S
n
, can be shown to satisfy an LDP:
Direct method: Find the expression of p(S
n
) and show that it has
the form of the LDP;
Indirect method: Calculate certain functions of S
n
, such as generat-
ing functions, whose properties can be used to infer that S
n
satises
an LDP;
Contraction method: Relate S
n
to another random variable, say
A
n
, which is known to satisfy an LDP and derive from this an LDP
for S
n
.
We have used the rst method when discussing the Gaussian, exponential
and Bernoulli sample means.
The main result of large deviation theory that we will use in these
notes to obtain LDPs along the indirect method is called the Gartner-Ellis
Theorem (GE Theorem for short), and is based on the calculation of the
following function:
(k) = lim
n
1
n
ln E[e
nkSn
], (2.17)
9
To be more precise, we should make clear that the coarse-grained pdf depends on the
spacing s by writing, say, pSn,s(s). However, to keep the notation to a minimum, we
will use the same lowercase p to denote a coarse-grained pdf and a true continuous pdf.
The context should make clear which of the two is used.
10
known as the scaled cumulant generating function
10
(SCGF for short).
In this expression E[] denotes the expected value, k is a real parameter, and
S
n
is an arbitrary RV; it is not necessarily an IID sample mean or even a
sample mean.
The point of the GE Theorem is to be able to calculate (k) without
knowing p(S
n
). We will see later that this is possible. Given (k), the GE
Theorem then says that, if (k) is dierentiable,
11
then
S
n
satises an LDP, i.e.,
lim
n

1
n
ln p
Sn
(s) = I(s); (2.18)
The rate function I(s) is given by the Legendre-Fenchel transform
of (k):
I(s) = sup
kR
{ks (k)}, (2.19)
where sup stands for the supremum.
12
We will come to calculate (k) and its Legendre-Fenchel transform for
specic RVs in the next section. It will also be seen in there that (k)
does not always exist (this typically happens when the pdf of a random
variable does not admit an LDP). Some exercises at the end of this section
illustrate many useful properties of (k) when this function exists and is
twice dierentiable.
2.4 Varadhans Theorem
The rigorous proof of the GE Theorem is too technical to be presented here.
However, there is a way to justify this result by deriving in a heuristic way
another result known as Varadhans Theorem.
The latter theorem is concerned with the evaluation of a functional
13
expectation of the form
W
n
[f] = E[e
nf(Sn)
] =
_
R
p
Sn
(s) e
nf(s)
ds, (2.20)
10
The function E[e
kX
] for k real is known as the generating function of the RV X;
ln E[e
kX
] is known as the log-generating function or cumulant generating function.
The word scaled comes from the extra factor 1/n.
11
The statement of the GE Theorem given here is a simplied version of the full result,
which is essentially the result proved by Gartner [22]. See Sec. 3.3 of [48] for a more
complete presentation of GE Theorem and [18] for a rigorous account of it.
12
In these notes, a sup can be taken to mean the same as a max.
13
A function of a function is called a functional.
11
where f is some function of S
n
, which is taken to be a real RV for simplicity.
Assuming that S
n
satises an LDP with rate function I(s), we can write
W
n
[f]
_
R
e
n[f(s)I(s)]
ds. (2.21)
with sub-exponential corrections in n. This integral has the form of a so-
called Laplace integral, which is known to be dominated for large n by
its largest integrand when it is unique. Assuming this is the case, we can
proceed to approximate the whole integral as
W
n
[f] e
nsup
s
[f(s)I(s)]
. (2.22)
Such an approximation is referred to as a Laplace approximation, the
Laplace principle or a saddle-point approximation (see Chap. 6 of
[3]), and is justied in the context of large deviation theory because the
corrections to this approximation are sub-exponential in n, as are those of
the LDP. By dening the following functional:
[f] = lim
n
1
n
ln W
n
(f) (2.23)
using a limit similar to the limit dening the LDP, we then obtain
[f] = sup
sR
{f(s) I(s)}. (2.24)
The result above is what is referred to as Varadhan Theorem [52, 48].
The contribution of Varadhan was to prove this result for a large class of RVs,
which includes not just IID sample means but also random vectors and even
random functions, and to rigorously handle all the heuristic approximations
used above. As such, Varadhans Theorem can be considered as a rigorous
and general expression of the Laplace principle.
To connect Varadhans Theorem with the GE Theorem, consider the
special case f(s) = ks with k R.
14
Then Eq. (2.24) becomes
(k) = sup
sR
{ks I(s)}, (2.25)
where (k) is the same function as the one dened in Eq. (2.17). Thus we see
that if S
n
satises an LDP with rate function I(s), then the SCGF (k) of
14
Varadhans Theorem holds in its original form for bounded functions f, but another
stronger version can be applied to unbounded functions and in particular to f(x) = kx;
see Theorem 1.3.4 of [16].
12
S
n
is the Legendre-Fenchel transform of I(s). This in a sense is the inverse
of the GE Theorem, so an obvious question is, can we invert the equation
above to obtain I(s) in terms of (k)? The answer provided by the GE
Theorem is that, a sucient condition for I(s) to be the Legendre-Fenchel
transform of (k) is for the latter function to be dierentiable.
15
This is the closest that we can get to the GE Theorem without proving it.
It is important to note, to fully appreciate the importance of that theorem,
that it is actually more just than an inversion of Varadhans Theorem because
the existence of (k) implies the existence of an LDP for S
n
. In our heuristic
inversion of Varadhans Theorem we assumed that an LDP exists.
2.5 The contraction principle
We mentioned before that LDPs can be derived by a contraction method.
The basis of this method is the following: let A
n
be a random variable
A
n
known to have an LDP with rate function I
A
(a), and consider another
random variable B
n
which is a function of the rst B
n
= f(A
n
). We want
to know whether B
n
satises an LDP and, if so, to nd its rate function.
To nd the answer, write the pdf of B
n
in terms of the pdf of A
n
:
p
Bn
(b) =
_
{a:f(a)=b}
p
An
(a) da (2.26)
and use the LDP for A
n
to write
p
Bn
(b)
_
{a:f(a)=b}
e
nI
A
(a)
da. (2.27)
Then apply the Laplace principle to approximate the integral above by its
largest term, which corresponds to the minimum of I(a) for a such that
b = f(a). Therefore,
p
Bn
(b) exp
_
n inf
{a:f(a)=b}
I
A
(a)
_
. (2.28)
This shows that p(B
n
) also satises an LDP with rate function I
B
(b) given
by
I
B
(b) = inf
{a:f(a)=b}
I
A
(a). (2.29)
15
The dierentiability condition, though sucient, is not necessary: I(s) in some cases
can be the Legendre-Fenchel transform of (k) even when the latter is not dierentiable;
see Sec. 4.4 of [48].
13
This formula is called the contraction principle because f can be
many-to-one, i.e., there might be many as such that b = f(a), in which case
we are contracting information about the rate function of A
n
down to B
n
.
In physical terms, this formula is interpreted by saying that an improbable
uctuation
16
of B
n
is brought about by the most probable of all improbable
uctuations of A
n
.
The contraction principle has many applications in statistical physics.
In particular, the maximum entropy and minimum free energy principles,
which are used to nd equilibrium states in the microcanonical and canonical
ensembles, respectively, can be seen as deriving from the contraction principle
(see Sec. 5 of [48]).
2.6 From small to large deviations
An LDP for a random variable, say S
n
again, gives us a lot of information
about its pdf. First, we know that p(S
n
) concentrates on certain points
corresponding to the zeros of the rate function I(s). These points correspond
to the most probable or typical values of S
n
in the limit n and can
be shown to be related mathematically to the Law of Large Numbers
(LLN for short). In fact, an LDP always implies some form of LLN (see
Sec. 3.5.7 of [48]).
Often it is not enough to know that S
n
converges in probability to some
values; we may also want to determine the likelihood that S
n
takes a value
away but close to its typical value(s). Consider one such typical value s

and
assume that I(s) admits a Taylor series around s

:
I(s) = I(s

) +I

(s

)(s s

) +
I

(s

)
2
(s s

)
2
+ . (2.30)
Since s

must correspond to a zero of I(s), the rst two terms in this series
vanish, and so we are left with the prediction that the small deviations of
S
n
around its typical value are Gaussian-distributed:
p
Sn
(s) e
nI

(s

)(ss

)
2
/2
. (2.31)
In this sense, large deviation theory contains the Central Limit The-
orem (see Sec. 3.5.8 of [48]). At the same time, large deviation theory
can be seen as an extension of the Central Limit Theorem because it gives
information not only about the small deviations of S
n
, but also about its
16
In statistical physics, a deviation of a random variable away from its typical value is
referred to as a uctuation.
14
large deviations far away from its typical value(s); hence the name of the
theory.
17
2.7 Exercises
2.7.1.
[10]
(Generating functions) Let A
n
be a sum of n IID RVs:
A
n
=
n

i=1
X
i
. (2.32)
Show that the generating function of A
n
, dened as
W
An
(k) = E[e
kAn
], (2.33)
satises the following factorization property:
W
An
(k) =
n

i=1
W
X
i
(k) = W
X
i
(k)
n
, (2.34)
W
X
i
(k) being the generating function of X
i
.
2.7.2.
[12]
(Gaussian sample mean) Find the expression of W
X
(k) for X
distributed according to a Gaussian pdf with mean and variance
2
. From
this result, nd W
Sn
(k) following the previous exercise and p(S
n
) by inverse
Laplace transform.
2.7.3.
[15]
(Exponential sample mean) Repeat the previous exercise for the ex-
ponential pdf shown in Eq. (2.4). Obtain from your result the approximation
shown in (2.8).
2.7.4.
[10]
(Bernoulli sample mean) Show that the probability distribution of
the Bernoulli sample mean is the binomial distribution:
P
Sn
(s) =
n!
(sn)![(1 s)n]!

sn
(1 )
(1s)n
. (2.35)
Use Stirlings approximation to put this result in the form of (2.9) with I(s)
given by Eq. (2.10).
2.7.5.
[12]
(Multinomial distribution) Repeat the previous exercise for IID
RVs taking values in the set {0, 1, . . . , q 1} instead of {0, 1}. Use Stirlings
approximation to arrive at an exponential approximation similar to (2.9).
(See solution in [20].)
17
A large deviation is also called a large uctuation or a rare and extreme event.
15
2.7.6.
[10]
(SCGF at the origin) Let (k) be the SCGF of an IID sample
mean S
n
. Prove the following properties of (k) at k = 0:
(0) = 0

(0) = E[S
n
] = E[X
i
]

(0) = var(X
i
)
Which of these properties remain valid if S
n
is not an IID sample mean, but
is some general RV?
2.7.7.
[12]
(Convexity of SCGFs) Show that (k) is a convex function of k.
2.7.8.
[12]
(Legendre transform) Show that, when (k) is everywhere dif-
ferentiable and is strictly convex (i.e., has no ane or linear parts), the
Legendre-Fenchel transform shown in Eq. (2.19) reduces to the Legendre
transform:
I(s) = k(s)s (k(s)), (2.36)
where k(s) is the unique of

(k) = s. Explain why the latter equation has a


unique root.
2.7.9.
[20]
(Convexity of rate functions) Prove that rate functions obtained
from the GE Theorem are strictly convex.
2.7.10.
[12]
(Varadhans Theorem) Verify Varadhans Theorem for the Gaus-
sian and exponential sample means by explicitly calculating the Legendre-
Fenchel transform of I(s) obtained for these sample means. Compare your
results with the expressions of (k) obtained from its denition.
3 Applications of large deviation theory
We study in this section examples of random variables and stochastic processes
with interesting large deviation properties. We start by revisiting the simplest
application of large deviation theory, namely, IID sample means, and then
move to Markov processes, which are often used for modeling physical and
man-made systems. More information about applications of large deviation
theory in statistical physics can be found in Secs 5 and 6 of [48], as well as
in the contribution of Engel contained in this volume.
16
3.1 IID sample means
Consider again the sample mean
S
n
=
1
n
n

i=1
X
i
(3.1)
involving n real IID variables X
1
, . . . , X
n
with common pdf p(x). To de-
termine whether S
n
satises an LDP and obtain its rate function, we can
follow the GE Theorem and calculate the SCGF (k) dened in Eq. (2.17).
Because of the IID nature of S
n
, (k) takes a simple form:
(k) = ln E[e
kX
]. (3.2)
In this expression, X is any of the X
i
s (remember they are IID). Thus to
obtain an LDP for S
n
, we only have to calculate the log-generating function
of X, check that it is dierentiable and, if so, calculate its Legendre-Fenchel
transform (or in this case, its Legendre transform; see Exercise 2.7.8). Exer-
cise 3.6.1 considers dierent distributions p(x) for which this calculation can
be carried out, including the Gaussian, exponential and Bernoulli distribu-
tions studied before.
As an extra exercise, let us attempt to nd the rate function of a special
sample mean dened by
L
n,j
=
1
n
n

i=1

X
i
,j
(3.3)
for a sequence X
1
, . . . , X
n
of n discrete IID RVs with nite state space
X = {1, 2, . . . , q}. This sample mean is called the empirical distribution
of the X
i
s as it counts the number of times the value or symbol j X
appears in a given realization of X
1
, . . . , X
n
. This number is normalized by
the total number n of RVs, so what we have is the empirical frequency
for the appearance of the symbol j in realizations of X
1
, . . . , X
n
.
The values of L
n,j
for all j X can be put into a vector L
n
called the
empirical vector. For the Bernoulli sample mean, for example, X = {0, 1}
and so L
n
is a two-dimensional vector L
n
= (L
n,0
, L
n,1
) containing the
empirical frequency of 0s and 1s appearing in a realization x
1
, . . . , x
n
of
X
1
, . . . , X
n
:
L
n,0
=
# 0s in x
1
, . . . , x
n
n
, L
n,1
=
# 1s in x
1
, . . . , x
n
n
. (3.4)
To nd the rate function associated with the random vector L
n
, we apply
the GE Theorem but adapt it to the case of random vectors by replacing k
17
in (k) by a vector k having the same dimension as L
n
. The calculation of
(k) is left as an exercise (see Exercise 3.6.8). The result is
(k) = ln

jX
P
j
e
k
j
(3.5)
where P
j
= P(X
i
= j). It is easily checked that (k) is dierentiable in the
vector sense, so we can use the GE Theorem to conclude that L
n
satises an
LDP, which we write as
p(L
n
= ) e
nI()
, (3.6)
with a vector rate function I() given by the Legendre-Fenchel transform of
(k). The calculation of this transform is the subject of Exercise 3.6.8. The
nal result is
I() =

jX

j
ln

j
P
j
. (3.7)
This rate function is called the relative entropy or Kullback-Leibler
divergence [8]. The full LDP for L
n
is referred to as Sanovs Theorem
(see [44] and Sec. 4.2 of [48]).
It can be checked that I() is a convex function of and that it has a
unique minimum and zero at = P, i.e.,
j
= P
j
for all j X. As seen
before, this property is an expression of the LLN, which says here that the
empirical frequencies L
n,j
converge to the probabilities P
j
as n . The
LDP goes beyond this result by describing the uctuations of L
n
around the
typical value = P.
3.2 Markov chains
Instead of assuming that the sample mean S
n
arises from a sum of IID RVs
X
1
, . . . , X
n
, consider the case where the X
i
s form a Markov chain. This
means that the joint pdf p(X
1
, . . . , X
n
) has the form
p(X
1
, . . . , X
n
) = p(X
1
)
n1

i=1
(X
i+1
|X
i
), (3.8)
where p(X
i
) is some initial pdf for X
1
and (X
i+1
|X
i
) is the transition
probability density that X
i+1
follows X
i
in the Markov sequence X
1

X
2
X
n
(see [27] for background information on Markov chains).
The GE Theorem can still be applied in this case, but the expression of
(k) is more complicated. Skipping the details of the calculation (see Sec. 4.3
18
of [48]), we arrive at the following result: provided that the Markov chain is
homogeneous and ergodic (see [27] for a denition of ergodicity), the SCGF
of S
n
is given by
(k) = ln (

k
), (3.9)
where (

k
) is the dominant eigenvalue (i.e., with largest magnitude) of
the matrix

k
whose elements
k
(x, x

) are given by
k
(x, x

) = (x

|x)e
kx

.
We call the matrix

k
the tilted matrix associated with S
n
. If the Markov
chain is nite, it can be proved furthermore that (k) is analytic and so
dierentiable. From the GE Theorem, we then conclude that S
n
has an LDP
with rate function
I(s) = sup
kR
{ks ln (

k
)}. (3.10)
Some care must be taken if the Markov chain has an innite number of states.
In this case, (k) is not necessarily analytic and may not even exist (see, e.g.,
[29, 30, 38] for illustrations).
An application of the above result for a simple Bernoulli Markov chain
is considered in Exercise 3.6.10. In Exercises 3.6.12 and 3.6.13, the case of
random variables having the form
Q
n
=
1
n
n1

i=1
q(x
i
, x
i+1
) (3.11)
is covered. This type of RV arises in statistical physics when studying
currents in stochastic models of interacting particles (see [30]). The tilted
matrix

k
associated with Q
n
has the form
k
(x, x

) = (x

|x)e
kq(x,x

)
.
3.3 Continuous-time Markov processes
The preceding subsection can be generalized to ergodic Markov processes
evolving continuously in time by taking a continuous time limit of discrete-
time ergodic Markov chains.
To illustrate this limit, consider an ergodic continuous-time process
described by the state X(t) for 0 t T. For an innitesimal time-step t,
time-discretize this process into a sequence of n+1 states X
0
, X
1
, . . . , X
n
with
n = T/t and X
i
= X(it), i = 0, 1, . . . , n. The sequence X
0
, X
1
, . . . , X
n
is
an ergodic Markov chain with innitesimal transition matrix (t) given by
(t) = (x(t + t)|x(t)) = e
Gt
= I +Gt +o(t), (3.12)
where I is the identity matrix and G the generator of X(t). With this
discretization, it is now possible to study LDPs for processes involving X(t)
19
by studying these processes in discrete time at the level of the Markov chain
X
0
, X
1
, . . . , X
n
, and transfer these LDPs into continuous time by taking the
limits t 0, n .
As a general application of this procedure, consider a so-called additive
process dened by
S
T
=
1
T
_
T
0
X(t) dt. (3.13)
The discrete-time version of this RV is the sample mean
S
n
=
1
nt
n

i=0
X
i
t =
1
n
n

i=0
X
i
. (3.14)
From this association, we nd that the SCGF of S
T
, dened by the limit
(k) = lim
T
1
T
ln E[e
TkS
T
], k R, (3.15)
is given by
(k) = lim
t0
lim
n
1
nt
ln E[e
kt

n
i=0
X
i
] (3.16)
at the level of the discrete-time Markov chain.
According to the previous subsection, the limit above reduces to ln (

k
(t)),
where

k
(t) is the matrix (or operator) corresponding to

k
(t) = e
ktx

(t)
=
_
1 +kx

t +o(t)
__
I +G

t +o(t)
_
= I +

G
k
t +o(t), (3.17)
where

G
k
= G+kx

x,x
. (3.18)
For the continuous-time process X(t), we must therefore have
(k) = (

G
k
), (3.19)
where (

G
k
) is the dominant eigenvalue of the tilted generator

G
k
.
18
Note
that the reason why the logarithmic does not appear in the expression of
(k) for the continuous-time process
19
is because is now the dominant
18
As for Markov chains, this result holds for continuous, ergodic processes with nite
state-space. For innite-state or continuous-space processes, a similar result holds provided
G
k
has an isolated dominant eigenvalue.
19
Compare with Eq. (3.9).
20
eigenvalue of the generator of the innitesimal transition matrix, which is
itself the exponential of the generator.
With the knowledge of (k), we can obtain an LDP for S
T
using the GE
Theorem: If the dominant eigenvalue is dierentiable in k, then S
T
satises
an LDP in the long-time limit, T , which we write as p
S
T
(s) e
TI(s)
,
where I(s) is the Legendre-Fenchel transform (k). Some applications of
this result are presented in Exercises 3.6.11 and 3.6.13. Note that in the case
of current-type processes having the form of Eq. (3.11), the tilted generator
is not simply given by

G
k
= G+kx

x,x
; see Exercise 3.6.13.
3.4 Paths large deviations
We complete our tour of mathematical applications of large deviation theory
by studying a dierent large deviation limit, namely, the low-noise limit of
the following stochastic dierential equation (SDE for short):
x(t) = f(x(t)) +

(t), x(0) = 0, (3.20)


which involves a force f(x) and a Gaussian white noise (t) with the
properties E[(t)] = 0 and E[(t

)(t)] = (t

t); see Chap. XVI of [49] for


background information on SDEs.
We are interested in studying for this process the pdf of a given random
path {x(t)}
T
t=0
of duration T in the limit where the noise power vanishes.
This abstract pdf can be dened heuristically using path integral methods
(see Sec. 6.1 of [48]). In the following, we denote this pdf by the functional
notation p[x] as a shorthand for p({x(t)}
T
t=0
).
The idea behind seeing the low-noise limit as a large deviation limit
is that, as 0, the random path arising from the SDE above should
converge in probability to the deterministic path x(t) solving the ordinary
dierential equation
x(t) = f(x(t)), x(0) = 0. (3.21)
This convergence is a LLN-type result, and so in the spirit of large deviation
theory we are interested in quantifying the likelihood that a random path
{x(t)}
T
t=0
ventures away from the deterministic path in the limit 0. The
functional LDP that characterizes these path uctuations has the form
p[x] e
I[x]/
, (3.22)
where
I[x] =
_
T
0
[ x(t) f(x(t))]
2
dt. (3.23)
21
See [21] and Sec. 6.1 of [48] for historical sources on this LDP.
The rate functional I[x] is called the action, Lagrangian or entropy
of the path {x(t)}
T
t=0
. The names action and Lagrangian come from
an analogy with the action of quantum trajectories in the path integral
approach of quantum mechanics (see Sec. 6.1 of [48]). There is also a close
analogy between the low-noise limit of SDEs and the semi-classical or WKB
approximation of quantum mechanics [48].
The path LDP above can be generalized to higher-dimensional SDEs
as well as SDEs involving state-dependent noise and correlated noises (see
Sec. 6.1 of [48]). In all cases, the minimum and zero of the rate functional
is the trajectory of the deterministic system obtained in the zero-noise
limit. This is veried for the 1D system considered above: I[x] 0 for all
trajectories and I[x] = 0 for the unique trajectory solving Eq. (3.21).
Functional LDPs are the most rened LDPs that can be derived for SDEs
as they characterize the probability of complete trajectories. Other coarser
LDPs can be derived from these by contraction. For example, we might be
interested to determine the pdf p(x, T) of the state x(T) reached after a
time T. The contraction in this case is obvious: p(x, T) must have the large
deviation form p(x, T) e
V (x,T)/
with
V (x, T) = inf
x(t):x(0)=0,x(T)=x
I[x]. (3.24)
That is, the probability of reaching x(T) = x from x(0) = 0 is determined by
the path connecting these two endpoints having the largest probability. We
call this path the optimal path, maximum likelihood path or instanton.
Using variational calculus techniques, often used in classical mechanics, it
can be proved that this path satises an Euler-Lagrange-type equation as
well as a Hamilton-type equation (see Sec. 6.1 of [48]). Applications of these
equations are covered in Exercises 3.6.143.6.17.
Quantities similar to the additive process S
T
considered in the previous
subsection can also be dened for SDEs (see Exercises 3.6.11 and 3.6.12).
An interesting aspect of these quantities is that their LDPs can involve the
noise power if the low-noise limit is taken, as well as the integration time
T, which arises because of the additive (in the sense of sample mean) nature
of these quantities. In this case, the limit T 0 must be taken before the
low-noise limit 0. If 0 is taken rst, the system considered is no
longer random, which means that there can be no LDP in time.
22
3.5 Physical applications
Some applications of the large deviation results seen so far are covered in the
contribution of Engel to this volume. The following list gives an indication
of these applications and some key references for learning more about them.
A more complete presentation of the applications of large deviation theory
in statistical physics can be found in Secs. 5 and 6 of [48].
Equilibrium systems: Equilibrium statistical mechanics, as embod-
ied by the ensemble theory of Boltzmann and Gibbs, can be seen with
hindsight as a large deviation theory of many-body systems at equilib-
rium. This becomes evident by realizing that the thermodynamic limit
is a large deviation limit, that the entropy is the equivalent of a rate
function and the free energy the equivalent of a SCGF. Moreover, the
Legendre transform of thermodynamics connecting the entropy and
free energy is nothing but the Legendre-Fenchel transform connecting
the rate function and the SCGF in the GE Theorem and in Varad-
hans Theorem. For a more complete explanation of these analogies,
and historical sources on the development of large deviation theory in
relation to equilibrium statistical mechanics, see the book by Ellis [19]
and Sec. 3.7 of [48].
Chaotic systems and multifractals: The so-called thermody-
namic formalism of dynamical systems, developed by Ruelle [40] and
others, can also be re-interpreted with hindsight as an application of
large deviation theory for the study of chaotic systems.
20
There are
two quantities in this theory playing the role of the SCGF, namely, the
topological pressure and the structure function. The Legendre
transform appearing in this theory is also an analogue of the one en-
countered in large deviation theory. References to these analogies can
be found in Secs. 7.1 and 7.2 of [48].
Nonequilibrium systems: Large deviation theory is becoming the
standard formalism used in studies of nonequilibrium systems modelled
by SDEs and Markov processes in general. In fact, large deviation
theory is currently experiencing a sort of revival in physics and mathe-
matics as a result of its growing application in nonequilibrium statistical
mechanics. Many LDPs have been derived in this context: LDPs for
the current uctuations or the occupation of interacting particle models,
20
There is a reference to this connection in a note of Ruelles popular book, Chance and
Chaos [39] (see Note 2 of Chap. 19).
23
such as the exclusion process, the zero-range process (see [28]) and
their many variants [45], as well as LDPs for work-related and entropy-
production-related quantities for nonequilibrium systems modelled with
SDEs [6, 7]. Good entry points in this vast eld of research are [13]
and [30].
Fluctuation relations: Several LDPs have come to be studied in
recent years under the name of uctuation relations. To illustrate
these results, consider the additive process S
T
considered earlier and
assume that this process admits an LDP with rate function I(s). In
many cases, it is interesting not only to know that S
T
admits an LDP
but to know how probable positive uctuations of S
T
are compared to
negative uctuations. For this purpose, it is common to study the ratio
p
S
T
(s)
p
S
T
(s)
, (3.25)
which reduces to
p
S
T
(s)
p
S
T
(s)
e
T[I(s)I(s)]
(3.26)
if we assume an LDP for S
T
. In many cases, the dierence I(s) I(s)
is linear in s, and one then says that S
T
satises a conventional
uctuation relation, whereas if it nonlinear in s, then one says
that S
T
satises an extended uctuation relation. Other types of
uctuation relations have come to be dened in addition to these; for
more information, the reader is referred to Sec. 6.3 of [48]. A list of
the many physical systems for which uctuation relations have been
derived or observed can also be found in this reference.
3.6 Exercises
3.6.1.
[12]
(IID sample means) Use the GE Theorem to nd the rate function
of the IID sample mean S
n
for the following probability distribution and
densities:
Gaussian:
p(x) =
1

2
2
e
(x)
2
/(2
2
)
, x R. (3.27)
Bernoulli: X
i
{0, 1}, P(X
i
= 0) = 1 , P(X
i
= 1) = .
Exponential:
p(x) =
1

e
x/
, x [0, ), > 0. (3.28)
24
Uniform:
p(x) =
_
1
a
x [0, a]
0 otherwise.
(3.29)
Cauchy:
p(x) =

(x
2
+
2
)
, x R, > 0. (3.30)
3.6.2.
[15]
(Nonconvex rate functions) Rate functions obtained from the GE
Theorem are necessarily convex (strictly convex, in fact; see Exercise 2.7.9),
but rate functions in general need not be convex. Consider, as an example,
the process dened by
S
n
= Y +
1
n
n

i=1
X
i
, (3.31)
where Y = 1 with probability
1
2
and the X
i
s are Gaussian IID RVs. Find
the rate function for S
n
assuming that Y is independent of the X
i
s. Then nd
the corresponding SCGF. What is the relation between the Legendre-Fenchel
transform of (k) and the rate function I(s)? How is the nonconvexity of
I(s) related to the dierentiability of (k)? (See Example 4.7 of [48] for the
solution.)
3.6.3.
[15]
(Exponential mixture) Repeat the previous exercise by replacing
the Bernoulli Y with Y = Z/n, where Z is an exponential RV with mean 1.
(See Example 4.8 of [48] for the solution.)
3.6.4.
[12]
(Product process) Find the rate function of
Z
n
=
_
n

i=1
X
i
_
1/n
(3.32)
where X
i
{1, 2} with p(X
i
= 1) = and p(X
i
= 2) = 1 , 0 < < 1.
3.6.5.
[15]
(Self-process) Consider a sequence of IID Gaussian RVs X
1
, . . . , X
n
.
Find the rate function of the so-called self-process dened by
S
n
=
1
n
ln p(X
1
, . . . , X
n
). (3.33)
(See Sec. 4.5 of [48] for information about this process.) Repeat the problem
for other choices of pdf for the RVs.
25
3.6.6.
[20]
(Iterated large deviations [35]) Consider the sample mean
S
m,n
=
1
m
m

i=1
X
(n)
i
(3.34)
involving m IID copies of a random variable X
(n)
. Show that if X
(n)
satises
an LDP of the form
p
X
(n) (x) e
nI(x)
, (3.35)
then S
m,n
satises an LDP having the form
p
Sm,n
(s) e
mnI(x)
. (3.36)
3.6.7.
[40]
(Persistent random walk) Use the result of the previous exercise
to nd the rate function of
S
n
=
1
n
n

i=1
X
i
, (3.37)
where the X
i
s are independent Bernoulli RVs with non-identical distribution
P(X
i
= 0) =
i
and P(X
i
= 1) = 1
i
with 0 < < 1.
3.6.8.
[15]
(Sanovs Theorem) Consider the empirical vector L
n
with compo-
nents dened in Eq. (3.3). Derive the expression found in Eq. (3.5) for
(k) = lim
n
1
n
ln E[e
nkLn
], (3.38)
where k L
n
denotes the scalar product of k and L
n
. Then obtain the rate
function found in (3.7) by calculating the Legendre transform of (k). Check
explicitly that the rate function is convex and has a single minimum and
zero.
3.6.9.
[20]
(Contraction principle) Repeat the rst exercise of this section
by using the contraction principle. That is, use the rate function of the
empirical vector L
n
to obtain the rate function of S
n
. What is the mapping
from L
n
to S
n
? Is this mapping many-to-one or one-to-one?
3.6.10.
[17]
(Bernoulli Markov chain) Consider the Bernoulli sample mean
S
n
=
1
n
n

i=1
X
i
, X
i
{0, 1}. (3.39)
26
Find the expression of (k) and I(s) for this process assuming that the X
i
s
form a Markov process with symmetric transition matrix
(x

|x) =
_
1 x

= x
x

= x
(3.40)
with 0 1. (See Example 4.4 of [48] for the solution.)
3.6.11.
[20]
(Residence time) Consider a Markov process with state space
X = {0, 1} and generator
G =
_


_
. (3.41)
Find for this process the rate function of the random variable
L
T
=
1
T
_
T
0

X(t),0
dt, (3.42)
which represents the fraction of time the state X(t) of the Markov chain
spends in the state X = 0 over a period of time T.
3.6.12.
[20]
(Current uctuations in discrete time) Consider the Markov chain
of Exercise 3.6.10. Find the rate function of the mean current Q
n
dened
as
Q
n
=
1
n
n

i=1
f(x
i+1
, x
i
), (3.43)
f(x

, x) =
_
0 x

= x
1 x

= x.
(3.44)
Q
n
represents the mean number of jumps between the states 0 and 1: at each
transition of the Markov chain, the current is incremented by 1 whenever a
jump between the states 0 and 1 occurs.
3.6.13.
[22]
(Current uctuations in continuous time) Repeat the previous
exercise for the Markov process of Exercise 3.6.11 and the current Q
T
dened
as
Q
T
= lim
t0
1
T
n1

i=0
f(x
i+1
, x
i
), (3.45)
where x
i
= x(it) ad f(x

, x) as above, i.e., f(x


i+1
, x
i
) = 1
x
i
,x
i+1
. Show
for this process that the tilted generator is

G
k
=
_
e
k
e
k

_
. (3.46)
27
3.6.14.
[17]
(Ornstein-Uhlenbeck process) Consider the following linear SDE:
x(t) = x(t) +

(t), (3.47)
often referred to as the Langevin equation or Ornstein-Uhlenbeck pro-
cess. Find for this SDE the solution of the optimal path connecting the
initial point x(0) = 0 with the uctuation x(T) = x. From the solution, nd
the pdf p(x, T) as well as the stationary pdf obtained in the limit T .
Assume 0 throughout but then verify that the results obtained are valid
for all > 0.
3.6.15.
[20]
(Conservative system) Show by large deviation techniques that
the stationary pdf of the SDE
x(t) = U(x(t)) +

(t) (3.48)
admits the LDP p(x) e
V (x)/
with rate function V (x) = 2U(x). The rate
function V (x) is called the quasi-potential.
3.6.16.
[25]
(Transversal force) Show that the LDP found in the previous
exercise also holds for the SDE
x(t) = U(x(t)) +A(x) +

(t) (3.49)
if U A = 0. (See Sec. 4.3 of [21] for the solution.)
3.6.17.
[25]
(Noisy Van der Pol oscillator) The following set of SDEs describes
a noisy version of the well-known Van der Pol equation:
x = v
v = x +v( x
2
v
2
) +

. (3.50)
The parameter R controls a bifurcation of the zero-noise system: for
0, the system with = 0 has a single attracting point at the origin,
whereas for > 0, it has a stable limit cycle centered on the origin. Show
that the stationary distribution of the noisy oscillator has the large deviation
form p(x, v) e
U(x,v)/
as 0 with rate function
U(x, v) = (x
2
+v
2
) +
1
2
(x
2
+v
2
)
2
. (3.51)
Find a dierent set of SDEs that has the same rate function. (See [26] for
the solution.)
28
3.6.18.
[30]
(Dragged Brownian particle [50]) The stochastic dynamics of a
tiny glass bead immersed in water and pulled with laser tweezers at constant
velocity can be modelled in the overdamped limit by the following reduced
Langevin equation:
x = (x(t) t) +

(t), (3.52)
where x(t) is the position of the glass bead, the pulling velocity, (t) a
Gaussian white noise modeling the inuence of the surrounding uid, and
the noise power related to the temperature of the uid. Show that the mean
work done by the laser as it pulls the glass bead over a time T, which is
dened as
W
T
=

T
_
T
0
(x(t) t) dt, (3.53)
satises an LDP in the limit T . Derive this LDP by assuming rst that
0, and then derive the LDP without this assumption. Finally, determine
whether W
T
satises a conventional or extended uctuation relation, and
study the limit 0. (See Example 6.9 of [48] for more references on this
model.)
4 Numerical estimation of large deviation proba-
bilities
The previous sections might give the false impression that rate functions
can be calculated explicitly for many stochastic processes. In fact, exact
and explicit expressions of rate functions can be found only in few simple
cases. For most stochastic processes of scientic interest (e.g,, noisy nonlinear
dynamical systems, chemical reactions, queues, etc.), we have to rely on
analytical approximations and numerical methods to evaluate rate functions.
The rest of these notes attempts to give an overview of some of these
numerical methods and to illustrate them with the simple processes (IID
sample means and Markov processes) treated before. The goal is to give a
avor of the general ideas behind large deviation simulations rather than
a complete survey, so only a subset of the many methods that have come
to be devised for numerically estimating rate functions are covered in what
follows. Other methods not treated here, such as the transition path method
discussed by Dellago in this volume, are mentioned at the end of the next
section.
29
4.1 Direct sampling
The problem addressed in this section is to obtain a numerical estimate of the
pdf p
Sn
(s) for a real random variable S
n
satisfying an LDP, and to extract
from this an estimate of the rate function I(s).
21
To be general, we take S
n
to be a function of n RVs X
1
, . . . , X
n
, which at this point are not necessarily
IID. To simplify the notation, we will use the shorthand = X
1
, . . . , X
n
.
Thus we write S
n
as the function S
n
() and denote by p() the joint pdf of
the X
i
s.
22
Numerically, we cannot of course obtain p
Sn
(s) or I(s) for all s R, but
only for a nite number of values s, which we take for simplicity to be equally
spaced with a small step s. Following our discussion of the LDP, we thus
attempt to estimate the coarse-grained pdf
p
Sn
(s) =
P(S
n
[s, s + s])
s
=
P(S
n

s
)
s
, (4.1)
where
s
denotes the small interval [s, s + s] anchored at the value s.
23
To construct this estimate, we follow the statistical sampling or Monte
Carlo method, which we broke down into the following steps (see the
contribution of Katzgraber in this volume for more details):
1. Generate a sample {
(j)
}
L
j=1
of L copies or realizations of the sequence
from its pdf p().
2. Obtain from this sample a sample {s
(j)
}
L
j=1
of values or realizations
for S
n
:
s
(j)
= S
n
(
(j)
), j = 1, . . . , L. (4.2)
3. Estimate P(S
n

s
) by calculating the sample mean

P
L
(
s
) =
1
L
L

j=1
1
s
(s
(j)
), (4.3)
where 1
A
(x) denotes the indicator function for the set A, which is
equal to 1 if x A and 0 otherwise.
21
The literature on large deviation simulation usually considers the estimation of P(Sn
A) for some set A rather than the estimation of the whole pdf of Sn.
22
For simplicity, we consider the Xis to be real RVs. The case of discrete RVs follow
with slight changes of notations.
23
Though not explicitly noted, the coarse-grained pdf depends on s; see Footnote 9.
30
1 0 1 2 3
0.0
0.5
1.0
1.5
2.0
s
I
n
,
L

n 10
L 100
L 1000
L 10000
L 100000
1 0 1 2 3
0.0
0.5
1.0
1.5
2.0
s
I
n
,
L

L 10 000
n 10
n 2
n 20
n 5
Figure 4.1: (Left) Naive sampling of the Gaussian IID sample mean ( =
= 1) for n = 10 and dierent sample sizes L. The dashed line is the exact
rate function. As L grows, a larger range of I(s) is sampled. (Right) Naive
sampling for a xed sample size L = 10 000 and various values of n. As n
increases, I
n,L
(s) approaches the expected rate function but the sampling
becomes inecient as it becomes restricted to a narrow domain.
4. Turn the estimator

P
L
(
s
) of the probability P(S
n

s
) into an
estimator p
L
(s) of the probability density p
Sn
(s):
p
L
(s) =

P
L
(
s
)
s
=
1
Ls
L

j=1
1
s
(s
(j)
). (4.4)
The result of these steps is illustrated in Fig. 4.1 for the case of the IID
Gaussian sample mean. Note that p
L
(s) above is nothing but an empirical
vector for S
n
(see Sec. 3.1) or a density histogram of the sample {s
(j)
}
L
j=1
.
The reason for choosing p(s) as our estimator of p
Sn
(s) is that it is an
unbiased estimator in the sense that
E[ p
L
(s)] = p
Sn
(s) (4.5)
for all L. Moreover, we know from the LLN that p
L
(s) converges in probability
to its mean p
Sn
(s) as L . Therefore, the larger our sample, the closer
we should get to a valid estimation of p
Sn
(s).
To extract a rate function I(s) from p
L
(s), we simply compute
I
n,L
(s) =
1
n
ln p
L
(s) (4.6)
and repeat the whole process for larger and larger integer values of n and L
until I
n,L
(s) converges to some desired level of accuracy.
24
24
Note that

PL(s) and pL(s) dier only by the factor s. In,L(s) can therefore be
31
4.2 Importance sampling
A basic rule of thumb in statistical sampling, suggested by the LLN, is that
an event with probability P will appear in a sample of size L roughly LP
times. Thus to get at least one instance of that event in the sample, we must
have L > 1/P as an approximate lower bound for the size of our sample (see
Exercise 4.7.5 for a more precise derivation of this estimate).
Applying this result to p
Sn
(s), we see that, if this pdf satises an LDP
of the form p
Sn
(s) e
nI(s)
, then we need to have L > e
nI(s)
to get at least
one instance of the event S
n

s
in our sample. In other words, our sample
must be exponentially large with n in order to see any large deviations (see
Fig. 4.1 and Exercise 4.7.2).
This is a severe limitation of the sampling scheme outlined earlier, and
we call it for this reason crude Monte Carlo or naive sampling. The
way around this limitation is to use importance sampling (IS for short)
which works basically as follows (see the contribution of Katzgraber for more
details):
1. Instead of sampling the X
i
s according to the joint pdf p(), sample
them according to a new pdf q();
25
2. Calculate instead of p
L
(s) the estimator
q
L
(s) =
1
Ls
L

j=1
1
s
_
S
n
(
(j)
)
_
R(
(j)
), (4.7)
where
R() =
p()
q()
(4.8)
is the called the likelihood ratio. Mathematically, R also corresponds
to the Radon-Nikodym derivative of the measures associated with
p and q.
26
The new estimator q
L
(s) is also an unbiased estimator of p
Sn
(s) because
(see Exercise 4.7.3)
E
q
[ q
L
(s)] = E
p
[ p
L
(s)]. (4.9)
computed from either estimator with a dierence (ln s)/n that vanishes as n .
25
The pdf q must have a support at least as large as that of p, i.e., q() > 0 if p() > 0,
otherwise the ratio Rn is ill-dened. In this case, we say that q is relatively continuous
with respect to p and write q p.
26
The Radon-Nikodym derivative of two measures and such that is dened
as d/d. If these measures have densities p and q, respectively, then d/d = p/q.
32
However, there is a reason for choosing q
L
(s) as our new estimator: we might
be able to come up with a suitable choice for the pdf q such that q
L
(s) has a
smaller variance than p
L
(s). This is in fact the goal of IS: select q() so as
to minimize the variance
var
q
( q
L
(s)) = E
q
[( q
L
(s) p(s))
2
]. (4.10)
If we can choose a q such that var
q
( q
L
(s)) < var
p
( p
L
(s)), then q
L
(s) will
converge faster to p
Sn
(s) than p
L
(s) as we increase L.
It can be proved (see Exercise 4.7.6) that in the class of all pdfs that
are relatively continuous with respect to p there is a unique pdf q

that
minimizes the variance above. This optimal IS pdf has the form
q

() = 1
s
(S
n
())
p()
p
Sn
(s)
= p(|S
n
() = s) (4.11)
and has the desired property that var
q
( q
L
(s)) = 0. It does seem therefore
that our sampling problem is solved, until one realizes that q

involves the
unknown pdf that we want to estimate, namely, p
Sn
(s). Consequently, q

cannot be used in practice as a sampling pdf. Other pdfs must be considered.


The next subsections present a more practical sampling pdf, which can be
proved to be optimal in the large deviation limit n . We will not attempt
to justify the form of this pdf nor will we prove its optimality. Rather, we will
attempt to illustrate how it works in the simplest way possible by applying
it to IID sample means and Markov processes. For a more complete and
precise treatment of IS of large deviation probabilities, see [4] and Chap. VI
of [2]. For background information on IS, see Katzgraber (this volume) and
Chap. V of [2].
4.3 Exponential change of measure
An important class of IS pdfs used for estimating p
Sn
(s) is given by
p
k
() =
e
nkSn()
W
n
(k)
p(), (4.12)
where k R and W
n
(k) is a normalization factor given by
W
n
(k) = E
p
[e
nkSn
] =
_
R
n
e
nkSn()
p() d. (4.13)
We call this class or family of pdfs parameterized by k the exponential
family. In large deviation theory, p
k
is also known as the tilted pdf
33
associated with p or the exponential twisting of p.
27
The likelihood ratio
associated with this change of pdf is
R() = e
nkSn()
W
n
(k), (4.14)
which, for the purpose of estimating I(s), can be approximated by
R() e
n[kSn()(k)]
(4.15)
given the large deviation approximation W
n
(k) e
n(k)
.
It is important to note that a single pdf of the exponential family with
parameter k cannot be used to eciently sample the whole of the function
p
Sn
(s) but only a particular point of that pdf. More precisely, if we want to
sample p
S
N
(s) at the value s, we must choose k such that
E
p
k
[ q
L
(s)] = p
Sn
(s) (4.16)
which is equivalent to solving

(k) = s for k, where


(k) = lim
n
1
n
ln E
p
[e
nkSn
] (4.17)
is the SCGF of S
n
dened with respect to p (see Exercise 4.7.7). Let
us denote the solution of these equations by k(s). For many processes
S
n
, it can be proved that the tilted pdf p
k(s)
corresponding to k(s) is
asymptotically optimal in n, in the sense that the variance of q
L
(s) under
p
k(s)
goes asymptotically to 0 as n . When this happens, the convergence
of q
L
(s) towards p
Sn
(s) is fast for increasing L and requires a sub-exponential
number of samples to achieve a given accuracy for I
n,L
(s). To obtain the full
rate function, we then simply need to repeat the sampling for other values of
s, and so other values of k, and scale the whole process for larger values of n.
In practice, it is often easier to reverse the roles of k and s in the sampling.
That is, instead of xing s and selecting k = k(s) for the numerical estimation
of I(s), we can x k to obtain the rate function at s(k) =

(k). This way,


we can build a parametric representation of I(s) over a certain range of s
values by covering or scanning enough values of k.
At this point, there should be a suspicion that we will not achieve much
with this exponential change of measure method (ECM method short)
since the forms of p
k
and q
L
(s) presuppose the knowledge of W
n
(k) and
in turn (k). We know from the GE Theorem that (k) is in many cases
27
In statistics and actuarial mathematics, p
k
is known as the associated law or Esscher
transform of p.
34
sucient to obtain I(s), so why taking the trouble of sampling q
L
(s) to get
I(s)?
28
The answer to this question is that I(s) can be estimated from p
k
without
knowing W
n
(k) using two insights:
Estimate I(s) indirectly by sampling an estimator of (k) or of S
n
,
which does not involve W
n
(k), instead of sampling the estimator q
L
(s)
of p
Sn
(s);
Sample according to the a priori pdf p() or use the Metropolis
algorithm, also known as the Markov chain Monte Carlo algorithm, to
sample according to p
k
() without knowing W
n
(k) (see Appendix A
and the contribution of Katzgraber in this volume).
These points will be explained in the next section (see also Exercise 4.7.11).
For the rest of this subsection, we will illustrate the ECM method for a
simple Gaussian IID sample mean, leaving aside the issue of having to know
W
n
(k). For this process
p
k
() = p
k
(x
1
, . . . , x
n
) =
n

i=1
p
k
(x
i
), (4.18)
where
p
k
(x
i
) =
e
kx
i
p(x
i
)
W(k)
, W(k) = E
p
[e
kX
], (4.19)
with p(x
i
) a Gaussian pdf with mean and variance
2
. We know from
Exercise 3.6.1 the expression of the generating function W(k) for this pdf:
W(k) = e
k+

2
2
k
2
. (4.20)
The explicit expression of p
k
(x
i
) is therefore
p
k
(x
i
) = e
k(x
i
)

2
2
k
2 e
(x
i
)
2
/(2
2
)

2
2
=
e
(x
i

2
k)
2
/(2
2
)

2
2
. (4.21)
The estimated rate function obtained by sampling the X
i
s according to
this Gaussian pdf with mean +
2
k and variance
2
is shown in Fig. 4.2
for various values of n and L. This gure should be compared with Fig. 4.1
which illustrates the naive sampling method for the same sample mean. It is
obvious from these two gures that the ECM method is more ecient than
naive sampling.
28
To make matters worst, the sampling of qL(s) does involve I(s) directly; see Exer-
cise 4.7.10.
35

2 1 0 1 2 3 4
0
1
2
3
4
s
I
n
,
L

L 100
n 10
n 5

2 1 0 1 2 3 4
0
1
2
3
4
s
I
n
,
L

L 1000
n 10
n 5
Figure 4.2: IS for the Gaussian IID sample mean ( = = 1) with the
exponential change of measure.
As noted before, the rate function can be obtained in a parametric way
by scanning k instead of scanning s and xing k according to s. If we
were to determine k(s) for a given s, we would nd

(k) = +
2
k = s, (4.22)
so that k(s) = (s )/
2
. Substituting this result in p
k
(x
i
), then yields
p
k(s)
(x
i
) =
e
(x
i
s)
2
/(2
2
)

2
2
. (4.23)
Thus to eciently sample p
Sn
(s), we must sample the X
i
s according to a
Gaussian pdf with mean s instead of the a priori pdf with mean . The
sampling is ecient in this case simply because S
n
concentrates in the sense
of the LLN at the value s instead of , which means that s has become the
typical value of S
n
under p
k(s)
.
The idea of the ECM method is the same for processes other than IID
sample means: the goal in general is to change the sampling pdf from p to p
k
in such a way that an event that is a rare event under p becomes a typical
under p
k
.
29
For more information on the ECM method and IS in general;
see [4] and Chaps. V and VI of [2]. For applications of these methods, see
Exercise 4.7.9.
4.4 Applications to Markov chains
The application of the ECM method to Markov chains follows the IID case
closely: we generate L IID realizations x
(j)
1
, . . . , x
(j)
n
, j = 1, . . . , L, of the
29
Note that this change is not always possible: for certain processes, there is no k that
makes certain events typical under p
k
. For examples, see [1].
36
chain = X
1
, . . . , X
n
according to the tilted joint pdf p
k
(), using either
p
k
directly or a Metropolis-type algorithm. The value k(s) that achieves
the ecient sampling of S
n
at the value s is determined in the same way as
in the IID case by solving

(k) = s, where (k) is now given by Eq. (3.9).


In this case, S
n
= s becomes the typical event of S
n
in the limit n .
A simple application of this procedure is proposed in Exercise 4.7.14. For
further reading on the sampling of Markov chains, see [4, 42, 43, 5, 41].
An important dierence to notice between the IID and Markov cases is
that the ECM does not preserve the factorization structure of p() for the
latter case. In other words, p
k
() does not describe a Markov chain, though,
as we noticed, p
k
() is a product pdf when p() is itself a product pdf. In
the case of Markov chains, p
k
() does retain a product structure that looks
like a Markov chain, and one may be tempted to dene a tilted transition
pdf as

k
(x

|x) =
e
kx

(x

|x)
W(k|x)
, W(k|x) =
_
R
e
kx

(x

|x) dx

. (4.24)
However, it is easy to see that the joint pdf of obtained from this transition
matrix does not reproduce the tilted pdf p
k
().
4.5 Applications to continuous-time Markov processes
The generalization of the results of the previous subsection to continuous-time
Markov processes follows from our discussion of the continuous-time limit
of Markov chains (see Sec. 3.2). In this case, we generate, according to the
tilted pdf of the process, a sample of time-discretized trajectories. The choice
of t and n used in the discretization will of course inuence the precision of
the estimates of the pdf and rate function, in addition to the sample size L.
Similarly to Markov chains, the tilted pdf of a continuous-time Markov
process does not in general describe a new tilted Markov process having a
generator related to the tilted generator

G
k
dened earlier. In some cases,
the tilting of a Markov process can be seen as generating a new Markov
process, as will be discussed in the next subsection, but this is not true in
general.
4.6 Applications to stochastic dierential equations
Stochastic dierential equations (SDEs) are covered by the continuous-time
results of the previous subsection. However, for this type of stochastic
processes the ECM can be expressed more explicitly in terms of the path
37
pdf p[x] introduced in Sec. 3.4. The tilted version of this pdf, which is the
functional analogue of p
k
(), is written as
p
k
[x] =
e
TkS
T
[x]
p[x]
W
T
(k)
, (4.25)
where S
T
[x] is some functional of the trajectory {x(t)}
T
t=0
and
W
T
(k) = E
p
[e
TkS
T
]. (4.26)
The likelihood ratio associated with this change of pdf is
R[x] =
p[x]
p
k
[x]
= e
TkS
T
[x]
W
T
(k) e
T[kS
T
[x](k)]
. (4.27)
As a simple application of these expressions, let us consider the SDE
x(t) = (t), (4.28)
where (t) is a Gaussian white noise, and the following additive process:
D
T
[x] =
1
T
_
T
0
x(t) dt, (4.29)
which represents the average velocity or drift of the SDE. What is the form
of p
k
[x] in this case? Moreover, how can we use this tilted pdf to estimate
the rate function I(d) of D
T
in the long-time limit?
An answer to the rst question is suggested by recalling the goal of the
ECM. The LLN implies for the SDE above that D
T
0 in probability as
T , which means that D
T
= 0 is the typical event of the natural
dynamic of x(t). The tilted dynamics realized by p
k(d)
[x] should change this
typical event to the uctuation D
T
= d; that is, the typical event of D
T
under p
k(d)
[x] should be D
T
= d rather than D
T
= 0. One candidate SDE
which leads to this behavior of D
T
is
x(t) = d +(t), (4.30)
so the obvious guess is that p
k(d)
[x] is the path pdf of this SDE.
Let us show that this is a correct guess. The form of p
k
[x] according to
Eq. (4.25) is
p
k
[x] e
TI
k
[x]
, I
k
[x] = I[x] kD
T
[x] +(k), (4.31)
38
where
I[x] =
1
2T
_
T
0
x
2
dt (4.32)
is the action of our original SDE and
(k) = lim
T
1
T
ln E[e
TkD
T
] (4.33)
is the SCGF of D
T
. The expectation entering in the denition of (k) can
be calculated explicitly using path integrals with the result (k) = k
2
/2.
Therefore,
I
k
[x] = I[x] kD
T
[x]
k
2
2
. (4.34)
From this result, we nd p
k(d)
[x] by solving

(k(d)) = d, which in this


case simply yields k(d) = d. Hence
I
k(d)
[x] = I[x] dD
T
[x] +
d
2
2
=
1
2T
_
T
0
( x d)
2
dt, (4.35)
which is exactly the action of the boosted SDE of Eq. (4.30).
Since we have (k), we can of course directly obtain the rate function
I(d) of D
T
by Legendre transform:
I(d) = k(d)d (k(d)) =
d
2
2
. (4.36)
This shows that the uctuations of D
T
are Gaussian, which is not surprising
given that D
T
is up to a factor 1/T a Brownian motion.
30
Note that the
same result can be obtained following Exercise 4.7.10 by calculating the
likelihood ratio R[x] for a path {x
d
(t)}
T
t=0
such that D
T
[x
d
] = d, i.e.,
R[x
d
] =
p[x
d
]
p
k(d)
[x
d
]
e
TI(d)
. (4.37)
These calculations give a representative illustration of the ECM method
and the idea that the large deviations of an SDE can be obtained by replacing
this SDE by a boosted or eective SDE whose typical events are large
deviations of the former SDE. Exercises 4.7.16 and 4.7.17 show that eective
SDEs need not correspond exactly to an ECM, but may in some cases
reproduce such a change of measure in the large deviation limit T .
30
Brownian motion is dened as the integral of Gaussian white noise. By writing x = (t),
we therefore dene x(t) to be a Brownian motion.
39
The important point in all cases is to be able to calculate the likelihood
ratio between a given SDE and a boosted version of it in order to obtain the
desired rate function in the limit T .
31
For the particular drift process D
T
studied above, the likelihood ratio
happens to be equivalent to a mathematical result known as Girsanovs
formula, which is often used in nancial mathematics; see Exercise 4.7.15.
4.7 Exercises
4.7.1.
[5]
(Variance of Bernoulli RVs) Let X be a Bernoulli RV with P(X =
1) = and P(X = 0) = 1 , 0 1. Find the variance of X in terms
of .
4.7.2.
[17]
(Direct sampling of IID sample means) Generate on a computer a
sample of size L = 100 of n IID Bernoulli RVs X
1
, . . . , X
n
with bias = 0.5,
and numerically estimate the rate function I(s) associated with the sample
mean S
n
of these RVs using the naive estimator p
L
(s) and the nite-size
rate function I
n,L
(s) dened in Eq. (4.6). Repeat for L = 10
3
, 10
4
, 10
5
, and
10
6
and observe how I
n,L
(s) converges to the exact rate function given in
Eq. (2.10). Repeat for exponential RVs.
4.7.3.
[10]
(Unbiased estimator) Prove that q
L
(s) is an unbiased estimator of
p
Sn
(s), i.e., prove Eq. (4.9).
4.7.4.
[15]
(Estimator variance) The estimator p
L
(s) is actually a sample mean
of Bernoulli RVs. Based on this, calculate the variance of this estimator with
respect to p() using the result of Exercise 4.7.1 above. Then calculate the
variance of the importance estimator q
L
(s) with respect to q().
4.7.5.
[15]
(Relative error of estimators) The real measure of the quality of
the estimator p
L
(s) is not so much its variance var( p
L
(s)) but its relative
error err( p
L
(s)) dened by
err( p
L
(s)) =
_
var( p
L
(s))
p
Sn
(s)
. (4.38)
Use the previous exercise to nd an approximation of err( p
L
(s)) in terms of
L and n. From this result, show that the sample size L needed to achieve a
certain relative error grows exponentially with n. (Source: Sec. V.I of [2].)
31
The limit 0 can also be considered.
40
4.7.6.
[17]
(Optimal importance pdf) Show that the density q

dened in
(4.11) is optimal in the sense that
var
q
( q
L
(s)) var
q
( q
L
(s)) = 0 (4.39)
for all q relatively continuous to p. (See Chap. V of [2] for help.)
4.7.7.
[15]
(Exponential change of measure) Show that the value of k solving
Eq. (4.16) is given by

(k) = s, where (k) is the SCGF dened in Eq. (4.3).


4.7.8.
[40]
(LDP for estimators) We noticed before that p
L
(s) is the empirical
vector of the IID sample {S
(j)
n
}
L
j=1
. Based on this, we could study the LDP
of p
L
(s) rather than just its mean and variance, as usually done in sampling
theory. What is the form of the LDP associated with p
L
(s) with respect to
p()? What is its rate function? What is the minimum and zero of that
rate function? Answer the same questions for q
L
(s) with respect to q().
(Hint: Use Sanovs Theorem and the results of Exercise 3.6.6.)
4.7.9.
[25]
(Importance sampling of IID sample means) Implement the ECM
method to numerically estimate the rate function associated with the IID
sample means of Exercise 3.6.1. Use a simple IID sampling based on the
explicit expression of p
k
in each case (i.e., assume that W(k) is known).
Study the convergence of the results as a function of n and L as in Fig. 4.2.
4.7.10.
[15]
(IS estimator) Show that q
L
(s) for p
k(s)
has the form
q
L
(s)
1
Ls
L

j=1
1
s
(s
(j)
) e
nI(s
(j)
)
. (4.40)
Why is this estimator trivial for estimating I(s)?
4.7.11.
[30]
(Metropolis sampling of IID sample means) Repeat Exercise 4.7.9,
but instead of using an IID sampling method, use the Metropolis algorithm
to sample the X
i
s in S
n
. (For information on the Metropolis algorithm, see
Appendix A.)
4.7.12.
[15]
(Tilted distribution from contraction) Let X
1
, . . . , X
n
be a se-
quence of real IID RVs with common pdf p(x), and denote by S
n
and L
n
(x)
the sample mean and empirical functional, respectively, of these RVs. Show
that the value of L
n
(x) solving the minimization problem associated with
the contraction of the LDP of L
n
(x) down to the LDP of S
n
is given by

k
(x) =
e
kx
p(x)
W(k)
, W(k) = E
p
[e
kX
], (4.41)
41
where k is such that

(k) = (ln W(k))

= s. To be more precise, show that


this pdf is the solution of the minimization problem
inf
:f()=s
I(), (4.42)
where
I() =
_
R
dx(x) ln
(x)
p(x)
(4.43)
is the continuous analogue of the relative entropy dened in Eq. (3.7) and
f() =
_
R
x(x) dx (4.44)
is the contraction function. Assume that W(k) exists. Why is
k
the same
as the tilted pdf used in IS?
4.7.13.
[50]
(Equivalence of ensembles) Comment on the meaning of the
following statements: (i) The optimal pdf q

is to the microcanonical ensemble


what the tilted pdf p
k
is to the canonical ensemble. (ii) Proving that p
k
achieves a zero variance for the estimator q
L
(s) in the limit n is the
same as proving that the canonical ensemble becomes equivalent to the
microcanonical ensemble in the thermodynamic limit.
4.7.14.
[25]
(Tilted Bernoulli Markov chain) Find the expression of the tilted
joint pdf p
k
() for the Bernoulli Markov chain of Exercise 3.6.10. Then use
the Metropolis algorithm (see Appendix A) to generate a large-enough sample
of realizations of in order to estimate p
Sn
(s) and I(s). Study the converge
of the estimation of I(s) in terms of n and L towards the expression of I(s)
found in Exercise 3.6.10. Note that for a starting conguration x
1
, . . . , x
n
and a target conguration x

1
, . . . , x

n
, the acceptance ratio is
p
k
(x

1
, . . . , x

n
)
p
k
(x
1
, . . . , x
n
)
=
e
kx

1
p(x

1
)

n
i=2
e
kx

i
(x

i
|x

i1
)
e
kx
1
p(x
1
)

n
i=2
e
kx
i
(x
i
|x
i1
)
, (4.45)
where p(x
i
) is some initial pdf for the rst state of the Markov chain. What
is the form of q
L
(s) in this case?
4.7.15.
[20]
(Girsanovs formula) Denote by p[x] the path pdf associated with
the SDE
x(t) =

(t), (4.46)
42
where (t) is Gaussian white noise with unit noise power. Moreover, let q[x]
be the path pdf of the boosted SDE
x(t) = +

(t). (4.47)
Show that
R[x] =
p[x]
q[x]
= exp
_

_
T
0
xdt +

2
2
T
_
= exp
_

x(T) +

2
2
T
_
.
(4.48)
4.7.16.
[27]
(Eective SDE) Consider the additive process,
S
T
=
1
T
_
T
0
x(t) dt, (4.49)
with x(t) evolving according to the simple Langevin equation of Exer-
cise 3.6.14. Show that the tilted path pdf p
k(d)
[x] associated with this
process is asymptotically equal to the path pdf of the following boosted SDE:
x(t) = (x(t) s) +(t). (4.50)
More precisely, show that the action I
k(s)
[x] of p
k(s)
[x] and the action J[x]
of the boosted SDE dier by a boundary term which is of order O(1/T).
4.7.17.
[22]
(Non-exponential change of measure) Although the path pdf
of the boosted SDE of Eq. (4.50) is not exactly the tilted path pdf p
k
[x]
obtained from the ECM, the likelihood ratio R[x] of these two pdfs does
yield the rate function of S
T
. Verify this by obtaining R[x] exactly and by
using the result of Exercise 4.7.10.
5 Other numerical methods for large deviations
The use of the ECM method appears to be limited, as mentioned before, by
the fact that the tilted pdf p
k
() involves W
n
(k). We show in this section
how to circumvent this problem by considering estimators of W
n
(k) and (k)
instead of estimators of the pdf p(S
n
), and by sampling these new estimators
according to the tilted pdf p
k
(), but with the Metropolis algorithm in order
not to rely on W
n
(k), or directly according to the a priori pdf p(). At the
end of the section, we also briey describe other important methods used in
numerical simulations of large deviations.
43
5.1 Sample mean method
We noted in the previous subsection that the parameter k entering in the
ECM had to be chosen by solving

(k) = s in order for s to be the typical


event of S
n
under p
k
. The reason for this is that
lim
n
E
p
k
[S
n
] =

(k). (5.1)
By the LLN we also have that S
n

(k) in probability as n if S
n
is
sampled according to p
k
.
These results suggest using
s
L,n
(k) =
1
L
L

j=1
S
n
(
(j)
) (5.2)
as an estimator of

(k) by sampling according to p


k
(), and then inte-
grating this estimator with the boundary condition (0) = 0 to obtain an
estimate of (k). In other words, we can take our estimator for (k) to be

L,n
(k) =
_
k
0
s
L,n
(k

) dk

. (5.3)
From this estimator, we then obtain a parametric estimation of I(s) from
the GE Theorem by Legendre-transforming

L,n
(k):
I
L,n
(s) = ks

L,n
(k), (5.4)
where s = s
L,n
(k) or, alternatively,
I
L,n
(s) = k(s) s

L,n
(k(s)), (5.5)
where k(s) is the root of

L,n
(k) = s.
32
The implementation of these estimators with the Metropolis algorithm is
done explicitly via the following steps:
1. For a given k R, generate an IID sample {
(j)
}
L
j=1
of L congurations
= x
1
, . . . , x
n
distributed according to p
k
() using the Metropolis
algorithm, which only requires the computation of the ratio
p
k
()
p
k
(

)
=
e
nkSn()
p()
e
nkSn(

)
p(

)
(5.6)
for any two congurations and

; see Appendix A;
32
In practice, we have to check numerically (using nite-scale analysis) that n,L(k)
does not develop non-dierentiable points in k as n and L are increased to . If non-
dierentiable points arise, then I(s) is recovered from the GE Theorem only in the range
of

; see Sec. 4.4 of [48].


44
6 4 2 0 2 4 6
4
2
0
2
4
6
k
s
L

L 5
L 50
L 100
6 4 2 0 2 4 6
0
5
10
15
20
k

L

2 1 0 1 2 3 4
0
1
2
3
4
s
I
L

L 5
L 50
L 100
Figure 5.1: (Top left) Empirical sample mean s
L
(k) for the Gaussian IID
sample mean ( = = 1). The spacing of the k values is k = 0.5. (Top
right) SCGF

L
(k) obtained by integrating s
L
(k). (Bottom) Estimate of
I(s) obtained from the Legendre transform of

L
(k). The dashed line in
each gure is the exact result. Note that the estimate of the rate function
for L = 5 is multi-valued because the corresponding

L
(k) is nonconvex.
2. Compute the estimator s
L,n
(k) of

(k) for the generated sample;


3. Repeat the two previous steps for other values of k to obtain a numerical
approximation of the function

(k);
4. Numerically integrate the approximation of

(k) over the mesh of


k values starting from k = 0. The result of the integration is the
estimator

L,n
(k) of (k);
5. Numerically compute the Legendre transform, Eq. (5.4), of the estima-
tor of (k) to obtain the estimator I
L,n
(s) of I(s) at s = s
L,n
(k);
6. Repeat the previous steps for larger values of n and L until the results
converge to some desired level of accuracy.
45
The result of these steps are illustrated in Fig. 5.1 for our test case of the
IID Gaussian sample mean for which

(k) = +
2
k. Exercise 5.4.1 covers
other sample means studied before (e.g., exponential, binary, etc.).
Note that for IID sample means, one does not need to generate L real-
izations of the whole sequence , but only L realizations of one summand
X
i
according the tilted marginal pdf p
k
(x) = e
kx
p(x)/W(k). In this case,
L takes the role of the large deviation parameter n. For Markov chains,
realizations of the whole sequence must be generated one after the other,
e.g., using the Metropolis algorithm.
5.2 Empirical generating functions
The last method that we cover is based on the estimation of the generating
function
W
n
(k) = E[e
nkSn
] =
_
R
n
p() e
nkSn()
d. (5.7)
Consider rst the case where S
n
is an IID sample mean. Then (k) = ln W(k),
where W(k) = E[e
kX
i
], as we know from Sec. 3.1, and so the problem reduces
to the estimation of the SCGF of the common pdf p(X) of the X
i
s. An
obvious estimator for this function is

L
(k) = ln
1
L
L

j=1
e
kX
(j)
, (5.8)
where X
(j)
are the IID samples drawn from p(X). From this estimator,
which we call the empirical SCGF, we build a parametric estimator of the
rate function I(s) of S
n
using the Legendre transform of Eq. (5.4) with
s(k) =

L
(k) =

L
j=1
X
(j)
e
kX
(j)

L
j=1
e
kX
(j)
. (5.9)
Fig. 5.2 shows the result of these estimators for the IID Gaussian sample
mean. As can be seen, the convergence of I
L
(s) is fast in this case, but is
limited to the central part of the rate function. This illustrates one limitation
of the empirical SCGF method: for unbounded RVs, such as the Gaussian
sample mean, W(k) is correctly recovered for large |k| only for large samples,
i.e., large L, which implies that the tails of I(s) are also recovered only for
large L. For bounded RVs, such as the Bernoulli sample mean, this limitation
does not arise; see [14, 15] and Exercise 5.4.2.
The application of the empirical SCGF method to sample means of
ergodic Markov chains is relatively direct. In this case, although W
n
no
46
6 4 2 0 2 4 6
0
5
10
15
20
k

L

L 10
L 100
L 1000
L 10000
6 4 2 0 2 4 6
4
2
0
2
4
6
k
s

2 1 0 1 2 3 4
0
1
2
3
4
s
I
L

L 10
L 100
L 1000
L 10000
Figure 5.2: (Top left) Empirical SCGF

L
(k) for the Gaussian IID sample
mean ( = = 1). (Top right) s(k) =

L
(k). (Bottom) Estimate of I(s)
obtained from

L
(k). The dashed line in each gure is the exact result.
longer factorizes into W(k)
n
, it is possible to group the X
i
s into blocks of
b RVs as follows:
X
1
+ +X
b
. .
Y
1
+X
b+1
+ +X
2b
. .
Y
2
+ +X
nb+1
+ +X
n
. .
Ym
(5.10)
where m = n/b, so as to rewrite the sample mean S
n
as
S
n
=
1
n
n

i=1
X
i
=
1
bm
m

i=1
Y
i
. (5.11)
If b is large enough, then the blocks Y
i
can be treated as being independent.
Moreover, if the Markov chain is ergodic, then the Y
i
s should be identically
distributed for i large enough. As a result, W
n
(k) can be approximated for
large n and large b (but b n) as
W
n
(k) = E[e
k

m
i=1
Y
i
] E[e
kY
i
]
m
, (5.12)
47
so that

n
(k) =
1
n
ln W
n
(k)
m
n
ln E[e
kY
i
] =
1
b
ln E[e
kY
i
]. (5.13)
We are thus back to our original IID problem of estimating a generating
function; the only dierence is that we must perform the estimation at the
level of the Y
i
s instead of the X
i
s. This means that our estimator for (k)
is now

L,n
(k) =
1
b
ln
1
L
L

j=1
e
kY
(j)
, (5.14)
where Y
(j)
, j = 1, . . . , L are IID samples of the block RVs. Following the IID
case, we then obtain an estimate of I(s) in the usual way using the Legendre
transform of Eq. (5.4) or (5.5).
The application of these results to Markov chains and SDEs is covered
in Exercises 5.4.2 and 5.4.3. For more information on the empirical SCGF
method, including a more detailed analysis of its convergence, see [14, 15].
5.3 Other methods
We briey describe below a number of other important methods used for
estimating large deviation probabilities, and give for each of them some
pointer references for interested readers. The fundamental ideas and concepts
related to sampling and ECM that we have treated before also arise in these
other methods.
Splitting methods: Splitting or cloning methods are the most used
sampling methods after Monte Carlo algorithms based on the ECM.
The idea of splitting is to grow a sample or population {} of
sequences or congurations in such a way that a given conguration
appears in the population in proportion to its weight p
k
(). This
is done in some iterative fashion by cloning and killing some
congurations. The result of this process is essentially an estimate
for W
n
(k), which is then used to obtain I(s) by Legendre transform.
For mathematically-oriented introductions to splitting methods, see
[33, 34, 9, 37]; for more physical introductions, see [23, 32, 46, 47].
Optimal path method: We briey mentioned in Sec. 3.4 that rate
functions related to SDEs can often be derived by contraction of the
action functional I[x]. The optimization problem that results from
this contraction is often not solvable analytically, but there are several
techniques stemming from optimization theory, classical mechanics,
48
and dynamic programming that can be used to solve it. A typical
optimization problem in this context is to nd the optimal path that a
noisy system follows with highest probability to reach a certain state
considered to be a rare event or uctuation. This optimal path is also
called a transition path or a reactive trajectory, and can be found
analytically only in certain special systems, such as linear SDEs and
conservative systems; see Exercises 5.4.45.4.6. For nonlinear systems,
the Lagrangian or Hamiltonian equations governing the dynamics of
these paths must be simulated directly; see [26] and Sec. 6.1 of [48] for
more details.
Transition path method: The transition path method is essentially
a Monte Carlo sampling method in the space of congurations (for
discrete-time processes) or trajectories {x(t)}
T
t=0
(for continuous-time
processes and SDEs), which aims at nding, as the name suggests,
transition paths or optimal paths. The contribution of Dellago in this
volume covers this method in detail; see also [11, 12, 10, 36, 51]. A
variant of the transition method is the so-called string method [17]
which evolves paths in conservative systems towards optimal paths
connecting dierent local minima of the potential or landscape function
of these systems.
Eigenvalue method: We have noted that the SCGF of an ergodic
continuous-time process is given by the dominant eigenvalue of its
tilted generator. From this connection, one can attempt to numerically
evaluate SCGFs using numerical approximation methods for linear
functional operators, combined with numerical algorithms for nding
eigenvalues. A recent application of this approach, based on the
renormalization group, can be found in [24, 25].
5.4 Exercises
5.4.1.
[25]
(Sample mean method) Implement the steps of the sample mean
method to numerically estimate the rate functions of all the IID sample
means studied in these notes, including the sample mean of the Bernoulli
Markov chain.
5.4.2.
[25]
(Empirical SCGF method) Repeat the previous exercise using
the empirical SCGF method instead of the sample mean method. Explain
why the former method converges quickly when S
n
is bounded, e.g., in the
49
Bernoulli case. Analyze in detail how the method converges when S
n
is
unbounded, e.g., for the exponential case. (Source: [14, 15].)
5.4.3.
[30]
(Sampling of SDEs) Generalize the sample mean and empirical
mean methods to SDEs. Study the additive process of Exercise 4.7.16 as a
test case.
5.4.4.
[30]
(Optimal paths for conservative systems) Consider the conservative
system of Exercise 3.6.15 and assume, without loss of generality, that the
global minimum of the potential U(x) is at x = 0. Show that the optimal
path which brings the system from x(0) = 0 to some uctuation x(T) = x = 0
after a time T is the time-reversal of the solution of the noiseless dynamics
x = U(x) under the terminal conditions x(0) = x and x(T) = 0. The
latter path is often called the decay path.
5.4.5.
[25]
(Optimal path for additive processes) Consider the additive process
S
T
of Exercise 4.7.16 under the Langevin dynamics of Exercise 3.6.14. Show
that the optimal path leading to the uctuation S
T
= s is the constant path
x(t) = s, t [0, T]. (5.15)
Explain why this result is the same as the asymptotic solution of the boosted
SDE of Eq. (4.50).
5.4.6.
[25]
(Optimal path for the dragged Brownian particle) Apply the
previous exercise to Exercise 3.6.18. Discuss the physical interpretation of
the results.
Acknowledgments
I would like to thank Rosemary J. Harris for suggesting many improvements
to these notes, as well as Alexander Hartmann and Reinhard Leidl for the
invitation to participate to the 2011 Oldenburg Summer School.
A Metropolis-Hastings algorithm
The Metropolis-Hastings algorithm
33
or Markov chain Monte Carlo
algorithm is a simple method used to generate a sequence x
1
, . . . , x
n
of
variates whose empirical distribution L
n
(x) converges asymptotically to a
33
Or, more precisely, the MetropolisRosenbluthRosenbluthTellerTellerHastings
algorithm.
50
target pdf p(x) as n . The variates are generated sequentially in the
manner of a Markov chain as follows:
1. Given the value x
i
= x, choose another value x

, called the proposal


or move, according to some (xed) conditional pdf q(x

|x), called the


proposal pdf ;
2. Accept the move x

, i.e., set x
i+1
= x

with probability min{1, a} where


a =
p(x

) q(x|x

)
p(x) q(x

|x)
; (A.1)
3. If the move is not accepted, set x
i+1
= x.
This algorithm is very practical for sampling pdfs having an exponential
form, such as the Boltzmann-Gibbs distribution or the tilted pdf p
k
, because
it requires only the computation of the acceptance ratio a for two values x
and x

, and so does not require the knowledge of the normalization constant


that usually enters in these pdfs. For the tilted pdf p
k
(), for example,
a =
p
k
(

) q(|

)
p
k
() q(

|)
=
e
nkSn()
p() q(|

)
e
nkSn()
p() q(

|)
. (A.2)
The original Metropolis algorithmcommonly used in statistical physics
is obtained by choosing q(x

|x) to be symmetric, i.e., if q(x

|x) = q(x|x

) for
all x and x

, in which case a = p(x

)/p(x). If q(x

|x) does not depend on x, the


algorithm is referred to as the independent chain Metropolis-Hastings
algorithm.
For background information on Monte Carlo algorithms, applications,
and technical issues such as mixing, correlation and convergence, see [31]
and Chap. 13 of [2].
References
[1] S. Asmussen, P. Dupuis, R. Rubinstein, and H. Wang. Importance
sampling for rare event. In S. Gass and M. Fu, editors, Encyclopedia
of Operations Research and Management Sciences. Kluwer, 3rd edition,
2011.
[2] S. Asmussen and P. W. Glynn. Stochastic Simulation: Algorithms and
Analysis. Stochastic Modelling and Applied Probability. Springer, New
York, 2007.
51
[3] C. M. Bender and S. A. Orszag. Advanced Mathematical Methods for
Scientists and Engineers. McGraw-Hill, New York, 1978.
[4] J. A. Bucklew. Introduction to Rare Event Simulation. Springer, New
York, 2004.
[5] J. A. Bucklew, P. Ney, and J. S. Sadowsky. Monte Carlo simulation and
large deviations theory for uniformly recurrent Markov chains. J. Appl.
Prob., 27(1):4459, 1990.
[6] R. Chetrite, G. Falkovich, and K. Gawedzki. Fluctuation relations
in simple examples of non-equilibrium steady states. J. Stat. Mech.,
2008(08):P08005, 2008.
[7] R. Chetrite and K. Gawedzki. Fluctuation relations for diusion pro-
cesses. Comm. Math. Phys., 282(2):469518, 2008.
[8] T. M. Cover and J. A. Thomas. Elements of Information Theory. John
Wiley, New York, 1991.
[9] T. Dean and P. Dupuis. Splitting for rare event simulation: A large
deviation approach to design and analysis. Stoch. Proc. Appl., 119(2):562
587, 2009.
[10] C. Dellago and P. Bolhuis. Transition path sampling and other advanced
simulation techniques for rare events. In C. Holm and K. Kremer, editors,
Advanced Computer Simulation Approaches for Soft Matter Sciences III,
volume 221 of Advances in Polymer Science, pages 167233. Springer,
Berlin, 2009.
[11] C. Dellago, P. G. Bolhuis, and P. L. Geissler. Transition path sampling.
Adv. Chem. Phys., 123:178, 2003.
[12] C. Dellago, P. G. Bolhuis, and P. L. Geissler. Transition path sampling
methods. In M. Ferrario, G. Ciccotti, and K. Binder, editors, Computer
Simulations in Condensed Matter Systems: From Materials to Chemical
Biology Volume 1, volume 703 of Lecture Notes in Physics. Springer,
New York, 2006.
[13] B. Derrida. Non-equilibrium steady states: Fluctuations and large devia-
tions of the density and of the current. J. Stat. Mech., 2007(07):P07023,
2007.
52
[14] N. G. Dueld, J. T. Lewis, N. OConnell, R. Russell, and F. Toomey.
Entropy of ATM trac streams: A tool for estimating QoS parameters.
IEEE J. Select. Areas Comm., 13(6):981990, 1995.
[15] K. Duy and A. P. Metcalfe. The large deviations of estimating rate
functions. J. Appl. Prob., 42(1):267274, 2005.
[16] P. Dupuis and R. S. Ellis. A Weak Convergence Approach to the Theory
of Large Deviations. Wiley Series in Probability and Statistics. John
Wiley, New York, 1997.
[17] W. E, W. Ren, and E. Vanden-Eijnden. String method for the study of
rare events. Phys. Rev. B, 66(5):052301, 2002.
[18] R. S. Ellis. Large deviations for a general class of random vectors. Ann.
Prob., 12(1):112, 1984.
[19] R. S. Ellis. Entropy, Large Deviations, and Statistical Mechanics.
Springer, New York, 1985.
[20] R. S. Ellis. The theory of large deviations: From Boltzmanns 1877
calculation to equilibrium macrostates in 2D turbulence. Physica D,
133:106136, 1999.
[21] M. I. Freidlin and A. D. Wentzell. Random Perturbations of Dynam-
ical Systems, volume 260 of Grundlehren der Mathematischen Wis-
senschaften. Springer-Verlag, New York, 1984.
[22] J. Gartner. On large deviations from the invariant measure. Th. Prob.
Appl., 22:2439, 1977.
[23] C. Giardina, J. Kurchan, and L. Peliti. Direct evaluation of large-
deviation functions. Phys. Rev. Lett., 96(12):120603, 2006.
[24] M. Gorissen, J. Hooyberghs, and C. Vanderzande. Density-matrix
renormalization-group study of current and activity uctuations near
nonequilibrium phase transitions. Phys. Rev. E, 79:020101, 2009.
[25] M. Gorissen and C. Vanderzande. Finite size scaling of current uctu-
ations in the totally asymmetric exclusion process. J. Phys. A: Math.
Theor., 44(11):115005, 2011.
[26] R. Graham. Macroscopic potentials, bifurcations and noise in dissipa-
tive systems. In F. Moss and P. V. E. McClintock, editors, Noise in
53
Nonlinear Dynamical Systems, volume 1, pages 225278, Cambridge,
1989. Cambridge University Press.
[27] G. Grimmett and D. Stirzaker. Probability and Random Processes.
Oxford University Press, Oxford, 2001.
[28] R. J. Harris, A. Rakos, and G. M. Sch utz. Current uctuations
in the zero-range process with open boundaries. J. Stat. Mech.,
2005(08):P08003, 2005.
[29] R. J. Harris, A. Rakos, and G. M. Sch utz. Breakdown of Gallavotti-
Cohen symmetry for stochastic dynamics. Europhys. Lett., 75:227233,
2006.
[30] R. J. Harris and G. M. Sch utz. Fluctuation theorems for stochastic
dynamics. J. Stat. Mech., 2007(07):P07020, 2007.
[31] W. Krauth. Statistical Mechanics: Algorithms and Computations. Ox-
ford Master Series in Statistical, Computational, and Theoretical Physics.
Oxford University Press, Oxford, 2006.
[32] V. Lecomte and J. Tailleur. A numerical approach to large deviations
in continuous time. J. Stat. Mech., 2007(03):P03004, 2007.
[33] P. LEcuyer, V. Demers, and B. Tun. Rare events, splitting, and quasi-
Monte Carlo. ACM Trans. Model. Comput. Simul., 17(2):10493301,
2007.
[34] P. LEcuyer, F. Le Gland, P. Lezaud, and B. Tun. Splitting techniques.
In G. Rubino and B. Tun, editors, Rare Event Simulation Using Monte
Carlo Methods, pages 3962. Wiley, New York, 2009.
[35] M. Lowe. Iterated large deviations. Stat. Prob. Lett., 26(3):219223,
1996.
[36] P. Metzner, C. Schutte, and E. Vanden Eijnden. Illustration of transi-
tion path theory on a collection of simple examples. J. Chem. Phys.,
125(8):084110, 2006.
[37] J. Morio, R. Pastel, and F. Le Gland. An overview of importance
splitting for rare event simulation. Eur. J. Phys., 31(5), 2010.
[38] A. Rakos and R. J. Harris. On the range of validity of the uctu-
ation theorem for stochastic Markovian dynamics. J. Stat. Mech.,
2008(05):P05005, 2008.
54
[39] D. Ruelle. Chance and Chaos. Penguin Books, London, 1993.
[40] D. Ruelle. Thermodynamic Formalism. Cambridge University Press,
Cambridge, 2nd edition, 2004.
[41] J. S. Sadowsky. On Monte Carlo estimation of large deviations proba-
bilities. Ann. Appl. Prob., 6(2):399422, 1996.
[42] J. S. Sadowsky and J. A. Bucklew. Large deviations theory techniques
in Monte Carlo simulation. In E. A. MacNair, K. J. Musselman, and
P. Heidelberger, editors, Proceedings of the 1989 Winter Simulation
Conference, pages 505513, New York, 1989. ACM.
[43] J. S. Sadowsky and J. A. Bucklew. On large deviations theory and
asymptotically ecient Monte Carlo estimation. IEEE Trans. Info. Th.,
36:579588, 1990.
[44] I. N. Sanov. On the probability of large deviations of random variables.
In Select. Transl. Math. Statist. and Probability, Vol. 1, pages 213244.
Inst. Math. Statist. and Amer. Math. Soc., Providence, R.I., 1961.
[45] H. Spohn. Large Scale Dynamics of Interacting Particles. Springer
Verlag, Heidelberg, 1991.
[46] J. Tailleur. Grandes deviations, physique statistique et syst`emes dy-
namiques. PhD thesis, Universite Pierre et Marie Curie, Paris, 2007.
[47] J. Tailleur and V. Lecomte. Simulation of large deviation functions using
population dynamics. In J. Marro, P. L. Garrido, and P. I. Hurtado,
editors, Modeling and Simulation of New Materials: Proceedings of
Modeling and Simulation of New Materials, volume 1091, pages 212219,
Melville, NY, 2009. AIP.
[48] H. Touchette. The large deviation approach to statistical mechanics.
Phys. Rep., 478(1-3):169, 2009.
[49] N. G. van Kampen. Stochastic Processes in Physics and Chemistry.
North-Holland, Amsterdam, 1992.
[50] R. van Zon and E. G. D. Cohen. Stationary and transient work-
uctuation theorems for a dragged Brownian particle. Phys. Rev. E,
67:046102, 2003.
55
[51] E. Vanden-Eijnden. Transition path theory. In M. Ferrario, G. Ciccotti,
and K. Binder, editors, Computer Simulations in Condensed Matter
Systems: From Materials to Chemical Biology Volume 1, volume 703 of
Lecture Notes in Physics, pages 453493. Springer, 2006.
[52] S. R. S. Varadhan. Asymptotic probabilities and dierential equations.
Comm. Pure Appl. Math., 19:261286, 1966.
56
Errata
I list below some errors contained in the published version of these notes and
in the versions (v1 and v2) previously posted on the arxiv.
The numbering of the equations in the published version is dierent from
the version posted on the arxiv. The corrections listed here refer to the arxiv
version.
I thank Bernhard Altaner and Andreas Engel for reporting errors. I
also thank Vivien Lecomte for a discussion that led me to spot the errors of
Sections 4.4 and 4.5.
Corrections to arxiv version 1
p. 15, in Eq. (2.35): The correct result is
sn
(1 )
(1s)n
. [Noted by
B. Altaner]
p. 16, in Ex. 2.7.8: (k) has a continuous derivative added.
p. 20, in Eq. (3.15): The limit T 0 should be T . [Noted by B.
Altaner]
p. 25, in Ex. 3.6.2: Y {1, +1} with P(Y = 1) = P(Y = 1) = 1/2
instead of Bernoulli.
p. 25, in Ex. 3.6.5: The RVs are now specied to be Gaussian RVs.
Corrections to arxiv version 2
p. 1: In the abstract: missing s to introduce.
p. 16, in Ex. 2.7.8: The conditions of (k) stated in the exercise are
again wrong: (k) must be dierentiable and must not have ane
parts to ensure that

(k) = s has a unique solution.


p. 20, in Eq. (3.17): The tilted generator is missing an identity matrix;
it should read

G
k
= G+kx

x,x
.
[Noted by A. Engel] Part of Section 3.3 and Eq. (3.17) was rewritten
to emphasize where this identity matrix term is coming from.
p. 27, in Eq. (3.40): The generator should have the o diagonal elements
exchanged. Idem for the matrix shown in Eq. (3.45). [Noted by A.
Engel]
57
These notes represent probability vectors as columns vectors. Therefore,
the columns, not the rows, of stochastic matrices must sum to one,
which means that the columns, not the rows, of stochastic generators
must sum to zero.
p. 27, in Ex. 3.6.13: It should be mentioned that f(x
i
, x
i+1
) is chosen
as f(x
i
, x
i+1
) = 1
x
i
,x
i+1
. [Noted by A. Engel]
p. 36, in Section 4.4: Although the tilted pdf p
k
() of a Markov chain
has a product structure reminiscent of a Markov chain, it cannot be
described as a Markov chain, contrary to what is written. In particular,
the matrix shown in Eq. (4.24) is not a proper stochastic matrix with
columns summing to one. For that, we should dene the matrix

k
(x

|x) =
e
kx

(x

|x)
W(k|x)
, W(k|x) =
_
e
kx

(x

|x) dx

.
However, the pdf of constructed from this transition matrix is
dierent from the tilted joint pdf p
k
(). Section 4.4 was rewritten to
correct this error.
This section is not entirely wrong: the Metropolis sampling can still
be done at the level of p
k
by generating IID sequences
(j)
, as now
described.
p. 37, in Section 4.5: The error of Section 4.4 is repeated here at
the level of generators: the G
k
shown in Eq. (4.25) is not a proper
stochastic generator with columns summing to zero. Section 4.5 was
completely rewritten to correct this error.
p. 42, Ex. 4.7.14: This exercise refers to the tilted joint pdf p
k
();
there is no tilted transition matrix here. The exercise was merged with
the next one to correct this error.
58

You might also like