0% found this document useful (0 votes)
12 views354 pages

Probabilites Numeriques

This document is an introduction to numerical probability for finance. It covers topics such as simulation of random variables, Monte Carlo methods, variance reduction techniques, quasi-Monte Carlo methods, optimal quantization methods, and stochastic approximation with applications in finance. The document contains chapters on each of these topics with examples and mathematical details. It is intended as a first draft of a textbook on numerical methods for probability and finance.

Uploaded by

Victor Vermès
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views354 pages

Probabilites Numeriques

This document is an introduction to numerical probability for finance. It covers topics such as simulation of random variables, Monte Carlo methods, variance reduction techniques, quasi-Monte Carlo methods, optimal quantization methods, and stochastic approximation with applications in finance. The document contains chapters on each of these topics with examples and mathematical details. It is intended as a first draft of a textbook on numerical methods for probability and finance.

Uploaded by

Victor Vermès
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 354

Introduction

to
Numerical Probability for Finance
—–
2016-17
—–

Gilles Pagès
LPMA-Université Pierre et Marie Curie

E-mail: [email protected]

http: //www.proba.jussieu.fr/pageperso/pages

first draft: Avril 2007


this draft: 01.09.2016
2
Contents

1 Simulation of random variables 9


1.1 Pseudo-random numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 The fundamental principle of simulation . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.3 The (inverse) distribution function method . . . . . . . . . . . . . . . . . . . . . . . 12
1.4 The acceptance-rejection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Simulation of Poisson distributions (and Poisson processes) . . . . . . . . . . . . . . 24
1.6 The Box-Müller method for normally distributed random vectors . . . . . . . . . . . 26
1.6.1 d-dimensional normally distributed random vectors . . . . . . . . . . . . . . . 26
1.6.2 Correlated d-dimensional Gaussian vectors, Gaussian processes . . . . . . . . 27

2 Monte Carlo method and applications to option pricing 31


2.1 The Monte Carlo method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2 Vanilla option pricing in a Black-Scholes model: the premium . . . . . . . . . . . . . 36
2.3 Greeks (sensitivity to the option parameters): a first approach . . . . . . . . . . . . 39
2.3.1 Background on differentiation of function defined by an integral . . . . . . . . 39
2.3.2 Working on the scenarii space (Black-Scholes model) . . . . . . . . . . . . . . 41
2.3.3 Direct differentiation on the state space: the log-likelihood method . . . . . . 45
2.3.4 The tangent process method . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3 Variance reduction 49
3.1 The Monte Carlo method revisited: static control variate . . . . . . . . . . . . . . . 49
3.1.1 Jensen inequality and variance reduction . . . . . . . . . . . . . . . . . . . . . 51
3.1.2 Negatively correlated variables, antithetic method . . . . . . . . . . . . . . . 57
3.2 Regression based control variate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.1 Optimal mean square control variate . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.2 Implementation of the variance reduction: batch vs adaptive . . . . . . . . . 61
3.3 Application to option pricing: using parity equations to produce control variates . . 66
3.3.1 Complexity aspects in the general case . . . . . . . . . . . . . . . . . . . . . . 68
3.3.2 Examples of numerical simulations . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3.3 Multidimensional case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4 Pre-conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.5 Stratified sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.6 Importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.6.1 The abstract paradigm of important sampling . . . . . . . . . . . . . . . . . . 78

3
4 CONTENTS

3.6.2 How to design and implement importance sampling . . . . . . . . . . . . . . 80


3.6.3 Parametric importance sampling . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.6.4 Computing the Value-at-Risk by Monte Carlo simulation: first approach. . . 84

4 The Quasi-Monte Carlo method 87


4.1 Motivation and definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.2 Application to numerical integration: functions with finite variations . . . . . . . . . 92
4.3 Sequences with low discrepancy: definition(s) and examples . . . . . . . . . . . . . . 95
4.3.1 Back again to Monte Carlo method on [0, 1]d . . . . . . . . . . . . . . . . . . 95
4.3.2 Röth’s lower bounds for the discrepancy . . . . . . . . . . . . . . . . . . . . . 96
4.3.3 Examples of sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.3.4 The Hammersley procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.3.5 Pros and Cons of sequences with low discrepancy . . . . . . . . . . . . . . . . 102
4.4 Randomized QMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.5 QM C in unbounded dimension: the acceptance rejection method . . . . . . . . . . . 109
4.6 Quasi-Stochastic Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5 Optimal Quantization methods (cubatures) I 113


5.1 Theoretical background on vector quantization . . . . . . . . . . . . . . . . . . . . . 113
5.2 Cubature formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.2.1 Lipschitz continuous functionals . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.2.2 Convex functionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.2.3 Differentiable functionals with Lipschitz continuous differentials . . . . . . . . 119
5.2.4 Quantized approximation of E(F (X) | Y ) . . . . . . . . . . . . . . . . . . . . 120
5.3 How to get optimal quantization? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.3.1 Competitive Learning Vector Quantization algorithm . . . . . . . . . . . . . . 121
5.3.2 Randomized Lloyd’s I procedure . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.4 Numerical integration (II): Richardson-Romberg extrapolation . . . . . . . . . . . . 121
5.5 Hybrid quantization-Monte Carlo methods . . . . . . . . . . . . . . . . . . . . . . . . 123
5.5.1 Optimal quantization as a control variate . . . . . . . . . . . . . . . . . . . . 125
5.5.2 Universal stratified sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.5.3 An (optimal) quantization based universal stratification: a minimax approach 127

6 Stochastic approximation and applications to finance 131


6.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.2 Typical a.s. convergence results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.3 Applications to Finance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.3.1 Application to recursive variance reduction by importance sampling . . . . . 140
6.3.2 Application to implicit correlation search . . . . . . . . . . . . . . . . . . . . 148
6.3.3 Application to correlation search (II): reducing variance using higher order
Richardson-Romberg extrapolation . . . . . . . . . . . . . . . . . . . . . . . . 152
6.3.4 The paradigm of model calibration by simulation . . . . . . . . . . . . . . . . 154
6.3.5 Recursive computation of the V @R and the CV @R (I) . . . . . . . . . . . . 159
6.3.6 Numerical methods for Optimal Quantization . . . . . . . . . . . . . . . . . . 165
6.4 Further results on Stochastic Approximation . . . . . . . . . . . . . . . . . . . . . . 170
CONTENTS 5

6.4.1 The ODE method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170


6.4.2 L2 -rate of convergence and application to convex optimization . . . . . . . . 178
6.4.3 Weak rate of convergence: CLT . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.4.4 The averaging principle for stochastic approximation . . . . . . . . . . . . . . 190
6.4.5 Traps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
6.4.6 (Back to) V @Rα and CV @Rα computation (II): weak rate . . . . . . . . . . 194
6.4.7 V @Rα and CV @Rα computation (III) . . . . . . . . . . . . . . . . . . . . . . 195
6.5 From Quasi-Monte Carlo to Quasi-Stochastic Approximation . . . . . . . . . . . . . 196
6.5.1 Further readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201

7 Discretization scheme(s) of a Brownian diffusion 203


7.1 Euler-Maruyama schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
7.1.1 Discrete time and stepwise constant Euler scheme(s) . . . . . . . . . . . . . . 204
7.1.2 Genuine (continuous) Euler scheme . . . . . . . . . . . . . . . . . . . . . . . . 205
7.2 Strong error rate and polynomial moments (I) . . . . . . . . . . . . . . . . . . . . . . 205
7.2.1 Main results and comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
7.2.2 Proofs in the quadratic Lipschitz case for homogeneous diffusions . . . . . . . 211
7.3 Non asymptotic deviation inequalities for the Euler scheme . . . . . . . . . . . . . . 216
7.4 Pricing path-dependent options (I) (Lookback, Asian, etc) . . . . . . . . . . . . . . . 224
7.5 Milstein scheme (looking for better strong rates. . . ) . . . . . . . . . . . . . . . . . . . 225
7.5.1 The 1-dimensional setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
7.5.2 Higher dimensional Milstein scheme . . . . . . . . . . . . . . . . . . . . . . . 228
7.6 Weak error for the Euler scheme (I) . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
7.6.1 Main results for E f (XT ): the Talay-Tubaro and Bally-Talay Theorems . . . 231
7.7 Standard and multistep Richardson-Romberg extrapolation . . . . . . . . . . . . . . 236
7.7.1 Richardson-Romberg extrapolation with consistent increments . . . . . . . . 236
7.7.2 Toward a multistep Richardson-Romberg extrapolation . . . . . . . . . . . . 238
7.8 Further proofs and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
7.8.1 Some useful inequalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
7.8.2 Polynomial moments (II) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
7.8.3 Lp -pathwise regularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
7.8.4 Lp -converge rate (II): proof of Theorem 7.2 . . . . . . . . . . . . . . . . . . . 245
7.8.5 The stepwise constant Euler scheme . . . . . . . . . . . . . . . . . . . . . . . 247
7.8.6 Application to the a.s.-convergence of the Euler schemes and its rate. . . . . 248
7.8.7 Flow of an SDE, Lipschitz continuous regularity . . . . . . . . . . . . . . . . 250
7.8.8 Strong error rate for the Milstein scheme: proof of Theorem 7.5 . . . . . . . . 251
7.8.9 Weak error expansion for the Euler scheme by the P DE method . . . . . . . 255
7.8.10 Toward higher order expansion . . . . . . . . . . . . . . . . . . . . . . . . . . 260

8 The diffusion bridge method : application to the pricing of path-dependent


options (II) 263
8.1 Theoretical results about time discretization and path-dependent payoffs . . . . . . 263
8.2 From Brownian to diffusion bridge: how to simulate functionals of the genuine Euler
scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
8.2.1 The Brownian bridge method . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
6 CONTENTS

8.2.2 The diffusion bridge (bridge of the Euler scheme) . . . . . . . . . . . . . . . . 267


8.2.3 Application to Lookback like path dependent options . . . . . . . . . . . . . . 269
8.2.4 Application to regular barrier options: variance reduction by pre-conditioning 272
8.2.5 Weak errors and Richardson-Romberg extrapolation for path-dependent options273
8.2.6 The case of Asian options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276

9 Back to sensitivity computation 281


9.1 Finite difference method(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
9.1.1 The constant step approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282
9.1.2 A recursive approach: finite difference with decreasing step . . . . . . . . . . 288
9.2 Pathwise differentiation method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
9.2.1 (Temporary) abstract point of view . . . . . . . . . . . . . . . . . . . . . . . . 291
9.2.2 Tangent process of a diffusion and application to sensitivity computation . . 292
9.3 Sensitivity computation for non smooth payoffs . . . . . . . . . . . . . . . . . . . . . 295
9.3.1 The log-likelihood approach (II) . . . . . . . . . . . . . . . . . . . . . . . . . 295
9.4 A flavour of stochastic variational calculus: from Bismut to Malliavin . . . . . . . . 297
9.4.1 Bismut’s formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
9.4.2 The Haussman-Clark-Occone formula: toward Malliavin calculus . . . . . . . 301
9.4.3 Toward practical implementation: the paradigm of localization . . . . . . . . 303
9.4.4 Numerical illustration: what localization is useful for (with V. Lemaire) . . . 304

10 Multi-asset American/Bermuda Options, swing options 309


10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
10.2 Time discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 310
10.3 A generic discrete time Markov chain model . . . . . . . . . . . . . . . . . . . . . . . 314
10.3.1 Principle of the regression method . . . . . . . . . . . . . . . . . . . . . . . . 316
10.4 Quantization (tree) methods (II) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
10.4.1 Approximation by a quantization tree . . . . . . . . . . . . . . . . . . . . . . 320
10.4.2 Error bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 320
10.4.3 Backkground on optimal quantization . . . . . . . . . . . . . . . . . . . . . . 323
10.4.4 Implementation of a quantization tree method . . . . . . . . . . . . . . . . . 323
10.4.5 How to improve convergence performances? . . . . . . . . . . . . . . . . . . . 326
10.4.6 Numerical optimization of the grids: Gaussian and non-Gaussian vectors . . 327
10.4.7 The case of normal distribution N (0; Id ) on Rd , d ≥ 1 . . . . . . . . . . . . . 327

11 Miscellany 331
11.1 More on the normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
11.2 Characteristic function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331
11.3 Numerical approximation of the distribution function . . . . . . . . . . . . . . . . . . 331
11.4 Table of the distribution function of the normal distribution . . . . . . . . . . . . . . 332
11.5 Uniform integrability as a domination property . . . . . . . . . . . . . . . . . . . . . 333
11.6 Interchanging. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 336
11.7 Measure Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
11.8 Weak convergence of probability measures on a Polish space . . . . . . . . . . . . . . 337
11.9 Martingale Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
Notations

 General notations
• bxc denotes the integral part of the real number x i.e. the highest integer not greater than x.
{x} = x − bxc denotes the fractional part of the real number x.

• The notation u ∈ Kd will denote the column vector u of the vector space Kd , K = R or C.
The row vector will be denoted u∗ or t u.
i i 1 d
P
• (u|v) = 1≤i≤d u v denotes the canonical inner product of vectors u = (u , . . . , u ) and
1 d d
v = (v , . . . , v ) of R .

• M(d, q, K) will denote the vector space of matrices with d rows and q columns with K-valued
entries.

• The transpose of a matrix A will be denoted A∗ or At .

• Cb (S, Rd ) := f : (S, δ) → Rd , continuous and bounded where (S, δ) denotes a metric space.


• For a function f : (Rd , | . |q ) → (Rp , | . |q )

|f (x) − f (y)|q
[f ]Lip = sup .
x6=y |x − y|d

and f is Lipschitz continuous continuous with coefficient [f ]Lip if [f ]Lip < +∞.

• An assertion P(x) depending on a generic element x of a measured space (E, E, µ) is true


µ-almost everywhere (denoted µ-a.e.) if it is true outside a µ-negligible set of E.
 Probabilistic notations
• The distribution on (Rd , Bor(Rd )) of a random vector X : (Ω, A, P) → Rd is denoted PX .
d
• X = Y stands for equality in distribution of the random vectors X and Y .
L
• −→ denotes the convergence in distribution of random vectors (i.e. the weak convergence of
their distributions).
n o
• LpRd (Ω, A, µ) = f : (Ω, A) → Rd s.t. X |f |p dµ < +∞ which does not depend upon the
R

selected norm on Rd .

7
8 CONTENTS
Chapter 1

Simulation of random variables

1.1 Pseudo-random numbers


From a mathematical point of view, the definition of a sequence of (uniformly distributed) random
numbers (over the unit interval [0, 1]) should be:

“Definition.” A sequence xn , n ≥ 1, of [0, 1]-valued real numbers is a sequence of random numbers


if there exists a probability space (Ω, A, P), a sequence Un , n ≥ 1, of i.i.d. random variables with
uniform distribution U([0, 1]) and ω ∈ Ω such that xn = Un (ω) for every n ≥ 1.

But this naive and abstract definition is not satisfactory because the “scenario” ω ∈ Ω may
be not a “good” one i.e. not a “generic”. . . ? Many probabilistic properties (like the law of large
numbers to quote the most basic one) are only satisfied P-a.s.. Thus, if ω precisely lies in the
negligible set that does not satisfy one of them, the induced sequence will not be “admissible”.
Whatever, one usually cannot have access to an i.i.d. sequence of random variables (Un ) with
distribution U([0, 1])! Any physical device would too slow and not reliable. Some works by logicians
like Martin-Löf lead to consider that a sequence (xn ) that can be generated by an algorithm cannot
be considered as “random” one. Thus the digits of π are not random in that sense. This is quite
embarrassing since an essential requested feature for such sequences is to be generated almost
instantly on a computer!
The approach coming from computer and algorithmic sciences is not really more tractable since
their definition of a sequence of random numbers is that the complexity of the algorithm to generate
the first n terms behaves like O(n). The rapidly growing need of good (pseudo-)random sequences
with the explosion of Monte Carlo simulation in many fields of Science and Technology (I mean not
only neutronics) after World War II lead to adopt a more pragmatic approach – say – heuristic –
based on a statistical tests. The idea is to submit some sequences to statistical tests (uniform
distribution, block non correlation, rank tests, etc)
For practical implementation, such sequences are finite, as the accuracy of computers is. One
considers some sequences (xn ) of so-called pseudo-random numbers defined by
yn
xn = , yn ∈ {0, . . . , N − 1}.
N
One classical process is to generate the integers yn by a congruential induction:
yn+1 ≡ ayn + b mod. N

9
10 CHAPTER 1. SIMULATION OF RANDOM VARIABLES

where gcd(a, N ) = 1, so that ā (class of a modulo N ) is invertible for the multiplication (modulo
N ). Let (Z/N Z)∗ denote the set of such invertible classes (modulo N ). The multiplication of
classes (module N ) is an internal law on (Z/NZ)∗ and ((Z/N Z)∗ , ×) is a commutative group for
this operation. By the very definition of the Euler function ϕ(N ) as the number of integers a in
{0, . . . , N − 1} such that gcd(a, N ) = 1, the cardinality of (Z/N Z)∗ is equal to ϕ(N ). Let us recall
that the Euler function is multiplicative and given by the following closed formula
 
Y 1
ϕ(N ) = N 1− .
p
p|N, p prime

Thus ϕ(p) = p − 1 for very prime integer of N and ϕ(pk ) = pk − p (primary numbers), etc. In
particular, if p is prime it shows that (Z/N Z)∗ = (Z/N Z) \ {0}.
If b = 0 (the most common case), one speaks of homogeneous generator. We will focus on this type
of generators in what follows.
 Homogeneous congruent generators. When b = 0, the period of the sequence (yn ) is given by the
multiplicative order of a in ((Z/N Z)∗ , ×) i.e.

τa := min{k ≥ 1, | ak ≡ 1 mod. N } = min{k ≥ 1, | āk }.

Moreover, we know by Lagrange’s Theorem that τa divides the cardinality ϕ(N ) of (Z/N Z)∗ .

For pseudo-random number simulation purpose, we search for couples (N, a) such that the
period τa of a in ((Z/N Z)∗ , ×) is very large. The needs an in-depth study of the multiplicative
groups ((Z/N Z)∗ , ×), with in mind that N should be large itself to allow a having a large period.
This suggests to focus on prime numbers or primary numbers since, as seen above, their Euler
function is itself large.
In fact the structure of these groups have been elucidated for a long time and we sum up below
these results.

Theorem 1.1 Let N = pα , p prime, α ∈ N∗ be a primary number.


(a) If α = 1 (i.e. N = p prime), then ((Z/N Z)∗ , ×) (whose cardinality is p − 1) is a cyclic group.
This means that there exists a ∈ {1, . . . , p − 1} s.t. (Z/pZ)∗ = hai. Hence the maximal period is
τ = p − 1.
(b) If p = 2, α ≥ 3, (Z/N Z)∗ (whose cardinality is 2α−1 = N/2) is not cyclic. The maximal period
is then τ = 2α−2 with a ≡ ±3 mod. 8. (If α = 2 (N = 4), the group of size2 is trivially cyclic!)
(c) If p 6= 2, then (Z/N Z)∗ (whose cardinality is pα−1 (p − 1)) is cyclic, hence τ = pα−1 (p − 1). It
a in (Z/pZ) spans the cyclic group ((Z/pZ)∗ , ×).
is generated by any element a whose class e

What does this theorem say in connection with our pseudo-random number generation problem?
First a very good news: when N is a prime number the group ((Z/N Z)∗ , ×) is cyclic i.e. there
exists a ∈ {1, . . . , N } such that (Z/N Z)∗ = {ān , 1 ≤ n ≤ N − 1 ∈}. The bad news is that we do
not know which a satisfy this property, not all do and, even worse, we do not know how to find any.
Thus, if p = 7, ϕ(7) = 7 − 1 = 6 and o(3) = o(5) = 6 but o(2) = o(4) = 3 (which divides 6) and
o(6) = 2 (which again divides 6).
1.1. PSEUDO-RANDOM NUMBERS 11

The send bad news is that the length of a sequence, though a necessary property of a sequence
(yn )n , provides no guarantee or even clue that (xn )n is a good sequence of pseudo-random numbers!
Thus, the (homogeneous) generator of the FORTRAN IMSL library does not fit in the formerly
described setting: one sets N := 231 − 1 = 2 147 483 647 (which is a Mersenne prime number (see
below)) discovered by Leonhard Euler in 1772), a := 75 , and b := 0 (a 6≡ 0 mod.8), the point being
that the period of 75 is not maximal.

Another approach to random number generation is based on shift register and relies upon the
theory of finite fields.

At this stage, a sequence must pass successfully through various statistical tests, having in
mind that such a sequence is finite by contraction and consequently cannot satisfy asymptotically
as common properties as the Law of Large Numbers, the Central Limit Theorem or the Law of the
Iterated Logarithm (see next chapter). Dedicated statistical toolboxes like DIEHARD (Marsaglia,
1998) have been devised to test and “certify” sequences of pseudo-random numbers.

The aim of this introductory section is just to give the reader the flavour of pseudo-random
numbers generation, but in no case we recommend the specific use of any of the above generators
or discuss the virtues of such or such generator.
For more recent developments on random numbers generators (like shift register, etc, we refer
e.g. to [42], [118], etc). Nevertheless, let us mention the Mersenne twister generators. This family
of generator has been introduced in 1997 by Makoto Matsumata and Takuji Nishimura in [111].
The first level of Mersenne Twister Generators (denoted M T -p) are congruential generators whose
period Np is a prime Mersenne number i.e. an integer of the form Np = 2p − 1 where p is itself
prime. The most popular and now worldwide implemented is the M T -19937, owing to its unrivaled
period 219937 − 1 ≈ 106000 since 210 ≈ 103 ). It can simulate a uniform distribution in 623 dimension
(i.e. on [0, 1]623 . A second “shuffling” device still improves it. For recent improvements and their
implementations in C ++, use the link

www.math.sci.hiroshima-u.ac.jp/e m-mat/MT/emt.html

Recently, new developments in massively parallel computing drew the attention back to pseudo-
random number simulation. In particular the GPU based intensive computations which use the
graphic device of a computer as a computing unit which may run in parallel hundreds of parallel
computations. One can imagine that each pixel is a (virtual) small computing unit achieving a
small chain of elementary computations (thread). What is really new is the access to such an inten-
sive parallel computing becomes cheap, although it requires some specific programming language
(like CUDA on Nvidia GPU). As concerns its use for intensively parallel Monte Carlo simulation,
some new questions arise: in particular the ability to generate independently (in parallel!) many
“independent” sequences of pseudo-random numbers since the computing units of a GPU never
“speak” to each other or to anybody else while running: each pixel is a separate (virtual) thread.
The Wichmann-Hill pseudo-random numbers generator (which is in fact a family of 273 different
methods) seems to be a good candidate for Monte Carlo simulation on GPU. For more insight on
that topic we refer to [151] and the references therein.
12 CHAPTER 1. SIMULATION OF RANDOM VARIABLES

1.2 The fundamental principle of simulation


Theorem 1.2 (Fundamental Theorem of simulation) Let (E, dE ) be a Polish space (com-
plete and separable) and let X : (Ω, A, P) → (E, BordE (E)) be a random variable with distribution
PX . Then there exists a Borel function ϕ : ([0, 1], B([0, 1]), λ[0,1] ) → (E, BordE (E), PX ) such that

PX = λ[0,1] ◦ ϕ−1

where λ[0,1] ◦ ϕ−1 denotes the image of the Lebesgue measure λ[0,1] on the unit interval by ϕ.

We will admit this theorem. For a proof we refer to [27] (Theorem A.3.1p.38). It also appears
as a “brick” in the proof of the Skorokhod representation theorem for random variables having
values in a Polish space.
As a consequence this means that, if U denotes any random variable with uniform distribution
on (0, 1) defined on a probability space (Ω, A, P), then
d
X = ϕ(U ).

The interpretation is that any E-valued random variable can be simulated from a uniform
distribution. . . provided the functionϕ is computable. If so is the case, the yield of the simulation is
1 since every (pseudo-)random number u ∈ [0, 1] produces a PX -distributed random number. Except
in very special situations (see below), this result turns out to be of purely theoretical nature and is
of little help for practical simulation. However the fundamental theorem of simulation is important
from a theoretical point view in Probability Theory since it is the fundamental step of Skorokhod
representation Theorem.
In the three sections below we provide a short background on the most classical simulation meth-
ods (inversion of the distribution function, acceptance-rejection method, Box-Müller for Gaussian
vectors). This is of course far from being exhaustive. For an overview of the different aspects of
simulation of non uniform random variables or vectors, we refer to [42]. But in fact, a large part
of the results from Probability Theory can give rise to simulation methods.

1.3 The (inverse) distribution function method


Let µ be a probability distribution on (R, Bor(R)) with distribution function F defined for every
x ∈ R by
F (x) := µ((−∞, x]).
The function F is always non-decreasing, “càdlàg” (French acronym for “right continuous with left
limits”) and limx→+∞ F (x) = 1, limx→−∞ F (x) = 0.
One can always associate to F its canonical left inverse Fl−1 defined on the open unit interval
(0, 1) by
∀u ∈ (0, 1), Fl−1 (u) = inf{x | F (x) ≥ u}.
One easily checks that Fl−1 is non-decreasing, left-continuous and satisfies

∀ u ∈ (0, 1), Fl−1 (u) ≤ x ⇐⇒ u ≤ F (x).


1.3. THE (INVERSE) DISTRIBUTION FUNCTION METHOD 13

d d
Proposition 1.1 If U = U ((0, 1)), then X := Fl−1 (U ) = µ.

Proof. Let x ∈ R. It follows from what precedes that

{X ≤ x} = {Fl−1 (U ) ≤ x} = {U ≤ F (x)}

so that P(X ≤ x) = P(F −1 (U ) ≤ x) = P(U ≤ F (X)) = F (x). ♦

Remarks. • When F is increasing and continuous on the real line, then F has an inverse function
denoted F −1 defined (0, 1) (increasing and continuous as well) such that F ◦ F −1 = Id|(0,1) and
F −1 ◦ F = IdR . Clearly F −1 = Fl−1 by the very definition of Fl−1 . But the above proof can be
made even more straightforward since {F −1 (U ) ≤ x} = {U ≤ F (x)} by simple left composition of
F −1 by the increasing function F .
Z x
• If µ has a probability density f such that {f = 0} has an empty interior, then F (x) = f (u)du
−∞
is continuous and increasing.
• One can replace R by any interval [a, b] ⊂ R or R (with obvious conventions).
• One could also have considered the right continuous canonical inverse Fr−1 by:

∀u ∈ (0, 1), Fr−1 (u) = inf{x | F (x) > u}.

One shows that Fr−1 is non-decreasing, right continuous and that

Fr−1 (u) ≤ x =⇒ u ≤ F (x) and u < F (x) =⇒ Fr−1 (u) ≤ x.


d
Hence Fr−1 (U ) = X since
 
≤ P(F (x) ≥ U ) = F (x)
P(Fr−1 (U ) ≤ x) = P(X ≤ x) = F (x).
≥ P(F (x) > U ) = F (x)

When X takes finitely many values in R, we will see in Example 4 below that this simulation
method corresponds to the standard simulation method of such random variables.

 Exercise. (a) Show that, for every u ∈ (0, 1)

F Fl−1 (u) − ≤ u ≤ F ◦ Fl−1 (u)




so that if F is continuous (or equivalently µ has no atom: µ({x}) = 0 for every x ∈ R), then
F ◦ Fl−1 = Id(0,1)
d
(b) Show that if F is continuous, then F (X) = U ([0, 1]).
(c) Show that if F is (strictly) increasing, then Fl−1 is continuous and Fl−1 ◦ F = IdR .

(d) One defines the survival function of µ by F̄ (x) = 1 − F (x) = µ (x, +∞) ,x ∈ R. One defines
the canonical right inverse of F̄ by

∀u ∈ (0, 1), F̄r−1 (u) = inf{x | F̄ (x) ≤ u}.


14 CHAPTER 1. SIMULATION OF RANDOM VARIABLES

Show that F̄r−1 (u) = Fl−1 (1−u). Deduce that F̄r−1 is right continuous on (0, 1) and that F̄r−1 (U )
has distribution µ. Define F̄l−1 and show that F̄r−1 (U ) has distribution µ. Finally establish for F̄r−1
similar properties to (a)-(b)-(c).

(Unformal) Definition The yield (often denoted r) of a simulation procedure is defined as the
inverse of the number of pseudo-random numbers used to generate one PX -distributed random
number.

One must keep in mind that the yield is attached to a simulation method, not to a probability
distribution (the fundamental theorem of simulation always provides a simulation method with
yield 1, except that it is usually not tractable).

Example. Typically, if X = ϕ(U1 , . . . , Um ) where ϕ is a Borel function defined on [0, 1]m and
U1 , . . . , Um are independent and uniformly distributed over [0, 1], the yield of this ϕ-based procedure
to simulate the distribution of X is r = 1/m.
Thus, the yield r of the (inverse) distribution function is consequently always equal to r = 1.

d
Examples: 1.Simulation of an exponential distribution. Let X = E(λ), λ > 0. Then
Z x
∀ x ∈ (0, ∞), FX (x) = λ e−λξ dξ = 1 − e−λx .
0

d
Consequently, for every y ∈ (0, 1), FX−1 (u) = − log(1 − u)/λ. Now, using that U = 1 − U if
d
U = U ((0, 1)) yields

log(U ) d
X=− = E(λ).
λ
c
2. Simulation of a Cauchy(c), c > 0, distribution. We know that PX (dx) = dx.
π(x2 + c2 )
Z x
c du 1 x π 
∀x ∈ R, FX (x) = = Arctan + .
−∞ u2 + c2 π π c 2

Hence FX−1 (x) = c tan π(u − 1/2) . It follows that




 d
X = c tan π(U − 1/2) = Cauchy(c).

θ
3. Simulation of a Pareto(θ), θ > 0, distribution. We know that PX (dx) = 1{x≥1} dx. The
x1+θ
d d
distribution function FX (x) = 1 − x−θ so that, still using U = 1 − U if U = U ((0, 1)),
1 d
X = U − θ = Pareto(θ).

4. Simulation of a distribution supported by a finite set E. Let E := {x1 , . . . , xN } be a subset of R


indexed so that i 7→ xi is increasing. Let X : (Ω, A, P) → E be an E-valued random variable with
1.3. THE (INVERSE) DISTRIBUTION FUNCTION METHOD 15

distribution P(X = xk ) = pk , 1 ≤ k ≤ N , where pk ∈ [0, 1], p1 + · · · + pN = 1. Then, one checks


that its distribution function FX reads

∀ x ∈ R, FX (x) = p1 + · · · + pi if xi ≤ x < xi+1

with the convention x0 = −∞ and xN +1 = +∞. As a consequence, its left continuous canonical
inverse is given by
N
X
−1
∀u ∈ (0, 1), FX,l (u) = xk 1{p1 +···+pk−1 <u≤p1 +···+pk }
k=1

so that
N
d
X
X= xk 1{p1 +···+pk−1 <U ≤p1 +···+pk } .
k=1

The yield of the procedure is still r = 1. However, when implemented naively, its complexity –
which corresponds to (at most) N comparisons for every simulation – may be quite high. See [42]
for some considerations (in the spirit of quick sort algorithms) which lead to a O(log N ) complexity.
Furthermore, this procedure underlines that one has access to the probability weights pk with an
arbitrary accuracy. This is not always the case even in a priori simple situations as emphasized in
Example 6 below.
It should be noticed of course that the above simulation formula is still appropriate for a random
variable taking values in any set E, not only for subsets of R!

5. Simulation of a Bernoulli random variable B(p), p ∈ (0, 1). This is the simplest application of
the previous method since
d
X = 1{U ≤p} = B(p).

6. Simulation of a Binomial random variable B(n, p), p ∈ (0, 1), n ≥ 1. One relies on the very
definition of the binomial distribution as the law of the sum of n independent B(p)-distributed
random variables i.e.
n
d
X
X= 1{Uk ≤p} = B(n, p).
k=1

where U1 , . . . , Un are i.i.d. random variables, uniformly distributed over [0, 1].
Note that this procedure has a very bad yield, namely r = n1 . Moreover, it needs n comparisons
like the standard method (without any shortcut).
Why not using the above standard method for random variable taking finitely many values?
Because the cost of the computation of the probability weights pk ’s is much too high as n grows.
7. Simulation of geometric random variables G(p) and G∗ (p), p ∈ (0, 1). It is the distribution of the
first success instant when repeating independently the same Bernoulli experiment with parameter
p. Conventionaly, G(p) starts at time 0 whereas G∗ (p) starts at time 1.
To be precise, if (Xk )k≥0 denotes an i.i.d. sequence of random variables with distribution B(p),
p ∈ (0, 1) then
d
τ ∗ := min{k ≥ 1 | Xk = 1} = G∗ (p)
16 CHAPTER 1. SIMULATION OF RANDOM VARIABLES

and
d
τ := min{k ≥ 0 | Xk = 1} = G(p).
Hence
P(τ ∗ = k) = p(1 − p)k−1 , k ∈ N∗ := {1, 2, . . . , n, . . .}
and
P(τ = k) = p(1 − p)k , k ∈ N := {0, 1, 2, . . . , n, . . .}

P P
(so that both random variables are P-a.s. finite since k≥0 P(τ = k) = k≥1 P(τ = k) = 1).
Note that τ + 1 has the same G∗ (p)-distribution as τ ∗ .
The (random) yields of the above two procedures are r∗ = τ1∗ and r = 1
τ +1 respectively. Their
common mean (average yield) r̄ = r̄∗ is given by
   
1 1 X1
E =E = p(1 − p)k−1
τ +1 τ∗ k
k≥1
p X1
= (1 − p)k
1−p k
k≥1
p
= − log(1 − x)|x=1−p
1−p
p
= − log(p).
1−p

 Exercises. 1. Let X : (Ω, A, P) → (R, Bor(R)) be a real-valued random variable with distribu-
tion function F and left continuous canonical inverse Fl−1 . Let I = [a, b], −∞ ≤ a < b ≤ +∞, be
d
a nontrivial interval of R. Show that, if U = U ([0, 1]), then
 d
Fl−1 F (a) + (F (b) − F (a))U = L(X | X ∈ I).

2. Negative binomial distributions The negative binomial distribution with parameter (n, p) is the
law µn,p of the nth success in an infinite sequence of independent Bernoulli trials, namely, with
above notations used for the geometric distributions, the distribution of

τ (n) = min k ≥ 1 | card{1 ≤ ` ≤ k | X` = 1} = n .




Show that
n−1 n
µn,p (k) = 0, k ≤ n − 1, µn,p (k) = Ck−1 p (1 − p)k−n , k ≥ n.
Compute the mean yield of its (natural and straightforward) simulation method.

1.4 The acceptance-rejection method


This method is due to Von Neumann (1951). It is contemporary of the development of computers
and of the Monte Carlo method. The original motivation was is to devise a simulation method
for a probability distribution ν on a measurable space (E, E), absolutely continuous with respect a
σ-finite non-negative measure µ with a density given, up to a constant, by f : (E, E) → R+ when
1.4. THE ACCEPTANCE-REJECTION METHOD 17

we know that f is dominated by a probability distribution g.µ which can be simulated at “low
cost”. (Note that ν = R ff dµ .µ.)
E
In most elementary applications (see below), E is either a Borel set of Rd equipped with its
Borel σ-field and µ is the trace of the Lebesgue measure on E or a subset E ⊂ Zd equipped with
the counting measure.
Let us be more precise. So, let µ be a non-negative σ-finite measure
1
R on (E, E) and let f, g :
(E, E) −→ R+ be two Borel functions. Assume that f ∈ LR+ (µ) with E f dµ > 0 and that g is a
probability density with respect to µ satisfying furthermore g > 0 µ-a.s. and there exists a positive
real constant c > 0 such that
f (x) ≤ c g(x) µ(dx)-a.e.
Z
Note that this implies c ≥ f dµ. As mentioned above, the aim of this section is to show how to
E
simulate some random numbers distributed according to the probability distribution

f
ν=R .µ
Rd f dµ

using some g.µ-distributed (pseudo-)random numbers. In particular, to make the problem consis-
tent, we will assume that ν 6= µ which in turn implies that
Z
c> f dµ.
Rd

The underlying requirements on f , g and µ to undertake a practical implementation of the


method described below are the following:

• the numerical value of the real constant c is known,

• we know how to simulate (at a reasonable cost) on a computer a sequence of i.i.d. random
vectors (Yk )k≥1 with the distribution g . µ
f
• we can compute on a computer the ratio g (x) at every x ∈ Rd (again at a reasonable cost).

As a first (not so) preliminary step, we will explore a natural connection (in distribution)
between an E-valued random varaible X with distribution ν and Y an E-valued random variable
with distribution g.µ. W swill see that he key idea is completely elementary. let h : (E, E) → R a
test function (measurable and bounded or non-negative). On the one hand
Z
1
E h(X) = R h(x)f (x)µ(dx)
E f dµ E
Z
1 f
= R h(y) (y)g(y)µ(dy) since g > 0 µ-a.e.
E f dµ E g
 
f
= E h(Y ) (Y ) .
g
18 CHAPTER 1. SIMULATION OF RANDOM VARIABLES

We can also stay on the state space E and note in a somewhat artificial way that

Z Z 1 
c
E h(X) = R h(y) 1{u≤ 1 f (y)} du g(y)µ(dy)
E f dµ E 0 c g
Z Z 1
c
= R h(y)1{u≤ 1 f (y)} g(y)µ(dy)du
E f dµ E 0
c g
 
c
= R E h(Y )1{cU ≤ f (Y )}
E f dµ
g

where U is uniformly distributed over [0, 1] and independent of Y .


c 1
By considering h ≡ 1, we derive from the above identity that R =   so
E f dµ f
P c U ≤ g (Y )
that finally
 
f
E h(X) = E h(Y ) | {cU ≤ (Y ))} .
g

The proposition below takes advantage of this identity in distribution to propose a simulation
procedure. In fact it is simply a reverse way to make (and interpret) the above computations.

Proposition 1.2 (Acceptance-rejection simulation method) Let (Un , Yn )n≥1 be a sequence


of i.i.d. random variable with distribution U ([0, 1]) ⊗ PY (independent marginals) defined on
(Ω, A, P) where PY (dy) = g(y)µ(dy) is the distribution of Y on (E, E). Set

τ := min{k ≥ 1 | c Uk g(Yk ) ≤ f (Yk )}.


R
f dµ
Then, τ has a geometric distribution G∗ (p) with parameter p := P(c U1 g(Y1 ) ≤ f (Y1 )) = E
c
and
d
X := Yτ = ν.

Remark. The (random) yield of the method is τ1 .Hnece we know that its mean yield is given by

R  
ERf dµ
1 p log p c
E =− = log R .
τ 1−p c − E f dµ E f dµ

Z
p log p
Since lim − = 1, the closer to f dµ the constant c is, the higher the yield of the simulation
p→1 1−p Rd
is.

Proof. Step 1: Let (U, Y ) be a couple of random variables with distribution U ([0, 1]) ⊗ PY . Let
1.4. THE ACCEPTANCE-REJECTION METHOD 19

h : Rd → R be a bounded Borel test function. We have


Z

E h(Y )1{c U g(Y )≤f (Y )} = h(y)1{c ug(y)≤f (y)} g(y)µ(dy)⊗du
E×[0,1]
Z "Z #
= 1{c ug(y)≤f (y)}∩{g(y)>0} du h(y)g(y)µ(dy)
E [0,1]
Z Z 1 
= h(y) 1{u≤ f (y) }∩{g(y)>0} du g(y)µ(dy)
E 0 cg(y)
Z
f (y)
= h(y) g(y)µ(dy)
{g(y)>0} cg(y)
Z
1
= h(y)f (y)µ(dy)
c E
where we used successively Fubini’s Theorem, g(y) > 0 µ(dy)-a.e., Fubini’s Theorem again and
f
cg (y) ≤ 1 µ(dy)-a.e. Note that we can apply Fubini’s Theorem since the reference measure µ is
σ-finite.
Putting h ≡ 1 yields R
f dµ
c= .
P(c U g(Y ) ≤ f (Y ))
Then, elementary conditioning yields
Z Z
 f (y)
E h(Y )|{c U g(Y ) ≤ f (Y )} = h(y) R µ(dy) = h(y)ν(dy)
E E f dµ

i.e.
L (Y |{c U g(Y ) ≤ f (Y )}) = ν.

Step 2: Then (using that τ is P-a.s. finite)


X 
E(h(Yτ )) = E 1{τ =n} h(Yn )
n≥1
X n−1 
= P {c U1 g(Y1 ) > f (Y1 )} E h(Y1 )1{c U1 g(Y1 )≤f (Y1 )}
n≥1
X
(1 − p)n−1 E h(Y1 )1{c U1 g(Y1 )≤f (Y1 )}

=
n≥1
X
p(1 − p)n−1 E h(Y1 ) | {c U1 g(Y1 ) ≤ f (Y1 )}

=
n≥1
Z
= 1 × h(y)ν(dy)
Z
= h(y)ν(dy). ♦

Remark.Z An important point to be noticed is that we do not need to know the numerical value
either of f dµ to implement the above acceptance-rejection procedure.
E
20 CHAPTER 1. SIMULATION OF RANDOM VARIABLES

Corollary 1.1 Set by induction for every n ≥ 1

τ1 := min{k ≥ 1 | c Uk g(Yk ) ≤ f (Yk )} and τn+1 := min{k ≥ τn + 1 | c Uk g(Yk ) ≤ f (Yk )}.

(a) The sequence (τn − τn−1 )n≥1 (with the convention τ0 = 0) is i.i.d. with distribution G∗ (p) and
the sequence
Xn := Yτn

is an i.i.d. PX -distributed sequence of random variables.

(b) Furthermore the random yield of the simulation of the first n PX -distributed random variables
Yτk , k = 1, . . . , n is
n a.s.
ρn = −→ p as n → +∞.
τn

Proof. (a) Left as an exercise (see below).


n
(b) The fact that ρn = τn is obvious. The announced a.s. convergence follows from the Strong Law
1 τn a.s. 1
of Large Numbers since (τn − τn−1 )n≥1 is i.i.d. and E τ1 = p which implies that n −→ p . ♦

 Exercise. Prove item (a) of the above corollary.

Before proposing first applications, let us briefly present a more applied point of view which is
closer to what is really implemented in practice when performing a Monte Carlo simulation based
on the acceptance-rejection method.

The user’s viewpoint (practitioner’s corner). The practical implementation of the acceptance-
rejection method is rather simple. Let h : E → R be a PX -integrable Borel function. How to
compute E h(X) using Von Neumann’s acceptance-rejection method? It amounts to the simulation
an n-sample (Uk , Yk )1≤k≤n on a computer and to the computation of

n
X
1{cUk g(Yk )≤f (Yk )} h(Yk )
k=1
n .
X
1{cUk g(Yk )≤f (Yk )}
k=1

Note that
n n
X 1X
1{cUk g(Yk )≤f (Yk )} h(Yk ) 1{cUk g(Yk )≤f (Yk )} h(Yk )
n
k=1 k=1
n = n , n ≥ 1.
X 1X
1{cUk g(Yk )≤f (Yk )} 1{cUk g(Yk )≤f (Yk )}
n
k=1 k=1
1.4. THE ACCEPTANCE-REJECTION METHOD 21

Hence, owing to the strong Law of Large Number (see next chapter if necessary) this quantity a.s.
converges as n goes to infinity toward
Z 1 Z Z
f (y)
du µ(dy)1{cug(y)≤f (y)} h(y)g(y) h(y)g(y)µ(dy)
0 E cg(y)
Z 1 Z = Z
f (y)
du µ(dy)1{cug(y)≤f (y)} g(y) g(y)µ(dy)
0 R d cg(y)
Z
f (y)
h(y)µ(dy)
EZ c
=
f (y)
µ(dy)
Rd c
Z
f (y)
= h(y) R µ(dy)
Z
R d
Rd f dµ

= h(y)ν(dy).
E

This third way to present the same computations show that in term of practical implementation,
this method is in fact very elementary.
Classical applications.  Uniform distributions on a bounded domain D.
d
Let D ⊂ a + [−M, M ]d , λd (D) > 0 (where a ∈ Rd , M > 0) and let Y = U (a + [−M, M ]d ), let
τD := min{n | Yn ∈ D} where (Yn )n≥1 is an i.i.d. sequence defined on a probability space (Ω, A, P)
with the same distribution as Y . Then,
d
YτD = U (D)

where U (D) denotes the uniform distribution over D.


This follows from the above Proposition 1.2 with E = a + [−M, M ]d , µ := λd|a+[−M,M ]d
(Lebesgue measure on a + [−M, M ]d ),

g(u) := (2M )−d 1a+[−M,M ]d (u)

and
f (x) = 1D (x) ≤ (2M )d g(x)
| {z }
=:c

f
so that R .µ = U (D).
Rd f dµ
As a matter of fact, with the notations of the above proposition,
n f o
τ = min k ≥ 1 | c Uk ≤ (Yk ) .
g
f f
However, g (y) > 0 if and only if y ∈ D and if so, g (y) = 1. Consequently τ = τD .
A standard application is to consider the unit ball of Rd , D := Bd (0; 1). When d = 2, this is
involved in the so-called polar method, see below, for the simulation of N (0; I2 ) random vectors.
22 CHAPTER 1. SIMULATION OF RANDOM VARIABLES

 The γ(α)-distribution.
dx
Let α > 0 and PX (dx) = fα (x) Γ(α) where

fα (x) = xα−1 e−x 1{x>0} (x).


R +∞
(Keep in mind that Γ(a) = 0 ua−1 e−u du, a > 0). Note that when α = 1 the gamma distribution
is but the exponential distribution. We will consider E = (0, +∞) and the reference σ-finite
measure µ = λ1|(0,+∞) .
– Case 0 < α < 1. We use the rejection method, based on the probability density
αe
xα−1 1{0<x<1} + e−x 1{x≥1} .

gα (x) =
α+e
The fact that gα is a probability density function follows from an elementary computation. First,
one checks that fα (x) ≤ cα gα (x) for every x ∈ R+ , where
α+e
cα = .
αe
Note that this choice of cα is optimal since fα (1) = cα gα (1). Then, one uses the inverse distribution
function to simulate the random variable with distribution PY (dy) = gα (y)λ(dy). Namely, if Gα
denotes the distribution function of Y , one checks that, for every x ∈ R,
 
e α αe 1 1 −x
Gα (x) = x 1{0<x<1} + + −e 1{x>1}
α+e α+e e α

so that for every u ∈ (0, 1),


 1  
α+e α α+e
G−1
α (u) = u 1{u< α+e
e
} − log (1 − u) 1{u≥ α+e
e
}.
e αe
Note that the computation of the Γ function is never required to implement this method.
– Case α ≥ 1. We rely on the following classical lemma about the γ distribution.

Lemma 1.1 Let X 0 and X” two independent random variables with distributions γ(α0 ) and γ(α”)
respectively. Then X = X 0 + X” has a distribution γ(α0 + α”).

Consequently, if α = n ∈ N, an induction based on the lemma shows that

X = ξ1 + · · · + ξn

where ξk are i.i.d. with exponential distribution since γ(1) = E(1). Consequently, if U1 , . . . , Un are
i.i.d. uniformly distributed random variables
n
!
d
Y
X = − log Uk .
k=1

To simulate a random variable with general distribution γ(α), one writes α = bαc + {α} where
bαc := max{k ≤ α, k ∈ N} denotes the integral value of α and {α} ∈ [0, 1) its fractional part.
1.4. THE ACCEPTANCE-REJECTION METHOD 23

 Exercises. 1. Prove the above lemma.


2. Show that considering the normalized probability density function of the γ(α)-distribution
(which involves the computation of Γ(α)) as function fα will not improve the yield of the simulation.
3. Let α = α0 + n, α0 = bαc ∈ [0, 1). Show that the yield of the simulation id given by r = n+τ 1
α
where τα has a G∗ (pα ) distribution with pα is related to the simulation of the γ(bαc)-distribution.
Show that
n
p  X (1 − p)k 
r̄ := Er = − log p + .
(1 − p)n+1 k
k=1

 Acceptance-rejection method for a bounded density with compact support.


Let f : Rd → R+ be a bounded Borel function with compact support (hence integral with
respect to the Lebesgue measure). If f can be computed at a reasonable cost, one may simulate the
f
distribution ν = R f.dλ λd by simply considering a uniform distribution on a hypercube dominating
Rd d

f . To be more precise let a = (a1 , . . . , ad ), b = (b1 , . . . , bd ), c ∈ Rd such that


d
Y
supp(f ) = {f 6= 0} ⊂ K = c + [ai , bi ].
i=1

1
Let E = K, let µ = λd|E be the reference measure and g = λd (K) 1K the density of the uniform
distribution over K (this distribution is very easy to simulate as emphasized in a former example).
Then
Y d
f (x) ≤ c g(x) x ∈ K with c = kf ksup λd (K) = kf ksup (bi − ai ).
i=1

Then, if (Yn )n≥1 denotes an i.i.d. sequence defined on a probability space (Ω, A, P) with the uniform
distribution over K, the stopping strategy τ of the Von Neumann’s acceptance-rejection method
reads 
τ = min k | kf ksup Uk ≤ f (Yk ) .
Equivalently this van be rewritten in more intuitive way as follows: let Vn = (Vn1 , Vn2 )n≥1 be an i.i.d.
sequence of random vectors defined on a probability space (Ω, A, P) having a uniform distribution
over K × [0, kf k∞ ]. Then
d
Vτ1 = ν τ = min k ≥ 1 | Vk2 ≤ f (Vk1 ) .

where

 Simulation of a discrete distribution supported by the non-negative integers.


d
Let Y = G∗ (p), p ∈ (0, 1). i.e. such its distribution satisfies PY = g.m where m is the counting
measure on N∗ (m(k) = 1, k ∈ N∗ ) and gk = p(1 − p)k−1 , k ≥ 1. Let f = (fk )k≥1 be aPfunction from
N → R+ defined by fk = ak (1 − p)k−1 and satisfying κ := supn an < +∞ (so that n fn < +∞).
Then
κ
fk ≤ c gk with c = .
p
 pak d ak (1 − p)k−1
Consequently, if τ := min k ≥ 1 | Uk ≤ κ , then Yτ = ν where νk := P n−1
, k ≥ 1.
n an (1 − p)
24 CHAPTER 1. SIMULATION OF RANDOM VARIABLES

There are many other applications of Von Neumann’s acceptance-rejection method, e.g. in
Physics, to take advantage of the fact the density to be simulated is only known up to constant.
Several methods have been devised to speed it up i.e. to increase its yield. Among them let us cite
the Zigurat method for which we refer to [112]. It has been developed by Marsaglia and Tsang in
the early 2000’s.

1.5 Simulation of Poisson distributions (and Poisson processes)


The Poisson distribution with parameter λ > 0, denoted P(λ), is an integral valued probability
measure analytically defined by

λk
∀ k ∈ N, P(λ)({k}) = e−λ .
k!
To simulate this distribution in an exact way, one relies on its close connection with the Poisson
counting process. The (normalized) Poisson counting process is the counting process induced by
the Exponential random walk (with parameter 1). It is defined by
X n o
∀ t ≥ 0, Nt = 1{Sn ≤t} = min n | Sn+1 > t
n≥1

where Sn = X1 + · · · + Xn , n ≥ 1, S0 = 0, (Xn )n≥1 is an i.i.d. sequence of random variables with


distribution E(1) defined on a probability space (Ω, A, P).

Proposition 1.3 The process (Nt )t≥0 has càdlàg (1 ) paths, independent stationary increments. In
particular for every s, t ≥ 0, s ≤ t, Nt − Ns is independent of Ns and has the same distribution as
Nt−s . Furthermore, for every t ≥ 0, Nt has a Poisson distribution with parameter t.

Proof. Let (Xk )k≥1 be a sequence of i.i.d. random variables with an exponential distribution E(1).
Set, for every n ≥ 1,
Sn = X1 + · · · + Xn ,
Let t1 , t2 ∈ R+ , t1 < t2 and let k1 , k2 ∈ N. Assume temporarily k2 ≥ 1.

P(Nt2 − Nt1 = k2 , Nt1 = k1 ) = P(Nt1 = k1 , Nt2 = k1 + k2 )


= P(Sk1 ≤ t1 < Sk1 +1 ≤ Sk1 +k2 ≤ t2 < Sk1 +k2 +1 ).

Now, if set A = P(Sk1 ≤ t1 < Sk1 +1 ≤ Sk1 +k2 ≤ t2 < Sk1 +k2 +1 ) for convenience, we get
Z
A = k +k +1 1{x1 +···+x ≤t1 ≤x1 +···+x , x1 +···+x
e−(x1 +···+xk1 +k
≤t2 <x1 +···+x }
)
2 +1 dx1 · · · dx
k1 +k2 +1
k1 k1 +1 k1 +k2 k1 +k2 +1
R+1 2

The usual change of variable x1 = u1 and xi = ui − ui−1 , i = 2, . . . , k1 + k2 + 1, yields


Z
A= e−uk1 +k2 +1 du1 · · · duk1 +k2 +1
{0≤u1 ≤···≤uk1 ≤t1 ≤uk1 +1 ≤···≤uk1 +k2 ≤t2 <uk1 +k2 +1 }

1
French acronym for right continuous with left limits (continu à droite limitée à gauche).
1.5. SIMULATION OF POISSON DISTRIBUTIONS (AND POISSON PROCESSES) 25

Integrating downward from uk1 +k2 +1 down to u1 we get owing to Fubini’s Theorem,
Z
A = du1 · · · duk1 +k2 e−t2
{0≤u1 ≤···≤uk1 ≤t1 ≤uk1 +1 ≤···≤uk1 +k2 ≤t2 <uk1 +k2 +1 }

(t2 − t1 )k2 −t2


Z
= du1 · · · duk1 e
{0≤u1 ≤···≤uk1 ≤t1 } k2 !
tk11
(t2 − t1 )k2 −t2
= e
k1 ! k2 !
tk1 (t2 − t1 )k2
= e−t1 1 × e−(t2 −t1 ) .
k1 ! k2 !
When k2 = 0, one computes likewise
tk11 tk1
P(Sk1 ≤ t1 < t2 < Sk1 +1 ) = × e−t2 = e−t1 1 × e−(t2 −t1 ) .
k1 ! k1 !
Summing over k2 ∈ N shows that, for every k1 ∈ N,
k1
−t1 t1
P(Nt1 = k1 ) = e
k1 !
d
i.e. Nt1 = P(t1 ). Summing over k1 ∈ N shows that, for every k2 ∈ N,
d d
Nt2 − Nt1 = Nt2 −t1 = P(t2 − t1 ).
Finally, this yields for every k1 , k2 ∈ N,
P(Nt2 − Nt−1 = k2 , Nt1 = k1 ) = P(Nt2 − Nt1 = k2 ) × P(Nt1 = k1 )
i.e. the increments Nt2 − Nt1 and Nt1 are independent.
One shows likewise, with a bit more technicalities, that in fact, if 0 < t1 < t2 < · · · < tn ,
n ≥ 1, then the increments (Nti − Nti−1 )i=1,...,n are independent and stationary in the sense that
d
Nti − Nti−1 = P(ti − ti−1 ). ♦

Corollary 1.2 (Simulation of a Poisson distribution) Let (Un )n≥1 be an i.i.d. sequence of
uniformly distributed random variables on the unit interval. The process null at zero and defined
for every t > 0 by
d
Nt = min k ≥ 0 | U1 · · · Uk+1 < e−t = P(t)


is a normalized Poisson process.


Proof. It follows from Example 1 in the former Section 1.3 that the exponentially distributed i.i.d.
sequence (Xk )k≥1 can be written in the following form
Xk = − log Uk , k ≥ 1.
Using that the random walk (Sn )n≥1 is non-decreasing it follows from the definition of a Poisson
process that for every t ≥ 0,
Nt = min{k ≥ 0 such that Sk+1 > t},
n   o
= min k ≥ 0 such that − log U1 · · · Uk+1 > t ,
= min k ≥ 0 such that U1 · · · Uk+1 < e−t .


26 CHAPTER 1. SIMULATION OF RANDOM VARIABLES

1.6 The Box-Müller method for normally distributed random vec-


tors
1.6.1 d-dimensional normally distributed random vectors
One relies on the Box-Müller method, which is probably the most efficient method to simulate the
bi-variate normal distribution N 0; I2 .

Proposition 1.4 Let R2 and Θ : (Ω, A, P) → R be two independent r.v. with distributions
L(R2 ) = E( 21 ) and L(Θ) = U ([0, 2π]) respectively. Then

d
X := (R cos Θ, R sin Θ) = N (0, I2 )

where R := R2 .

Proof. Let f be a bounded Borel function.


 2
x1 + x22 dx1 dx2
ZZ  ZZ
ρ2 dρdθ
f (x1 , x2 )exp − = f (ρ cos θ, ρ sin θ)e− 2 1R∗+ (ρ)1(0,2π) (θ)ρ
R2 2 2π 2π

using the standard change of variable: x1 = ρ cos θ, x2 = ρ sin θ. We use the facts that (ρ, θ) 7→
(ρ cos θ, ρ sin θ) is a C 1 -diffeomorphism from (0, 2π)×(0, ∞) → R2 \(R+ ×{0}) and λ2 (R+ ×{0}) = 0.

Setting now ρ = r, one has:
r
x2 + x22  dx1 dx2 √ √ e− 2
ZZ ZZ
 drdθ
f (x1 , x2 ) exp − 1 = f ( r cos θ, r sin θ) 1R∗+ (ρ)1(0,2π) (θ)
R2 2 2π 2 2π
 √ √ 
= IE f R2 cos Θ, R2 sin Θ
= IE(f (X)). ♦

Corollary 1.3 (Box-Müller method) One can simulate a distribution N (0; I2 ) from a couple
(U1 , U2 ) of independent random variable with distribution U ([0, 1]) by setting
p p 
X := −2 log(U1 ) cos(2πU2 ), −2 log(U1 ) sin(2πU2 ) .

The yield of the simulation is r = 1/2 with respect to the N (0; 1). . . and r = 1 when the aim is
to simulate a N (0; I2 )-dsitributed (pseudo-)random vector or, equivalently, two N (0; 1)-distributed
(pseudo-)random numbers.

d
Proof. Simulate the exponential distribution using the inverse distribution function using U1 =
d d
U ([0, 1]) and note that if U2 = U ([0, 1]), then 2πU2 = U ([0, 2π]) (where U2 is taken independent of
U1 . ♦

Application to the simulation of the multivariate normal distribution To simulate a


d-dimensional vector N (0; Id ), one may assume that d is even and “concatenate” the above process.
1.6. THE BOX-MÜLLER METHOD FOR NORMALLY DISTRIBUTED RANDOM VECTORS27

d
We consider a d-tuple (U1 , . . . , Ud ) = U ([0, 1]d ) random vector (so that U1 , . . . , Ud are i.i.d. with
distribution U ([0, 1])) and we set
p p 
(X2i−1 , X2i ) = −2 log(U2i−1 ) cos(2πU2i ), −2 log(U2i−1 ) sin(2πU2i ) , i = 1, . . . , d/2. (1.1)

d
 Exercise (Marsaglia’s Polar method). See [112] Let (V1 , V2 ) = U (B(0; 1)) where B(0, 1)
denotes the canonical Euclidean unit ball in R2 . Set R2 = V12 + V22 and
 p p 
X := V1 −2 log(R2 )/R2 , V2 −2 log(R2 )/R2 .
   
d V1 V2
(a) Show that R2 = U ([0, 1]), cos(θ), sin(θ) , R2 and VR1 , VR2 are independent.

R, R ∼
d
Deduce that X = N (0; I2 ).
d
(b) Let (U1 , U2 ) = U ([−1, 1]2 ). Show that L (U1 , U2 ) | U12 + U22 ≤ 1 = U B(0; 1) . Derive a sim-
 

ulation method for N (0; I2 ) combining the above identity and an appropriate acceptance-rejection
algorithm to simulate the N (0; I2 ) distribution. What us the yield of the resulting procedure.
(c) Compare the performances of this so-called Marsaglia’s polar method with those of the Box-
Müller algorithm (i.e. the acceptance-rejection rule versus the computation of trigonometric func-
tions). Conclude.

1.6.2 Correlated d-dimensional Gaussian vectors, Gaussian processes


1 d d
Let X i= (X j
 , . . . , X ) be a centered R -valued Gaussian vector with covariance matrix Σ =
cov(X , X ) 1≤i,j≤d . The matrix Σ is symmetric non-negative (but possibly non-invertible). It
follows from the very definition of Gaussian vectors that any linear combination (u|X) = i ui X i
P
has a (centered) Gaussian distribution with variance u∗ Σu so that the characteristic function of X
is given by
1 ∗
ΦX (u) := E ei(u|X) = e− 2 u Σu , u ∈ Rd .


As a well-known consequence the covariance matrix Σ characterizes the distribution of X which


allows us to denote N (0; Σ) the distribution of X.
The key to simulate such a random vector is the following general lemma (which has nothing
to do with Gaussian vectors). It describes how the covariance is modified by a linear transform.

Lemma 1.2 Let Y be an Rd -valued square integrable random vector and let A ∈ M(q, d) be a q × d
matrix. Then the covariance matrix CAY of the random vector AY is given by

CAY = ACY A∗
where A∗ stands for the transpose of A.

This result can be used in two different ways.


– Square root of Σ. It is classical background that there exists a unique non-negative symmetric
√ √ 2 √ √ ∗
matrix commuting with Σ, denoted Σ, such that Σ = Σ = Σ Σ . Consequently, owing to
the above lemma,
d √ d
If Z = N (0; Id ) then ΣZ = N (0; Σ).
28 CHAPTER 1. SIMULATION OF RANDOM VARIABLES

One can compute Σ by diagonalizing Σ in the orthogonal group: if Σ = P Diag(λ1 , . . . , λd )P ∗
with P P ∗ = Id √and λ1 , . . . , λd√≥ 0. Then,
√ by ∗uniqueness of the square root of Σ as defined above,
it is clear that Σ = P Diag( λ1 , . . . , λd )P .
– Cholesky decomposition of Σ. When the covariance matrix Σ is invertible ((i.e. definite), it is
much more efficient to rely on the Cholesky decomposition (see e.g. Numerical Recipes [119]) by
decomposing Σ as
Σ = TT∗
where T is a lower triangular matrix (i.e. such that Tij = 0 if i < j). Then, owing to Lemma 1.2,
d
T Z = N (0; Σ).
In fact, the Cholevsky decomposition is the matrix formulation of the Hilbert-Schmidt orthonor-
malization procedure. In particular, there exists a unique such lower triangular matrix T with
positive diagonal terms (Tii > 0, i = 1, . . . , d) called the Cholesky matrix of Σ.

The Cholesky based approach is more performing since it approximately divides the complexity
of this phase of the simulation almost by a factor 2.
d d
 Exercises. 1. Let Z = (Z1 , Z2 ) be a Gaussian vector such that Z1 = Z2 = N (0; 1) and
Cov(Z1 , Z2 ) = ρ ∈ [−1, 1].
(a) Compute for every u ∈ R2 the Laplace transform L(u) = E e(u|Z) of Z.
(b) Compute for every σ1 , σ2 > 0 the correlation (2 ) ρX1 ,X2 between the random variables X1 =
eσ1 Z1 and X2 = eσ2 Z2 .
2
(c) Show that inf ρ∈[−1,1] ρX1 ,X2 ∈ (−1, 0) and that, when σi = σ > 0, inf ρ∈[−1,1] ρX1 ,X2 = −e−σ .
2. Let Σ be a positive definite matrix. Show the existence of a unique lower triangularP matrix T
and a diagonal matrix D such that both T and D have positive diagonal entries and i,j=1 Tij2 = 1
for every i = 1, . . . , d. [Hint: change the reference Euclidean norm to perform the Hilbert-Schmidt
decomposition] (3 ).

Application to the simulation of the standard Brownian motion at fixed times.


Let W = (Wt )t≥0 be a standard Brownian motion defined on a probability space (Ω, A, P). Let
(t1 , . . . , tn ) be a non-decreasing ntuples (0 ≤ t1 < t2 < . . . , tn−1 < tn ) of instants. One elementary
definition of the standard Brownian motion is that it is a centered Gaussian process with covariance
given by C W (s, t) = E(Ws Wt ) = s∧t (4 ). The resulting simulation method relying on the Cholesky
decomposition of the covariance structure of the Gaussian vector (Wt1 , . . . , Wtn ) given by
ΣW
(t1 ,...,tn ) = [ti ∧ tj ]1≤i,j≤n
2
The correlation ρX1 ,X2 between two square integrable non a.s. constant random variables defined on the same
probability space is defined by
Cov(X1 , X2 ) Cov(X1 , X2 )
ρX1 ,X2 = = p .
σ(X1 )σ(X2 ) Var(X1 )Var(X2 )
3
This modified Cholesky decomposition is faster than the standard one (see e.g. [155]) since it avoids square
root computations even if, in practice, the cost of the decomposition itself remains negligible compared to that of a
large-scale Monte Carlo simulation.
4
This definition does not include the fact that W has continuous paths, however, it can be derived using the
celebrated Kolmogorov criterion, that it has a modification with continuous paths (see e.g. [141]).
1.6. THE BOX-MÜLLER METHOD FOR NORMALLY DISTRIBUTED RANDOM VECTORS29

is a first possibility.
However, it seems more natural to use the independence and the stationarity of the increments
i.e. that
   
d
Wt1 , Wt2 − Wt1 , . . . , Wtn − Wtn−1 = N 0; Diag(t1 , t2 − t1 , . . . , tn − tn−1 )

so that  
Wt1  
Z1
 Wt2 − Wt1 
 d √ √ p  .. 
 = Diag t1 , t2 − t1 , . . . , tn − tn−1 

 .. . 
 . 
Zn
Wtn − Wtn−1
 
d
where (Z1 , . . . , Zn ) = N 0; In . The simulation of (Wt1 , . . . , Wtn ) follows by summing up the
increments.

Remark. To be more precise, one derives from the above result that
   
Wt1 Wt1
 Wt   W t − Wt 
2  2 1  
 ..  = L   where L = 1{i≥j} 1≤i,j≤n .
  
..
 .   . 
Wtn Wtn − Wtn−1
√ √ √ 
Hence,
√ if we set T = L Diag t1 , t2 − t1 , . . . , tn − tn−1 one checks on the one hand that
tj − tj−1 1{i≥j} 1≤i,j≤n and on the other hand that
 
Wt1  
Z1
 Wt2 
 . 
 = T  ..  .
 
 ..
 . 
Zn
Wtn

We derive, owing to Lemma 1.2, that T T ∗ = T In T ∗ = ΣW t1 ,...,tn . The matrix T being lower
triangular, it provides the Cholesky decomposition of the covariance matrix ΣW t1 ,...,tn .

Practitioner’s corner: Warning! In quantitative finance, especially when modeling the dy-
namics of several risky assets, say d, the correlation between the Brownian sources of randomness
B = (B 1 , . . . , B d ) attached to the log-return is often misleading in terms of notations: since it is
usually written as
q
σij Wtj .
X
i
∀ i ∈ {1, . . . , d}, Bt =
j=1

where W = (W 1 , . . . , W q ) is a standard q-dimensional Brownian motion (i.e. each component W j ,


j = 1, . . . , q, is a standard Brownian motion and all these components are independent). The
normalized covariance matrix of B (at time 1) is given by
q
Cov B1i , B1j , =
 X
σi` σj` = σi. |σj. = (σσ ∗ )ij


`=1
30 CHAPTER 1. SIMULATION OF RANDOM VARIABLES

where σi. is the column vector [σij ]1≤j≤q , σ = [σij ]1≤i≤d,1≤j≤q and (.|.) denotes here the canonical
inner product on Rq . So one should proceed the Cholesky decomposition on the symmetric non-
negative matrix ΣB1 = σσ ∗ . Also have in mind that, if q < d, then σB1 has rank at most q and
cannot be invertible.

Application to the simulation of the Fractional Brownian motion at fixed times.


The fractional Brownian motion (fBm) W H = (WtH )t≥0 with Hurst coefficient H ∈ (0, 1) is
defined as a centered Gaussian process with a correlation function given for every s,t ∈ R+ by
H 1  2H 
C W (s, t) = t + s2H − |t − s|2H .
2
When H = 12 , W H is simply a standard Brownian motion. When H 6= 12 , W H has none of the usual
properties of the Brownian motion (except the stationarity of its increments and some self-similarity
properties) : it has dependent increments, it is not a martingale (or even a semi-martingale). It is
not either a Markov process.
A natural approach to simulate the fBm W H at times t1 , . . . , tn is to rely on the Cholesky
H
decomposition of its auto-covariance matrix [C W (ti , tj )]1≤i,j≤n . On the one hand, this matrix is
ill-conditioned which induces instability in the computation of the Cholesky decomposition. This
should be out in balance with the fact that such a computation can be made only once and offline
to be stored (at least for usual values of the ti s like Ti = iT
n , i = 1, . . . , n).
Other methods have been introduced based in which the auto-covariance matrix is embed in a
circuit matrix. It relies on a (fast) Fourier transform procedure. This method, originally introduced
in [41], has been recently improved in [157] where a precise algorithm is described.
Chapter 2

Monte Carlo method and applications


to option pricing

2.1 The Monte Carlo method


The basic principle of the so-called Monte Carlo method is to implement on a computer the Strong
Law of Large Number (SLLN ): if (Xk )k≥1 denotes a sequence (defined on a probability space
(Ω, A, P)) of independent copies of an integrable random variable X then
X1 (ω) + · · · + XM (ω) M →+∞
P(dω)-a.s. X M (ω) := −→ mX := E X.
M
The sequence (Xk )k≥1 is also called an i.i.d. sequence of random variables with distribution µ =
PX (that of X) or an (infinite) sample of the distribution µ. Two conditions are requested to
implement the above SLLN on a computer (which can be seen as the true definition of a Monte
Carlo simulation of the distribution µ or of the random vector X).
 One can generate on the computer some as perfect as possible (pseudo-)random num-
bers which can be seen as a “generic” sequence (Uk (ω))k≥1 of a sample (Uk )k≥1 of the uni-
form distribution on [0, 1]. Note that ((U(k−1)d+` (ω))1≤`≤d )k≥1 is then generic for the sample
((U(k−1)d+` )1≤`≤d )k≥1 of the uniform distribution on [0, 1]d for any d ≥ 1 (theoretically speaking)
⊗d
since U ([0, 1]d ) = U ([0, 1]) . This problem has already been briefly discussed in the introductory
chapter.
 One can write “represent” X either as
– X = ϕ(U ) where U is a uniformly distributed random vector on [0, 1]d where the Borel
function ϕ : u 7→ ϕ(u) can be computed at any u ∈ [0, 1]d
or
– X = ϕτ (U1 , . . . , Uτ ) where (Un )n≥1 is a sequence of i.i.d. uniformly distributed over
[0, 1], τ is a simulatable finite stopping rule (or stopping time) for the sequence (Uk )k≥1 , taking
values in N∗ . By “stopping rule” we mean that the event {τ = k} ∈ σ(U1 , . . . , Uk ) for every k ≥ 1
and by simulatable we mean that 1{τ =k} = ψk (U1 , . . . , Uk ) where ψk has an explicit form for every
k ≥ 1. We also assume that for every k ≥ 1, the function ϕk is a computable function (defined on
the set of finite [0, 1]k -valued sequences) as well.

31
32 CHAPTER 2. MONTE CARLO METHOD AND APPLICATIONS TO OPTION PRICING

This procedure is the core of the Monte Carlo simulation. We provided several examples of
such representations in the previous chapter. For further developments on that wide topic, we refer
to [42] and the references therein, but in some way an important part of the scientific activity in
Probability Theory is motivated by or can be applied to simulation.

Once these conditions are fulfilled, one can perform a Monte Carlo simulation. But this leads
to two important issues:
– what about the rate of convergence of the method?
and
– how can the resulting error be controlled?
The answer to these questions call upon fundamental results in Probability Theory and Statis-
tics.
 Rate(s) of convergence. The (weak) rate of convergence in the SLLN is ruled by the Central
Limit Theorem (CLT ) which says that if X is square integrable (X ∈ L2 (P)), then
√  L 2
M X M − mX −→ N (0; σX ) as M → +∞

where σX 2 = Var(X) := E((X − E X)2 ) = E(X 2 ) − (E X)2 is the variance (its square root σ is
X
called standard deviation) (1 ). Also note that the mean quadratic rate of convergence (i.e. the rate
in L2 (P)) is exactly
σ
kX M − mX k2 = √ X .
M

If σX = 0, then X M = X = mX P-a.s.. So, from now on, we may assume without loss of
generality that σX > 0.
The Law of the Iterated Logarithm (LIL) provides an a.s. rate of convergence, namely
s s
M  M 
lim X M − mX = σX and lim X M − mX = −σX .
M 2 log(log M ) M 2 log(log M )

A proof of this (difficult) result can be found e.g. in [149]. All these rates stress the main drawback
of the Monte Carlo method: it is a slow method since dividing the error by 2 needs to increase the
size of the simulation by 4.
1 L
The symbol −→ stands for the convergence in distribution (or “in law” whence the notation “L”) : a sequence
of random variables (Yn )n≥1 converges in distribution toward Y∞ if

∀ f ∈ Cb (R, R), Ef (Yn ) −→ Ef (Y∞ ) as n → +∞.

It can be defined equivalently as the weak convergence of the distributions PYn toward the distribution PY∞ . Con-
vergence in distribution is also characterized by the following property

∀ A ∈ B(R), P(Y∞ ∈ ∂A) = 0 =⇒ P(Yn ∈ A) −→ P(Y∞ ∈ A) as n → +∞.

The extension to Rd -valued random vectors is straightforward. See [24] for a presentation of weak convergence of
probability measures in a general framework.
2.1. THE MONTE CARLO METHOD 33

 Data driven control of the error: confidence level and confidence interval. Assume that σX > 0
(otherwise the problem is empty). As concerns the control of the error, one relies on the CLT . It
is obvious that the CLT also reads
√ X M − mX L
M −→ N (0; 1) as M → +∞.
σX

Furthermore, the normal distribution having a density, it has no atom. Consequently, this conver-
gence implies (in fact it is equivalent, see [24]) that for every real numbers a > b,


 
X M − mX 
lim P M ∈ [b, a] = P N (0; 1) ∈ [b, a] as M → +∞.
M σX
= Φ0 (a) − Φ0 (b)

where Φ0 denotes the distribution function of the standard normal distribution i.e.
Z x
ξ2 dξ
Φ0 (x) = e− 2 √ .
−∞ 2π

 Exercise. Show that, as well as any distribution function of a symmetric random variable
without atom, the distribution function of the centered normal distribution on the real line satisfies

∀ x ∈ R, Φ0 (x) + Φ0 (−x) = 1.

Deduce that P(|N (0; 1)| ≤ a) = 2Φ0 (a) − 1 for every a > 0.

In turn, if X1 ∈ L3 (P), the convergence rate in the Central √ Limit Theorem is ruled by the
Berry-Essen Theorem (see [149]). It turns out to be again 1/ M (2 ) which is rather slow from a
statistical viewpoint but is not a real problem within the usual range of Monte Carlo simulations
(at least many thousands, usually one hundred thousands or one million paths). Consequently,
√ X M − mX
one can assume that M does have a standard normal distribution. Which means in
σX
particular that one can design a probabilistic control of the error directly derived from statistical
concepts: let α ∈ (0, 1) denote a confidence level (close to 1) and let qα be the two-sided α-quantile
defined as the unique solution to the equation

P(|N (0; 1)| ≤ qα ) = α i.e. 2Φ0 (qα ) − 1 = α.

Then, setting a = qα and b = −qα , one defines the theoretical random confidence interval at level
α  
α σX σX
JM := X M − qα √ , X M + qα √ ,
M M
2
Namely, if σX > 0,

C E|X1 − EX1 |3 1

∀ n ≥ 1, ∀ x ∈ R, P M (X̄M − mX ) ≤ xσX − Φ0 (x) ≤ √ .

3
σX M 1 + |x|3

See [149].
34 CHAPTER 2. MONTE CARLO METHOD AND APPLICATIONS TO OPTION PRICING

which satisfies

 
α |X − mX |
P(mX ∈ JM ) = P M M ≤ qα
σX
M →+∞
−→ P(|N (0; 1)| ≤ qα ) = α.

However, at this stage this procedure remains purely theoretical since the confidence interval
JM involves the standard deviation σX of X which is usually unknown.
Here comes out the “trick” which made the tremendous success of the Monte Carlo method:
the variance of X can in turn be evaluated on-line by simply adding a companion Monte Carlo
2 , namely
simulation to estimate the variance σX

M
1 X
VM = (Xk − X M )2
M −1
k=1
M
1 X M 2
= Xk2 − 2
X M −→ Var(X) = σX as M → +∞.
M −1 M −1
k=1

The above convergence (3 ) follows from the SLLN applied to the i.i.d. sequence of integrable
random variables (Xk2 )k≥1 (and the convergence of X M which follows from the SLLN as well). It
is an easy exercise to show that moreover E(V M ) = σX2 i.e., with the terminology of Statistics, V
M
2
is an unbiased estimator of σX . This remark is of little importance in practice due to the usual –
large – range of Monte Carlo simulations. Note that the above a.s. convergence is again ruled by
a CLT if X ∈ L4 (P) (so that X 2 ∈ L2 (P)).

 Exercises. 1. Show that V M is unbiased, that is E V M = σX


2.

2. Show that the sequence (X M , V M )M ≥1 satisfies the following recursive equation

1 
XM = X M −1 + XM − X M −1
M
M −1 X
= X M −1 + M , M ≥ 1
M M

(with the convention X 0 = 0) and

1  
VM = V M −1 − V M −1 − (XM − X M −1 )(XM − X M )
M −1
M −2 (XM − X M −1 )2
= V M −1 + , M ≥ 2.
M −1 M
3
When X “already” has an N (0; 1) distribution, then V M has a χ2 (M − 1)-distribution. The χ2 (ν)-distribution,
known as the χ2 -distribution with ν ∈ N∗ degrees of freedom is the distribution of the sum Z12 + · · · + Zν2 where
Z1 , . . . , Zν are i.i.d. with N (0; 1) distribution. The loss of one degree of freedom for V M comes from the fact that
X1 , . . . , XM and X M satisfies a linear equation which induces a linear constraint. This result is known as Cochran’s
Theorem.
2.1. THE MONTE CARLO METHOD 35

As a consequence, one derives from Slutsky’s Theorem that

√ X M − mX √ X M − mX σ M →+∞
M q = M × qX −→ N (0; 1).
VM σX VM

Of course, within the usual range of Monte Carlo simulations, one can always consider that, for
large M ,
√ X M − mX
M q ≈ N (0; 1).
VM

Note that when X is itself normally distributed, one shows that the empirical mean X M and the
empirical variance V M are independent so the that the true distribution of the left hand side of the
above (approximate) equation is a Student distribution with M − 1 degree of freedom (4 ), denoted
T (M − 1).

Finally, one defines the confidence interval at level α of the Monte Carlo simulation by
 s s 
VM VM
IM = X M − qα , X M + qα
M M

which will still satisfy (for large M )



P(mX ∈ IM ) ≈ P |N (0; 1)| ≤ qα = α

For numerical implementation one often considers qα = 2 which corresponds to the confidence
level α = 95, 44 % ≈ 95 %.

The “academic” birth of the Monte Carlo method started in 1949 with the publication of a
seemingly seminal paper by Metropolis and Ulam “The Monte Carlo method” in J. of American
Statistics Association (44, 335-341). In fact, the method had already been extensively used for
several years as a secret project of the U.S. Defense Department.
One can also consider that, in fact, the Monte Carlo method goes back to the celebrated
Buffon’s needle experiment and should subsequently be credited to the French naturalist Georges
Louis Leclerc, Comte de Buffon (1707-1788).

As concerns Finance and more precisely the pricing of derivatives, it seems difficult to trace the
origin the implementation of Monte Carlo method for the pricing and hedging of options. In the
academic literature, the first academic paper dealing in a systematic manner with a Monte Carlo
approach seems to go back to Boyle in [28].
4
The
√ Student distribution with ν degrees of freedom, denoted T (ν) is the law of the random variable
νZ0
p where Z0 , Z1 , . . . , Zν are i.i.d. with N (0; 1) distribution. As expected for the coherence of what
Z12 + · · · + Zν2
precedes, one shows that T (ν) converges in distribution to the normal distribution N (0; 1) as ν → +∞.
36 CHAPTER 2. MONTE CARLO METHOD AND APPLICATIONS TO OPTION PRICING

2.2 Vanilla option pricing in a Black-Scholes model: the premium


For the sake of simplicity, one considers a 2-dimensional correlated Black-Scholes model (under its
unique risk neutral probability) but a general d-dimensional can be defined likewise.

dXt0 = rXt0 dt, X00 = 1,

dXt1 = Xt1 (rdt + σ1 dWt1 ), X01 = x10 , (2.1)

dXt2 = Xt2 (rdt + σ2 dWt2 ), X02 = x20 ,

with the usual notations (r interest rate, σi volatility of X i ). In particular, W = (W 1 , W 2 ) denotes


a correlated bi-dimensional
p Brownian motion such that d < W 1 , W 2 >t = ρ dt (this means that
2 1
Wt = ρWt + 1 − ρ W 2 ft where (W 1 , W
2 f 2 ) is a standard 2-dimensional Brownian motion). The
filtration F of this market is the augmented filtration of W i.e. Ft = FtW := σ(Ws , 0 ≤ s ≤ t, NP )
where NP denote the family of P-negligible sets of A (5 ). By “filtration of the market” we mean
that F is the smallest filtration satisfying the usual conditions to which the process (Xt ) is adapted.
Then, for every t ∈ [0, T ],

σi2
)t+σi Wti
Xt0 = ert , Xti = xi0 e(r− 2 , i = 1, 2.

A European vanilla option with maturity T > 0 is an option related to a European payoff

hT := h(XT )

which only depends on X at time T . In such a complete market the option premium at time 0 is
given by
V0 = e−rT E(h(XT ))

and more generally, at any time t ∈ [0, T ],

Vt = e−r(T −t) E(h(XT ) |Ft ).

The fact that W has independent stationary increments implies that X 1 and X 2 have indepen-
dent stationary ratios i.e.
XTi d XTi −t
= is independent of FtW .
Xti xi0

As a consequence, if
V0 := v(x0 , T ),

5
One shows that, owing to the 0-1 Kolmogorov law, that this filtration is right continuous i.e. Ft = ∩s>t Fs . A
right continuous filtration which contains the P-negligible sets satisfies the so-called “usual conditions”.
2.2. VANILLA OPTION PRICING IN A BLACK-SCHOLES MODEL: THE PREMIUM 37

then

Vt = e−r(T −t) E h(XT ) | Ft



  !  
Xi
= e−r(T −t) E h Xti × T  | Ft 
Xti
i=1,2
  ! 
Xi
= e−r(T −t) E h  xi Ti−t 
x0
i=1,2 |xi =Xti
= v(Xt , T − t).

Examples. • Vanilla call with strike price K:

h(x1 , x2 ) = (x1 − K)+ .

There is a closed form for this option which is but the celebrated Black-Scholes formula for
option on stock (without dividend)

CallBS
0 = C(x0 , K, r, σ1 , T ) = x0 Φ0 (d1 ) − e−rT KΦ0 (d2 ) (2.2)
σ12
log(x0 /K) + (r + √
√ 2 )T
with d1 = , d2 = d1 − σ1 T
σ1 T

where Φ0 denotes the distribution function of the N (0; 1)-distribution.

• Best-of call with strike price K:

hT = max(XT1 , XT2 ) − K

+
.

A quasi-closed form is available involving the distribution function of the bi-variate (correlated)
normal distribution. Laziness may lead to price it by MC (P DE is also appropriate but needs more
care) as detailed below.

• Exchange Call Spread with strike price K:

hT = (XT1 − XT2 ) − K

+
.
For this payoff no closed form is available. One has the choice between a PDE approach (quite
appropriate in this 2-dimensional setting but requiring some specific developments) and a Monte
Carlo simulation.

We will illustrate below the regular Monte Carlo procedure on the example of a Best-of call.

 Pricing a Best-of Call by a Monte Carlo simulation. To implement a (crude) Monte


Carlo simulation one needs to write the payoff as a function of independent uniformly distributed
random variables, or equivalently as a function of independent random variables that are simple
38 CHAPTER 2. MONTE CARLO METHOD AND APPLICATIONS TO OPTION PRICING

functions of independent uniformly distributed random variables, namely a centered and normalized
Gaussian. In our case, it amounts to writing
d
e−rT hT = ϕ(Z 1 , Z 2 )
√ 1 √
   2   2  
1 σ1 2 σ2 1
p
2 2
 −rT
:= max x0 exp − T +σ1 T Z , x0 exp − T +σ2 T ρZ + 1 − ρ Z −Ke
2 2 +
d
where Z = (Z 1 , Z 2 ) = N (0; I2 ) (the dependence of ϕ in xi0 , etc, is dropped). Then, simulating a
M -sample (Zm )1≤m≤M of the N (0; I2 ) distribution using e.g. the Box-Müller yields the estimate
Best-of Call0 = e−rT E max(XT1 , XT2 ) − K + )


M
1 2 1 X
= E(ϕ(Z , Z )) ≈ ϕM := ϕ(Zm ).
M
m=1
One computes an estimate for the variance using the same sample
M
1 X M
V M (ϕ) = ϕ(Zm )2 − ϕ2 ≈ Var(ϕ(Z))
M −1 M −1 M
m=1
since M is large enough. Then one designs a confidence interval for E ϕ(Z) at level α ∈ (0, 1) by
setting  s s 
α V M (ϕ) V M (ϕ) 
IM = ϕM − qα , ϕ M + qα
M M

where qα is defined by P(|N (0; 1)| ≤ qα ) = α (or equivalently 2Φ0 (qα ) − 1 = α).
Numerical Application: We consider a European “Best-of-Call” option with the following parame-
ters
r = 0.1, σi = 0.2 = 20 %, ρ = 0, X0i = 100, T = 1, K = 100
(being aware that the assumption ρ = 0 is not realistic). The confidence level is set (as usual) at
α = 0.95.
The Monte Carlo simulation parameters are M = 2m , m = 10, . . . , 20 (have in mind that 210 =
1024). The (typical) results of a numerical simulation are reported in the Table below.

M ϕM IM
210 = 1024 18.8523 [17.7570 ; 19.9477]
11
2 = 2048 18.9115 [18.1251 ; 19.6979]
212 = 4096 18.9383 [18.3802 ; 19.4963]
213 = 8192 19.0137 [18.6169 ; 19.4105]
214 = 16384 19.0659 [18.7854 ; 19.3463]
215 = 32768 18.9980 [18.8002 ; 19.1959]
216 = 65536 19.0560 [18.9158 ; 19.1962]
217 = 131072 19.0840 [18.9849 ; 19.1831]
218 = 262144 19.0359 [18.9660 ; 19.1058]
219 = 524288 19.0765 [19.0270 ; 19.1261]
220 = 1048576 19.0793 [19.0442 ; 19.1143]
2.3. GREEKS (SENSITIVITY TO THE OPTION PARAMETERS): A FIRST APPROACH 39

20

19.5

19

18.5

18

17.5
10 11 12 13 14 15 16 17 18 19 20

Figure 2.1: Black-Scholes Best of Call. The Monte Carlo estimator (×) and its confidence
interval (+) at level α = 95 %.

 Exercise. Proceed likewise with an Exchange Call spread option.

Remark. Once the script is written for one option, i.e. one payoff function, it is almost instan-
taneous to modify it to price another option based on a new payoff function: the Monte Carlo
method is very flexible, much more than a P DE approach.

Conclusion for practitioners. Performing a Monte Carlo simulation to compute mX = EX,


consists of three steps:
 Specification of a confidence level α ∈ (0, 1) (α ≈ 1).
 Simulation of an M -sample X1 , X2 , . . . , XM of i.i.d. random vectors having the same distri-
bution as X and (possibly recursive) computation of both its empirical mean X̄M and its empirical
variance V̄M .
 Computation of the resulting confidence interval at confidence level α which will be the result
of the Monte Carlo simulation.

We will see further on in Chapter 3 how to specify the size M of the simulation to comply with
an a priori accuracy level.

2.3 Greeks (sensitivity to the option parameters): a first approach


2.3.1 Background on differentiation of function defined by an integral
The greeks or sensitivities denote the set of parameters obtained as derivatives of the premium of an
option with respect to some of its parameters: the starting value, the volatility, etc. In elementary
situations, one simply needs to apply some more or less standard theorems like the one reproduced
below (see [30], Chapter 8 for a proof). A typical example of such a “elementary situation” is the
case of a (possibly multi-dimensional) Black-Scholes model.
40 CHAPTER 2. MONTE CARLO METHOD AND APPLICATIONS TO OPTION PRICING

Theorem 2.1 (Interchanging differentiation and expectation) Let (Ω, A, P) be a probabil-


ity space, let I be a nontrivial interval of R. Let ϕ : I × Ω → R be a Bor(I) ⊗ A-measurable
function.
(a) Local version. Let x0 ∈ I. If the function ϕ satisfies:
(i) for every x ∈ I, the random variable ϕ(x, . ) ∈ L1R (Ω, A, P),
∂ϕ
(ii) P(dω)-a.s. (x0 , ω) exists,
∂x
(iii) There exists Y ∈ L1R+ (P) such that, for every x ∈ I,

P(dω)-a.s. |ϕ(x, ω) − ϕ(x0 , ω)| ≤ Y (ω)|x − x0 |,


Z

then, the function Φ(x) := E ϕ(x, . ) = ϕ(x, ω)P(dω) is defined at every x ∈ I, differentiable at

x0 with derivative  
0 ∂ϕ
Φ (x0 ) = E (x0 , . ) .
∂x
(b) Global version. If ϕ satisfies (i) and
∂ϕ
(ii)glob P(dω)-a.s., (x, ω) exists at every x ∈ I,
∂x
(iii)glob There exists Y ∈ L1R+ (Ω, A, P) such that, for every x ∈ I,

∂ϕ(x, ω)
P(dω)-a.s. ≤ Y (ω),
∂x

then, the function Φ(x) := E ϕ(x, . ) is defined and differentiable at every x ∈ I, with derivative
 
∂ϕ
Φ0 (x) = E (x, . ) .
∂x

Remarks. • The local version of the above theorem can be necessary to prove the differentiability
of a function defined by an expectation over the whole real line (see exercise below).
• All what precedes remains true if one replaces the probability space (Ω, A, P) by any measurable
space (E, E, µ) where µ is a non-negative measure (see again Chapter 8 in [30]). However, this ex-
tension is no longer true as set when dealing under the uniform integrability assumption mentioned
in the exercises below.
• Some variants of the result can be established to get a theorem for right or left differentiability
of functions Φ = Eω ϕ(x, ω) defined on the real line, (partially) differentiable functions defined on
Rd , for holomorphic functions on C, etc.
• There exists a local continuity result for such functions ϕ defined as an expectation which are quite
similar to Claim (a). The domination property by an integrable non-negative random variable Y is
requested on ϕ(x, ω) itself. A precise statement is provided in Chapter 11 (with the same notations).
For a proof we still refer to [30], Chapter 8. This result is in particular useful to establish the
differentiability of a multi-variable function by combining the existence and the continuity of its
partial derivatives.
2.3. GREEKS (SENSITIVITY TO THE OPTION PARAMETERS): A FIRST APPROACH 41

d
 Exercise. Let Z = N (0; 1) defined on a probability space (Ω, A, P), ϕ(x, ω) = (x − Z(ω))+ and
Φ(x) = E ϕ(x − Z)+ , x ∈ R.
(a) Show that Φ is differentiable on the real line by applying the local version of Theorem 2.1 and
compute its derivative.
(b) Show that if I denotes a non-trivial interval of R. Show that if ω ∈ {Z ∈ I} (i.e. Z(ω) ∈ I), the
function x 7→ (x − Z(ω))+ is never differentiable on the whole interval I.

 Exercises (extension to uniform integrability). One can replace the domination property
(iii) in Claim (a) (local version) of the above Theorem 2.1 by the less stringent uniform integrability
assumption
 
ϕ(x, .) − ϕ(x0 , .)
(iii)ui is P-uniformly integrable on (Ω, A, P).
x − x0 x∈I\{x0 }

For the definition and some background on uniform integrability, see Chapter 11, Section 11.5.
1. Show that (iii)ui implies (iii).
2. Show that (i)-(ii)-(iii)ui implies the conclusion of Claim (a) (local version) in the above Theo-
rem 2.1.
3. State a “uniform integrable” counterpart of (iii)glob to extend Claim (b) (global version) of
Theorem 2.1.
4. Show that uniform integrability of the above family of random variables follows from its Lp -
boundedness for (any) p > 1.

2.3.2 Working on the scenarii space (Black-Scholes model)


To illustrate the different methods to compute the sensitivities, we will consider the 1-dimensional
the Black-Scholes model
dXtx = Xtx (rdt + σdWt ), X0x = x > 0,
2
so that Xtx = x exp (r − σ2 )t + σWt . Then we consider, for every x ∈ (0, ∞),


Φ(x) := E ϕ(XTx ), (2.3)

where ϕ : (0, +∞) → R lies in L1 (PX x ) for every x ∈ (0, ∞) and T > 0. This corresponds (when ϕ
T
is non-negative), to vanilla payoffs with maturity T . However, we skip on purpose the discounting
factor in what follows to alleviate notations: one can always imagine it is included as a constant in
the function ϕ below since we work at a fixed time T . The updating of formulas is obvious.
First, we are interested in computing the first two derivatives Φ0 (x) and Φ”(x) of the function
Φ which correspond (up to the discounting factor) to the δ-hedge of the option and its γ parameter
respectively. The second parameter γ is involved in the so-called “tracking error”. But other
sensitivities are of interest to the practitioners like the vega i.e. the derivative of the (discounted)
function Φ with respect to the volatility parameter, the ρ (derivative with respect to the interest
rate r), etc. The aim is to derive some representation of these sensitivities as expectations in order
to compute them using a Monte Carlo simulation, in parallel with the premium computation. Of
42 CHAPTER 2. MONTE CARLO METHOD AND APPLICATIONS TO OPTION PRICING

course the Black-Scholes model is simply a toy model to illustrate such an approach since, for
this model, many alternative and more efficient methods can be implemented to carry out these
computations.
We will first work on the scenarii space (Ω, A, P), because this approach contains the “seed” of
methods that can be developed in much more general settings in which the SDE has no explicit
solution like it has in the Black-Scholes model. On the other hand, as soon as an explicit expression
is available for the density pT (x, y) of XTx , it is more efficient to use the next section 2.3.3.
Proposition 2.1 (a) If ϕ : (0, +∞) → R is differentiable and ϕ0 has polynomial growth, then the
function Φ defined by (2.3) is differentiable and
x
x XT

0 0
∀ x > 0, Φ (x) = E ϕ (XT ) . (2.4)
x
(b) If ϕ : (0, +∞) → R is differentiable outside a countable set and is locally Lipschitz continuous
with polynomial growth in the following sense
∃ m ≥ 0, ∀ u, v ∈ R+ , |ϕ(u) − ϕ(v)| ≤ C|u − v|(1 + |u|m + |v|m ),
then Φ is differentiable everywhere on (0, ∞) and Φ0 is given by (2.4).
(c) If ϕ : (0, +∞) → R is simply a Borel function with polynomial growth, then Φ is still differen-
tiable and  
0 x WT
∀ x > 0, Φ (x) = E ϕ(XT ) . (2.5)
xσT
Proof. (a) This straightforwardly follows from the explicit expression for XTx and the above
differentiation Theorem 2.1 (global version) since, for every x ∈ (0, ∞),
∂ ∂XTx Xx
ϕ(XTx ) = ϕ0 (XTx ) = ϕ0 (XTx ) T .
∂x ∂x x
0 m
Now |ϕ (u)| ≤ C(1 + |u| ) (m ∈ N, C ∈ (0, ∞)) so that, if 0 < x ≤ L < +∞,

∂ 0
ϕ(X x ) ≤ Cr,σ,T m
∈ L1 (P)

∂x T
(1 + L exp (m + 1)σW T )
0
where Cr,σ,T is another positive real constant. This yields the domination condition of the derivative.
(b) This claim follows from Theorem 2.1(a) (local version) and the fact that, for every T > 0,
P(XTx = ξ) = 0 for every ξ ≥ 0.
2
(c) Now, still under the assumption (a) (with µ := r − σ2 ),
0
Z √ √  2 du
ϕ0 x exp µT + σ T u) exp µT + σ T u e−u /2 √

Φ (x) =
R 2π
√ 
∂ϕ x exp (µT + σ T u) −u2 /2 du
Z
1
= √ e √
xσ T R ∂u 2π
2 /2
1
Z √  ∂e−u du
= − √ ϕ x exp (µT + σ T u) √
x σ TZ R ∂u 2π
1 √  −u2 /2 du
= √ ϕ x exp (µT + σ T u) u e √
x σ TZ R 2π
1 √ √ 2 du
ϕ x exp (µT + σ T u) T u e−u /2 √

=
x σT R 2π
2.3. GREEKS (SENSITIVITY TO THE OPTION PARAMETERS): A FIRST APPROACH 43

where we used an integration by parts to obtain the third equality taking advantage of the fact
that, owing to the polynomial growth assumptions on ϕ,
√ 2
lim ϕ x exp (µT + σ T u) e−u /2 = 0.

|u|→+∞

Finally, coming back to Ω,


1
Φ0 (x) = E ϕ(XTx )WT .

(2.6)
x σT
When ϕ is not differentiable, let us first sketch the extension of the formula by a density argu-
ment. When ϕ is continuous and has compact support in R+ , one may assume without loss of gener-
ality that ϕ is defined on the whole real line as a continuous function with compact support. Then ϕ
can be uniformly approximated by differentiable functions ϕn with compact support (use a convolu-
tion by mollifiers, see [30], Chapter 8). Then, with obvious notations, Φ0n (x) := x σT
1
E ϕn (XTx )WT
0
converges uniformly on compact sets of (0, ∞) to Φ (x) defined by (2.6) since
E |WT |
|Φ0n (x) − Φ0 (x)| ≤ kϕn − ϕksup .
xσT
Furthermore Φn (x) converges (uniformly) toward f (x) on (0, ∞). Consequently Φ is differen-
tiable with derivative f 0 . ♦

Remark. We will see in the next section a much quicker way to establish claim (c). The above
method of proof, based on an integration by parts, can be seen as a toy-introduction to a systematic
WT
way to produce random weights like xσT as a substitute of the differentiation of the function
ϕ, especially when the differential does not exist. The most general extension of this approach,
developed on the Wiener space(6 ) for functionals of the Brownian motion is known as Malliavin-
Monte Carlo method.

 Exercise (Extension to Borel functions with polynomial growth). (a) Show that as
soon as ϕ is a Borel function with polynomial growth, the function f defined by (2.3) is continuous.
[Hint: use that the distribution XTx has a probability density pT (x, y) on the positive real line which
continuously depends on x and apply the continuity theorem for functions defined by an integral,
see Theorem 11.3(a) in the “Miscellany” Chapter 11.]
(b) Show that (2.6) holds true as soon as ϕ is a bounded Borel function. [Hint: Apply the Functional
Monotone Class Theorem (see Theorem 11.5 in the “Miscellany” Chapter 11) to an appropriate
vector subspace of functions ϕ and use the Baire σ-field Theorem.]
(c) Extend the result to Borel functions ϕ with polynomial growth [Hint: use that ϕ(XTx ) ∈ L1 (P)
and ϕ = limn ϕn with ϕn = (n ∧ ϕ ∨ (−n))].
(d) Derive from what precedes a simple expression for Φ(x) when ϕ = 1I is the indicator function
of the interval I.

Comments: The extension to Borel functions ϕ always needs at some place an argument based
on the regularizing effect of the diffusion induced by the Brownian motion. As a matter of fact if
6
The Wiener space C(R+, Rd ) and its Borel σ-field for the topology of uniform convergence on compact sets,
namely σ ω 7→ ω(t), t ∈ R+ , endowed with the Wiener measure i.e. the distribution of a standard d-dimensional
Brownian motion (W 1 , . . . , W d ).
44 CHAPTER 2. MONTE CARLO METHOD AND APPLICATIONS TO OPTION PRICING

Xtx were the solution to a regular ODE this extension to non-continuous functions ϕ would fail.
We propose in the next section an approach (the log-likelihood method) directly based on this
regularizing effect through the direct differentiation of the probability density pT (x, y) of XTx .

 Exercise. Prove claim (b) in details.

Note that the assumptions of claim (b) are satisfied by usual payoff functions like ϕCall (x) = (x−K)+
or ϕP ut := (K − x)+ (when XTx has a continuous distribution). In particular, this shows that

∂ E(ϕCall (XTx )) XTx


 
= E 1{X x ≥K}
∂x T x

The computation of this quantity – which is part of that of the Black-Scholes formula – finally
yields as expected:
∂ E(ϕCall (XTx ))
= erT Φ0 (d1 )
∂x
(keep in mind that the discounting factor is missing).

 Exercises. 0. A comparison. Try a direct differentiation of the Black-Scholes formula (2.2) and
compare with a (formal) differentiation based on Theorem 2.1. You should find by both methods

∂Call0BS
(x) = Φ0 (d1 ).
∂x
But the true question is: “how long did it take you to proceed?”

1. Application to the computation of the γ (i.e. Φ”(x)). Show that, if ϕ is differentiable with a
derivative having polynomial growth,
1 0 x x x

Φ”(x) := E (ϕ (X )X − ϕ(X ))W
x2 σT T T T T

and that, if ϕ is continuous with compact support,


!!
1 WT2 1
Φ”(x) := 2 E ϕ(XTx ) − WT −
x σT σT σ

Extend this identity to the case where ϕ is simply Borel with polynomial growth. Note that a
(somewhat simpler) formula also exists when the function ϕ is itself twice differentiable but such
an assumption is not realistic for financial applications.

2. Variance reduction for the δ (7 ). The above formulae are clearly not the unique representations
of the δ as an expectation: using that E WT = 0 and E XTx = xerT , one derives immediately that
x
rT XT

0 0 rT rT 0 x 0
Φ (x) = ϕ (xe )e + E (ϕ (XT ) − ϕ (xe ))
x
7
In this exercise we slightly anticipate on the next chapter entirely devoted to Variance reduction.
2.3. GREEKS (SENSITIVITY TO THE OPTION PARAMETERS): A FIRST APPROACH 45

as soon as ϕ is differentiable at xerT . When ϕ is simply Borel


1
Φ0 (x) = E (ϕ(XTx ) − ϕ(xerT ))WT .

xσT
3. Variance reduction for the γ. Show that
1
E (ϕ0 (XTx )XTx − ϕ(XTx ) − xerT ϕ0 (xerT ) + ϕ(xerT ))WT .

Φ”(x) =
x2 σT

4. Testing the variance reduction if any. Although the former two exercises are entitled “variance
reduction” the above formulae do not guarantee a variance reduction at a fixed time T . It seems
intuitive that they do only when the maturity T is small. Do some numerical experiments to test
whether or not the above formulae induce some variance reduction.
When the maturity increases, test whether the regression method introduced in Chapter 3,
Section 3.2 works or not with these “control variates”.

5. Computation of the vega (8 ). Show likewise that E(ϕ(XTx )) is differentiable with respect to the
volatility parameter σ under the same assumptions on ϕ, namely


E(ϕ(XTx )) = E ϕ0 (XTx )XTx (WT − σT )

∂σ
if ϕ is differentiable with a derivative with polynomial growth. Derive without any further compu-
tations – but with the help of the previous exercises – that
!!
∂ x x WT2 1
E(ϕ(XT )) = E ϕ(XT ) − WT − .
∂σ σT σ

if ϕ is simply Borel with polynomial growth. [Hint: use the former exercises.]
This derivative is known (up to an appropriate discounting) as the vega of the option related
to the payoff ϕ(XTx ). Note that the γ and the vega of a Call satisfy

vega(x, K, r, σ, T ) = x2 σT γ(x, K, r, σ, T )

which is the key of the tracking error formula.

In fact the beginning of this section can be seen as an introduction to the so-called tangent
process method (see the section that ends this chapter and section 9.2.2 in Chapter 9).

2.3.3 Direct differentiation on the state space: the log-likelihood method


In fact, one can also achieve similar computations directly on the state space of a family of random
variables (or vectors) XTx (indexed by its stating value x), provided this random variable (or vector)
has an explicit probability density pT (x, y) with respect to a reference measure µ(dy) on the real
line (or Rd ). In general µ = λd , Lebesgue measure.
8
Which is not a greek letter. . .
46 CHAPTER 2. MONTE CARLO METHOD AND APPLICATIONS TO OPTION PRICING

In fact, one may imagine in full generality that XTx depends on a parameter θ a real parameter
of interest : thus, XTx = XTx (θ) may be the solution at time T to a stochastic differential equation
which coefficients depend on θ. This parameter can be the starting x itself, or more specifically in
a Black-Scholes model, the volatility σ, the interest rate r, the maturity T or even the strike K of
a Call or Put option written on the risky asset X x .
Formally, one has Z
Φ(θ) = E(ϕ(XTx (θ))) = ϕ(y)pT (θ, x, y)µ(dy)
R
so that, formally,
Z
0 ∂pT
Φ (θ) = ϕ(y) (θ, x, y)µ(dy)
R ∂θ
∂pT
(θ, x, y)
Z
∂θ
= ϕ(y) p (θ, x, y)µ(dy)
R pT (θ, x, y) T
 
x ∂ log pT x
= E ϕ(XT ) (θ, x, XT ) . (2.7)
∂θ

Of course, the above computations need to be supported by appropriate assumptions (domina-


tion, etc) to justify interchange of integration and differentiation (see Exercises below). Using this
approach, one retrieves the formulas obtained in Section 2.3.2 for the Black-Scholes model much
more easily, especially when the function ϕ is only Borel. This is the aim of the exercises at the
end of this section.

A multi-dimensional version of this result can be established the same way round. However, this
straightforward and simple approach to “greek” computation remains limited beyond the Black-
Scholes world by the fact that it needs to have access not only to the regularity of the probability
density pT (x, y, θ) of the asset at time T but also to its explicit expression in order to include it in
a simulation process.
A solution in practice is to replace the true process X by an approximation, typically its Euler
scheme, with step Tn , (X̄txn )0≤k≤n where X̄0x = x and tnk := kT
n (see Chapter 7 for details). Then,
k
under some natural ellipticity assumption (non degeneracy of the volatility) this Euler scheme
starting at x (with step = T /n) does have a density p̄ kT (x, y), and more important its increments
n
have conditional densities which makes possible to proceed using a Monte Carlo simulation.

An alternative is to “go back” to the “scenarii” space Ω. Then, some extensions of the first two
approaches are possible: if the function and the diffusion coefficients (when the risky asset prices
follow a Brownian diffusion) are smooth enough and the payoff function is smooth too, one relies on
the tangent process (derivative of the process with respect to its starting value, or more generally
with respect to one of its structure parameters, see below).
When the payoff function has no regularity, a more sophisticated method is to introduce some
Malliavin calculus methods which correspond to a differentiation theory with respect to the generic
Brownian paths. An integration by parts formula plays the role of the elementary integration by
parts used in Section 2.3.2. This second topic is beyond the scope of the present course, although
some aspects will be mentioned through Bismut and Clark-Occone formulas.
2.3. GREEKS (SENSITIVITY TO THE OPTION PARAMETERS): A FIRST APPROACH 47

 Exercises: 0. Provide simple assumptions to justify the above formal computations in (2.7), at
some point θ0 or for all θ running over a non empty pen interval Θ of R (or domain of Rd if θ is
vector valued). [Hint: use the remark right below Theorem 2.1.]
1. Compute the probability density pT (σ, x, y) of XTx,σ in a Black-Scholes model (σ > 0 stands for
the volatility parameter).
2. Re-establish all the sensitivity formulae established in the former section 2.3.2 (including the
exercises at the end of the section) using this approach.
3. Apply these formulae to the case ϕ(x) := e−rT (x − K)+ and retrieve the classical expressions
for the greeks in a Black-Scholes model: the δ, the γ and the vega.

2.3.4 The tangent process method


In fact when both the payoff function and the coefficients of the SDE are regular enough, one can
differentiate directly the function/functional of the process with respect to a given parameter. The
former section 2.3.2 was a special case of this method for vanilla payoffs in a Black-Schole model.
We refer to section 9.2.2 for a more detailed developments.
48 CHAPTER 2. MONTE CARLO METHOD AND APPLICATIONS TO OPTION PRICING
Chapter 3

Variance reduction

3.1 The Monte Carlo method revisited: static control variate


Let X ∈ L2R (Ω, A, P) be a random variable, assumed to be easy to simulate. One wishes to compute

mX = E X ∈ R

as the result of a Monte Carlo simulation.

 Confidence interval revisited from the simulation viewpoint. The parameter m is


to be computed by a Monte Carlo simulation. Let Xk , k ≥ 1 be a sequence of i.i.d. copies of X.
Then (SLLN )
M
1 X
mX = lim X M P-a.s. with X M := Xk .
M →+∞ M
k=1
This convergence is ruled by the Central Limit Theorem (CLT )
√  L
M X M − mX −→ N (0; Var(X)) as M → +∞.

Hence for large enough M , and q ∈ R


  
σ(X) σ(X)
P m ∈ XM − q √ , XM + q √ ≈ 2Φ0 (q) − 1
M M
Z x
ξ2 dξ
e− 2 √ .
p
where σ(X) := Var(X) and Φ0 (x) =
−∞ 2π
In numerical probability, we adopt the following reverse point of view based on the target or
prescribed accuracy ε > 0: to make X̄M enter a confidence interval [m − ε, m + ε] with a confidence
level α := 2Φ0 (qα ) − 1, one needs to process a Monte Carlo simulation of size
qα2 Var(X)
M ≥ M X (ε, α) = . (3.1)
ε2
In practice, of course, the estimator V M is computed on-line to estimate the variance Var(X) as
presented in the previous chapter. This estimate needs not to be as sharp as the estimation of m,
so it can be processed at he beginning of the simulation on a smaller sample size.

49
50 CHAPTER 3. VARIANCE REDUCTION

As a first conclusion, this shows that, a confidence level being fixed, the size of a Monte Carlo
simulation grows linearly with the variance of X for a given accuracy and quadratically as the
inverse of the prescribed accuracy for a given variance.
 Variance reduction: (not so) naive approach. Assume now that we know two random
variables X, X 0 ∈ L2R (Ω, A, P) satisfying

m = E X = E X 0 ∈ R, Var(X), Var(X 0 ), Var(X − X 0 ) > 0.


(the last condition only says that X and X 0 are not a.s. equal).
Question: Which random vector (distribution. . . ) is more appropriate?
Several examples of such a situation have already been pointed out in the previous chapter:
usually many formulae are available to compute a greek parameter, even more if one takes into
account the (potential) control variates introduced in the exercises.
A natural answer is: if both X and X 0 can be simulated with an equivalent cost (complexity),
then the one with the lowest variance is the best choice i.e.
X0 if Var(X 0 ) < Var(X), X otherwise,
provided this fact is known a priori.

 Practitioner’s corner. Usually, the problem appears as follows: there exists a random
variable Ξ ∈ L2R (Ω, A, P) such that

(i) E Ξ can be computed at a very low cost by a deterministic method (closed form, numerical
analysis method),
(ii) the random variable X − Ξ can be simulated with the same cost (complexity) than X,
(iii) the variance Var(X − Ξ) < Var(X).

Then, the random variable


X0 = X − Ξ + E Ξ
can be simulated at the same cost as X,

E X0 = E X = m and Var(X 0 ) = Var(X − Ξ) < Var(X).

Definition 3.1 A random variable Ξ satisfying (i)-(ii)-(iii) is called a control variate for X.

 Exercise. Show that if the simulation process of X and X − Ξ have complexity κ and κ0
respectively, then (iii) becomes
(iii)0 κ0 Var(X − Ξ) < κVar(X).
Example (Toy-): In the previous chapter, we showed that, in B-S model, if the payoff function
is differentiable outside a countable set and locally Lipschitz continuous with polynomial growth
at infinity then the function Φ(x) = E ϕ(XTx ) is differentiable on (0, +∞) and
 Xx   W 
Φ0 (x) = E ϕ0 (XTx ) T = E ϕ(XTx ) T .
x xσT
3.1. THE MONTE CARLO METHOD REVISITED: STATIC CONTROL VARIATE 51

So we have at hand two formulas for Φ0 (x) that can be implemented in a Monte Carlo simulation.
Which one should we choose to compute Φ0 (x) (i.e. the δ-hedge up to an e−rT -factor)? Since they
have the same expectations, the two random variables should be discriminated through (the square
of) their L2 - norm, namely

X x 2
   
0 x x W T 2
E ϕ (XT ) T and E ϕ(XT ) .
x xσT

It is clear, owing to the LIL for the Brownian motion that at 0 that if ϕ(x) 6= 0,
 W 2
lim inf E ϕ(XTx ) T = +∞
T →0 xσT

and that, if ϕ0 is bounded


 W T 2 1
E ϕ(XTx ) =O → 0 as T → +∞.
xσT T
On the other hand, the first formula
 X x 2
lim inf E ϕ0 (XTx ) T = (ϕ0 (x))2
T →0 x

whereas
 X x 2
lim inf E ϕ0 (XTx ) T = +∞.
T →+∞ x
As a consequence the first formula is more appropriate for short maturities whereas the second
formula has less variance for long maturities. This appears somewhat as an exception: usually
Malliavin-Monte Carlo weights tend to introduce variance, not only for short maturities. One way
to partially overcome this drawback is to introduce some localization methods (see e.g. [26] for an
in-depth analysis or Section 9.4.3 for an elementary introduction).

 A variant (pseudo-control variate). In option pricing, when the random variable X is a


payoff it is usually non-negative. In that case, any random variable Ξ satisfying (i)-(ii) and

(iii)” 0≤Ξ≤X

can be considered as a good candidate to reduce the variance, especially if Ξ is close to X so that
X − Ξ is “small”.
However, note that it does not imply (iii). Here is a trivial counter-example: let X ≡ 1, then
Var(X) = 0 whereas a uniformly distributed random variable Ξ on [1 − η, 1], 0 < η < 1, will
(almost) satisfy (i)-(ii) but Var(X − Ξ) > 0. . . Consequently, this variant is only a heuristic method
to reduce the variance which often works, but with no a priori guarantee.

3.1.1 Jensen inequality and variance reduction


This section is in fact an illustration of the notion of pseudo-control variate described above.
52 CHAPTER 3. VARIANCE REDUCTION

Proposition 3.1 (Jensen Inequality) Let X : (Ω, A, P) → R be a random variable and let
g : R → R be a convex function. Suppose X and g(X) are integrable. Then, for any sub-σ-field B
of A,
g (E(X | B)) ≤ E (g(X) | B) P-a.s.
In particular considering B = {∅, Ω} yields the above inequality for regular expectation i.e.

g E(X) ≤ E g(X).

Proof. The inequality is a straightforward consequence of the following classical characterization


of a convex function
n o
g is convex if and only if ∀ x ∈ R, g(x) = sup ax + b, a, b ∈ Q, ay + b ≤ g(y), ∀ y ∈ R . ♦

Jensen inequality is an efficient tool to design control variate when dealing with path-dependent
or multi-asset options as emphasized by the following examples:

Examples: 1. Basket or index option. We consider a payoff on a basket of d (positive) risky


assets (this basket can be an index). For the sake of simplicity we suppose it is a call with strike
K i.e.
d
!
X
hT = αi XTi,xi − K
i=1 +

where (X 1,x1 , . . . , X d,xd ) models the P


price of d traded risky assets on a market and the αk are some
positive (αi > 0) weights satisfying 1≤i≤d αi = 1. Then the convexity of the exponential implies
that
d
i,x
αi log(XT i )
P X
0≤e 1≤i≤d ≤ αi XTi,xi
i=1
so that  P i,xi 
hT ≥ kT := e 1≤i≤d αi log(XT ) − K ≥ 0.
+

The motivationP for this example is that in a (possibly correlated) d-dimensional Black-Scholes
model (see below), 1≤i≤d αi log(XTi,xi ) still has a normal distribution so that the Call like European
option written on the payoff
 P i,xi 
kT := e 1≤i≤d αi log(XT ) − K
+

has a closed form.


Let us be more specific on the model and the variance reduction procedure.
The correlated d-dimensional Black-Scholes model (under the risk-neutral probability measure
with r > 0 denoting the interest rate) can be defined by the following system of SDE’s which
governs the price of d risky assets denoted i = 1, . . . , d:
 
q
dXti,xi = Xti,xi r dt + σij dWtj  , t ∈ [0, T ], xi > 0, i = 1, . . . , d
X

j=1
3.1. THE MONTE CARLO METHOD REVISITED: STATIC CONTROL VARIATE 53

where W = (W 1 , . . . , W q ) is a standard q-dimensional Brownian motion and σ = [σij ]1≤i≤d,1≤j≤q


is a is given by d × q matrix with real entries. Its solution
 
 2
 q
σ
Xti,xi = xi exp  r − i. t + σij Wtj , t ∈ [0, T ], xi > 0, i = 1, . . . , d.
X
2
j=1

where
q
X
σi.2 = 2
σij , i = 1, . . . , d.
j=1

 Exercise. Show that if the matrix σσ ∗ is positive definite (then q ≥ d) and one may assume
without modifying the model that X i,xi only depends on the first i components of a d-dimensional
standard Brownian motion. [Hint: think to Cholesky decomposition in Section 1.6.2.]

Now, let us describe the two phases of the variance reduction procedure:
– Phase I: Ξ = e−rT kT as a pseudo-control variate and computation of its expectation E Ξ.
The vanilla call option has a closed form in a Black-Scholes model and elementary computations
show that  
X d
 1 X 
αi log(XTi,xi /xi ) = N  r − αi σi.2 T ; α∗ σσ ∗ α T 
2
1≤i≤d 1≤i≤d

where α is the column vector with components αi , i = 1, . . . , d.


Consequently, the premium at the origin e−rT E kT admits a closed form given by

d
!
1 P 2 ∗ ∗
 √
xαi i e− 2 1≤i≤d αi σi. −α σσ α T
Y
−rT
e E kT = CallBS , K, r, α∗ σσ ∗ α, T .
i=1

– Phase II: Joint simulation of the couple (hT , kT ).


We need to simulate M independent copies of the couple (hT , kT ) or, to be more precise of the
quantity
 
d
!
X  P i,xi 
e−rT (hT − kT ) = e−rT  αi XTi,xi − K − e 1≤i≤d αi log(XT ) − K  .
+
i=1 +

This task clearly amounts to simulating M independent copies WT(m) = (WT1,(m) , . . . , WTq,(m) ),
m = 1, . . . , M of the d-dimensional standard Brownian motion W at time T i.e. M independent
(m) (m) d √
copies Z (m) = (Z1 , . . . , Zq ) of the N (0; Iq ) distribution in order to set WT(m) = T Z (m) ,
m = 1, . . . , M .
The resulting pointwise estimator of the premium is given, with obvious notations, by
m d
!
e−rT X (m) 1 P 2 ∗ ∗
 √
xαi i e− 2 1≤i≤d αi σi. −α σσ α T
Y
hT − kT(m) + CallBS

, K, r, α∗ σσ ∗ α, T .
M
m=1 i=1
54 CHAPTER 3. VARIANCE REDUCTION
P 
Remark. The extension to more general payoffs of the form ϕ α X i,xi is straightfor-
1≤i≤d i T
ward provided ϕ is non-decreasing and a closed form exists for the vanilla option with payoff
 P  i,xi
αk log(XT )
ϕ e 1≤i≤d .

 Exercise. Other ways to take advantage of the convexity of the exponential function can be
explored: thus one can start from
 
X
i,xi
X X XTi,xi
αi XT =  αi xi  α
ei
xi
1≤i≤d 1≤i≤d 1≤i≤d

αi xi
where α ei = P , i = 1, . . . , d. Compare on simulations the respective performances of these
k αk xk
different approaches.

2. Asian options and Kemna-Vorst control variate in a Black-Scholes model (see [80]).
Let  Z T 
1
hT = ϕ Xtx dt
T 0
be a generic Asian payoff where ϕ is non-negative, non-decreasing function defined on R+ and let

σ2
  
x
Xt = x exp r− t + σWt , x > 0, t ∈ [0, T ],
2

be a regular Black-Scholes dynamics with volatility σ > 0 and interest rate r. Then, (standard)
Jensen inequality applied to the probability measure T1 1[0,T ] (t)dt implies
Z T  Z T  
1 1 
Xtx dt ≥ x exp (r − σ /2)t + σWt dt 2
T 0 T
0
σ T
 Z 
T
= x exp (r − σ 2 /2) + Wt dt .
2 T 0
Now Z T Z T Z T
Wt dt = T WT − sdWs = (T − s)dWs
0 0 0
so that Z T  Z T   
1 d 1 2 T
Wt dt = N 0; 2 s ds = N 0; .
T 0 T 0 3
This suggests to rewrite the right hand side of the above inequality in a “Black-Scholes asset” style,
namely:
1 T x 1 T
Z  Z 
αT 2
Xt dt ≥ x e exp (r − (σ /3)/2)T + σ Wt dt
T 0 T 0
where a straightforward computation shows that
 r σ2 
α=− + .
2 12
3.1. THE MONTE CARLO METHOD REVISITED: STATIC CONTROL VARIATE 55

This naturally leads to introduce the so-called Kemna-Vorst (pseudo-)control variate

1 σ2  1 T
 2
 Z 
KV −( r2 + σ12 )T
kT := ϕ xe exp r − T +σ Wt dt
2 3 T 0

which is clearly of Black-Scholes type and satisfies moreover

hT ≥ kTKV .

– Phase I: The random variable kTKV is an admissible control variate as soon as the vanilla
option related to the payoff ϕ(XTx ) has a closed form. Indeed if a vanilla option related to the payoff
ϕ(XTx ) has a closed form
e−rT E ϕ(XTx ) = Premiumϕ BS (x, r, σ, T ),

then, one has  


2
−( r2 + σ12 )T σ
e −rT KV
E kT = Premiumϕ
BS xe , r, √ , T .
3
– Phase II: One has to simulate independent copies of hT − kTKV i.e. in practice, independent
copies of the couple (hT , kTKV ). This requires theoretically speaking to know how to simulate
exactly paths of the standard Brownian motion (Wt )t∈[0,T ] and, moreover to impute with an infinite
RT
accuracy integral of the form T1 0 f (t)dt.
In practice these two tasks are clearly impossible (one cannot even compute a real-valued
function f (t) at every t ∈ [0, T ] with a computer). In fact one relies on quadrature formulas to
approximate the time integrals in both payoffs which makes this simulation possible since only
finitely many random marginals of the Brownian motion, say Wt1 , . . . , Wtn are necessary which is
then quite realistic. Typically, one uses a mid-point quadrature formula
T n  
2k − 1
Z
1 1X
f (t)dt ≈ f T
T 0 n 2n
k=1

or any other numerical integration method, having in mind nevertheless that the (continuous)
functions f of interest are here given for the first and the second payoff function by

σ2 
  
f (t) = ϕ x exp r− t + σWt (ω) and f (t) = Wt (ω)
2

respectively. Hence, their regularity is (almost) 21 -Hölder (in fact α-Hölder, for every α < 12 , locally
as for the payoff hT ). Finally, in practice it will amount to simulating independent copies of the
n-tuple  
W T , . . . , W (2k−1)T , . . . , W (2n−1)T .
2n 2n 2n

from which one can reconstruct both a mid -point approximation of both integrals appearing in hT
and kT .
In fact one can improve this first approach by taking advantage of the fact that W is a Gaussian
process as detailed in the exercise below
56 CHAPTER 3. VARIANCE REDUCTION

Further developments to reduce the time discretization error are proposed in Section 8.2.6
(see [97] where an in-depth study of the Asian option pricing in a Black-Scholes model is carried
out).

 Exercises. 1. (a) Show that if f : [0, T ] → R is continuous then


n Z T
1 X  kT 
lim f = f (t) dt
n n n 0
k=1
 2
 
Show that t 7→ x exp r − σ2 t + σWt (ω) is α-Hölder for very α ∈ (0, 12 ) (with a random Hölder
ratio of course).
(b) Show that
n
 Z T !
T2

1 X
L Wt dt | W kT − W (k−1)T = δwk , 1 ≤ k ≤ n = N ak δwk ;
T 0 n n 12n2
k=1

with ak = 2(k−n)+1
2n T , k = 1, . . . , n. [Hint: we know that for gaussian vector conditional expectation
and affine regression coincide.]
(c) Propose a variance reduction method in which the pseudo-control variate e−rT kT will be simu-
lated exactly.
2. Check that what precedes can be applied to payoffs of the form
 Z T 
1 x x
ϕ Xt dt − XT
T 0
where ϕ is still a non-negative, non-decreasing function defined on R.
2. Best-of-call option. We consider the Best-of-call payoff given by
hT = max(XT1 , XT2 ) − K + .


(a) Using the convexity inequality (that can still be seen as an application of Jensen inequality)

ab ≤ max(a, b), a, b > 0,
show that q 
kT := XT1 XT2 − K .
+
is a natural (pseudo-)control variate for hT .
(b) Show that, in a 2-dimensional Black-Scholes (possibly correlated) model (see expel in Sec-
tion 2.2), the premium of the option with payoff kT (known as geometric mean option) has a closed
form. Show that this closed form can be written as a Black-Scholes formula with appropriate
parameters (and maturity T ).
(c) Check on (at least one) simulation(s) that this procedure does reduce the variance (use the
parameters of the model specified in Section 2.2).
(d) When σ1 and σ2 are not equal, improve the above variance reduction protocol by considering a
parametrized family of (pseudo-)control variate, obtained from the more general inequality aθ b1−θ ≤
max(a, b) when θ ∈ (0, 1).

 Exercise. Compute the premium of the European option with payoff kT .


3.1. THE MONTE CARLO METHOD REVISITED: STATIC CONTROL VARIATE 57

3.1.2 Negatively correlated variables, antithetic method


In this section we assume that X and X 0 have not only the same expectation mX but also the same
variance, i.e. Var(X) = Var(X 0 ), and can be simulated with the same complexity κ = κX = κX 0 .
In such a situation, choosing between X or X 0 may seem a prirori a question of little interest.
However, it is possible to take advantage of this situation to reduce the variance of a simulation
when X and X 0 are negatively correlated..
Set
X + X0
χ=
2
0
(This corresponds to Ξ = X−X 2 with our formalism). It is reasonable (when no further information
on (X, X 0 ) is available) to assume that the simulation complexity of χ is twice that of X and X 0 ,
i.e. κχ = 2κ. On the other hand

1
Var X + X 0

Var (χ) =
4
1
Var(X) + Var(X 0 ) + 2Cov(X, X 0 )

=
4
Var(X) + Cov(X, X 0 )
= .
2
The size M X (ε, α) and M χ (ε, α) of the simulation using X and χ respectively to enter a given
interval [m − ε, m + ε] with the same confidence level α is given, following (3.1), by
 q 2  q 2
α α
MX = Var(X) with X and Mχ = X = Var(χ) with χ.
ε ε
Taking into account the complexity like in the exercise that follows Definition 3.1, that means in
terms of CP U computation time that one should better use χ if and only if

κχ M χ < κX M X ⇐⇒ 2κ M χ < κ M X

i.e.
2Var(χ) < Var(X).
Given the above inequality, this reads

Cov(X, X 0 ) < 0.

To use this remark in practice, one usually relies on the following result.

Proposition 3.2 (co-monotony) (a) Let Z : (Ω, A, P) → R be a random variable and let
ϕ, ψ : R → R be two monotone (hence Borel) functions with the same monotony. Assume that
ϕ(Z), ψ(Z) ∈ L2R (Ω, A, P). Then
Cov(ϕ(Z), ψ(Z)) ≥ 0.
If, mutatis mutandis, ϕ and ψ have opposite monotony, then

Cov(ϕ(Z), ψ(Z)) ≤ 0.
58 CHAPTER 3. VARIANCE REDUCTION

Furthermore, the inequity is strict holds as an equality if and only if ϕ(Z) = E ϕ(Z) P-a.s. or
ψ(Z) = E ψ(Z) P-a.s..
d
(b) Assume there exists a non-increasing (hence Borel) function T : R → R such that Z = T (Z).
Then X = ϕ(Z) and X 0 = ϕ(T (Z)) are identically distributed and satisfy

Cov(X, X 0 ) ≤ 0.

In that case, the random variables X and X 0 are called antithetic.

Proof. (a) Inequality. The functions Let Z, Z 0 be two independent random variables defined
on the same probability space (Ω, A, P) with distribution PZ (1 ). Then, using that ϕ and ψ are
monotone with the same monotony, we have

ϕ(Z) − ϕ(Z 0 ) ψ(Z) − ψ(Z 0 ) ≥ 0


 

so that the expectation of this product is non-positive (and finite since all random variables are
square integrable). Consequently

E ϕ(Z)ψ(Z) + E(ϕ(Z 0 )ψ(Z 0 ) − E ϕ(Z)ψ(Z 0 ) − E ϕ(Z 0 )ψ(Z) ≥ 0.


   

d
Using that Z 0 = Z and that Z, Z 0 are independent, we get

2E ϕ(Z)ψ(Z) ≥ E ϕ(Z))E(ψ(Z 0 ) + E ϕ(Z 0 ))E(ψ(Z) = 2E ϕ(Z))E(ψ(Z) ,


   

that is   
Cov ϕ(Z), ψ(Z) = E ϕ(Z)ψ(Z) − E ϕ(Z))E(ψ(Z) ≥ 0.
If the functions ϕ and ψ have opposite monotony, then

ϕ(Z) − ϕ(Z 0 ) ψ(Z) − ψ(Z 0 ) ≤ 0


 

and one concludes as above up to sign changes.


Equality case. As for the equality case under the co-monotony assumption, we may assume without
loss of unreality that ϕ and ψ are non-decreasing. Moreover,we make the following convention: if
a is not an atom of the distribution PZ of Z, then set ϕ(a) = ϕ(a+ ), ψ(a) = ψ(a+ ), idem for b.
Now, if ϕ(Z) or ψ(Z) are P-a.s. constant (hence to their expectation) then equality clearly
holds.
Conversely, it follows
 by reading backward the  above proof that
 if equality holds, then E ϕ(Z)−
0 0 0 0

ϕ(Z ) ψ(Z) − ψ(Z ) = 0 so that ϕ(Z) − ϕ(Z ) ψ(Z) − ψ(Z ) P-a.s.. Now, let I be the (closed)
convex hull of the support of the distribution µ = PZ of Z on the real line. Assume e.g. that
I = [a, b] ⊂ R, a, b ∈ R, a < b (other cases can be adapted easily from that one).
By construction a and b are in the support of µ so that for every ε, ε ∈ (0, b − a) both P(a ≤
Z < a + ε) and P(b − ε < Z ≤ b) are (strictly) positive. If a is an atom of PZ , one may choose
εa = 0, idem for b. Hence the event Cε = {a ≤ Z < Z + ε} ∩ {b − ε < Z 0 ≤ b} has a positive
probability since Z and Z 0 are independent.
1
This is always possible owing to Fubini’s Theorem for product measures by considering the probability space
(Ω , A⊗2 , P⊗2 ): extend Z by Z(ω, ω 0 ) = Z(ω) and define Z 0 by Z 0 (ω, ω 0 ) = Z(ω 0 ).
2
3.1. THE MONTE CARLO METHOD REVISITED: STATIC CONTROL VARIATE 59

Now assume that ϕ(Z) is not P-a.s. constant. Then, ϕ cannot be constant on I and ϕ(a) < ϕ(b)
(with the above convention on atoms). Consequently, on Cε , ϕ(Z) − ϕ(Z 0 ) > 0 a.s. so that
ψ(Z) − Ψ(Z 0 ) = 0 P-a.s.; which in turn implies that ψ(a + ε) = ψ(b − ε). Then, letting ε and ε to
0, one derives that ψ(a) = ψ(b) (still having in mind the convention on atoms)). Finally this shows
that ψ(Z) is P-a.s. constant. ♦

(b) Set ψ = ϕ ◦ T so that ϕ and ψ have opposite monotony. Noting that X and X 0 do have the
same distribution and applying claim (a) completes the proof. ♦

This leads to the well-known “antithetic random variables method.

Antithetic random variables method. This terminology is shared by two classical situations
in which the above approach applies:
d
– the symmetric random variable Z: Z = −Z (i.e. T (z) = −z).
d
– the [0, L]-valued random variable Z such that Z = L − Z (i.e. T (z) = L − z). This is satisfied
d
by U = U ([0, 1]) with L = 1.

Examples. 1. European option pricing in a BS model. Let hT = ϕ(XTx ) with ϕ monotone



 2

(r− σ2 )T +σ T Z W d
(like for Calls, Puts, spreads, etc). Then hT = ϕ xe , Z = √TT = N (0; 1). The
σ2

function z 7→ ϕ(xe(r− 2
)T +σ Tz
) is monotone as the composition of two monotone functions and
Z is symmetric.
d
2. Uniform distribution on the unit interval. If ϕ is monotone on [0, 1] and U = U ([0, 1]) then
 
ϕ(U ) + ϕ(1 − U ) 1
Var ≤ Var(ϕ(U )).
2 2

The above one dimensional Proposition 3.2 admits a multi-dimensional extension that reads as
follows:

Theorem 3.1 Let d ∈ N∗ and let Φ, ϕ : Rd → R two functions satisfying the following joint
marginal monotony assumption: for every i ∈ {1, . . . , d}, for every (z1 , . . . , zi−1 , zi+1 , . . . , zd ) ∈
Rd−1 ,

zi 7→ Φ(z1 , . . . , zi , . . . , zd ) and zi 7→ ϕ(z1 , . . . , zi , . . . , zd ) have the same monotony

which may depend on i and on (zi+1 , . . . , zd ). For every i ∈ {1, . . . , d}, let Z1 , . . . , Zd be independent
real valued random variables defined on a probability space (Ω, A, P) and let Ti : R → R, i =
d
1, . . . , d, be non-increasing functions such that Ti (Zi ) = Zi . Then, if Φ(Z1 , . . . , Zd ), ϕ(Z1 , . . . , Zd ) ∈
L2 (Ω, A, P), 
Cov Φ(Z1 , . . . , Zd ), ϕ(T1 (Z1 ), . . . , Td (Zd )) ≤ 0.

Remarks. • This result may be successfully applied to functions f (X̄ T , . . . , X̄k T , . . . , X̄n T ) of
n n n
T
the Euler scheme with step n of a one-dimensional Brownian diffusion with a non-decreasing drift
60 CHAPTER 3. VARIANCE REDUCTION

and a deterministic strictly positive diffusion coefficient, provided f is “marginally monotonic” i.e.
monotonic in each of its variable with the same monotony. (We refer to Chapter 7 for an intro-
duction to the Euler scheme of a diffusion.) The idea is to rewrite the functional as a “marginally
monotonic” function of the n (independent) Brownian increments which play the role of the r.v.
Zi . Furthermore, passing to the limit as the step size goes to zero, yields some correlation results
for a class of monotone continuous functionals defined on the canonical space C([0, T ], R) of the
diffusion itself (the monotony should be understood with respect to the naive partial order.
• For more insight about this kind of co-monotony properties and their consequences for the pricing
of derivatives, we refer to [125].

 Exercises. 1. Prove the above theorem by induction on d. [Hint: use Fubini’s Theorem.]
2. Show that if ϕ and ψ are non-negative Borel functions defined on R, monotone with opposite
monotony,

E ϕ(Z)ψ(Z) ≤ E ϕ(Z) E ψ(Z)
so that, if ϕ(Z), ψ(Z) ∈ L1 (P), then ϕ(Z)ψ(Z) ∈ L1 (P).
3. Use Proposition 3.2(a) and Proposition 2.1(b) to derive directly (in the Black-Scholes model)
from its representation as an expectation that the δ-hedge of a European option whose payoff
function is convex is non-negative.

3.2 Regression based control variate


3.2.1 Optimal mean square control variate
We come back to the original situation of two square integrable random variables X and X 0 , having
the same expectation
E X = E X0 = m
with nonzero variances i.e.
Var(X), Var(X 0 ) > 0.
We assume again that X and X 0 are not identical in the sense that P(X 6= X 0 ) > 0 which turns
out to be equivalent to
Var(X − X 0 ) > 0.
We saw that if Var(X 0 )  Var(X), one will naturally choose X 0 to implement the Monte Carlo
simulation and we provided several classical examples in that direction. However we will see that
with a little more effort it is possible to improve this naive strategy.
This time we simply (and temporarily) set

Ξ := X − X 0 with EΞ = 0 and Var(Ξ) > 0.

The idea is simply to parametrize the impact of the control variate Ξ by a factor λ i.e. we set for
every λ ∈ R,
X λ = X − λ Ξ.
3.2. REGRESSION BASED CONTROL VARIATE 61

Then the strictly convex parabolic function Φ defined by

Φ(λ) := Var(X λ ) = λ2 Var(Ξ) − 2 λ Cov(X, Ξ) + Var(X)

reaches its minimum value at λmin with

Cov(X, Ξ) E(X Ξ)
λmin := =
Var(Ξ) E Ξ2
0
Cov(X , Ξ) E(X 0 Ξ)
= 1+ =1+ .
Var(Ξ) E Ξ2

Consequently

(Cov(X, Ξ))2 (Cov(X 0 , Ξ))2


2
σmin := Var(X λmin ) = Var(X) − = Var(X 0 ) − .
Var(Ξ) Var(Ξ)

so that

2
≤ min Var(X), Var(X 0 )

σmin
2
and σmin = Var(X) if and only if Cov(X, Ξ) = 0.
Remark. Note that Cov(X, Ξ) = 0 if and only if λmin = 0 i.e. Var(X) = min Φ(λ).
λ∈R
If we denote by ρX,Ξ the correlation coefficient between X and Ξ, one gets

2
σmin = Var(X)(1 − ρ2X,Ξ ) = Var(X 0 )(1 − ρ2X 0 ,Ξ ).

A more symmetric expression for Var(X λmin ) is

Var(X)Var(X 0 )(1 − ρX,X


2
0
)
2
σmin = p p p
( Var(X) − Var(X 0 ))2 + 2 Var(X)Var(X 0 )(1 − ρX,X 0 )

1 + ρX,X 0
≤ σX σX 0 .
2
where σX and σX 0 denote the standard deviations of X and X 0 respectively.

3.2.2 Implementation of the variance reduction: batch vs adaptive


Let (Xk , Xk0 )k≥1 be an i.i.d. sequence of random vectors with the same distribution as (X, X 0 ) and
let λ ∈ R. Set for every k ≥ 1

Ξk = Xk − Xk0 , Xkλ = Xk − λ Ξk .
62 CHAPTER 3. VARIANCE REDUCTION

Now, set for every size M ≥ 1 of the simulation:

M M
1 X 2 1 X
VM := Ξk , CM := Xk Ξk
M M
k=1 k=1

and
CM
λM := (3.2)
VM
with the convention λ0 = 0.

The “batch” approach.  Definition of the batch estimator. The Strong Law of Large Numbers
implies that both

VM −→ Var(X − X 0 ) and CM −→ Cov(X, X − X 0 ) P-a.s. as M → +∞.


so that
λM → λmin P-a.s. as M → +∞.

This suggests to introduce the batch estimator of m, defined for every size M ≥ 1 of the
simulation by
M
λM 1 X λM
XM = Xk .
M
k=1

One checks that, for every M ≥ 1,

M M
λ 1 X 1 X
X MM = Xk − λM Ξ̄k
M M
k=1 k=1
= X̄M − λM Ξ̄M (3.3)

with standard notations for empirical means.


 Convergence of the batch estimator. The asymptotic behaviour of the batch estimator is
summed up in the proposition below.

Proposition 3.3 The batch estimator P-a.s. converges to m (consistency) i.e.

M
λ 1 X λM a.s.
X MM = Xk −→ EX = m
M
k=1

2
and satisfies a CLT (asymptotic normality) with an optimal asymptotic variance σmin i.e.
√ 
λ

L 2

M X MM − m −→ N 0; σmin .
3.2. REGRESSION BASED CONTROL VARIATE 63

Remark. However, note that the batch estimator is a biased estimator of m since E λM Ξ̄M 6= 0.

Proof. First, one checks from (3.3) that


M
1 X λM a.s.
Xk −→ m − λmin × 0 = m.
M
k=1
Now, it follows from the regular CLT that
M
!
√ 1 X λmin L 2

M Xk − m −→ N 0; σmin
M
k=1

since Var(X − λmin Ξ) = 2 .


σmin On the other hand
M M
!
√ 1 X λM 1 X
Xk − Xkλmin
 P
M = λM − λmin × √ Ξk −→ 0
M M k=1
k=1

owing to Slutsky Lemma (2 ) since λM − λmin → 0 a.s. as M → +∞ and


M
1 X L
Ξk −→ N 0; E Ξ2


M k=1
by the regular CLT applied to the centered square integrable i.i.d. random variables Ξk , k ≥ 1.
Combining these two convergence results yields the announced CLT . ♦

 Exercise. Let X̄m and Ξ̄m denote the empirical mean processes of the sequences (Xk )k≥ and
Ξk )k≥1 respectively. Show that the quadruplet (X̄M , Ξ̄M , CM , VM ) can be computed in a recursive
way from the sequence (Xk , Xk0 )k≥1 . Derive a recursive way to compute the batch estimator

 Practitioner’s corner. One may proceed as follows:


– Recursive implementation: Use the recursion satisfied by the sequence (X̄k , Ξ̄k , Ck , Vk )k≥1
to compute λM and the resulting batch estimator for each size M .
– True batch implementation: A first phase of the simulation of size M 0 , M 0  M (say
M 0 ≈ 5 % or 10 % of the total budget M of the simulation) devoted to a rough estimate λM 0 of
λmin , based on the Monte Carlo estimator (3.2).
A second phase of the simulation to compute the estimator of m defined by
M
1 X λ 0
Xk M
M − M0
k=M 0 +1

Var(X λ )|λ=λ̂
M 0
whose asymptotic variance – given the first phase of the simulation – is . This
M − M0
approach is not recursive at all. On the other hand, the above resulting estimator satisfies a
CLT with asymptotic variance Φ(λ̂M 0 ) = Var(X λ )|λ=λ̂ . In particular we most likely observe a
M0
significant – although not optimal – variance reduction? So, from this point of view, you can stop
reading this section at this point. . . .
2
If Yn → c in probability and Zn → Z∞ in distribution then, Yn Zn → c Z∞ in distribution. In particular if c = 0
the last convergence holds in probability.
64 CHAPTER 3. VARIANCE REDUCTION

The adaptive unbiased approach. Another approach is to design an adaptive estimator of m


by considering at each step k the (predictable) estimator λk−1 of λmin . This adaptive estimator is
defined and analyzed below.

Theorem 3.2 Assume X, X 0 ∈ L2+δ (P) for some δ > 0. Let (Xk , Xk0 )k≥1 be an i.i.d. sequence
with the same distribution as (X, X 0 ). We set for every k ≥ 1

ek = Xk − λ̃ Ξk = (1 − λ̃k−1 )Xk + λ̃k−1 X 0


X where λ̃k = (−k) ∨ (λk ∧ k)
k−1 k

and λk is defined by (3.2). Then the adaptive estimator of m defined by


M
λ
e 1 X e
XM = Xk
M
k=1

λ
e
is unbiased (E X M = m), convergent i.e.

λ
e a.s.
X M −→ m as M → +∞,

and asymptotically normal with minimal variance i.e.



 
λ
e L 2
M X M − m −→ N (0; σmin ) as M → +∞.

The rest of this section can be omitted at the occasion of a first reading although the method of proof
exposed below is quite standard when dealing with the efficiency of an estimator by martingale methods.

Proof. Step 1 (a.s. convergence): Let F0 = {∅, Ω} and let Fk := σ(X1 , X10 , . . . , Xk , Xk0 ), k ≥ 1, be the
filtration of the simulation.
First we show that (X ek − m)k≥1 is a sequence of square integrable (Fk , P)-martingale increments. It is
clear by construction that X ek is Fk -measurable. Moreover,

E (Xek )2 ≤ 2 E Xk2 + E(λ̃k−1 Ξk )2




EXk2 + E λ̃2k−1 E Ξ2k < +∞



= 2

where we used that Ξk and λ̃k−1 are independent. Finally, for every k ≥ 1,

E(Xek | Fk−1 ) = E(Xk | Fk−1 ) − λ̃k−1 E(Ξk | Fk−1 ) = m.


This shows that the adaptive estimator is unbiased since E(Xek ) = m for every k ≥ 1. In fact, we can
also compute the conditional variance increment process:

E((Xek − m)2 | Fk−1 ) = Var(X λ )|λ=λ̃k−1 = Φ(λ̃k−1 ).


Now, we set for every k ≥ 1,
k e` − m
X X
Nk := .
`
`=1
3.2. REGRESSION BASED CONTROL VARIATE 65

It follows from what precedes that the sequence (Nk )k≥1 is a square integrable ((Fk )k , P)-martingale since
(Xek − m)k≥1 , is a sequence of square integrable (Fk , P)-martingale increments. Its conditional variance
increment process (also known as “bracket process”) hN ik , k ≥ 1 given by
k
X E((Xe` − m)2 | F`−1 )
hN ik =
`2
`=1
k
X Φ(λ̃`−1 )
= .
`2
`=1

Now, the above series is a.s. converging since we know that Φ(λ̃k ) a.s. converges towards Φ(λmin ) as
kn → +∞ since λ̃k a.s. converges toward λmin and Φ is continuous. Consequently,
hN i∞ = a.s. lim hN iM < +∞ a.s.
M →+∞

Hence, it follows from Proposition 11.4 in Chapter 11 that NM → N∞ P-a.s. as M → +∞ where N∞ is an


a.s. finite random variable. In turn, the Kronecker Lemma (see below) implies,
M
1 X e a.s.
Xk − m −→ 0 as M →∞
M
k=1

i.e.
M
1 X e a.s.
X M :=
e Xk −→ m as M → ∞.
M
k=1

We will need this classical lemma (see Chapter 11 (Miscellany), Section 11.9 for a proof).
Lemma 3.1 (Kronecker Lemma) Let (an )n≥1 be a sequence of real numbers and let (bn )n≥1 be a non-
decreasing sequence of positive real numbers with limn bn = +∞. Then
  !
X an n
1 X
 converges in R as a series =⇒ ak −→ 0 as n → +∞ .
bn bn
n≥1 k=1

Step 2 (CLT , weak rate of convergence): It is a consequence of the Lindeberg Central Limit Theorem for
(square integrable) martingale increments (see Theorem 11.7 the Miscellany chapter or Theorem 3.2 and
its Corollary 3.1, p.58 in [70] refried to as “Lindeberg’s CLT ” in what follows). In our case, the array of
martingale increments is defined by
ek − m
X
XM,k := √ , 1 ≤ k ≤ M.
M
There are basically two assumptions to be checked. First the convergence of the conditional variance incre-
2
ment process toward σmin :
M M
X
2 1 X
E(XM,k | Fk−1 ) = E((Xek − m)2 | Fk−1 )
M
k=1 k=1
M
1 X
= Φ(λ̃k−1 )
M
k=1
2
−→ σmin := min Φ(λ).
λ
66 CHAPTER 3. VARIANCE REDUCTION

The second one is the so-called Lindeberg condition (see again Miscellany or [70], p.58) which reads in
our framework:
XM
2 P
∀ ε > 0, E(XM,` 1{|XM,` |>ε} | F`−1 ) −→ 0.
`=1

In turn, owing to conditional Markov inequality and the definition of XM,` , this condition classically follows
from the slightly stronger
sup E(|Xe − m|2+δ | F`−1 ) < +∞ P-a.s.
`
`≥1

since
M M
X
2 1 X
E(XM,` 1{|XM,` |>ε} | F`−1 ) ≤ δ E(|Xe` − m|2+δ | F`−1 ).
`=1 εδ M 1+ 2 `=1

Now, using that (u + v) 2+δ


≤21+δ
(u 2+δ
+v 2+δ
), u,v ≥ 0, and the fact that X, X 0 ∈ L2+δ (P), one gets

E (|Xe` − m|2+δ | F`−1 ) ≤ 21+δ (E |X − m|2+δ + |λ̃`−1 |2+δ E |Ξ|2+δ ).


Finally, the Lindeberg Central Limit Theorem implies
M
!
√ 1 X e L 2
M Xk − m −→ N (0; σmin ).
M
k=1

This means that the expected variance reduction does occur if one implements the recursive approach
described above. ♦

3.3 Application to option pricing: using parity equations to pro-


duce control variates
The variance reduction by regression introduced in the former section still relies on the fact that
κX ≈ κX−λΞ or, equivalently that the additional complexity induced by the simulation of Ξ given
that of X is negligible. This condition may look demanding but we will see that in the framework
of derivative pricing this requirement is always fulfilled as soon as the payoff of interest satisfies a
so-called parity equation i.e. that the original payoff can by duplicated by a “synthetic” version.
Furthermore these parity equations are model free so they can be applied for various specifica-
tions of the dynamics of the underlying asset.
In this paragraph, we denote by (St )t≥0 the risky asset (with S0 = s0 > 0) and set St0 = ert the
riskless asset. We work under the risk-neutral probability which means that

e−rt St t∈[0,T ]

is a martingale on the scenarii space (Ω, A, P)

(with respect to the augmented filtration of (St )t∈[0,T ] ). This means that P is a risk-neutral proba-
bility (supposed to exist). furthermore, to comply with usual assumptions of AOA theory, we will
assume that this risk neutral probability is unique (complete market) to justify that we may price
any derivative under this probability. However this has no real impact on what follows.

 Vanilla Call-Put parity (d = 1): We consider a Call and a Put with common maturity T
and strike K. We denote by

Call0 = e−rT E (ST − K)+ Put0 = e−rT E K − ST )+


 
and
3.3. APPLICATION TO OPTION PRICING: USING PARITY EQUATIONS TO PRODUCE CONTROL VARIAT

the premia of this Call and this Put option respectively. Since

(ST − K)+ − (K − ST )+ = ST − K

and e−rt St

t∈[0,T ]
is a martingale, one derives the classical Call-Put parity equation:

Call0 − Put0 = s0 − e−rT K

so that Call0 = E(X) = E(X 0 ) with

X := e−rT (ST − K)+ and X 0 := e−rT (K − ST )+ + s0 − e−rT K.

As a result one sets


Ξ = X − X 0 = e−rT ST − s0
which turns out to be the terminal value of a martingale null at time 0 (this is in fact the generic
situation of application of this parity method).
Note that the simulation of X involves that of ST so that the additional cost of the simulation
of Ξ is definitely negligible.

 Asian Call-Put parity: We consider an :em asian Call and an Asian Put with common
maturity T , strike K and averaging period [T0 , T ], 0 ≤ T0 > T ].

 Z T  !  Z T  !
−rT 1 −rT 1
CallAs
0 =e E St dt − K and PutAs
0 =e E K− St dt .
T − T0 T0 +
T − T0 T0 +

Still using that Set = e−rt St is a P-martingale and, this time, Fubini-Tonelli’s Theorem yield

1 − e−r(T −T0 )
CallAs As
0 − Put0 = s0 − e−rT K
r(T − T0 )
so that
0
CallAs
0 = E(X) = E(X )

with
 Z T 
−rT 1
X := e St dt − K
T − T0 T0 +
e−r(T −T0 ) T
 
1−
Z
0 −rT −rT 1
X := s0 −e K +e K− St dt .
r(T − T0 ) T − T0 T0 +

This leads to
T
1 − e−r(T −T0 )
Z
−rT 1
Ξ=e St dt − s0 .
T − T0 T0 r(T − T0 )

Remark. In both cases, the parity equation directly follows from the P-martingale property of
Set = e−rt St .
68 CHAPTER 3. VARIANCE REDUCTION

3.3.1 Complexity aspects in the general case


In practical implementations, one often neglects the cost of the computation of λmin since only a
rough estimate is computed: this leads to stop its computation after the first 5% or 10% of the
simulation.
– However, one must be aware that the case of the existence of parity equations is quite specific
since the random variable Ξ is involved in the simulation of X, so the complexity of the simu-
lation process is not increased: thus in the recursive approach the updating of λM and of (the
empirical mean) X e is (almost) costless. Similar observations can be made to some extent on
M
batch approaches. As a consequence, in that specific setting, the complexity of the adaptive linear
regression procedure and the original one are (almost) the same!

– Warning ! This is no longer true in general. . . and in a general setting the complexity of
the simulation of X and X 0 is double of that of X itself. Then the regression method is efficient iff

1
2
σmin < min(Var(X), Var(X 0 ))
2
(provided one neglects the cost of the estimation of the coefficient λmin ).
The exercise below shows the connection with antithetic variables which then appears as a
special case of regression methods.

 Exercise (Connection with the antithetic variable methdod). Let X, X 0 ∈ L2 (P) such
that E X = E X 0 = m and Var(X) = Var(X 0 ).
(a) Show that
1
λmin = .
2
(b) Show that
X + X0
X λmin =
2
and
X + X0
 
1
Var(X) + Cov(X, X 0 ) .

Var =
2 2
Characterize the couple (X, X 0 ) with which the regression method does reduce the variance. Make
the connection with the antithetic method.

3.3.2 Examples of numerical simulations


 Vanilla B-S Calls. The model parameters are specified as follows

T = 1, x0 = 100, r = 5, σ = 20, K = 90, . . . , 120.

The simulation size is set at M = 106 .

 Asian Calls in a Heston model.


3.3. APPLICATION TO OPTION PRICING: USING PARITY EQUATIONS TO PRODUCE CONTROL VARIAT

0.015

K ---> Call(s_0,K), CallPar(s_0,K), InterpolCall(s_0,K)


0.01

0.005

-0.005

-0.01

-0.015

-0.02
90 95 100 105 110 115 120
Strikes K=90,...,120

Figure 3.1: Black-Scholes Calls: Error = Reference BS−(MC Premium). K =


90, . . . , 120. –o–o–o– Crude Call. –∗–∗–∗– Synthetic Parity Call. –×–×–×– Interpolated synthetic Call.

0.7

0.6

0.5
K ---> lambda(K)

0.4

0.3

0.2

0.1
90 95 100 105 110 115 120
Strikes K=90,...,120

Figure 3.2: Black-Scholes Calls: K 7→ 1 − λmin (K), K = 90, . . . , 120, for the Interpolated
synthetic Call.

The dynamics of the risky asset is this time a stochastic volatility model, namely the Heston
model, defined as follows. Let ϑ, k, a such that ϑ2 /(4ak) < 1 (so that vt remains a.s. positive,
see [91]).

dSt = St r dt + vt dWt1 ,

s0 = x0 > 0, (risky asset)


dvt = k(a − vt )dt + ϑ vt dWt2 , v0 > 0 with < W 1 , W 2 >t = ρ t, ρ ∈ [−1, 1].

The payoff is a an Asian call with strike price K


 Z T  
Hest −rT 1
AsCall =e E Ss ds − K .
T 0 +

Usually, no closed form are available for Asian payoffs even in the Black-Scholes model and so is
the case in the Heston model. Note however that (quasi-)closed forms do exist for vanilla European
70 CHAPTER 3. VARIANCE REDUCTION

18

K ---> StD_Call(s_0,K), StD_CallPar(s_0,K), StD_InterpolCall(s_0,K)


16

14

12

10

4
90 95 100 105 110 115 120
Strikes K=90,...,120

Figure 3.3: Black-Scholes Calls. Standard Deviation(MC Premium). K =


90, . . . , 120. –o–o–o– Crude Call. –∗–∗–∗– Parity Synthetic Call. –×–×–×– Interpolated Synthetic Call.

options in this model (see [72]) which is the origin of its success. The simulation has been carried
out by replacing the above diffusion by an Euler scheme (see Chapter 7 for an introduction to
Euler scheme). In fact the dynamics of the stochastic volatility process does not fulfill the standard
Lipschitz continuous continuous assumptions required to make the Euler scheme converge at least
at its usual rate. In the present case it is even difficult to define this scheme because of the term

vt . Since our purpose here is to illustrate the efficiency of parity relations to reduce variance we
adopted a rather “basic” scheme, namely
r
 rT q T p 
S̄ kT = S̄ (k−1)T 1 + + |v̄ (k−1)T | ρUk2 + 1 − ρ2 Uk1 , S̄0 = s0 > 0
n n n n n
T q
v̄ kT = k a − v̄ (k−1)T + ϑ |v (k−1)T | Uk2 , v̄0 = v0 > 0
n n n n

where Uk = (Uk1 , Uk2 )k≥1 is an i.i.d. sequence of N (0, I2 )-distributed random vectors.
This scheme is consistent but its rate of convergence if not optimal. The simulation of the Heston
model has given rise to an extensive literature. Se e.g. among others Diop, Alfonsi, Andersen, etc.....

– Parameters of the model:


s0 = 100, k = 2, a = 0.01, ρ = 0.5, v0 = 10%, ϑ = 20%.

– Parameters of the option portfolio:


T = 1, K = 90, · · · , 120 (31 strikes).

 Exercises. One considers a 1-dimensional Black-Scholes model with market parameters

r = 0, σ = 0.3, x0 = 100, T = 1.

1. One considers a vanilla Call with strike K = 80. The random variable Ξ is defined as above.
Estimate the λmin (one should be not too far from 0.825). Then, compute a confidence interval for
3.3. APPLICATION TO OPTION PRICING: USING PARITY EQUATIONS TO PRODUCE CONTROL VARIAT

K ---> StD-AsCall(s_0,K), StD-AsCallPar(s_0,K), StD-AsInterCall(s_0,K)


14

12

10

0
90 95 100 105 110 115 120
Strikes K=90,...,120

Figure 3.4: Heston Asian Calls. Standard Deviation (MC Premium). K = 90, . . . , 120.
–o–o–o– Crude Call. –∗–∗–∗– Synthetic Parity Call. –×–×–×– Interpolated synthetic Call.

0.9

0.8

0.7
K ---> lambda(K)

0.6

0.5

0.4

0.3

0.2

0.1

0
90 95 100 105 110 115 120
Strikes K=90,...,120

Figure 3.5: Heston Asian Calls. K 7→ 1 − λmin (K), K = 90, . . . , 120, for the Interpolated
Synthetic Asian Call.

the Monte Carlo pricing of the Call with and without the linear variance reduction for the following
sizes of the simulation: M = 5 000, 10 000, 100 000, 500 000.
2. Proceed as above but with K = 150 (true price 1.49). What do you observe ? Provide an
interpretation.

3.3.3 Multidimensional case


 Let X := (X 1 , . . . , X d ), Ξ := (Ξ1 , . . . , Ξq ) : (Ω, A, P) −→ Rq , be square integrable random
vectors.
E X = m ∈ Rd , E Ξ = 0 ∈ Rq .
Let D(X) := [Cov(X i , X j )]1≤i,j≤d and let D(Ξ) denote the covariance (dispersion) matrices of X
and Ξ respectively. Assume
D(X) and D(Ξ) > 0
72 CHAPTER 3. VARIANCE REDUCTION

0.02

0.015

K −−−> AsCall(s0,K), AsCallPar(s0,K), InterAsCall(s0,K)


0.01

0.005

−0.005

−0.01

−0.015

−0.02
90 95 100 105 110 115 120
Strikes K=90,...,120

Figure 3.6: Heston Asian Calls. M = 106 (Reference: M C with M = 108 ). K = 90, . . . , 120.
–o–o–o– Crude Call. –∗–∗–∗– Parity Synthetic Call. –×–×–×– Interpolated Synthetic Call.

as positive definite symmetric matrices.


 Problem: Find a matrix Λ ∈ M(d, q) solution to the optimization problem

Var(X − Λ Ξ) = min {Var(X − L Ξ), L ∈ M(d, q)}

where Var(Y ) is defined by Var(Y ) := E|Y − E Y |2 = E |Y |2 − |E Y |2 for any Rd -valued random


vector.
 Solution
Λ = D(Ξ)−1 C(X, Ξ)
where
C(X, Ξ) = [Cov(X i , Ξj )]1≤i≤d,1≤j≤q .
 Examples-Exercises: Let Xt = (Xt1 , . . . , Xtd ), t ∈ [0, T ], be the price process of d risky traded
assets (be careful about the notations that collide at this point: X is here for the traded assets and
the aim is to reduce the variance of the discounted payoff usually denoted with the letter h).

1. Options on various baskets:


 
Xd
hiT =  θji XTj − K  , i = 1, . . . , d.
j=1
+

Remark. This approach also produces an optimal asset selection (since it is essentially a P CA)
which helps for hedging.

2. Portfolio of forward start options


 
hi,j = XTj i+1 − XTj i , i = 1, . . . , d − 1
+

where Ti , i = 0, . . . , d is an increasing sequence of maturities.


3.4. PRE-CONDITIONING 73

3.4 Pre-conditioning
The principle of the pre-conditioning method – also known as the Blackwell-Rao method – is
based on the very definition of conditional expectation: let (Ω, A, P) be probability space and
let X : (Ω, A, P) → R be a square integrable random variable. The practical constraint for
implementation is the ability to simulate at a competitive cost E(X | B). The examples below
show some typical situations where so is the case.
For every sub-σ-field B ⊂ A
E X = E (E(X | B))
and

Var (E(X | B)) = E(E(X | B)2 ) − (E X)2


≤ E(X 2 ) − (E X)2 = Var(X)

since conditional expectation is a contraction in L2 (P) as an orthogonal projection. Furthermore


the above inequality is strict except if X is B-measurable i.e. in any non-trivial case.
The archetypal situation is the following: assume

X = g(Z1 , Z2 ), g ∈ L2 (P(Z1 ,Z2 ) ),

where Z1 , Z2 are independent random vectors. Set B := σ(Z2 ). Then standard results on condi-
tional expectations show that

E X = E G(Z2 ) where G(z2 ) = E (g(Z1 , Z2 ) | Z2 = z2 ) = E g(Z1 , z2 )

is a version of the conditional expectation of g(Z1 , Z2 ) given σ(Z2 ). At this stage, the pre-
conditionning method can be implemented as soon as the following conditions are satisfied:
– a closed form is available for the function G and
– the (distribution of) Z2 can be simulated with the same complexity as (the distribution of)
X.
σi2 i
Examples. 1. Exchange spread options. Let XTi = xi e(r− 2 )T +σi WT , xi , σi > 0, i = 1, 2, be two
“Black-Scholes” assets at time T related to two Brownian motions WTi , i = 1, 2, with correlation
ρ ∈ [−1, 1]. One considers an exchange spread options with strike K i.e. related to the payoff

hT = (XT1 − XT2 − K)+ .

Then one can write √ p


(WT1 , WT2 ) =

T 1 − ρ2 Z1 + ρ Z2 , Z2
where Z = (Z1 , Z2 ) is an N (0; I2 )-distributed random vector. Then
σ2 √ √ σ2 √
  
−rT −rT (r− 21 )T +σ1 T ( 1−ρ2 Z1 +ρz2 ) (r− 22 )T +σ2 T z2
e E (hT | Z2 ) = e E x1 e − x2 e −K
+ |z2 =Z2

ρ 2 σ1
2T √ 2 √
 
σ2 p
= CallBS x1 e− 2 +σ1 ρ T Z2 , x2 e(r− 2 )T +σ2 T Z2 + K, r, σ1 1 − ρ2 , T .
74 CHAPTER 3. VARIANCE REDUCTION

Then one takes advantage of the closed form available for vanilla call options in a Black-scholes
model to compute
 
PremiumBS (X01 , x20 , K, σ1 , σ2 , r, T ) = E E(e−rT hT | Z2 )

with a smaller variance than with the original payoff.


2. Barrier options. This example will be detailed in Section 8.2.3 devoted to the pricing of (some
classes) of barrier options in a general model using the simulation of a continuous Euler scheme
(using the so-called Brownian bridge method).

3.5 Stratified sampling


The starting idea of stratification is to localize the Monte Carlo method on the elements of a
measurable partition of the state space E of a random variable X : (Ω, A, P) → (E, E).
Let (Ai )i∈I be a finite E-measurable partition of the state space E. The Ai ’s are called strata.
Assume that the weights
pi = P(X ∈ Ai ), i ∈ I,
are known, (strictly) positive and that, still for every i ∈ I,
d
L(X | X ∈ Ai ) = ϕi (U )

where U is uniformly distributed on [0, 1]ri (with ri ∈ N ∪ {∞}, the case ri = ∞ corresponding
to the acceptance-rejection method) and ϕi : [0, 1]ri → E is an (easily) computable function.
This second condition simply means that L(X | X ∈ Ai ) is easy to simulate on a computer. To
be more precise we implicitly assume in what follows that the the simulation of X and of the
conditional distribution L(X | X ∈ Ai ), i ∈ I, or equivalently the random vectors ϕi (U ) have the
same complexity. One must always keep that in mind since it is a major constraint for practical
implementations of stratification methods.
This simulability condition usually has a strong impact on the possible design of the strata.
For convenience, we will assume in what follows that ri = r.
Let F : (E, E) → (R, Bor(R)) such that E F 2 (X) < +∞,

X  
E F (X) = E 1{X∈Ai } F (X)
i∈I
X 
= pi E F (X) | X ∈ Ai
i∈I
X 
= pi E F (ϕi (U )) .
i∈I

The stratification idea takes place now. Let M be the global “budget” allocated to the simula-
tion of E F (X). We split this budget into |I| groups by setting

Mi = qi M, i ∈ I,
3.5. STRATIFIED SAMPLING 75

be the allocated budget to compute E F (ϕi (U )) in each stratum “Ai ”. This leads to define the
following (unbiased) estimator
M
I X 1 Xi
(X)M :=
Fd pi F (ϕi (Uik ))
Mi
i∈I k=1

where (Uik )1≤k≤Mi ,i∈I are uniformly distributed on [0, 1]r , i.i.d. random variables. Then, elementary
computations show that
1 X p2i 2
 
I
Var F (X)M =
d σ
M qi F,i
i∈I
where, for every i ∈ I,

 
2

σF,i = Var F (ϕi (U )) = Var F (X) | X ∈ Ai
 2 
= E F (X) − E(F (X)|X ∈ Ai ) | X ∈ Ai
 2 
E F (X) − E(F (X)|X ∈ Ai ) 1{X∈Ai }
= .
pi
Optimizing the simulation allocation to each stratum amounts to solving the following minimization
problem ( )
X p2 X
2
min i
σF,i where PI := (qi )i∈I ∈ RI+ | qi = 1 .
(qi )∈PI qi
i∈I i∈I
 A sub-optimal choice, but natural and simple, is to set

qi = pi , i ∈ I.

Such a choice is first motivated by the fact that the weights pi are known and of course because it
does reduce the variance since
X p2 X
i 2 2
σF,i = pi σF,i
qi
i∈I i∈I
X
E 1{X∈Ai } (F (X) − E(F (X)|X ∈ Ai ))2

=
i∈I
= kF (X) − E(F (X)|σ({X ∈ Ai }, i ∈ I))k22 . (3.4)

where σ({X ∈ Ai }, i ∈ I)) denotes the σ-field spanned by the partition {X ∈ Ai }, i ∈ I. Conse-
quently
X p2
σ ≤ kF (X) − E(F (X))k22 = Var(F (X))
i 2
(3.5)
qi F,i
i∈I

with equality if and only if E (F (X)|σ({X ∈ Ai }, i ∈ I)) = E F (X) P-a.s. Or, equivalently, equality
holds in this inequality if and only if

E(F (X) | X ∈ Ai ) = E F (X), i ∈ I.


76 CHAPTER 3. VARIANCE REDUCTION

So this choice always reduces the variance of the estimator since we assumed that the stratification
is not trivial. It corresponds in the opinion poll world to the so-called quota method.
 The optimal choice is the solution to the above constrained minimization problem. It follows
from a simple application of Schwarz Inequality (and its equality case) that
X X pi σF,i √
pi σF,i = √ qi
qi
i∈I i∈I
!1 !1
X p2i σF,i
2 2 X 2

≤ qi
qi
i∈I i∈I
!1
X p2i σF,i
2 2

= ×1
qi
i∈I
!1
X p2i σF,i
2 2

= .
qi
i∈I

Consequently, the optimal choice for the allocation parameters qi ’s i.e. the solution to the above
constrained minimization problem is given by

∗ pi σF,i
qF,i =P , i ∈ I,
j pj σF,j

with a resulting minimal variance given by


!2
X
pi σF,i .
i∈I

2 are not known which makes the imple-


At this stage the problem is that the local inertia σF,i
mentation less straightforward. Some attempts have been made to circumvent this problem, see
e.g. [45] for a recent reference based on an adaptive procedure for the computation of the local
2 .
F -inertia σF,i
However, using that the Lp -norms with respect to a probability measure are non-decreasing in
p, one derives that σF,i ≥ E (|X − E(X|{X ∈ Ai })| | {X ∈ Ai }) so that
!2
X
pi σF,i ≥ kX − E(X|σ({X ∈ Ai }, i ∈ I))k21 .
i∈I

d
Examples. Stratifications for the computation of E F (X), X = N (0; Id ), d ≥ 1.
(a) Stripes. Let v be a fixed unitary vector (a simple and natural choice for v is v = e1 =
(1, 0, 0, · · · , 0)): it is natural to define the strata as hyper-stripes perpendicular to the main axis
Re1 of X). So, we set, for a given size N of the stratification (I = {1, . . . , N }),

Ai := {x ∈ Rd s.t. (v|x) ∈ [yi−1 , yi ]}, i = 1, . . . , N


3.5. STRATIFIED SAMPLING 77

where yi ∈ R is defined by Φ0 (yi ) = Ni , i = 0, . . . , N (the N -quantiles of the N (0; 1) distribution).


Then, if Z denotes a N (0; 1)-distributed random variable,

1
pi = P(X ∈ Ai ) = P(Z ∈ [yi−1 , yi ]) = Φ0 (yi ) − Φ0 (yi−1 ) =
N

where Φ0 denotes the distribution function of the N (0; 1)-distribution. Other choices are possible
for the yi , leading to a non uniform distribution of the pi ’s. The simulation of the conditional
distributions follows from the fact that

L(X | (v|X) ∈ [a, b]) = ξ1 v + πv⊥⊥ (ξ2 )

d d
where ξ1 = L(Z | Z ∈ [a, b]) is independent of ξ2 = N (0, Id−1 ),
 
d d
L(Z | Z ∈ [a, b]) = Φ−1
0 (Φ0 (b) − Φ0 (a))U + Φ0 (a) , U = U ([0, 1])

and πv⊥⊥ denotes the orthogonal projection on v ⊥ . When v = e1 , this reads simply

L(X | (v|X) ∈ [a, b]) = L(Z | Z ∈ [a, b]) ⊗ N (0, Id−1 ).

d
(b) Hyper-rectangles. We still consider X = (X 1 , . . . , X d ) = N (0; Id ), d ≥ 2. Let (e1 , . . . , ed ) denote
the canonical basis of Rd . We define the strata as hyper-rectangles. Let N1 , . . . , Nd ≥ 1.

d n
Y o d
Y
Ai := x ∈ Rd s.t. (e` |x) ∈ [yi`` −1 , yi`` ] , i∈ {1, . . . , N` }
`=1 `=1

i
where the yi` ∈ R are defined by Φ0 (yi` ) = N` , i = 0, . . . , N` . Then, for every multi-index i =
(i1 , . . . , id ) ∈ d`=1 {1, . . . , N` },
Q

d
O
L(X | X ∈ Ai ) = L(Z | Z ∈ [yi`` −1 , yi`` ]). (3.6)
`=1

Optimizing the allocation to each stratum in the simulation for a given function F in order to
to reduce the variance is of course interesting and can be highly efficient but with the drawback to
be strongly F -dependent, especially when this allocation needs an extra procedure like in [45]. An
alternative and somewhat dual approach is to try optimizing the strata themselves uniformly with
respect to a class of functions F (namely Lipschitz continuous continuous functions) prior to the
allocation of the allocation.
This approach emphasizes the connections between stratification and optimal quantization and
provides bounds on the best possible variance reduction factor that can be expected from a strati-
fication. Some elements are provided in Chapter 5, see also [38] for further developments in infinite
dimension.
78 CHAPTER 3. VARIANCE REDUCTION

3.6 Importance sampling


3.6.1 The abstract paradigm of important sampling
The basic principle of importance sampling is the following: let X : (Ω, A, P) → (E, E) be an
E-valued random variable. Let µ be a σ-finite reference measure on (E, E) so that PX  µ i.e.
there exists a probability density f : (E, E) → (R+ , B(R+ )) such that

PX = f.µ.

In practice, we will have to simulate several r.v., whose distributions are all absolutely continuous
with respect to this reference measure µ. For a first reading one may assume that E = R and µ is
the Lebesgue measure but what follows can also be applied to more general measured spaces like
the Wiener space (equipped with the Wiener measure), etc. Let h ∈ L1 (PX ). Then,
Z Z
E h(X) = h(x)PX (dx) = h(x)f (x)µ(dx).
E E

Now for any µ-a.s. positive probability density function g defined on (E, E) (with respect to µ),
one has Z Z
h(x)f (x)
E h(X) = h(x)f (x)µ(dx) = g(x)µ(dx).
E E g(x)
One can always enlarge (if necessary) the original probability space (Ω, A, P) to design a random
variable Y : (Ω, A, P) → (E, E) having g as a probability density with respect to µ. Then, going
back to the probability space yields for every non-negative or PX -integrable function h : E → R,
 
h(Y )f (Y )
E h(X) = E . (3.7)
g(Y )

So, in order to compute E h(X) one can also implement a Monte Carlo simulation based on the
simulation of independent copies of the random variable Y i.e.
  M
h(Y )f (Y ) 1 X f (Yk )
E h(X) = E = lim h(Yk ) a.s.
g(Y ) M →+∞ M g(Yk )
k=1

Practitioner’s corner
 Practical requirements (to undertake the simulation). To proceed it is necessary to simulate
independent copies of Y and to compute the ratio of density functions f /g at a reasonable cost (note
that only the ratio is needed which makes useless the computation of some “structural” constants
like (2π)d/2 when both f and g are Gaussian densities, e.g. with different means, see below). By
“reasonable cost” for the simulation of Y , we mean at the same cost as that of X (in terms of
complexity). As concerns the ratio f /g this means that its computation remains negligible with
respect to that of h.
 Sufficient conditions (to undertake the simulation). Once the above conditions are fulfilled,
the question is: is it profitable to proceed like that? So is the case if the complexity of the
simulation for a given accuracy (in terms of confidence interval) is lower with the second method.
3.6. IMPORTANCE SAMPLING 79

If one assumes as above that,simulating X and Y on the one hand, and computing h(x) and
(hf /g)(x) on the other hand are both comparable in terms of complexity, the question amounts to
comparing the variances or equivalently the squared quadratic norm of the estimators since they
have the same expectation E h(X).

Now
" 2 # " 2 #
h(Y )f (Y ) hf
E = E (Y )
g(Y ) g
h(x)f (x) 2
Z  
= g(x)µ(dx))2
g(x)
ZE
f (x)
= h(x)2 f (x)µ(dx)
E g(x)
 
2f
= E h(X) (X) .
g
hf
As a consequence simulating (Y ) rather than h(X) will reduce the variance if and only if
g
 
f
E h(X)2 (X) < E h(X)2 .

(3.8)
g
Remark. In fact, theoretically, as soon as h is non-negative and E h(X) 6= 0, one may reduce the
variance of the new simulation to. . . 0. As a matter of fact, using Schwarz Inequality one gets, as
h(Y )f (Y )
if trying to “reprove” that Var g(Y ) ≥ 0,

Z 2
2
E h(X) = h(x)f (x)µ(dx)
E
Z !2
h(x)f (x) p
= p g(x)µ(dx)
E g(x)
(h(x)f (x))2
Z Z
≤ µ(dx) × g dµ
E g(x) E
(h(x)f (x))2
Z
= µ(dx)
E g(x)
since g is a probability density
p function. Now(x)the equality case in Schwarz inequality says that the
variance is 0 if and only if g(x) and h(x)f
√ are proportional µ(dx)-a.s. i.e. h(x)f (x) = c g(x)
g(x)
µ(dx)-a.s. for a (non-negative) real constant c. Finally this leads, when h has a constant sign and
E h(X) 6= 0 to
h(x)
g(x) = f (x) µ(dx) a.s.
E h(X)
This choice is clearly impossible to make since this would mean that E h(X) is known since it
is involved in the formula. . . and would then be of no use. A contrario this may suggest a direction
to design the (distribution) of Y .
80 CHAPTER 3. VARIANCE REDUCTION

3.6.2 How to design and implement importance sampling


The intuition that must guide practitioners when designing an importance sampling method is to
replace a random variable X by a random variable Y so that fhg (Y ) is in some way often “closer”
than h(X) to their common mean. Let us be more specific.
We consider a Call on the risky asset (Xt )t∈[0,T ] with strike price K and maturity T > 0 (with
interest rate r ≡ 0 for simplicity). If X0 = x  K i.e. the option is deep out-of-the money at the
origin of time so that most of the scenarii payoff XT (ω) will satisfy XT (ω) ≤ K or equivalently
(XT (ω) − K)+ = 0. In such a setting, the event {(XT − K)+ > 0} – the payoff is positive – is a
rare event so that the number of scenarios that will produce a nonzero value for (XT − K)+ will
be small, inducing a too rough estimate of the quantity of interest E (XT − K)+ .
By contrast, if we switch from (Xt )t∈[0,T ] to (Yt )t∈[0,T ] so that
f XT
– we can compute the ratio gY (y) where fXT and gYT are the probability densities of XT and
T
YT respectively,
– YT takes most, or at least i significant part, of its values in [K, +∞).
Then !
f
E (XT − K)+ = E (YT − K)+ XT (YT )
gYT
f
and we can reasonably hope that we will simulate more significant scenarios for (YT − K)+ gXT (YT )
YT
than for (XT − K)+ . This effect will be measured by the variance reduction.
This interpretation in terms of “rare events” is in fact the core of importance sampling, more
than the plain “variance reduction” feature. In particular, this is what a practitioner must have
in in mind when searching for a “good” probability distribution g: importance sampling is more a
matter of “focusing light where it is needed” than reducing variance.
When dealing with vanilla options in simple models (typically local volatility), one usually
works on the state space E = R+ and importance sampling amounts to a change of variable in one
dimensional integrals as emphasized above. However, in a more involved frameworks, one considers
the scenarii space as a state space, typically E = Ω = C(R+ , Rd ) and uses Girsanov’s Theorem
instead of the usual change of variable with respect to the Lebesgue measure.

3.6.3 Parametric importance sampling


In practice, the starting idea is to introduce a parametric family of random variables (Yθ )θ∈Θ (often
defined on the same probability space (Ω, A, P) as X) such that,
– for every θ ∈ Θ, Yθ has a probability density gθ > 0 µ-a.e. with respect to a reference measure
µ and Yθ is as easy to simulate as X in terms of complexity.
– The ratio gf has a small computational cost where f is the probability density of the distri-
θ
bution of X with respect to µ.
Furthermore we can always assume, by adding a value to Θ if necessary, that for a value θ0 ∈ Θ
of the parameter, Yθ0 = X (at least in distribution).
3.6. IMPORTANCE SAMPLING 81

The problem becomes a parametric optimization problem, typically solving the minimization
problem ( " 2 #  )
f 2 f
min E h(Yθ ) (Yθ ) = E h(X) (X) .
θ∈Θ gθ gθ

Of course there is no reason why the solution to the above problem should be θ0 (if so, such a
parametric model is inappropriate). At this stage one can follow two strategies:
– Try to solve by numerical means the above minimization problem,
– Use one’s intuition to select a priori a good (although sub-optimal) θ ∈ Θ by applying the
heuristic principle: “focus light where it is needed”.

Example (Cameron-Martin formula and Importance sampling by mean translation).


This example takes place in a Gaussian framework. We consider (as a starting motivation) a
1-dimensional Black-Scholes model defined by

d
XTx = xeµT +σWT = xeµT +σ TZ
, Z = N (0; 1)

σ2
with x > 0, σ > 0 and µ = r − 2 . Then, the premium of an option with payoff ϕ : (0, +∞) →
(0, +∞),

e−rT E ϕ(XTx ) = E h(Z)


Z
z 2 dz
= h(z)e− 2 √
R 2π
 √ 
where h(z) = e−rT ϕ xeµT +σ T z , z ∈ R.

From now on we forget about the financial framework and deal with
Z z2
e2
E h(Z) = h(z)f (z)dz where f (z) = √
R 2π

and the random variable Z will play the role of X in the above theoretical part. The idea is to
introduce the parametric family

Yθ = Z + θ, θ ∈ Θ := R.

We consider the Lebesgue measure on the real line λ1 as a reference measure, so that
(y−θ)2
e− 2
gθ (y) = √ , y ∈ R, θ ∈ Θ := R.

Elementary computations show that

f θ2
(y) = e−θy+ 2 , y ∈ R, θ ∈ Θ := R.

82 CHAPTER 3. VARIANCE REDUCTION

Hence, we derive the Cameron-Marin formula


θ2
 
E h(Z) = e 2 E h(Yθ )e−θYθ

θ2
 
= e 2 E h(Z + θ)e−θ(Z+θ)

θ2
 
= e− 2 E h(Z + θ)e−θZ .

In fact, a standard change of variable based on the invariance of the Lebesgue measure by
translation yields the same result in a much more straightforward way: setting z = u + θ shows
that
Z
θ2 u2 du
E h(Z) = h(u + θ)e− 2 −θu− 2 √
R 2π
θ 2  
= e− 2 E e−θZ h(Z + θ)

θ2
 
= e 2 E h(Z + θ)e−θ(Z+θ) .

It is to be noticed again that there is no need to account for the normalization constants to
f g0
compute the ratio = .
g gθ

The next step is to choose a “good” θ which significantly reduces the variance i.e., following
Condition (3.8) (using the formulation involving with “Yθ = Z + θ”) such that
 2 2
θ
−θ(Z+θ)
E e 2 h(Z + θ)e < E h2 (Z)

i.e.  
−θ2 2 −2θZ
e E h (Z + θ)e < E h2 (Z)

or, equivalently if one uses the formulation of (3.8) based on the original random variable (here Z)
θ2
 
E h2 (Z)e 2 −θZ < E h2 (Z).

Consequently the variance minimization amounts to the following problem


 2 
θ 2
2 −θZ −θ 2 −2θZ
 
min e 2 E h (Z)e = e E h (Z + θ)e .
θ∈R

It is clear that the solution of this optimization problem and the resulting choice of θ choice
highly depends on the function h.

– Optimization approach: When h is smooth enough, an approach based on large deviation esti-
mates has been proposed by Glasserman and al. (see [57]). We propose a simple recursive/adaptive
3.6. IMPORTANCE SAMPLING 83

approach in Chapter 6 based on Stochastic Approximation methods which does not depend upon
the regularity of the unction h (see also [4] for a pioneering work in that direction).

– Heuristic suboptimal approach: Let us come√back temporarily  to our pricing problem


involving the specified function h(z) = x exp (µT + σ T z) − K + , z ∈ R. When x  K (deep-
out-of-the-money option), most simulations of h(Z) will produce 0 as a result. A first simple idea
– if one does not wish to carry out the above optimization. . . – can be to “re-center the simulation”
of XTx around K by replacing Z by Z + θ where θ satisfies
 √ 
E x exp µT + σ T (Z + θ) = K

which yields, since EXTx = xerT ,


log(x/K) + rT
θ := − √ . (3.9)
σ T
Solving the similar, although slightly different equation,
 √ 
E x exp µT + σ T (Z + θ) = erT K

would lead to
log(x/K)
θ := − √ . (3.10)
σ T
A third simple, intuitive idea is to search for θ such that
 √   1
P x exp µT + σ T (Z + θ) < K =
2
which yields
log(x/K) + µT
θ := − √ . (3.11)
σ T

This choice is also the solution to the equation xeµT +σ T θ = K, etc.
All these choices are suboptimal but reasonable when x  K. However, if we need to price a
whole portfolio including many options with various strikes, maturities (and underlyings. . . ), the
above approach is no longer possible and a data driven optimization method like the one developed
in Chapter 6 becomes mandatory.
Other parametric methods can be introduced, especially in non-Gaussian frameworks, like for
example the so-called “exponential tilting” (or Esscher transform) for distributions having a Laplace
transform on the whole real line (see e.g. [102]). Thus, when dealing with the NIG distribution
(for Normal Inverse Gaussian) this transform has an impact on the thickness oft the tail of the
distribution. Of course there is no a priori limit to what can be designed on as specific problem.
When dealing with path-dependent options, one usually relies on the Girsanov theorem to modify
likewise the drift of the risky asset dynamics (see [102]). Of course all this can be adapted to
multi-dimensional models. . .

 Exercises. 1. (a) Show that under appropriate integrability assumptions on h to be specified,


the function
θ2
 
θ 7→ E h2 (Z)e 2 −θZ
84 CHAPTER 3. VARIANCE REDUCTION

is strictly convex and differentiable on the whole real line with a derivative given by θ 7→ E h2 (Z)(θ−
θ2

Z)e 2 −θZ .
(b) Show that if h is an even function, then this parametric importance sampling procedure by
mean translation is useless. Give a necessary and sufficient condition (involving h and Z) that
makes it always useful.
2. Set r = 0, σ = 0.2, X0 = x = 70, T = 1. One wishes to price a Call option with strike price
K = 100 (i.e. deep out-of-the-money). The true Black-Scholes price is 0.248.
Compare the performances of
(i) a “crude” Monte Carlo simulation,
(ii) the above “intuitively guided” heuristic choices for θ.
Assume now that x = K = 100. What do you think of the heuristic choice suboptimal choice?
d
3. Write all what precedes when Z = N (0; Id ).
4. Randomization of an integral. Let h ∈ L1 (Rd , Bor(Rd ), λ).
(a) Show that for any Rd -valued random vector Y having an absolutely continuous distribution
PY = g.λd with g > 0 λd a.s. on {h > 0}, one has
Z  
h
hdλ = E (X)
Rd g
Z
Derive a probabilistic method to compute h dλ.
Rd
(b) Propose an importance sampling approach to that problem inspired by the above examples.

3.6.4 Computing the Value-at-Risk by Monte Carlo simulation: first approach.


Let X be a real-valued random variable defined on a probability space, representative of a loss. For
the sake of simplicity we suppose here that X has a continuous distribution i.e. that its distribution
function defined for every x ∈ R by F (x) := P(X ≤ x) is continuous. For a given confidence
level α ∈ (0, 1) (usually closed to 1), the Value-at-Risk at level α (denoted V @Rα or V @Rα,X
following [49]) is any real number satisfying the equation

P(X ≤ V @Rα,X ) = α ∈ (0, 1). (3.12)

Equation (3.12) has at least one solution since F is continuous which may not be unique in
general. For convenience one often assumes that the lowest solution of Equation (3.12) is the
V @Rα,X . In fact Value-at-Risk (of X) is not consistent as a measure of risk (as emphasized
in [49]), but nowadays it is still widely used to measure financial risk.
One naive way to compute V @Rα,X is to estimate the empirical distribution function of a (large
enough) Monte Carlo simulation at some points ξ lying in a grid Γ := {ξi , i ∈ I}, namely
M
1 X
Fd
(ξ)M := 1{Xk ≤ξ} , ξ ∈ Γ,
M
k=1
3.6. IMPORTANCE SAMPLING 85

where (Xk )k≥1 is an i.i.d. sequence of X-distributed random variables. Then one solves the equation
Fb(ξ)M = α (using an interpolation step of course).
Such an approach based on empirical distribution of X needs to simulate extreme values of
X since α is usually close to 1. So implementing a Monte Carlo simulation directly on the above
equation is usually a slightly meaningless exercise. Importance sampling becomes the natural way
to “re-center” the equation.
Assume e.g. that
d
X := ϕ(Z), Z = N (0, 1).
Then, for every ξ ∈ R,
θ2
 
P(X ≤ ξ) = e− 2 E 1{ϕ(Z+θ)≤ξ} e−θZ
so that the above Equation (3.12) now reads (θ being fixed)
θ2
 
E 1{ϕ(Z+θ)≤V @Rα,X } e−θZ = e 2 α.

It remains to find good θ’s. Of course this choice of depends on ξ but in practice it should be
fitted to reduce the variance in the neighbourhood of V @Rα,X . We will see in Chapter 6 devoted
that more efficient methods based on Stochastic Approximation can be devised. But they also
need variance reduction to be implemented. Furthermore, similar ideas can be used to compute a
consistent measure of risk called the Conditional Value-at-Risk (or Averaged Value-at-Risk). All
these topics will be investigated in Chapter 6.
86 CHAPTER 3. VARIANCE REDUCTION
Chapter 4

The Quasi-Monte Carlo method

In this chapter we present the so-called quasi-Monte Carlo (QM C) method which can be seen as
a deterministic alternative to the standard Monte Carlo method: the pseudo-random numbers are
replaced by deterministic computable sequences of [0, 1]d -valued vectors which, once substituted
mutatis mutandis to pseudo-random numbers in the Monte Carlo may speed up significantly its
rate of convergence making it almost independent of the structural dimension d of the simulation.

4.1 Motivation and definitions


Computing an expectation E ϕ(X) using a Monte Carlo simulation ultimately amounts to comput-
ing either a finite-dimensional integral
Z
f (u1 , . . . , ud )du1 · · · dud
[0,1]d

or an infinite dimensional integral Z


f (u)λ∞ (du)
∗)
[0,1](N

where [0, 1](N ) denotes the set of finite [0, 1]-valued sequences (or, equivalently, sequences vanishing
at a finite range) and λ∞ = λ⊗N is the Lebesgue measure on ([0, 1]N , Bor([0, 1]N )). Integrals of
the first type show up when X can be simulated by standard methods like inverse distribution
d
function, Box-Müller, etc, so that X = g(U ), U = (U 1 , · · · , U d ) = U ([0, 1]d ) whereas the second
type is typical of a simulation using an acceptance-rejection method.
As concerns finite dimensional integrals, we saw that if (Un )n≥1 denotes an i.i.d. sequence of
uniformly distributed random vectors on [0, 1]d , then for every function f ∈ L1R ([0, 1]d , λd ),
n Z
1X
P(dω)-a.s. f (Uk (ω)) −→ E(f (U1 )) = f (u1 , . . . , ud )du1 · · · dud (4.1)
n [0,1]d
k=1

where the subset Ωf of P-probability 1 one which this convergence does hold depends on the
function f . In particular the above a.s.-convergence holds for any continuous function on [0, 1]d .
But in fact, taking advantage of the separability of the space of continuous functions, one shows

87
88 CHAPTER 4. THE QUASI-MONTE CARLO METHOD

that, we will show below that this convergence simultaneously holds for all continuous functions on
[0, 1]d and even on the larger class of functions of Riemann integrable functions on [0, 1]d .
First we briefly recall the basic definition of weak convergence of probability measures on metric
spaces (see [24] Chapter 1, for a general introduction to weak convergence of measures on metric
spaces).

Definition 4.1 (Weak convergence) Let (S, δ) be a metric space and S := Borδ (S) its Borel
σ-field. Let (µn )n≥1 be a sequence of probability measures on (S, S) and µ a probability measure on
(S)
the same space. The sequence (µn )n≥1 weakly converges to µ (denoted µn =⇒ µ throughout this
chapter) if, for every function f ∈ Cb (S, R),
Z Z
f dµn −→ f dµ as n → +∞. (4.2)
S S

One important result on weak convergence of probability measures that we will use in this
chapter is the following. The result below is important for applications.
(S)
Proposition 4.1 (See Theorem 5.1 in [24]) If µn =⇒ µ, then the above convergence (4.2) holds
for every bounded Borel function f : (S, S) → R such that

µ(Disc(f )) = 0 where Disc(f ) = x ∈ S, f is discontinuous at x ∈ S.

Functions f such that µ(Disc(f )) = 0 are called µ-a.s. continuous functions.

Theorem 4.1 (Glivenko-Cantelli) If (Un )n≥1 is an i.i.d. sequence of uniformly distributed ran-
dom variables on the unit hypercube [0, 1]d , then
n
1X ([0,1]d )
P(dω)-a.s. δUk (ω) =⇒ λd|[0,1]d = U ([0, 1]d )
n
k=1

i.e.
n Z
d 1X
P(dω)a.s. ∀ f ∈ C([0, 1] , R), f (Uk (ω)) −→ f (x)λd (dx). (4.3)
n [0,1]d
k=1

Proof. The vector space C([0, 1]d , R) endowed with the sup-norm kf k∞ := supx∈[0,1]d |f (x)| is
separable in the sense that there exists a sequence (fm )m≥1 of continuous functions on [0, 1]d which
is everywhere dense in C([0, 1]d , R) for the sup-norm (1 ). Then, a countable union of P-negligible
sets remaining P-negligible, one derives from (4.1) the existence of Ω0 ⊂ Ω such that P(Ω0 ) = 1
and
n Z
1X
∀ ω ∈ Ω0 , ∀ m ≥ 1, fm (Uk (ω)) −→ E(fm (U1 )) = fm (u)λd (du) as n → +∞.
n [0,1]d
k=1
1
When d = 1, an easy way to construct this sequence is to consider the countable family of piecewise affine
functions with monotony breaks at rational points of the unit interval and taking rational values at this break points
(and at 0 and 1). The density follows from that of the set Q of rational numbers. When d ≥ 2, one proceeds likewise
by considering functions which are affine on hyper-rectangles with rational vertices which tile the unit hypercube.
4.1. MOTIVATION AND DEFINITIONS 89

On the other hand, it is straightforward that, for any f ∈ C([0, 1]d , R), for every n ≥ 1 and every
ω ∈ Ω0 ,

1 Xn n
1X
fm (Uk (ω)) − f (Uk (ω)) ≤ kf − fm k∞ and |E(fm (U1 )) − E(f (U1 ))| ≤ kf − fm k∞

n n


k=1 k=1

As a consequence, for every m ≥ 1.



1 Xn
lim sup f (Uk (ω)) − E f (U1 ) ≤ 2kf − fm k∞ .

n n
k=1

Now, the fact that the sequence (fm )m≥1 is everywhere dense in C([0, 1]d , R) for the sup-norm
exactly means that
lim inf kf − fm k∞ = 0.
m
This completes the proof since it shows that, for every ω ∈ Ω0 , the expected convergence holds for
every continuous function on [0, 1]d . ♦

Corollary 4.1 Owing to Proposition 4.1, one may replace in (4.3) the set of continuous functions
on [0, 1]d by that of all bounded Borel λd -a.s. continuous functions on [0, 1]d

Remark. In fact, one may even replace the Borel measurability by the “Lebesgue”-measurable
d → R, replace the B([0, 1]d ), Bor(R) -measurability by the

namely, for a function f : [0, 1]
L([0, 1]d ), Bor(R) -measurability where L([0, 1]d ) denotes the completion of the Borel σ-field on


[0, 1]d by the λd -negligible sets (see [30], Chapter 13). Such functions are known as Riemann
integrable functions on [0, 1]d (see again [30], Chapter 13).

What precedes suggests that, as long as one wishes to compute some quantities E f (U ) for
(reasonably) smooth functions f , we only need to have access to a sequence that satisfies the above
convergence property for its empirical distribution. Furthermore, we know from the first chapter
devoted to simulation that this situation is generic since all distributions can be simulated from
a uniformly distributed random variable. This leads to formulate the following definition of a
uniformly distributed sequence .

Definition 4.2 A [0, 1]d -valued sequence (ξn )n≥1 is uniformly distributed (u.d.) on [0, 1]d if
n
1X ([0,1]d )
δξk =⇒ U ([0, 1]d ) as n → +∞.
n
k=1

We need some characterizations of uniform distribution which can be established more easily on
examples. These are provided by the proposition below. To this end we need to introduce further
definitions and notations.

Definition 4.3 (a) We define a partial order, denoted “≤”, on [0, 1]d is defined by: for every
x = (x1 , . . . , xd ), y = (y 1 , . . . , y d ) ∈ [0, 1]d

x≤y if xi ≤ y i , 1 ≤ i ≤ d.
90 CHAPTER 4. THE QUASI-MONTE CARLO METHOD

(b) The “box” [[x, y]] is defined for every x = (x1 , . . . , xd ), y = (y 1 , . . . , y d ) ∈ [0, 1]d , x ≤ y, by

[[x, y]] := {ξ ∈ [0, 1]d , x ≤ ξ ≤ y}.

Note that [[x, y]] 6= ∅ if and only if x ≤ y and, if so is the case, [[x, y]] = Πdi=1 [xi , y i ].

Notation. In particular the unit hypercube [0, 1]d can be denoted [[0, 1]].

Proposition 4.2 (Portemanteau Theorem) (See among others [86, 27]) Let (ξn )n≥1 be a [0, 1]d -
valued sequence. The following assertions are equivalent.

(i) (ξn )n≥1 is uniformly distributed on [0, 1]d .

(ii) For every x ∈ [[0, 1]] = [0, 1]d ,


n d
1X Y
1[[0,x]] (ξk ) −→ λd ([[0, x]]) = xi as n → +∞.
n
k=1 i=1

(iii) (“Discrepancy at the origin” or “star discrepancy”)



1 Xn d
Y
Dn∗ (ξ) := sup 1[[0,x]] (ξk ) − xi −→ 0 as n → ∞.

x∈[0,1] d n
k=1 i=1

(iv) (“Extreme discrepancy”)



1 Xn d
Y
Dn∞ (ξ) := sup 1[[x,y]] (ξk ) − i i
(y − x ) −→ 0 as n → +∞.

x,y∈[0,1]d n k=1 i=1

(v) (Weyl’s criterion) For every integer p ∈ Nd \ {0}


n
1 X 2ı̃π(p|ξk )
e −→ 0 as n → +∞ (where ı̃2 = −1).
n
k=1

(vi) (Bounded Riemann integrable function) For every bounded λd -a.s. continuous Lebesgue-mea-
surable function f : [0, 1]d → R
n Z
1X
f (ξk ) −→ f (x)λd (dx) as n → +∞.
n [0,1]d
k=1

Remark. The two moduli introduced in items (iii) and (iv) define the discrepancy at the origin
and the extreme discrepancy respectively.

Sketch of proof. The ingredients of the proof come from the theory of weak convergence of
probability measures. For more details in the multi-dimensional setting we refer to [24] (Chapter 1
devoted to the general theory of weak convergence of probability measures on a Polish space)
4.1. MOTIVATION AND DEFINITIONS 91

or [86] (old but pleasant book devoted to u.d. sequences). We provide some elements of proof in
1-dimension.
The equivalence (i) ⇐⇒ (ii) is simply the characterization of weak convergence of probability
measures by the convergence of their distributions functions (2 ) since the distribution function Fµn
1 P 1 P
of µn = n 1≤k≤n δξk is given by Fµn (x) = n 1≤k≤n 1{0≤ξk ≤x} .
Owing to second Dini’s Lemma, this convergence of non-decreasing (distribution) functions
is uniform as soon as it holds pointwise since its pointwise limiting function is FU ([0,1]) (x) = x
is continuous. This remark yields the equivalence (ii) ⇐⇒ (iii). Although more technical, the
d-dimensional extension is remains elementary and relies on a similar principle.
The equivalence (iii) ⇐⇒ (iv) is trivial since Dn∗ (ξ) ≤ D∞ (ξ) ≤ 2Dn∗ (ξ) in 1-dimension. In
d-dimension the equality on the right reads with 2d instead of 2.
Item (v) is based on the fact that weak convergence of finite measures on [0, 1]d is characterized
by that of the sequences of their Fourier coefficients (the Fourier coefficients of a finite measure µ
on ([0, 1] , Bor([0, 1] ) are defined by cp (µ) := [0,1]d e2ı̃π(p|u) µ(du), p ∈ Nd , ı̃2 = −1). One checks
d d
R

that the Fourier coefficients cp (λd,[0,1]d ) p∈Nd are simply cp (λd,0,1]d ) = 0 if p 6= 0 and 1 if p = 0.
Item (vi) follows from (i) and Proposition 4.1 since for every x ∈ [[0, 1]], fc (ξ) := 1{ξ≤x} is
continuous outside {x} which is clearly Lebesque negligible. Conversely (vi) implies the pointwise
convergence of the distribution function Fµn as defined above toward FU ([0,1]) . ♦

The discrepancy at the origin Dn∗ (ξ) plays a central rôle in the theory of uniformly distributed
sequences: it idoes not only provide a criterion for uniform distribution, it also appears in as an
upper error modulus for numerical integration when the function f has the appropriate regularity
(see Koksma-Hlawka Inequality below).

Remark. One can define the discrepancy Dn∗ (ξ1 , . . . , ξn ) at the origin of a given n-tuple (ξ1 , . . . , ξn ) ∈
([0, 1]d )n by the expression in (iii) of the above Proposition.

The geometric interpretation of the discrepancy is the following: if x := (x1 , . . . , xd ) ∈ [[0, 1]],
the hyper-volume of [[0, x]] is equal to the product x1 · · · xd and
n
1X card {k ∈ {1, . . . , n}, ξk ∈ [[0, x]]}
1[[0,x]] (ξk ) =
n n
k=1

is but the frequency with which the first n points ξk ’s of the sequence fall into [[0, x]]. The discrepancy
measures the maximum induced error when x runs over [[0, 1]].

The first exercise below provides a first example of uniformly distributed sequence.

 Exercises. 1. Rotations of the torus ([0, 1]d , +). Let (α1 , . . . , αd ) ∈ (R \ Q)d (irrational numbers)
such that (1, α1 , . . . , αd ) are linearly independent over Q (3 ). Let x = (x1 , . . . , xd ) ∈ Rd . Set for
every n ≥ 1
ξn := ({xi + n αi })1≤i≤d .
2
The distribution function Fµ of a probability measure µ on [0, 1] is defined by Fµ (x) = µ([0, x]). One shows
that a sequence of probability measures µn converges toward a probability measure µ if and only if their distribution
functions Fµn and Fµ satisfy Fµn (x) converges to Fµ (x) at every x ∈ R such that Fµ is continuous at x (see [24],
Chapter 1)
3
This means that if λi ∈ Q, i = 0, . . . , d satisfy λ0 + λ1 α1 + · · · + λd αd = 0 then λ0 = λ1 = · · · = λd = 0.
92 CHAPTER 4. THE QUASI-MONTE CARLO METHOD

where {x} denotes the fractional part of a real number x. Show that the sequence (ξn )n≥1 is
uniformly distributed on [0, 1]d (and can be recursively generated). [Hint: use the Weyl criterion.]

2.(a) Assume d = 1. Show that for every n-tuple (ξ1 , . . . , ξn ) ∈ [0, 1]n
 
k − 1 (n) k (n)
Dn∗ (ξ1 , . . . , ξn ) = max − ξk , − ξk
1≤k≤n n n
(n)
where (ξk )1≤k≤nP is the reordering of the n-tuple (ξ1 , . . . , ξn ). [Hint: How does the “càdlàg”
function x 7→ n1 nk=1 1{ξk ≤x} − x reach its supremum?].
(b) Deduce that
1 (n) 2k − 1
Dn∗ (ξ1 , . . . , ξn ) = + max ξ − .
2n 1≤k≤n k 2n
(c) Minimal discrepancy at the origin. Show that the n-tuple with the lowest discrepancy (at the
2k−1 1

origin) is 2n 1≤k≤n (the “mid-point” uniform n-tuple) with discrepancy 2n .

4.2 Application to numerical integration: functions with finite


variations
Definition 4.4 (see [133, 27]) A function f : [0, 1]d → R has finite variation in the measure sense
if there exists a signed measure (4 ) ν on ([0, 1]d , Bor([0, 1]d )) such that ν({0}) = 0 and

∀ x ∈ [0, 1]d , f (x) = f (1) + ν([[0, 1 − x]])

(or equivalently f (x) = f (0) − ν(c [[0, 1 − x]])). The variation V (f ) is defined by

V (f ) := |ν|([0, 1]d )

where |ν| is the variation measure of ν.

 Exercises. 1. Show that f has finite variation in the measure sense if and only if there exists
a signed ν measure with ν({1}) = 0 such that

∀ x ∈ [0, 1]d , f (x) = f (1) + ν([[x, 1]]) = f (0) − ν(c [[x, 1]])

and that its variation is given by |ν|([0, 1]d ). This could of course be taken as the definition
equivalently to the above one.
2. Show that the function f defined on [0, 1]2 by

f (x1 , x2 ) := (x1 + x2 ) ∧ 1, (x1 , x2 ) ∈ [0, 1]2


4
A signed measure ν on a space (X, X ) is a mapping from X to R which P satisfies the two axioms of a measure,
namely ν(∅) = 0 and if An , n ≥ 1, are pairwise disjoint, then ν(∪n An ) = n≥1 ν(An ) (the series is commutatively
convergent hence absolutely convergent). Such a measure is finite and can be decomposed as ν = ν1 − ν2 where ν1 ,
ν2 are non-negative finite measures supported by disjoint sets i.e. there exists A ∈ X such that ν1 (Ac ) = ν2 (A) = 0
(see [145]).
4.2. APPLICATION TO NUMERICAL INTEGRATION: FUNCTIONS WITH FINITE VARIATIONS93

d
has finite variation in the measure sense [Hint: consider the distribution of (U, 1−U ), U = U ([0, 1])].

For the class of functions with finite variations, the Koksma-Hlawka Inequality provides an error
n Z
1X
bound for f (ξk ) − f (x)dx based on the star discrepancy.
n [0,1]d
k=1

Proposition 4.3 (Koksma-Hlawka Inequality (1943 when d = 1)) Let ξ = (ξ1 , . . . , ξn ) be an n-


tuple of [0, 1]d -valued vectors and let f : [0, 1]d → R be a function with finite variation (in the
measure sense). Then
n Z
1 X
f (ξk ) − f (x)λd (dx) ≤ V (f )Dn∗ (ξ).

n

[0,1]d
k=1
Pn
Proof. Set µ en = n1 k=1 δξk − λd|[0,1]d . It is a signed measure with 0-mass. Then, if f has finite
variation with respect to a signed measure ν,
n Z Z
1X
f (ξk ) − f (x)λd (dx) = f (x)de
µn (dx)
n [0,1]d
k=1
Z
µn ([0, 1]d ) +
= f (1)e ν([[0, 1 − x]]))e
µn (dx)
[0,1]d
Z Z 
= 0+ 1{v≤1−x} ν(dv) µ en (dx)
[0,1]d
Z
= en ([[0, 1 − v]])ν(dv)
µ
[0,1]d

where we used Fubini’s Theorem to interchange the integration order (which is possible since
|ν| ⊗ |e
µn | is a finite measure). Finally, using the extended triangular inequality for integrals with
respect to signed measures,
Z
1 X n Z
f (ξk ) − f (x)λd (dx) = en ([[0, 1 − v]])ν(dv)
µ

n

k=1 [0,1]d [0,1]d
Z
≤ |e
µn ([[0, 1 − v]])| |ν|(dv)
[0,1]d

≤ µn ([[0, v]])||ν|([0, 1]d )


sup |e
v∈[0,1]d
= Dn∗ (ξ)V (f ). ♦

Remarks. • The notion of finite variation in the measure sense has been introduced in [27]
and [133]. When d = 1, it coincides with the notion of left continuous functions with finite variation.
When d ≥ 2, it is a slightly more restrictive notion than “finite variation” in the Hardy and Krause
sense (see e.g. [86, 118] for a definition of this slightly more general notion). However, it is much
easier to handle, in particular to establish the above Koksma-Hlawka Inequality! Furthermore
V (f ) ≤ VH&K (f ). Conversely, one shows that a function with finite variation in the Hardy and
94 CHAPTER 4. THE QUASI-MONTE CARLO METHOD

Krause sense is λd -a.s. equal to a function fe having finite variations in the measure sense satisfying
V (f˜) ≤ VH&K (f ). In one dimension, finite variation in the Hardy & Krause sense exactly coincides
with the standard definition of finite variation.
• A classical criterion (see [27, 133]) for finite variation in the measure sense is the following: if
∂df
f : [0, 1]d → R having a cross derivative in the distribution sense which is an integrable
∂x1 · · · ∂xd
function i.e.
df
Z
∂ 1 d
(x , . . . , x ) dx1 · · · dxd < +∞,


1 d
[0,1]d ∂x · · · ∂x

then f has finite variation in the measure sense. This class includes the functions f defined by
Z
f (x) = f (1) + ϕ(u1 , . . . , ud )du1 . . . dud , ϕ ∈ L1 ([0, 1]d , λd ). (4.4)
[[0,1−x]]

 Exercises. 1. Show that the function f on [0, 1]3 defined by

f (x1 , x2 , x3 ) := (x1 + x2 + x3 ) ∧ 1

has not finite variations in the measure sense [Hint: its third derivative in the distribution sense is
not a measure] (5 ).
2. (a) Show directly that if f satisfies (4.4), then for any n-tuple (ξ1 , . . . , ξn ),

1 X n Z

f (ξk ) − f (x)λd (dx) ≤ kϕkL1 (λ ) Dn (ξ1 , . . . , ξn ).

n d|[0,1]d

[0,1]d
k=1

(b) Show that if ϕ is also in Lp ([0, 1]d , λd ), for an exponent p ∈ [1, +∞] with an Hölder conjugate
q, then
1 Xn Z
f (ξk ) − f (x)λd (dx) ≤ kϕkLp (λd|[0,1]d ) Dn(q) (ξ1 , . . . , ξn )

n

k=1 [0,1]d

where
Z n d
q ! 1q
1 X Y
Dn(q) (ξ1 , . . . , ξn ) = 1[[0,x]] (ξk ) − xi λd (dx) .

[0,1]d n


k=1 i=1

This modulus is called the Lq -discrepancy of (ξ1 , . . . , ξn ).


3. Other forms of finite variation and Koksma-Hlawka inequality. Let : [0, 1]d → R defined as by

f (x) = ν([[0, x]]), x ∈ [0, 1]d

where ν is a signed measure on [0, 1]d . Show that



n
1 X Z
f (ξk ) − f (x)λd (dx) ≤ |ν|([0, 1]d )Dn∞ (ξ1 , . . . , ξn ).

n

k=1 [0,1]d
5
In fact it has not finite variation in the Hardy & Krause sense either.
4.3. SEQUENCES WITH LOW DISCREPANCY: DEFINITION(S) AND EXAMPLES 95

4. L1 -discrepancy in 1-dimension. Let d = 1 and q = 1. Show that the L1 -discrepancy satisfies

n Z (n)
ξk+1

X k
Dn(1) (ξ1 , . . . , ξn )

= − u du
(n) n
k=0 ξk

(n) (n) (n) (n)


where ξ0 = 0, ξn+1 = 1 and ξ1 ≤ · · · ≤ ξn is the reordering of (ξ1 , . . . , ξn ).

4.3 Sequences with low discrepancy: definition(s) and examples


4.3.1 Back again to Monte Carlo method on [0, 1]d
Let (Un )n≥1 be an i.i.d. sequence of random vectors uniformly distributed over [0, 1]d defined in a
probability space (Ω, A, P). We saw that

P(dω)-a.s., (Un (ω))n≥1 is uniformly distributed.

So, it is natural to evaluate its (random) discrepancy Dn∗ (Uk (ω))k≥1 as a measure of its uniform


distribution and one may wonder at which rate it goes to zero: to be more precise is there a kind
of transfer of the Central Limit Theorem (CLT ) and the Law of the Iterated Logarithm (LIL) –
which rule the weak and strong rate of convergence in the SLLN – to this discrepancy modulus.
The answer is positive since Dn∗ ((Un )n≥1 ) satisfies both a CLT and a LIL. These results are due
to Chung (see e.g. [33]).

 Chung’s CLT for the star discrepancy.

√  L E(supx∈[0,1]d |Zxd |)
n Dn∗ (Uk )k≥1 −→ sup |Zxd | E Dn∗ ((Uk )k≥1 ) ∼

and √ as n → +∞
x∈[0,1]d n

where (Zxd )x∈[0,1]d denotes the centered Gaussian multi-index process (or “bridged hyper-sheet”)
with covariance structure given by

d
Y d
Y d
 Y
∀ x = (x1 , . . . , xd ), y = (y 1 , . . . , y d ) ∈ [0, 1]d , Cov(Zxd , Zyd ) = xi ∧ y i − xi yi .


i=1 i=1 i=1

When d = 1, Z 1 is simply the Brownian bridge over the unit interval [0, 1] and the distribution of
its sup-norm is given, using its tail (or survival) function, by

2 z2
X
∀ z ∈ R+ , P( sup |Zx | ≥ z) = 2 (−1)k+1 e−2k
x∈[0,1] k≥1

(see [40] or [24], chapter 2).


96 CHAPTER 4. THE QUASI-MONTE CARLO METHOD

 Chung’s LIL for the star discrepancy.


s
2n
Dn∗ (Uk (ω))k≥1 = 1

lim sup P(dω)-a.s.
n log(log n)

At this stage, one can provide a preliminary definition of a sequence with low discrepancy on
[0, 1]d as a [0, 1]d -valued sequence ξ := (ξn )n≥1 such that
r !
∗ log(log n)
Dn (ξ) = o as n → +∞
n

which means that its implementation with a function with finite variation will speed up the con-
vergence rate of numerical integration by the empirical measure with respect to the worst rate of
the Monte Carlo simulation.

 Exercise. Show, using the standard LIL, the easy part of Chung’s LIL, that is
s
2n
Dn∗ (Uk (ω))k≥1 ≥ 2 sup
 p
lim sup λd ([[0, x]])(1 − λd ([[0, x]])) = 1 P(dω)-a.s.
n log(log n) x∈[0,1]d

4.3.2 Röth’s lower bounds for the discrepancy


Before providing examples of sequences with low discrepancy, let us first give elements about
the known lower bounds for the asymptotics of the (star) discrepancy of a uniformly distributed
sequence.
The first results are due to Röth ([144]): there exists a universal constant cd ∈ (0, ∞) such that,
for any [0, 1]d -valued n-tuple (ξ1 , . . . , ξn ),
d−1
(log n) 2
Dn∗ (ξ1 , . . . , ξn ) ≥ cd . (4.5)
n
cd ∈ (0, ∞) such that, for every sequence ξ = (ξn )n≥1 ,
Furthermore, there exists a real constant e
d
(log n) 2
Dn∗ (ξ) ≥e
cd for infinitely many n. (4.6)
n
This second lower bound can be derived from the first one, using the Hammersley procedure
introduced and analyzed in the next section.
On the other hand, we know (see Section 4.3.3 below) that there exists sequences for which

(log n)d
∀ n ≥ 1, Dn∗ (ξ) = C(ξ) where C(ξ) < +∞.
n
Based on this, one can derive from the Hammersley procedure (see Section4.3.4 below) the existence
of a real constant Cd ∈ (0, +∞) such that

n (log n)d−1
∀ n ≥ 1, ∃ (ξ1 , . . . , ξn ) ∈ [0, 1]d , Dn∗ (ξ1 , . . . , ξn ) ≤ Cd .
n
4.3. SEQUENCES WITH LOW DISCREPANCY: DEFINITION(S) AND EXAMPLES 97

In spite of more than fifty years of investigations, the gap between these asymptotic lower and
upper bounds have not been improved: it has still not been proved whether there exists a sequence
d
for which C(ξ) = 0 i.e. for which the rate (lognn) is not optimal.
In fact it is widely shared in the QM C community that in theabove lower  bounds, d−1
2 can be
(log n) d
d
replaced by d − 1 in (4.5) and 2 by d in (4.6) so that the rate O n seems to be the lowest
possible for a u.d. sequence. When d = 1, Schmidt proved that this conjecture is true.
This leads to a more convincing definition for sequence of a sequence with low discrepancy.

Definition 4.5 A [0, 1]d -valued sequence (ξn )n≥1 is a sequence with low discrepancy if

(log n)d
 
Dn∗ (ξ) = O as n → +∞.
n
(p)
For more insight about the other measures of uniform distribution (Lp -discrepancy Dn (ξ),
diaphony, etc), we refer e.g. to [25].

4.3.3 Examples of sequences


 Van der Corput and Halton sequence. Let p1 , . . . , pd be the first d prime numbers (or
simply, d pairwise prime numbers). The d-dimensional Halton sequence is defined, for every n ≥ 1,
by:
ξn = (Φp1 (n), . . . , Φpd (n))
where the so-called “radical inverse functions” Φp are defined by
r
X ak
Φp (n) = with n = a0 + a1 p + · · · + ar pr , ai ∈ {0, . . . , p − 1}, ar 6= 0,
pk+1
k=0

denotes the p-adic expansion of n. Then, for every n ≥ 1,


d  !
(log n)d
  
∗ 1 Y log(p i n)
Dn (ξ) ≤ 1+ (pi − 1) =O as n → +∞. (4.7)
n log(pi ) n
i=1

The proof of this bound essentially relies on the the Chinese Remainder Theorem (known as
“Théorème chinois” in French). Several improvements of this classical bound have been established:
some of numerical interest (see e.g. [120]), some more theoretical like those established by Faure
(see [46])
d  !
1 Y log n pi + 2
Dn∗ (ξ) ≤ d+ (pi − 1) + , n ≥ 1.
n 2 log pi 2
i=1
Remark. When d = 1, the sequence (Φp (n))n≥1 is called the p-adic Van der Corput sequence (and
the integer p needs not to be prime).

One easily checks that the first terms of the V dC(2) sequence are as follows
1 1 3 1 5 3 7
ξ1 = , ξ 2 = , ξ 3 = , ξ 4 = , ξ 5 = , ξ 6 = , ξ 7 = . . .
2 4 4 8 8 8 8
98 CHAPTER 4. THE QUASI-MONTE CARLO METHOD

(n) (n)
 Exercise. Let ξ = (ξn )n≥1 denote the p-adic Van der Corput sequence and let ξ1 ≤ · · · ≤ ξn
be the reordering of the first n terms of ξ.
(a) Show that, for every k ∈ {1, . . . , n},

(n) k
ξk ≤ .
n+1
[Hint: Use an induction on q where n = qp + r, 0 ≤ r ≤ p − 1.]
(b) Derive that, for every n ≥ 1,
ξ1 + · · · + ξn 1
≤ .
n 2
(c) One considers the p-adic Van der Corput sequence (ξ˜n )n≥1 starting at 0 i.e.

ξ˜1 = 0, ξ˜n = ξn−1 , n ≥ 2


(n+1)
where (ξn )n≥1 is the regular p-adic Van der Corput sequence. Show that ξ˜k ≤ k−1
n+1 , k =
1 ˜
1, . . . , n + 1. Deduce that the L -discrepancy of the sequence ξ satisfies

˜ = 1 ξ1 + · · · + ξn
Dn(1) (ξ) − .
2 n

 The Kakutani sequences. This denotes a family of sequences first obtained as a by-product
when trying to generate the Halton sequence as the orbit of an ergodic transform (see [90, 92, 121]).
This extension is based on the p-adic addition on [0, 1], also known as the Kakutani adding machine,
defined on regular p-adic expansions of real of [0, 1] as the addition from the left to the right of the
regular p-adic expansions with carrying over (the p-adic regular expansion of 1 is conventionally
p
set to 1 = 0.(p − 1)(p − 1)(p − 1) . . . ). Let ⊕p denote this addition.Thus, as an example

0.12333... ⊕10 0.412777... = 0.535011...

In a more formal way if x, y ∈ [0, 1] have x = 0, x1 x2 · · · xk · · · and y = 0, y1 y2 · · · yk · · · as regular


p-adic expansions, then

(x ⊕p y)k = (xk + yk )1{xk−1 +yk−1 ≤p−1} + (1 + xk + yk )1{xk−1 +yk−1 ≥p} .

with the convention x1 = y1 = 0.


Then, for every y ∈ [0, 1], one defines the associated p-adic rotation with angle y by

Tp,y (x) := x ⊕p y.

Proposition 4.4 Let p1 , . . . , pd denote the first d prime numbers, y1 , . . . , yd ∈ (0, 1), where yi is a
pi -adic rational number satisfying yi ≥ 1/pi , i = 1, . . . , d and x1 , . . . , xd ∈ [0, 1]. Then the sequence
(ξ)n≥1 defined by
ξn := (Tpn−1
i ,yi
(xi ))1≤i≤d , n ≥ 1.
has a discrepancy at the origin Dn∗ (ξ) satisfying the same upper-bound (4.7) as the Halton sequence,
see [92, 121].
4.3. SEQUENCES WITH LOW DISCREPANCY: DEFINITION(S) AND EXAMPLES 99

p
Remark. Note that if yi = xi = 1/pi = 0.1 , i = 1, . . . , d, the sequence ξ is simply the regular
Halton sequence.
One asset of this approach is to provide an easy recursive form for the computation of ξn since

ξn = ξn−1 ⊕p (y1 , . . . , yd )

where, with a slight abuse of notation, ⊕p denotes here the componentwise pseudo-addition.
Appropriate choices of the starting vector and the “angle” can significantly reduce the discrep-
ancy at least at a finite range (see remark below).

Practitioner’s corner. Heuristic arguments not developed here suggest that a good choice for
the “angles” yi and the starting values xi is
p
2pi − 1 − (pi + 2)2 + 4pi
yi = 1/pi , xi = 1/5, i 6= 3, 4, xi = , i = 3, 4.
3
This specified Kakutani sequence is much easier to implement than the Sobol’ sequences and
behaves quite well up to medium dimensions d, say 1 ≤ d ≤ 20 (see [133, 27]).

 The Faure sequences. These sequences were introduced in [47]. Let p be the smallest prime
integer not lower than d (i.e. p ≥ d). The d-dimensional Faure sequence is defined for every n ≥ 1,
by
ξn = (Φp (n − 1), Cp (Φp (n − 1)), · · · , Cpd−1 (Φp (n − 1)))

where Φp still denotes the p-adic radical


P inverse function and, for every p-adic rational number u
with (regular) p-adic expansion u = k≥0 uk p −(k+1) ∈ [0, 1]
 
X X
j
uj mod. p p−(k+1) .

Cp (u) = 
k
k≥0 j≥k
| {z }
∈{0,...,p−1}

These sequences their discrepancy at the origin satisfies (see [47])


 d !
1 1 p−1
Dn∗ (ξ) ≤ d
(log n) + O((log n) d−1
) . (4.8)
n d! 2 log p

It has been shown later on in [118] that they share the so-called “Pp,d ” property

Proposition 4.5 Let m ∈ N and let ` ∈ N∗ . For any r1 , . . . , rd ∈ N such that r1 + · · · + rd = m


and every x1 , . . . , xd ∈ N, there is exactly one term in the sequence ξ`pm +i , i = 0, . . . , pm − 1 lying
Yd
xk p−rk , (xk + 1)p−rk .
 
in the hyper-cube
k=1
100 CHAPTER 4. THE QUASI-MONTE CARLO METHOD

The prominent feature of Faure’s estimate (4.8) is that the coefficient of the leading error term
(in the log n-scale) satisfies
1 p−1 d
 
lim =0
d d! 2 log p
d
which seems to suggest that the rate is asymptotically better than O (lognn) .
A non-asymptotic upper-bound is provided in [133] (due to Y.J. Xiao in his PhD thesis [158]):
for every n ≥ 1,
d !
1 1 p−1 d
   
∗ log(2n)
Dn (ξ) ≤ +d+1 .
n d! 2 log p
Note that this bound has the same coefficient in its leading term than the asymptotic error
bound obtained by Faure although, from a numerical point of view, it becomes efficient only for
very large n: thus if d = p = 5 and n = 1 000,
 d   d !
1 1 p − 1 log(2n)
Dn∗ (ξ) ≤ +d+1 ≈ 1.18
n d! 2 log p

which is of little interest if one keeps in mind that, by construction, the discrepancy takes its values
in [0, 1]. This can be explained by the form of the “constant” term (in the log n-scale) since

(d + 1)d p − 1 d
 
lim = +∞
d→+∞ d! 2
A better bound is provided in Xiao’s PhD thesis, provided n ≥ pd+2 /2 (of less interest for
applications when d increases since pd+2 /2 ≥ dd+2 /2). Furthermore, it is still suffering from the
same drawback mentioned above.

 The Sobol’ sequence. Sobol’s sequence, although pioneering and striking contribution to
sequences with low discrepancy, appears now as a specified sequence in the family of Niederreiter’s
sequences defined below. The usual implemented sequence is a variant due to Antonov and Saleev
(see [3]). For recent developments on that topic, see e.g. [156].

 The Niederreiter sequences. These sequences were designed as generalizations of Faure and
Sobol’ sequences (see [118]).
Let q ≥ d be the smallest primary integer not lower than d (a primary integer reads q = pr with
p prime). The (0, d)-Niederreiter sequence is defined for every integer n ≥ 1, by
ξn = (Ψq,1 (n − 1), Ψq,2 (n − 1), · · · , Ψq,d (n − 1))
where !
(i)
X X
Ψq,i (n) := ψ −1 C(j,k) Ψ(ak ) q −j
j k

and Ψ : {0, . . . , q − 1} → IFq is a one-to-one correspondence between {0, . . . , q − 1} and the finite
field IFq with cardinal q satisfying Ψ(0) = 0 and
(i) k

C(j,k) = j−1 Ψ(i − 1).
4.3. SEQUENCES WITH LOW DISCREPANCY: DEFINITION(S) AND EXAMPLES 101

These sequences coincide with the Faure sequence when q is the lowest prime number not lower
than d. When q = 2r , with 2r−1 < d ≤ 2r , the sequence coincides with the Sobol’ sequence (in its
original form).
The sequences of this family all share the Pp,d property and have consequently discrepancy
satisfying an upper bound with a structure similar to that of the Faure sequence.

4.3.4 The Hammersley procedure


The Hammersley procedure is a canonical method to design a [0, 1]d -valued n-tuple from a [0, 1]d−1
one with a discrepancy ruled by the (d − 1)-dimensional one.

Proposition 4.6 Let d ≥ 2. Let (ζ1 , . . . , ζn ) be a [0, 1]d−1 -valued n-tuple. Then the [0, 1]d -valued
n-tuple defined by  
k
(ξk )1≤k≤n = ζk ,
n 1≤k≤n
satisfies
max1≤k≤n kDk∗ (ζ1 , . . . , ζk )   1 + max ∗
1≤k≤n kDk (ζ1 , . . . , ζk )
≤ Dn∗ (ξk )1≤k≤n ≤
n n
Proof. It follows from the very definition of the discrepancy at the origin
!
1 Xn d−1
Y
∗ i
Dn ((ξk )1≤k≤n ) = sup 1{ζk ∈[[0,x]], k ≤y} − x y

n

n
(x,y)∈[0,1]d−1 ×[0,1]
k=1

k=1
= sup sup |· · · |
x∈[0,1]d−1 y∈[0,1]
!
1 Xk d−1 k−1 d−1
k Y i 1 X k Y i
= max sup 1{ζ` ∈[[0,x]]} − x ∨ sup 1{ζ` ∈[[0,x]]} − x

1≤k≤n x∈[0,1]d−1 n `=1
n x∈[0,1]d−1 n
i=1 `=1
n
i=1

1 Pn k
since a function y 7→ n k=1 ak 1{ n
k
≤y} − b y (ak , b ≥ 0) reaches its supremum at y = n or at its
left limit “ nk −” for an index k ∈ {1, . . . , n}. Consequently
!
k−1 d−1
1 1 X k Y
Dn∗ ((ξk )1≤k≤n ) = max k Dk∗ (ζ1 , . . . , ζk ) ∨ (k − 1) sup 1{ζ` ∈[[0,x]]} − xi (4.9)

n 1≤k≤n x∈[0,1] d−1 k − 1 k − 1
`=1 i=1
1  
max k Dk∗ (ζ1 , . . . , ζk ) ∨ (k − 1)Dk−1


≤ (ζ1 , . . . , ζk−1 ) + 1
n 1≤k≤n
1 + max1≤k≤n kDk∗ (ζ1 , . . . , ζk )
≤ .
n
The lower bound is obvious from (4.9). ♦

Corollary 4.2 Let d ≥ 1. There exists a real constant Cd ∈ (0, ∞) such that, for every n ≥ 1,
there exists an n-tuple (ξ1n , . . . , ξnn ) ∈ ([0, 1]d )n satisfying

1 + (log n)d−1
Dn∗ (ξ1n , . . . , ξnn ) ≤ Cd .
n
102 CHAPTER 4. THE QUASI-MONTE CARLO METHOD

Proof. If d = 1, a solution is given for a fixed integer n ≥ 1 by setting ξk = nk , k = 1, . . . , n (or


ξk = 2k−1
2n , k = 1, . . . , n, etc). If d ≥ 2, it suffices to apply for every n ≥ 1 the Hammersley procedure
to any (d − 1)-dimensional sequence ζ = (ζn )n≥1 with low discrepancy in the sense of Definition 4.5.
If so, the constant Cd can be taken equal to 2 cd−1 (ζ) where Dn∗ (ζ) ≤ cd−1 (ζ)(1 + (log n)d−1 )/n (for
every n ≥ 1). ♦

The main drawback of this procedure is that if one starts from a sequence with low discrepancy
(often defined recursively), one looses the “telescopic” feature of such a sequence. If one wishes,
for a given function f defined on [0, 1]d , to increase n in order to improve the accuracy of the
approximation, all the terms of the sum in the empirical mean have to be re-computed

 Exercise. Derive the theoretical lower bound (4.6) for infinite sequences from the one for (4.5).

4.3.5 Pros and Cons of sequences with low discrepancy


The use of sequences with low discrepancy to compute integrals instead of the Monte Carlo method
(using pseudo-random numbers) is known as the Quasi-Monte Carlo method (QM C). This termi-
nology extends to so-called good lattice points not described here (see [118]).
The Pros.  The main attracting feature of sequences with low discrepancy is undoubtedly
the Koksma-Hlawka inequality combined with the rate of decay of the discrepancy which suggests
that the QM C method is almost dimension free. Unfortunately, in practice, stander bounds for
discrepancy do not allow for the sue of this inequality to provide (100%-)“confidence intervals”
 Furthermore, one can show that for smoother functions on [0, 1] the integration rate can be
O(1/n) when the sequence ξ can be obtained as the orbit ξn = T n−1 (ξ1 ), n ≥ 1, of a(n ergodic)
transform T : [0, 1]d → [0, 1]d (6 ) and f is a coboundary i.e. can be written
Z
f− f (u)du = g − g ◦ T
[0,1]d

6
Let (X, X , µ) be a probability space. A mapping T : (X, X ) → (X, X ) is ergodic if

(i) µ ◦ T −1 = µ µ is invariant for T

(ii) ∀ A ∈ X , T −1 (A) = A =⇒ µ(A) = 0 or 1.

Then the Birkhoff’s pointwise ergodic Theorem (see [85]) implies that, for every f ∈ L1 (µ)
n Z
1X
µ(dx)-a.s. f (T k−1 (x)) −→ f dµ.
n
k=1

The mapping T is uniquely ergodic if µ is the only measure satisfying T . If X is a topological space, X = Bor(X)
and T is continuous, then, for any continuous function f : X → R,
n Z
1
X
k−1
sup f (T (x)) −→ f dµ .

x∈X n X
k=1

n−1
In particular it shows that any orbit (T (x))n≥1 is µ-distributed. When X = [0, 1]d and µ = λd , one retrieves the
notion of uniformly distributed sequence. This provides a powerful tool to devise uniformly distributed sequences
(like Kakutani sequences).
4.3. SEQUENCES WITH LOW DISCREPANCY: DEFINITION(S) AND EXAMPLES 103

where g is a bounded Borel function (see e.g. [121]). As a matter of fact


n
g(ξ1 ) − g(ξn+1 )
Z
1X 1
f (ξk ) − f (u)du = =O .
n [0,1]d n n
k=1

The Kakutani transforms (rotations with respect to ⊕p ) and the automorphisms of the torus
are typical examples of ergodic transforms, the first ones being uniquely ergodic (keep in mind
that the p-adic Van der Corput sequence is an orbit of the transform Tp,1/p ). The fact that the
Kakutani transforms are not continuous – or to be precise their representation on [0, 1]d – can be
circumvent (see [121]) and one can take advantage of this unique ergodicity to characterize their
coboundaries. Easy-to-check criterions based on the rate of decay of trh Fourier coefficients cp (f ),
p = (p1 , . . . , pd ) ∈ N of a function f as kpk := p1 × · · · × pd goes to infinity have been provided
(see [121, 158, 159]) for the p-adic Van der Corput sequences in 1-dimension and other orbits of
the Kakutani transforms. These criterions are based
Extensive numerical tests on problems involving some smooth (periodic) functions on [0, 1]d ,
d ≥ 2, have been carried out, see e.g. [133, 27]. They suggest that this improvement still holds in
higher dimension, at least partially. These are for the pros.
The cons.  As concerns the cons, the first one is that all the non asymptotic available non
asymptotic bounds are very poor from a numerical point of view. We still refer to [133] for some
examples which show that these bounds cannot be used to provide any kind of (deterministic)
error intervals. By contrast, one must always have in mind that the regular Monte Carlo method
automatically provides a confidence interval.
 The second drawback concerns the family of functions for which the QM C speeds up the conver-
gence through the Koksma-Hlawka Inequality. This family – mainly the functions with some kind
of finite variations – somehow becomes smaller and smaller as the dimension d increases since the
requested condition becomes more and more stringent. If one is concerned with the usual regularity
of functions like Lipschitz continuous continuity, the following striking theorem due to Proı̈nov
([139]) shows that the curse of dimensionality comes in the game without possible escape.
Theorem 4.2 (Proı̈nov) Assume Rd is equipped with the `∞ -norm defined by |(x1 , . . . , xd )|∞ :=
max |xi |. Let
1≤i≤d
w(f, δ) := sup |f (x) − f (y)|, δ ∈ (0, 1).
x,y∈[0,1]d , |x−y|∞ ≤δ

denote the uniform continuity modulus of f (with respect to the `∞ -norm)


(a) Let (ξ1 , . . . , ξn ) ∈ ([0, 1]d )n . For every continuous function f : [0, 1]d → R,
Z
n
1X 1
f (x)dx − f (ξk ) ≤ Cd w f, Dn∗ (ξ1 , . . . , ξn ) d

n

[0,1]d
k=1

where Cd ∈ (0, ∞) is a universal optimal real constant only depending on d.


|f (x)−f (y)|
In particular if f is Lipschitz continuous continuous with coefficient [f ]Lip := supx,y∈[0,1]d |x−y|∞ ,
then Z
n
1X 1
f (x)dx − f (ξk ) ≤ Cd [f ]Lip Dn∗ (ξ1 , . . . , ξn ) d .

n

[0,1]d
k=1
(b) If d = 1, Cd = 1 and if d ≥ 2, Cd ∈ [1, 4].
104 CHAPTER 4. THE QUASI-MONTE CARLO METHOD

 Exercise. Show using the Van der Corput sequences starting at 0 (defined in Section 4.3.3, see
the exercise in the paragraph devoted to Van der Corput and Halton sequences) and the function
f (x) = x on [0, 1] that the above Proı̈nov Inequality cannot be improved for Lipschitz continuous
functions even in 1-dimension. [Hint: Reformulate some results of the Exercise in Section 4.3.3.]

 A third drawback of QM C for practical numerical integration is that all functions need to be
defined on the unit hypercube. One way to get partially rid of that may be to consider integration
on some domains C ⊂ [0, 1]d having a regular boundary in the Jordan sense (7 ). Then a Koksma-
Hlawka like inequality holds true :
Z
n
1X 1
f (x)λd (dx) − 1{xk ∈C} f (ξk ) ≤ V (f )Dn∞ (ξ) d

C n
k=1

where V (f ) denotes the variations of f (in the Hardy & Krause or measure sense) and Dn∞ (ξ)
denotes the extreme discrepancy of (ξ1 , . . . , ξn ) (see again [118]). The simple fact to integrate over
such a set annihilates the low discrepancy effect (at least from a theoretical point of view).

 Exercise. Prove Proı̈nov’s Theorem when d = 1 [Hint: read the next chapter and compare
discrepancy and quantization error].

This suggests that the rate of numerical integration in d-dimension by a 


sequence with
 low
  d−1

discrepancy of Lipschitz continuous functions is O log1n as n → ∞ (and O (log n)1


d
when
nd nd
considering a n-tuple if one relies on the Hammersley method). This emphasizes that sequences
with low discrepancy are not spared by the curse of dimensionality when implemented on functions
with standard regularity. . .

 WARNING the dimensional trap! A fourth drawback, undoubtedly the most danger-
ous for beginners, is of course that a given (one-dimensional) sequence (ξn )n≥1 does not “simulate”
independence as emphasized by the classical exercise below.

 Exercise. Let ξ = (ξn )n≥1 denote the dyadic Van der Corput sequence. Show that, for every
n≥0
1 ξn
ξ2n+1 = ξ2n + and ξ2n =
2 2
with the convention ξ0 = 0. Deduce that
n
1X 5
lim ξ2k ξ2k+1 = .
n n 24
k=1

Compare with E (U V ), where U, V are independent with uniform distribution over [0, 1]. Conclude.

In fact this phenomenon is typical of the price to be paid for “filling up the gaps” faster than
random numbers do. This is the reason why it is absolutely mandatory to use d-dimensional
sequences with low discrepancy to perform QM C computations related to a random vector X of the
7
Namely that for every ε > 0, λd ({u ∈ [0, 1]d | dist(u, ∂C) < ε}) ≤ κC ε.
4.3. SEQUENCES WITH LOW DISCREPANCY: DEFINITION(S) AND EXAMPLES 105

d
form X = Ψ(Ud ), Ud = U ([0, 1]d ) (d is sometimes called the structural dimension of the simulation).
The d components of these d-dimensional sequences do simulate independence.
This has important consequences on very standard simulation methods as illustrated below.

Application to the “QM C Box-Müller  method”: to adapt the Box-Müller method of sim-
ulation of a normal distribution N 0; 1 introduced in Corollary 1.3, we proceed as follows: let
ξ = (ξn1 , ξn2 )n≥1 be a u.d. sequence over [0, 1]2 (in practice chosen with low discrepancy). We set
for every n ≥ 1,
r r !
1 2 1 1 2
 1 1 2

ζn = ζn , ζn ) := − log(ξn ) sin 2πξn , − log(ξn ) cos 2πξn , n ≥ 1.
2 2

Then, for every bounded continuous function f1 : R → R,


n
1X d
f1 ζn1 −→ E f (Z1 ), Z1 = N 0; 1
 
lim
n n
k=1

and, for every bounded continuous function f2 : R2 → R,


n
1X  d 
lim f2 ζn −→ E f (Z), Z = N 0; I2 .
n n
k=1

These continuity assumption on f1 and f2 can be relaxed, e.g. for f2 , into: the function fe2 defined
on (0, 1]2 by
r r !
1 2 1 2 1 2
 1 2

(ξ , ξ ) 7−→ fe2 (ξ , ξ ) := f2 − log(ξ 1 ) sin 2πξ , − log(ξ 1 ) cos 2πξ
2 2

is Riemann integrable. Likewise, the Koksma-Hlawka inequality applies, provided fe2 has finite
d
variation.
 The same holds for f1 and admits a straightforward extension to functions f (Z), Z =
N 0; Id . Establishing that these functions do have finite variations in practice is usually quite
demanding (if true)
The extension of the multivariate Box-Müller method (1.1) should be performed following the
same rule of the structural dimension: it requires to consider a u.d. sequence over [0, 1]d .
T
In particular, we will see further on in Chapter 7 that simulating the Euler scheme with step m of
a d-dimensional diffusion
 over [0, T ] with an underlying q-dimensional Brownian motion
 consumes m
independent N 0; Iq -distributed random vectors i.e. m × q independent N 0; 1 random variables.
To perform a QM C simulation of a function of this Euler scheme at time T , we consequently should
consider a sequence with low discrepancy over [0, 1]mq . Existing error bounds on sequences with
low discrepancy and the sparsity of functions with finite variation make essentially meaningless any
use of Koksma-Hlawka’s inequality or Proı̈nov’s theorem to produce error bounds. Not to say that
in the latter case, the curse of dimensionality will lead to extremely poor theoretical bounds for
Lipschitz functions (like fe2 in 2-dimension).

 Exercise. (a) Implement a QM C-adapted Box-Müller simulation method for N 0; I2 (based




on the sequence with low discrepancy of your choice) and organize a race M C vs QM C to compute
106 CHAPTER 4. THE QUASI-MONTE CARLO METHOD

various calls, say CallBS (K, T ) (T = 1, K ∈ {95, 96, . . . , 104, 105}) in a Black-Scholes model (with
r = 2%, σ = 30%, x = 100, T = 1). To simulate this underlying Black-Scholes risky asset, first use
the closed expression
σ2

Xtx = xe(r− 2
)t+σ TZ
.

(b) Anticipating on Chapter 7, implement the Euler scheme (7.3) of the Black-Scholes dynamics

dXtx = Xtx (rdt + σdWt ).


T
Consider steps of the form m with m = 10, 20, 50, 100. What conclusions can be drawn?
 However, in practice, even the statistical independence between the coordinates of a d dimensional
sequence with low discrepancy is only true asymptotically: thus, in high dimension, for small values
of n the coordinates of the Halton sequence remain highly correlated. As a matter of fact, the ith
component of the canonical d-dimensional Halton sequence (i.e. designed from the d first prime
numbers p1 , . . . , pd ) starts as follows

n 1 n − pi
ξni = , n ∈ {1, . . . , pi − 1}, ξn = 2 + , n ∈ {pi , . . . , 2pi − 1}
pi pi pi

so it is clear that the ith and the (i + 1)th components will remain highly correlated if i is close to
d and d is large: p80 = 503 and P81 = 509,. . .
To overcome this correlation observed for (not so) small values of n, the usual method is to
discard the first values of a sequence.
As a conclusion to this section, let us emphasize graphically in terms of texture the differences
between M C and QM C i.e. between (pseudo-)randomly generated points (say 60 000) and the
same number of terms of the Halton sequence (with p1 = 2 and p2 = 3).

[Figures temporarily not reproduced]

4.4 Randomized QMC


The idea is to introduce some randomness in the QM C method in order to produce a confidence
interval or, from a dual viewpoint to use the QM C approach as a variance reducer in the Monte
Carlo method.
Let (ξn )n≥1 be a u.d. sequence over [0, 1]d .

Proposition 4.7 (a) Let a := (a1 , . . . , ad ) ∈ Rd , the sequence ({a + ξn })n≥1 is u.d. where
{x} denotes the component-wise fractional part of x = (x1 , . . . , xd ) ∈ Rd (defined as {x} =
{x1 }, . . . , {xd } ).
(b) Let U be a uniformly distributed random variable on [0, 1]d . Then, for every a ∈ Rd

d
{a + U } = U.
4.4. RANDOMIZED QMC 107

Figure 4.1: Left: 6.104 randomly generated point; Right: 6.104 terms of Halton(2, 3)-sequence.

Proof. (a) This follows from the (static) Weyl criterion: let p ∈ Nd \ {0},

n n
1 X 2ı̃π(p|{a+ξk }) 1 X 2ı̃π(p1 (a1 +ξ1 )+···+pd (ad +ξd ))
e = e k k
n n
k=1 k=1
n
1 X 2ı̃π(p|ξk )
= e2ı̃π(p|a) e
n
k=1
−→ 0 as n → +∞.

Hence the random variable {a + U } (supported by [0, 1]d ) has the same Fourier transform as U i.e.
the same distribution. (b) One easily checks that both random variables have the same characteristic
function. ♦

Consequently, if U is a uniformly distributed random variable on [0, 1]d and f : [0, 1]d → R is a
Riemann integrable function, the random variable
n
1X
χ = χ(f, U ) := f ({U + ξk })
n
k=1

satisfies
1
Eχ = × n E f (U ) = E f (U )
n
108 CHAPTER 4. THE QUASI-MONTE CARLO METHOD

owing to claim (b). Then, starting from an M -sample U1 , . . . , UM , one defines the Monte Carlo
estimator of size M attached to χ, namely
M M n
1 X 1 XX
I(f, ξ)n,M :=
b χ(f, Um ) = f ({Um + ξk })
M nM
m=1 m=1 k=1

This estimator has a complexity approximately equal to κ × nM where κ is the unitary complexity
induced by the computation of one value of the function f . It straightforwardly satisfies by the
Strong Long of large Numbers
b ξ)n,M −→a.s.
I(f, E f (U )
from what precedes and a CLT
√  L
N 0; σn2 (f, ξ)

b ξ)n,M − E f (U ) −→
M I(f,

a.s. converges toward E χ = E f (U ) by the above claim (a) and satisfies a CLT with variance with
n
!
1 X
σn2 (f, ξ) = Var f ({U + ξk }) .
n
k=1

Hence, the specific rate of convergence of the QM C is irremediably lost. So, this hybrid method
should be compared to regular Monte Carlo of size nM through their respective variances. It is
clear that we will observe a variance reduction if and only if

σn2 (f, ξ) Var(f (U ))


<
M nM
i.e.
n
!
1X Var(f (U ))
Var f ({U + ξk }) ≤ .
n n
k=1
The only natural upper bound for the left hand side of this inequality is
Z n Z !2
1X
σn2 (f, ξ) = f ({u + ξk }) − f dλd du
[0,1]d n [0,1]d
k=1

n Z 2
1 X
≤ sup f ({u + ξk }) − f dλd .

u∈[0,1]d n [0,1]d
k=1

One can show that fu : v 7→ f ({u + v}) has finite variation as soon as f has and supu∈[0,1]d V (fu ) <
+∞ (more precise results can be established) then

σn2 (f, ξ) ≤ sup V (fu )2 Dn∗ (ξ1 , . . . , ξn )2


u∈[0,1]d

so that, if ξ = (ξn )n≥1 is sequence with low discrepancy (Halton, Kakutani, Sobol’, etc),

(log n)2d
σn2 (f, ξ) ≤ Cξ2 , n ≥ 1.
n2
4.5. QM C IN UNBOUNDED DIMENSION: THE ACCEPTANCE REJECTION METHOD 109

Consequently, in that case, it is clear that randomized QM C provides a very significant variance
2d
reduction (for the same complexity) of a magnitude proportional to (lognn) (with an impact of
d
magnitude (log √
n)
n
on the confidence interval). But one must have in mind once again that such
functions become dramatically sparse as d increases.
In fact, even better bounds can be obtained for some classes of functions whose Fourier coeffi-
cients cp (f ), p = (p1 , . . . , pd ) ∈ N satisfy some decreasing rate assumptions as kpk := p1 × · · · × pd
2
Cf,ξ
goes to infinity since in that case, one has σn2 (f, ξ) ≤ 2 so that the gain in terms of variance
n
then becomes proportional n1 for such functions (a global budget /complexity being prescribed fro
the simulation).
By contrast, if we consider the Lipschitz setting, things go radically differently: assume that
f : [0, 1]d → R is Lipschitz continuous and isotropically periodic, i.e. for every x ∈ [0, 1]d and
every vector ei = (δij )1≤j≤d , i = 1, . . . , d of the canonical basis of Rd (δij stands for the Kronecker
symbol) f (x+ei ) = f (x) as soon as x+ei ∈ [0, 1]d , then f can be extended as a Lipschitz continuous
function on the whole Rd with the same Lipschitz coefficient, say [f ]Lip . Furthermore, it satisfies
f (x) = f ({x}) for every x ∈ Rd . Then, it follows from Proı̈nov’s Theorem 4.2 that
n Z 2
1 X 2
sup f ({u + ξk }) − f dλd ≤ [f ]2Lip Dn∗ (ξ1 , . . . , ξn ) d

u∈[0,1]d n

k=1 [0,1]d
2 (log n)2
≤ Cd2 [f ]2Lip Cξd 2 .
nd
(where Cd is Proı̈nov’s constant). This time, still for a prescribed budget, the “gain” factor in
2
terms of variance is proportional to n1− d (log n)2 , which is not a gain. . . but a loss as soon as d ≥ 2!

For more results and details, we refer to the survey [153] on randomized QM C and the references
therein.
Finally, randomized QM C is a specific (and not so easy to handle) variance reduction method,
not a QM C speeding up method. Which suffers from one drawback shared by all QM C-based
simulation methods: the sparsity of the class of functions with finite variations and the difficulty
to identify them in practice when d > 1.

4.5 QM C in unbounded dimension: the acceptance rejection method


If one looks at the remark “The user’s viewpoint” in Section 1.4 devoted to Von Neumann’s
acceptance-rejection method, it is a simple exercise to check that one can replace pseudo-random
numbers by a u.d. sequence in the procedure (almost) mutatis mutandis except for some more
stringent regularity assumptions.
We adopt the notations of this section and assume that the Rd -valued random vectors X
and Y have absolutely continuous distributions with respect to a reference σ-finite measure µ on
(Rd , Bor(Rd )). Assume that PX has a density proportional to f , that PY = g.µ and that f and g
satisfy
f ≤ c g µ-a.e. and g > 0 µ-a.e.
110 CHAPTER 4. THE QUASI-MONTE CARLO METHOD

where c is a positive real constant.


Furthermore we make the assumption that Y can be simulated at a reasonable cost like in the
original rejection-acceptation method i.e. that

Y = Ψ(U ), U ∼ U ([0, 1]r )

for some r ∈ N∗ where Ψ : [0, 1]r → R.


Additional “QM C assumptions”:
 The first additional assumption in this QM C framework is that we ask Ψ to be a Riemann
integrable function (i.e. Borel, bounded and λr -a.s. continuous).
 We also assume that the function

I : (u1 , u2 ) 7→ 1{c u1 g(Ψ(u2 ))≤f (Ψ(u2 ))} is λr+1 -a.s. continuous on [0, 1]r+1 (4.10)

(which also amounts to Riemann integrability since I is bounded).


 Our aim is to compute E ϕ(X), where ϕ ∈ L1 (PX ). Since we will use ϕ(Y ) = ϕ ◦ Ψ(Y ) to
perform this integration (see below), we also ask ϕ to be such that

ϕ ◦ Ψ is Riemann integrable.

This is classically holds true if ϕ is continuous (see e.g. [30], Chapter 3).

Let ξ = (ξn1 , ξn2 )n≥1 be a [0, 1] × [0, 1]r -valued sequence, assumed to be with low discrepancy (or
simply uniformly distributed) over [0, 1]r+1 . Hence, (ξn1 )n≥1 and (ξn2 )n≥1 are in particular uniformly
distributed over [0, 1] and [0, 1]r respectively.
If (U, V ) ∼ U ([0, 1] × [0, 1]r ), then (U, Ψ(V )) ∼ (U, Y ). Consequently, the product of two
Riemann integrable functions being Riemann integrable,

Pn 2
k=1 1{c ξk1 g(Ψ(ξk2 ))≤f (Ψ(ξk2 ))} ϕ(Ψ(ξk )) n→+∞ E(1{c U g(Y )≤f (Y )} ϕ(Y ))
Pn −→
k=1 1{c ξk1 g(Ψ(ξk2 ))≤f (Ψ(ξk2 ))} P(c U g(Y ) ≤ f (Y ))
Z
= ϕ(x)f (x)dx
Rd
= E ϕ(X) (4.11)

where the last two lines follow from computations carried out in Section 1.4.
The main gap to apply the method in a QM C framework is the a.s. continuity assump-
tion (4.10). The following proposition yields an easy and natural criterion.
f
Proposition 4.8 If the function g ◦ Ψ is λr -a.s. continuous on [0, 1]r , then Assumption (4.10) is
satisfied.

Proof. First we note that


f  n f o
Disc(I) ⊂ [0, 1] × Disc ◦ Ψ ∪ (ξ 1 , ξ 2 ) ∈ [0, 1]r+1 s.t. c ξ 1 = ◦ Ψ(ξ 2 ) .
g g
4.6. QUASI-STOCHASTIC APPROXIMATION 111

where I denotes the function defined in (4.10). Now, it is clear that


f   f 
λr+1 ([0, 1] × Disc ◦ Ψ) = λ1 ([0, 1]) × λr Disc( ◦ Ψ) = 1 × 0 = 0
g g
f
owing to the λr -a.s. continuity of g ◦ Ψ. Consequently
n f o
λr+1 (Disc(I)) = λr+1 (ξ 1 , ξ 2 ) ∈ [0, 1]r+1 s.t. c ξ 1 = ◦ Ψ(ξ 2 ) .
g

In turn, this subset of [0, 1]r+1 is negligible for the Lebesgue measure λr+1 since, coming back to
the independent random variables U and Y and having in mind that g(Y ) > 0 P-a.s.,
n f o
λr+1 (ξ 1 , ξ 2 ) ∈ [0, 1]r+1 s.t. c ξ 1 = ◦ Ψ(ξ 2 ) = P(c U g(Y ) = f (Y ))
g
 
f
=P U = (Y ) = 0
cg

where we used (see exercise below) that U and Y are independent by construction and that U has
a diffuse distribution (no atom). ♦
f
Remark. The criterion of the proposition is trivially satisfied when g and Ψ are continuous on
Rd and [0, 1]r respectively.

 Exercise. Show that if X and Y are independent and X or Y has no atom then

P(X = Y ) = 0.

As a conclusion note that in this section we provide no rate of convergence for this acceptation-
rejection method by quasi-Monte Carlo. In fact there is no such error bound under realistic as-
sumptions on f , g, ϕ and Ψ. Only empirical evidence can justify its use in practice.

4.6 Quasi-Stochastic Approximation


It is natural to try replacing regular pseudo-random numbers by quasi-random numbers in others
procedures where they are commonly implemented. So is the case of stochastic Approximation
which can be seen as the stochastic counterpart of recursive zero search or optimization procedures
like Newton-Raphson algorithm, etc. These aspects of QM C will be investigated in Chapter 6.
112 CHAPTER 4. THE QUASI-MONTE CARLO METHOD
Chapter 5

Optimal Quantization methods


(cubatures) I

Optimal Vector Quantization is a method coming from Signal Processing devised to approximate a
continuous signal by a discrete one in an optimal way. Originally developed in the 1950’s (see [55]),
it was introduced as a quadrature formula for numerical integration in the early 1990’s (see [122]),
and for conditional expectation approximations in the early 2000’s, in order to price multi-asset
American style options [12, 13, 11]. In this brief chapter, we focus on the cubature formulas for
numerical integration with respect to the distribution of a random vector X taking values in Rd .
In view of applications, we only deal with the canonical Euclidean quadratic optimal quanti-
zation in Rd although general theory of optimal vector quantization can be developed in a much
more general framework in finite (general norms on both Rd and the probability space (Ω, A, P))
and even infinite dimension (so-called functional quantization, see [106, 131]).
We recall that the canonical Euclidean norm on the vector space Rd is denoted | . |.

5.1 Theoretical background on vector quantization


Let X be an Rd -valued random vector defined on a probability space (Ω, F, P). The purpose of
vector quantization is to study the best approximation of X by random vectors taking at most N
fixed values x1 , . . . , xN ∈ Rd .

Definition 5.1.1 Let x = (x1 , . . . , xN ) ∈ (Rd )N . A Borel partition (Ci (x))i=1,...,N of Rd is a


Voronoi partition of the N -quantizer x (or codebook; the term grid being used for the set of values
{x1 , . . . , xN }) if, for every i ∈ {1, . . . , N },

Ci (x) ⊂ {ξ ∈ Rd , |ξ − xi | ≤ min |ξ − xj |}.


i6=j

The Borel sets Ci (x) are called Voronoi cells of the partition induced by x (note that, if x has not
pairwise distinct components, then some of the cells Ci (x) are empty).

Remark. In the above definition | . | denotes the canonical Euclidean norm. However (see [66] many
results in what follows can be established for any norm on Rd , except e.g. some differentiability
results on the quadratic distortion function (see Proposition 6.3.1 in Section 6.3.6).

113
114 CHAPTER 5. OPTIMAL QUANTIZATION METHODS (CUBATURES) I

Let {x1 , . . . , xN } be the set of values of the N -tuple x (whose cardinality is at most N ). The
nearest neighbour projection Projx : Rd → {x1 , . . . , xN } induced by a Voronoi partition is defined
by
N
X
Projx (ξ) := xi 1Ci (x) , ξ ∈ Rd .
i=1
Then, we define an x-quantization of X by
N
X
X̂ x = Projx (X) = xi 1{X∈Ci (x)} . (5.1)
i=1

b x is given by
The pointwise error induced when replacing X by X
b x | = dist(X, {x1 , . . . , xN }) = min |X − xi |.
|X − X
1≤i≤N

When X has a strongly continuous distribution , i.e. P(X ∈ H) = 0 for any hyperplane H of Rd ,
any two x-quantizations are P-a.s. equal.

Definition 5.1.2 The mean quadratic quantization error induced by the the N -tuple x ∈ (Rd )N is
defined as the quadratic norm of the pointwise error i.e.
  1 Z 1
2 2
x i 2 i 2
kX − X̂ k2 = E min |X − x | = min |ξ − x | PX (dξ) .
1≤i≤N Rd 1≤i≤N

We briefly recall some classical facts about theoretical and numerical aspects of Optimal Quan-
tization. For further details, we refer e.g. to [66, 129, 130, 124, 131].

Theorem 5.1.1 [66, 122] Let X ∈ L2Rd (P). The quadratic distortion function at level N (defined
squared mean quadratic quantization error on (Rd )N )
 
x = (x1 , . . . , xN ) 7−→ E min |X − xi |2 = kX − X̂ x k22 (5.2)
1≤i≤N

reaches a minimum at least at one quantizer x∗ ∈ (Rd )N .


Furthermore, if the distribution PX has an infinite support then x∗,(N ) = (x∗,1 , . . . , x∗,N ) has
pairwise distinct components and the sequence N 7→ minx∈(Rd )N kX − X̂ x k2 is decreasing to 0 as
N → +∞.

Rajouter la preuve
This leads naturally to the following definition.

Definition 5.1.3 Any N -tuple solution to the above distortion minimization problem is called an
optimal N -quantizer or an optimal quantizer at level N .

Remark. • When N = 1, the N -optimal quantizer is always unique, equal to E X.


• Optimal quantizers are never unique, at least because the distortion function is symmetric so that
any permutation of an optimal quantizer is still optimal. The question of the geometric uniqueness
is much more involved. However, in 1-dimension, when the distribution of X is unimodal i.e. has
a log-concave density, then there is a unique optimal quantizer (up to a permutation).
5.1. THEORETICAL BACKGROUND ON VECTOR QUANTIZATION 115

Proposition 5.1.1 [122, 124] Any L2 -optimal N -quantizer x ∈ (Rd )N is stationary in the follow-
b x∗ of X,
ing sense: for every Voronoi quantization X
E(X|X̂ x ) = X̂ x . (5.3)

Proof. Let x∗ be an optimal quadratic quantizer and X b x∗ an optimal quantization of X given


by (5.1) where (Ci (x∗ ))1≤i≤N is a Voronoi partition induced by x∗ . Then, for every y ∈ (Rd )N ,
b x∗ k ≤ kX − X b y k . Now we consider the random variable E(X | X b x∗ . Being measurable

kX − X 2 2

with respect to the σ-field A∗ = σ {X ∈ Ci (x∗ ), i = 1, . . . , N } spanned by Xb x∗ , it reads




N

X
bx

E(X | X = yi 1{X∈Ci (x)}
i=1

where y1 , . . . , yN ∈ Rd . Let y = (y1 , . . . , yN ) (with possibly repeated components). Then, by the


very definition of a nearest neighbour projection (on {y1 , . . . , yN })
b x∗ ≥ X − X b y ≥ X − X b x∗ .

X − E(X | X
2 2 2

On the other hand, by the very definition of conditional expectation as an orthogonal projection
b x∗ obviously lies in L2 d (Ω, A∗ , P),
on A∗ and the fact that X R

b x | ≤ X − X b x∗ .

X − E(X | X
2 2

Combining these two strings of inequalities, yields


b x∗ X − E(X | Xb x∗ = min X − Y , Y A∗ -measurable
 
X − X
2 2 2

b x∗ finally implies Xb x∗ = E(X | X


b x∗ P-a.s.

Uniqueness of conditional expectation given X ♦

Remark. • It is shown in [66] (Theorem 4.2, p.38) an additional property of optimal quantizers:
the boundaries of any of their Voronoi partition are PX -negligible (even if PX do have atoms).
• Let x ∈ (Rd )N be an N -tuple with pairwise distinct
 components and its Voronoi partitions (all)
SN bx
have a P-negligible boundary i.e. P i=1 ∂Ci (x) = 0. Then the Voronoi quantization X of
X is P-a.s. unique and, if x is a local minimum of the quadratic distortion function, then x is a
stationary quantizer, still in the sense that
bx = Xb x.

E X |X
This is an easy consequence of the differentiability result of the quadratic distortion established
further on in Proposition 6.3.1 in Section 6.3.6.

Figure 5.1 shows a quadratic optimal – or at least close to optimality – quantization grid for a
bivariate normal distribution N (0, I2 ). The rate of convergence to 0 of the optimal quantization
error is ruled by the so-called Zador Theorem.
Figure 5.2 illustrates on the bi-variate normal distribution the intuitive fact that optimal quan-
tization does not produce quantizers whose Voronoi cells all have the same “weights” . In fact,
optimizing a quantizer tends to equalize the local inertia of each cell i.e. E 1{X∈Ci (x)} |X − xi |2 ,


i = 1, . . . , N .
116 CHAPTER 5. OPTIMAL QUANTIZATION METHODS (CUBATURES) I

-1

-2

-3
-3 -2 -1 0 1 2 3

Figure 5.1: Optimal quadratic quantization of size N = 200 of the bi-variate normal distribution
N (0, I2 ).

Figure 5.2: Voronoi Tessellation of an optimal N -quantizer (N = 500). Color code: the heaviest
the cell is for the normal distribution, the lightest the cell looks.

Theorem 5.1.2 (Zador’s Theorem) Let d ≥ 1. (a) Sharp rate (see [66]). Let X ∈ L2+δ
Rd
( P)
for some δ > 0. Assume PX (dξ) = ϕ(ξ)λd (dξ) + ν(dξ), ν ⊥ λd (λd Lebesgue measure on Rd ).
Then, there is a constant Je2,d ∈ (0, ∞), such that
Z 1+1
1 d 2 d
x
lim N d min kX − X̂ k2 = Je2,d ϕ d+2 dλd .
N →+∞ x∈(Rd )N Rd

(b) Non asymptotic upper bound (see [107]). Let δ > 0. There exists a real constant Cd,δ ∈
5.1. THEORETICAL BACKGROUND ON VECTOR QUANTIZATION 117


Figure 5.3: Optimal N -quantization (N = 500) of W1 , supt∈[0,1] Wt depicted with its Voronoi
tessellation, W standard Brownian motion.

(0, ∞) such that, for every Rd-valued random vector X,


1
∀ N ≥ 1, min kX − X̂ x k2 ≤ Cd,δ σ2+δ (X)N − d .
x∈(Rd )N

where, for any p ∈ (0, ∞), σp (X) = min kX − akp .


a∈Rd

1
Remarks. • The N d factor is known as the curse of dimensionality: this is the optimal rate to
“fill” a d-dimensional space by 0-dimensional objects.
• The real constant Je2,d clearly corresponds to the case of the uniform distribution U ([0, 1])d ) over
the unit hypercube [0, 1]d for which the slightly more precise statement holds
1 1
lim N d min kX − X̂ x k2 = inf N d min kX − X̂ x k2 ) = Je2,d .
N x∈(Rd )N N x∈(Rd )N

One key of the proof is a self-similarity argument “à la Halton” which establishes the theorem for
the U ([0, 1])d ) distributions.
• Zador’s Theorem holds true for any general – possibly non Euclidean – norm on Rd and the value
of Je2,d depends on the reference norm on Rd . When d = 1, elementary computations show that
1
Je2,1 = 2√ 3
. When d = 2, with the canonical Euclidean norm, one shows (see [117] for a proof,
q
see also [66]) that Je2,d = 185√3 . Its exact value is unknown for d ≥ 3 but, still for the canonical
Euclidean norm, one has (see [66]) using some random quantization arguments,
r s
d d
Je2,d ∼ ≈ as d → +∞.
2πe 17, 08
118 CHAPTER 5. OPTIMAL QUANTIZATION METHODS (CUBATURES) I

5.2 Cubature formulas


The random vector X̂ x takes its values in a finite space {x1 , . . . , xN }, so for every continuous
functional f : Rd → R with f (X) ∈ L2 (P), we have
N
X
x
E(f (X̂ )) = f (xi )P(X ∈ Ci (x))
i=1

which is the quantization-based cubature formula to approximate E (f (X)) [122, 126]. As X̂ x is close
to X, it is natural to estimate E(f (X)) by E(f (X̂ x )) when f is continuous. Furthermore, when f
is smooth enough, one can provide an upper bound for the resulting error using the quantization
error kX − X̂ x k2 , or its square (when the quantizer x is stationary).
The same idea can be used to approximate the conditional expectation E(f (X)|Y ) by E(f (X̂)|Ŷ ),
but one also needs the transition probabilities:

P(X ∈ Cj (x) | Y ∈ Ci (y)).


Numerical computation of E (F (X b x )) is possible as soon as F (ξ) can be computed at any ξ ∈ Rd
and the distribution (P(X b = xi ))1≤i≤N of X b x is known. The induced quantization error kX − X b xk
2
is used to control the error (see below). These quantities related to the quantizer Γ are also called
companion parameters.
Likewise, one can consider a priori the σ(X b x )-measurable random variable F (X
b x ) as a good
approximation of the conditional expectation E(F (X) | X b x ).

5.2.1 Lipschitz continuous functionals


Assume that the functional F is Lipschitz continuous continuous on Rd . Then

b x ) − F (X
E(F (X) | X
b x ) ≤ [F ] E(|X − X
b x| | X
b x)
Lip

so that, for every real exponent r ≥ 1,


b x ) − F (X
kE(F (X) | X b x )kr ≤ [F ] kX − X
b x kr
Lip

(where we applied conditional Jensen inequality to the convex function u 7→ ur ). In particular,


b x )), one derives (with r = 1) that
using that E F (X) = E(E(F (X) | X

E F (X) − E F (X
b x ) ≤ kE(F (X) | Xb x ) − F (X
b x )k
1

b xk .
≤ [F ]Lip kX − X 1

Finally, using the monotony of the Lr (P)-norms as a function of r yields



b x ) ≤ [F ] kX − X
E F (X) − E F (X
b x k ≤ [F ] kX − Xb xk . (5.4)
Lip 1 Lip 2

In fact, considering the Lipschitz continuous functional F (ξ) := d(ξ, x), shows that

kX − X b x k = sup E F (X) − E F (X b x ) . (5.5)
1
[F ]Lip ≤1
5.2. CUBATURE FORMULAS 119

The Lipschitz continuous functionals making up a characterizing family for the weak convergence
of probability measures on Rd , one derives that, for any sequence of N -quantizers xN satisfying
kX − Xb xN k → 0 as N → +∞,
1

(Rd )
b xN = xN ) δ N =⇒
X
P(X i x i
PX
1≤i≤N

( Rd )
where =⇒ denotes the weak convergence of probability measures on Rd .

5.2.2 Convex functionals


If F is a convex functional and X
b is a stationary quantization of X, a straightforward application
of Jensen inequality yields  
E F (X) | Xb ≥ F (X)b

so that  
b ≤ E (F (X)) .
E F (X)

5.2.3 Differentiable functionals with Lipschitz continuous differentials


Assume now that F is differentiable on Rd , with a Lipschitz continuous continuous differential DF ,
and that the quantizer x is stationary (see Equation (5.3)).
A Taylor expansion yields

F (X) − F (X
b x ) − DF (X
b x ).(X − X
b x ) ≤ [DF ] |X − X b x |2 .
Lip

b x yields
Taking conditional expectation given X
 
b x)−F (X
E(F (X) | X
b x)− E DF (X b x).(X − X
b x) | X
b x ≤ [DF ] E(|X − X
b x |2 | X
b x).
Lip

Now, using that the random variable DF (X b x ) is σ(X


b x )-measurable, one has
   
E DF (Xb x ).(X − X
b x ) = E DF (X b x ).E(X − X
bx | X
b x) = 0

so that  
x x x 2 bx
(F (X) | X ) − F (X ) ≤ [DF ] |X − X | | X .

E b b Lip E b

Then, for every real exponent r ≥ 1,



b x ) − F (X
E(F (X) | X
b x ) b x k2 .
≤ [DF ]Lip kX − X (5.6)
2r
r

In particular, when r = 1, one derives like in the former setting



EF (X) − EF (X
b x ) ≤ [DF ] kX − X
b x k2 . (5.7)
Lip 2

In fact, the above inequality holds provided F is C 1 with Lipschitz continuous differential on
every Voronoi cell Ci (x). A similar characterization to (5.5) based on these functionals could be
established.
Some variant of these cubature formulae can be found in [129] or [67] for functions or functionals
F having only some local Lipschitz continuous regularity.
120 CHAPTER 5. OPTIMAL QUANTIZATION METHODS (CUBATURES) I

5.2.4 Quantized approximation of E(F (X) | Y )


Let X and Y be two Rd -valued random vectors defined on the same probability space (Ω, A, P) and
F : Rd → R be a Borel functional. The natural idea to approximate E(F (X) | Y ) by quantization
is to replace mutatis mutandis the random vectors X and Y by their quantizations Xb and Yb . The
resulting approximation is then

E(F (X) | Y ) ≈ E(F (X)


b | Yb ).

At this stage a natural question is to look for a priori estimates in Lp for the resulting error
given the Lp -quantization errors kX − Xk
b p and kY − Yb kp .

To this end, we need further assumptions on F . Let ϕF : Rd → R be a (Borel) version of the


conditional expectation i.e. satisfying

E (F (X) | Y ) = ϕF (Y ).

Usually, no closed form is available for the function ϕF but some regularity property can be es-
tablished, especially in a (Feller) Markovian framework. Thus assume that both F and ϕF are
Lipschitz continuous continuous with Lipschitz continuous coefficients [F ]Lip and [ϕF ]Lip .

 Quadratic case p = 2. We get


    
E F (X) | Y − E F (X)b | Yb = E F (X) | Y − E F (X) | Yb + E F (X) − F (X) b | Yb
   
= E F (X) | Y ) − E E F (X) | Y | Yb + E F (X) − F (X)b | Yb

where we used that Yb is σ(Y


 )-measurable.
  
Now E F (X) | Y ) − E E F (X) | Y | Yb and E F (X) − F (X)
b | Yb are clearly orthogonal in
L2 (P) so that
  2    2
 2
E F (X) | Y −E F (X)
b | Yb
= E F (X) | Y )−E E F (X) | Y | Yb + E F (X)−F (X)
b | Yb
.

2 2 2

Using this time the very definition of conditional expectation given Yb as the best quadratic ap-
proximation among σ(Yb )-measurable random variables, we get
  
E F (X) | Y ) − E E F (X) | Y | Yb = ϕF(Y ) − E (ϕF (Y )|Yb )

2 2

≤ ϕF(Y ) − ϕF (Yb ) .

2

On the other hand, still using that E( . |σ(Yb )) is an L2 -contraction and this time that F is Lipschitz
continuous continuous yields

E(F (X) − F (X) | Y ) ≤ F (X) − F (X) ≤ [F ]Lip X − X .
b b b b
2 2 2

Finally,
5.3. HOW TO GET OPTIMAL QUANTIZATION? 121

 2 2 2
E(F (X) | Y ) − E F (X) ≤ [F ]2Lip X − X
b | Yb + [ϕF ]2Lip Y − Yb .
b
2 2 2

 Lp -case p 6= 2. In the non-quadratic case a counterpart of the above inequality remains valid for
the Lp -norm itself, provided [ϕF ]Lip is replaced by 2[ϕF ]Lip , namely

E(F (X) | Y ) − E(F (X)b | Yb )
≤ [F ]Lip X − X + 2[ϕF ]Lip Y − Yb .
b
p p p

Exercise. Prove the above Lp -error bound when p 6= 2.

5.3 How to get optimal quantization?


This is often considered as the prominent drawback of optimal quantization, at least with respect
to Monte Carlo method. Computing optimal or optimized quantization grids (and their weights)
is less flexible (and more time consuming) than simulating a random vector. This means that
optimal quantization is mainly useful when ones has to compute many integrals (or conditional
expectations) with respect to the same probability distribution like e.g. the Gaussian distributions.
As soon as d ≥ 2, all procedures to optimize the quantization error are based on some nearest
neighbour search. Let us cite the randomized Lloyd I procedure or the Competitive Learning Vector
Quantization algorithm. The first one is a fixed point procedure and the second one a recursive
stochastic approximation procedure. For further details we refer to [129] and to Section 6.3.6 in
Chapter 6. Some optimal/ized grids of the Gaussian distribution are available on the web site

www.quantize.maths-fi.com

as well as several papers dealing with quantization optimization.


Recent implementations of exact or approximate fast nearest neighbour search procedures con-
firmed that the computation time can be considerably reduced in higher dimension.

5.3.1 Competitive Learning Vector Quantization algorithm


Section 6.3.6 in Chapter 6

5.3.2 Randomized Lloyd’s I procedure


Section 6.3.6 in Chapter 6

5.4 Numerical integration (II): Richardson-Romberg extrapola-


tion
The challenge is to fight against the curse of dimensionality to increase the critical dimension beyond which
the theoretical rate of convergence of the Monte Carlo method outperform that of optimal quantization.
Combining the above cubature formula (5.4), (5.7) and the rate of convergence of the (optimal) quantization
error, it seems natural to derive that the critical dimension to use quantization based cubature formulae is
d = 4 (when dealing with continuously differentiable functions), at least when compared to Monte Carlo
122 CHAPTER 5. OPTIMAL QUANTIZATION METHODS (CUBATURES) I

simulation. Several tests have been carried out and reported in [129, 127] and to refine this a priori theoretical
bound. The benchmark was made of several options on a geometric index on d independent assets in a Black-
Scholes model: puts, puts spread and the same in a smoothed version, always without any control variate. Of
course, not correlated assets is not a realistic assumption but it is clearly more challenging as far as numerical
integration is concerned. Once the dimension d and the number of points N have been chosen, we compared
the resulting integration error with a one standard deviation confidence interval of the corresponding Monte
Carlo estimator for the same number of integration points N . The last standard deviation is computed
thanks to a Monte Carlo simulation carried out using 104 trials.
The results turned out to be more favourable to quantization than predicted by theoretical bounds,
mainly because we carried out our tests with rather small values of N whereas curse of dimensionality is an
asymptotic bound. Until the dimension 4, the larger N is, the more quantization outperforms Monte Carlo
simulation. When the dimension d ≥ 5, quantization always outperforms Monte Carlo (in the above sense)
until a critical size Nc (d) which decreases as d increases.
In this section, we provide a method to push ahead theses critical sizes, at least for smooth enough
functionals. Let F : Rd → R, be a twice differentiable functional with Lipschitz continuous Hessian D2 F .
Let (Xb (N ) )N ≥1 be a sequence of optimal quadratic quantizations. Then

1    
E(F (X)) = E(F (Xb (N ) )) + E D2 F (Xb (N ) ).(X − Xb (N ) )⊗2 + O E|X − X| b3 . (5.8)
2
Under some assumptions which are satisfied by most usual distributions (including the normal distribution),
it is proved in [67] as a special case of a more general result that
3
b 3 = O(N − d )
E|X − X|
3−ε
or at least (in particular when d = 2) E |X − X| b 3 = O(N − d ), ε > 0. If furthermore, we make the conjecture
that  c  
 1
E D2 F (Xb (N ) ).(X − Xb (N ) )⊗2 = F,X2 + o ( 3 (5.9)
Nd Nd
one can use a Richardson-Romberg-Richardson extrapolation to compute E(F (X)). Namely, one considers
two sizes N1 and N2 (in practice one often sets N1 = N/2 and N2 = N ). Then combining (5.8) with N1 and
N2 ,
2 2 !
N2d E(F (Xb (N2 ) )) − N d E(F (X
b (N1 ) )) 1
1
E(F (X)) = 2 2 +O 1 2 2 .
N2d − N1d (N1 ∧ N2 ) d (N2d − N1d )
 Numerical illustration: In order to see the effect of the extrapolation technique described above,
numerical computations have been carried out in the case of the regularized version of some Put Spread
options on geometric indices in dimension d = 4, 6, 8 , 10. By “regularized”, we mean that the payoff at
maturity T has been replaced by its price function at time T 0 < T (usually, T 0 ≈ T ). Numerical integration
was performed using the Gaussian optimal grids of size N = 2k , k = 2, . . . , 12 (available at the web site
www.quantize.maths-fi.com).
We consider again one of the test functions implemented in [129] (pp. 152). These test functions were
borrowed from classical option pricing in mathematical finance, namely a Put spread option (on a geometric
index which is less classical). Moreover we will use a “regularized” version of the payoff. One considers d
independent traded assets S 1 , . . . , S d following a d-dimensional Black & Scholes dynamics (under its risk
neutral probability)
σ2 √ i,t
 
i i
St = s0 exp (r − )t + σ tZ , i = 1, . . . , d.
2
where Z i,t = Wti , W = (W 1 , . . . , W d ) is a d-dimensional standard Brownian motion. Independence is
unrealistic but corresponds to the most unfavorable case for numerical experiments. We also assume that
S0i = s0 > 0, i = 1, . . . , d and that the d assets share the same volatility σ i = σ > 0. One considers,
5.5. HYBRID QUANTIZATION-MONTE CARLO METHODS 123

1 σ2 1
the geometric index It = St1 . . . Std d . One shows that e− 2 ( d −1)t It has itself a risk neutral Black-Scholes
dynamics. We want to test the regularized Put spread option on this geometric index with strikes K1 < K2
(at time T /2). Let ψ(s0 , K1 , K2 , r, σ, T ) denote the premium at time 0 of a Put spread on any of the assets
Si.

ψ(x, K1 , K2 , r, σ, T ) = π(x, K2 , r, σ, T ) − π(x, K1 , r, σ, T )


= Ke−rT erf(−d2 ) − x erf(−d1 ),
π(x, K, r, σ, T )
2
log(x/K) + (r + σ2d )T p
d1 = p , d2 = d1 − σ T /d.
σ T /d

Using the martingale property of the discounted value of the premium of a European option yields that the
premium e−rT E((K1 − IT )+ − (K2 − IT )+ ) of the Put spread option on I satisfies on the one hand
σ2 1 √
e−rT E((K1 − IT )+ − (K2 − IT )+ ) = ψ(s0 e 2 (d −1)T
, K1 , K2 , r, σ/ d, T )

and, one the other hand


e−rT E((K1 − IT )+ − (K2 − IT )+ ) = E g(Z)
where
σ2 1
−1) T2
g(Z) = e−rT /2 ψ(e 2 (d
I T , K1 , K2 , r, σ, T /2)
2

1, T2 d, T2 d
and Z = (Z ,...,Z ) = N (0; Id ). The numerical specifications of the function g are as follows:
s0 = 100, K1 = 98, K2 = 102, r = 5%, σ = 20%, T = 2.
The results are displayed below (see Fig. 5.4) in a log-log-scale for the dimensions d = 4, 6, 8, 10.
First, we recover the theoretical rates (namely −2/d) of convergence for the error bounds. Indeed, some
slopes β(d) can be derived (using a regression) for the quantization errors and we found β(4) = −0.48,
β(6) = −0.33, β(8) = −0.25 and β(10) = −0.23 for d = 10 (see Fig. 5.4). Theses rates plead for the
implementation of Richardson-Romberg extrapolation. Also note that, as already reported in [129], when
d ≥ 5, quantization still outperforms M C simulations (in the above sense) up to a critical number Nc (d) of
points (Nc (6) ∼ 5000, Nc (7) ∼ 1000, Nc (8) ∼ 500, etc).
As concerns the Richardson-Romberg extrapolation method itself, note first that it always gives better
results than “crude” quantization. As regards now the comparison with Monte Carlo simulation, no critical
number of points NRomb (d) comes out beyond which M C simulation outperforms Richardson-Romberg
extrapolation. This means that NRomb (d) is greater than the range of use of quantization based cubature
formulas in our benchmark, namely 5 000.
Romberg extrapolation techniques are commonly known to be unstable, and indeed, it has not been
always possible to estimate satisfactorily its rate of convergence on our benchmark. But when a significant
slope (in a log-log scale) can be estimated from the Richardson-Romberg errors (like for d = 8 and d = 10 in
Fig. 5.4 (c), (d)), its absolute value is larger than 1/2, and so, these extrapolations always outperform the M C
method even for large values of N . As a by-product, our results plead in favour of the conjecture (5.9) and
lead to think that Richardson-Romberg extrapolation is a powerful tool to accelerate numerical integration
by optimal quantization, even in higher dimension.

5.5 Hybrid quantization-Monte Carlo methods


In this section we explore two aspects of the variance reduction by quantization. First we propose
use (optimal) quantization as a control variate, then a stratified sampling method relying on a
quantization based stratification. This second method can be seen as a guided Monte Carlo method
or a hybrid Quantization/Monte Carlo method. This method has been originally introduced in [129,
124 CHAPTER 5. OPTIMAL QUANTIZATION METHODS (CUBATURES) I

d = 4 | European Put Spread (K1,K2) (regularized) d=6


10 10
g4 (slope -0.48) QTF g4 (slope -0.33)
g4 Romberg (slope ...) QTF g4 Romberg (slope -0.84)
MC standart deviation (slope -0.5) MC

1
1

0.1

0.1

0.01

0.01

0.001

0.001
1e-04

1e-05 0.0001
1 10 100 1000 10000 1 10 100 1000 10000

(a) (b)

d=8 d = 10
0.1 0.1
QTF g4 (slope -0.25) QTF g4 (slope -0.23)
QTF g4 Romberg (slope -1.2) QTF g4 Romberg (slope -0.8)
MC MC

0.01

0.01

0.001

0.0001 0.001
100 1000 10000 100 1000 10000

(c) (d)

Figure 5.4: Errors and standard deviations as functions of the number of points N in a log-log-scale.
The quantization error is displayed by the cross + and the Richardson-Romberg extrapolation error
by the cross ×. The dashed line without crosses denotes the standard deviation of the Monte Carlo
estimator. (a) d = 4, (b) d = 6, (c) d = 8, (d) d = 10.
5.5. HYBRID QUANTIZATION-MONTE CARLO METHODS 125

131] to deal with Lipschitz continuous functionals of the Brownian motion. Here, we will deal with
a finite dimensional setting.

5.5.1 Optimal quantization as a control variate


Let X : (Ω, A, P) → Rd be square integrable random vector. We assume that we have access to an
N -quantizer x := (x1 , . . . , xn ) ∈ (Rd )N and we denote

b N = Projx (X)
X

(one of ) its (Borel) nearest neighbour projection. We also assume that we have access to the
bN
numerical values of the “companion” probability distribution of x, that is the distribution of X
given by
P(Xb N = xi ) = P(X ∈ Ci (x)), i = 1, . . . , N,

where (Ci (x))i=1,...,N denotes the Voronoi tessellation of the N -quantizer induced by the above
nearest neighbour projection.
Let F : Rd → Rd be a Lipschitz continuous function such that F (X) ∈ L2 (P). In order to
compute E(F (X)), one writes
 
b N )) + E F (X) − F (X
E(F (X)) = E(F (X bN )
M
1 X N
N
= E(F (X )) +
b F (X (m) ) − F (Xd
(m) ) +R
N,M (5.10)
| {z } M
(a)
m=1
| {z }
(b)

where X (m) , m = 1, . . . , M are M independent copies of X, the symbol bN denotes the nearest
neighbour projection on a fixed N -quantizer x ∈ (Rd )N and RN,M is a remainder term defined
by (5.10). Term (a) can be computed by quantization and Term (b) can computed by a Monte
Carlo simulation. Then,

b N ))
σ(F (X) − F (X b N )k2
kF (X) − F (X bN k
kX − X 2
kRN,M k2 = √ ≤ √ ≤ [F ]Lip √
M M M

as M → +∞ where σ(Y ) denotes the standard deviation of a random variable Y . Furthermore


√ L b N ))

M RN,M −→ N 0; Var(F (X) − F (X as M → +∞.

b N )N ≥1 is a sequence of
Consequently, if F is simply a Lipschitz continuous function and if (X
optimal quadratic quantizations of X, then

b N )k
kF (X) − F (X σ2+δ (X)
2
kRN,M k2 ≤ √ ≤ C2,δ [F ]Lip √ 1 . (5.11)
M MN d

where C2,δ the constant coming from the non-asymptotic version of Zador’s Theorem.
126 CHAPTER 5. OPTIMAL QUANTIZATION METHODS (CUBATURES) I

About practical implementation


As concerns practical implementation of this quantization based variance reduction method, the
N
main gap is the nearest neighbour search needed at each step to compute Xd(m) from X (m) .
In 1-dimension, an (optimal) N -quantizer is usually directly obtained as a sorted N -tuple (with
non-decreasing components) and the complexity of a nearest number search on the real line based on
a dichotomy procedure is approximately log N
log 2 . Of course this case is of little interest for applications.
In d dimensions, there exist nearest neighbour search procedure with a O(log N )-complexity,
once the N -quantizer has been given an appropriate tree structure (which costs O(N log N )).
The most popular tree based procedure for nearest neighbour search is undoubtedly the Kd-tree
(see [53]). During the last ten years, several attempts to improve it has been carried out, among
them one can mention the Principal Axis Tree algorithm (see [108]). These procedures are efficient
for quantizers with a large size N lying in a vector space with medium dimension (say up to 10).
An alternative to speed up the nearest neighbour search procedure is to restrict to product
quantizers whose Voronoi cells are hyper-parallelepipeds. In that case the nearest neighbour search
reduces to those on the d marginals with an approximate resulting complexity of d log N
log 2 .
However this nearest neighbour search procedure slows down the global procedure.

5.5.2 Universal stratified sampling


The main drawback of what precedes is that repeated use of nearest neighbour search procedures.
Using a quantization based stratification may be a mean to take advantage of quantization to
reduce the variance without having to implement such time consuming procedures. On the other
hand, one important drawback of the regular stratification method as described in Section 3.5 is
to depend on the function F , at least when concerned by the optimal choice for the allocation
parameters qi . But in fact, one can show stratification has a uniform efficiency among the class of
Lipschitz continuous continuous functions. This follows from the easy proposition below where we
use some notations already introduced in Section 3.5.

Proposition 5.1 (Universal stratification) Let (Ai )i∈I be a stratification of Rd . For every
i ∈ I, define the local inertia of the random vector X by

σi2 = E |X − E(X|X ∈ Ai )|2 |X ∈ Ai .




(a) Then, for every Lipschitz continuous continuous function F : (Rd , | . |) → (Rd , | . |),

∀ i ∈ I, sup σF,i = σi (5.12)


[F ]Lip ≤1

where, for every i ∈ I, σF,i is nonegative and defined by


 
2
= min E |F (X) − a|2 | X ∈ Ai = E |F (X) − E X | X ∈ Ai |2 | X ∈ Ai .
 
σF,i
a∈Rd

(b) Suboptimal choice (qi = pi ).


!
X
2
X  2
pi σi2 = X − E X|σ({X ∈ Ai }, i ∈ I) 2 .

sup pi σF,i = (5.13)
[F ]Lip ≤1 i∈I i∈I
5.5. HYBRID QUANTIZATION-MONTE CARLO METHODS 127

(c) Optimal choice (of the qi ’s).


!2 !2
X X
sup pi σF,i = pi σi . (5.14)
[F ]Lip ≤1 i∈I i∈I

Furthermore !2
X  2
pi σi ≥ X − E X|σ({X ∈ Ai }, i ∈ I ) 1 .
i∈I

Remark. Any real-valued Lipschitz continuous continuous function can be seen as an Rd -valued
Lipchitz function, but then the above equalities (5.12), (5.13) and (5.14) only hold as inequalities.

Proof. (a) In fact


2
σF,i = Var (F (X) | X ∈ Ai )
= E (F (X) − E(F (X)|X ∈ Ai ))2 | X ∈ Ai


≤ E (F (X) − F (E(X|X ∈ Ai )))2 |X ∈ Ai




owing to the very definition of conditional expectation as a minimizer w.r.t. to the conditional
distribution. Now using that F is Lipschitz, it follows that

2 1
≤ [F ]2Lip E 1{X∈Ai } |X − E(X|X ∈ Ai )|2 = [F ]2Lip σi2 .

σF,i
pi

Equalities in (b) and (c) straightforwardly follow from (a). Finally, the monotony of the Lp -norms
implies
N
X
pi σi ≥ kX − E(X|σ({X ∈ Ai }, i ∈ I))k1 . ♦
i=1

5.5.3 An (optimal) quantization based universal stratification: a minimax ap-


proach
The starting idea is to use the Voronoi diagram of an N -quantizer x = (x1 , . . . , xN ) of X to design
the strata in a stratification procedure. Firstly, this amounts to setting I = {, . . . , N } and

Ai = Ci (x), i ∈ I.

Then, for every i ∈ {1, . . . , N }, there exists a Borel function ϕ(xi , .) : [0, 1]q → Rd such that

d 1Ci (x) PX (dξ)


b x = xi ) =
ϕ(xi , U ) = L(X | X .
P(X ∈ Ci (x))
d
where U = U ([0, 1]q ). Note that the dimension q is arbitrary: one may always assume that q = 1
by the fundamental theorem of simulation, but in order to obtain some closed forms for ϕ(xi , .),
128 CHAPTER 5. OPTIMAL QUANTIZATION METHODS (CUBATURES) I

we are lead to consider situations where q ≥ 2 (or even infinite when considering a Von Neumann
acceptance-rejection method).
d bx d
Now let (ξ, U ) be a couple of independent random vectors such that ξ = X and U = U ([0, 1])q .
Then, one checks that
d
ϕ(ξ, U ) = X
so that one may assume without loss of generality that X = ϕ(ξ, U ) which in turn implies that
ξ=X b x i.e.
d
b x , U ), U =
X = ϕ(X U ([0, 1]q ), b x independent.
U, X
In terms of implementation, as mentioned above we need a closed formula for the function ϕ
which induces some stringent constraint on the choice of the N -quantizers. In particular there is
no reasonable hope to consider true optimal quadratic quantizer for that purpose. A reasonable
compromise is to consider some optimal product quantization for which the function ϕ can easily
be made explicit (see Section 3.5).

Let Ad (N ) denote the family of all Borel partitions of Rd having (at most) N elements.

Proposition 5.2 (a) Suboptimal choice One has


!
X
2 b x k2 .
inf sup pi σF,i = min kX − X 2
(Ai )1≤i≤N ∈Ad (N ) [F ]Lip ≤1 x∈(Rd )N
i∈I

(b) Optimal choice One has


!
X
inf sup pi σF,i ≥ b xk .
min kX − X 1
(Ai )1≤i≤N ∈Ad (N ) [F ]Lip ≤1 x∈(Rd )N
i∈I

Proof. First we rewrite (5.13) and (5.14) in terms of quantization i.e. with Ai = Ci (x),:

!
X X 2
2
sup pi σF,i ≤ pi σi2 x
= X − E(X|X ) . (5.15)
b
[F ]Lip ≤1 i∈I i∈I
2

and
!2 !2
X X 2
sup pi σF,i ≤ pi σi
b x )
≥ X − E(X|X (5.16)
[F ]Lip ≤1 i∈I i∈I
1

b x ).
where we used the obvious fact that σ({X ∈ Ci (x)}, i ∈ I) = σ(X
(a) It follows from (5.13) that
!
X
pi σi2

inf = inf kX − E X|σ({X ∈ Ai }, i ∈ I k2 .
(Ai )1≤i≤N (Ai )1≤i≤N
i∈I
5.5. HYBRID QUANTIZATION-MONTE CARLO METHODS 129

Now (see e.g. [66] or [124]),

∗N
n o
bx
kX − X k2 = min kX − Y k2 , Y : (ω, A, P) → Rd , |Y (Ω)| ≤ N

and
b x∗N = E(X | X
X b x∗N ).

Consequently, (5.15) completes the proof.


(b) It follows from (5.14)
X
pi σi ≥ kX − E(X|σ({X ∈ Ai }, i ∈ I)k1
i∈I
≥ kdist(X, x(A))k1
 
where x(A) := E X|σ({X ∈ Ai }, i ∈ I (ω), ω ∈ Ω has at most N elements. Now kd(X, x(A))k1 =
b x(A) k , consequently
E dist(X, x(A)) = kX − X 1

X
b x(A) k
pi σi ≥ kX − X 1
i∈I

= b xk .
min kX − X ♦
1
x∈(Rd )N

As a conclusion, we see that the notion of universal stratification (with respect to Lipschitz con-
tinuous functions) and quantization are closely related since the variance reduction factor that can
be obtained by such an approach is essentially ruled by the quantization rate of the state space of
the random vector X by its distribution.

One dimension In that case the method applies straightforwardly provided both the distribution
function FX (u) := P(X ≤ u) of X (on R̄) and its right continuous (canonical) inverse on [0, 1],
denoted FX−1 are computable.
We also need the additional assumption that the N -quantizer x = (x1 , . . . , xN ) satisfies the
following continuity assumption

P(X = xi ) = 0, i = 1, . . . , N.

Note that so is always the case if X has a density.


Then set
xi + xi+1
xi+ 1 = , i = 1, . . . , N − 1, x− 1 = −∞, xN + 1 = +∞.
2 2 2 2

Then elementary computations show that, with q = 1,


 
∀ u ∈ [0, 1], ϕN (xi , u) = FX−1 FX (xi− 1 ) + (FX (xi+ 1 ) − FX (xi− 1 ))u , i = 1, . . . , N. (5.17)
2 2 2
130 CHAPTER 5. OPTIMAL QUANTIZATION METHODS (CUBATURES) I

Higher dimension We consider a random vector X = (X 1 , . . . , X d ) whose marginals X i are


independent. This may appear as a rather stringent restriction in full generality although it is
often possible to “ extract” in a model an innovation with that correlation structure. At least in
a Gaussian framework, such a reduction is always possible after an orthogonal diagonalization of
its covariance matrix. One considers a product quantizer (see e.g. [123, 124]) defined as follows:
for every ` ∈ {1, . . . , d}, let xN` = (xN ` N` `
1 , . . . , x(N` ) ) be an N` -quantizer of the marginal X and set
Qd
N := N1 × · · · × Nd . Then, define for every multi-index i := (i1 , . . . , id ) ∈ I := `=1 {1, . . . , N` },
(N1 ) (Nd )
xi = (xi1 , . . . , x id ).

Then, one defines ϕN,X (xi , u) by setting q = d and


 
(N )
ϕN,X (xi , (u1 , . . . , ud )) = ϕN,X ` (xi` ` , u` ) .
1≤`≤d

where ϕN,X` is defined by (5.17).


Chapter 6

Stochastic approximation and


applications to finance

6.1 Motivation
In Finance, one often faces some optimization problems or zero search problems. The first ones
often reduce to the second one since, at least in a convex framework, minimizing a function amounts
to finding a zero of its gradient. The most commonly encountered examples are the extraction of
implicit parameters (implicit volatility of an option, implicit correlations in the credit markets
or for a single best-of-option), the calibration, the optimization of an exogenous parameters for
variance reduction (regression, Importance sampling, etc). All these situations share a common
feature: the involved functions all have a representation as an expectation, namely they read
h(y) = E H(y, Z) where Z is a q-dimensional random vector. The aim of this chapter is to provide
a toolbox – stochastic approximation – based on simulation to solve these optimization or zero
search problems. It can be seen as an extension of the Monte Carlo method.
Stochastic approximation can be presented as a probabilistic extension of Newton-Raphson like
zero search recursive procedures of the form

∀ n ≥ 0, yn+1 = yn − γn+1 h(yn ) (0 < γn ≤ γ0 ) (6.1)

where h : Rd → Rd is a continuous vector field satisfying a sub-linear growth assumption at infinity.


Under some appropriate mean-reverting assumptions, one shows that such a procedure is bounded
and eventually converges to a zero y∗ of h. As an example if one sets γn = (Dh(yn−1 ))−1 , the above
recursion is but the regular Newton-Raphson procedure for zero search of the function h (one can
also set γn = 1 and replace h by (Dh)−1 ◦ h).
In one dimension, mean-reversion may be obtained by a non-decreasing assumption made on
the function h or, more simply by assuming that h(y)(y − y∗ ) > 0 for every y 6= y∗ : if so, yn
is decreasing as long as yn > y∗ and decreasing whenever yn < y∗ . In higher dimension, this
assumption becomes (h(y) | y − y∗ ) > 0, y 6= y∗ and will be extensively called upon further on.
More generally mean-reversion may follow from a monotony assumption on h in one (or higher
dimension) and more generally follows from he existence of a so-called Lyapunov function. To
introduce this notion, let us make a (light) connection with Ordinary Differential Equations (ODE).

131
132 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

Let us make a connection with diffferential dynamic systems: thus, when γn = γ > 0, Equa-
tion (6.1) is but the Euler scheme with step γ > 0 of the ODE

ODEh ≡ ẏ = −h(y).

A Lyapunov function for ODEh is a function L : Rd → R+ such that any solution t 7→ x(t)
of the equation satisfies t 7→ L(x(t)) is non-increasing as t increases. If L is differentiable this is
mainly equivalent to the condition (∇L|h) ≥ 0 since
d
L(y(t)) = (∇L(y(t))|ẏ(t)) = −(∇L|h)(y(t)).
dt
If such a Lyapunov function does exist (which is not always the case!), the system is said to be
dissipative.
Basically one meets two frameworks:
– the function L is identified a priori, it is the object of interest for optimization purpose, and
one designs a function h from L e.g. by setting h = ∇L (or possibly h proportional to ∇L).
– The function of interest is h and one has to search for a Lyapunov function L (which may not
exist). This usually requires a deep understanding of the problem from a dynamical point of view.

This duality also occurs in discrete time Stochastic Approximation Theory from its very begin-
ning in the early 1950’ (see [142, 82]).
As concerns the constraints on the Lyapunov function, due to the specificities of the discrete
time setting, we will require some further regularity assumption on ∇L, typically ∇L is Lipschitz
continuous and |∇L|2 ≤ C(1 + L) (essentially quadratic property).

 Exercises. 1. Show that if a function h : Rd → Rd is non-decreasing in the following sense

∀ x, y ∈ Rd , (h(y) − h(x)|y − x) ≥ 0

and if h(y∗ ) = 0 then L(y) = |y − y∗ |2 is a Lyapunov function for ODEh .


2. Assume furthermore that {h = 0} = {y∗ } and that h satisfies a sub-linear growth assumption:
|h(y)| ≤ C(1 + |y|), y ∈ Rd . Show that the sequence (yn )n≥0 defined by (6.1) converges toward y∗ .

Now imagine that no straightforward access to numerical values of h(y) is available but that h
has an integral representation with respect to an Rq -valued random vector Z, say
Borel d
h(y) = E H(y, Z), H : Rd × Rq −→ Rd , Z = µ, (6.2)

(satisfying E |H(y, Z)| < +∞ for every y ∈ Rd ). Assume that


– H(y, z) is easy to compute for any couple (y, z)
– the distribution µ of Z can be simulated at a reasonable cost.
One idea can be to simply “randomize” the above zero search procedure (6.1) by using at each
iterate a Monte Carlo simulation to approximate h(yn ).
A more sophisticated idea is to try doing both simultaneously by using on the one hand
H(yn , Zn+1 ) instead of h(yn ) and on the other hand by letting the step γn go to 0 to asymptotically
6.1. MOTIVATION 133

smoothen the chaotic (stochastic. . . ) effect induced by this “local” randomization. However one
should not to make γn go to 0 too fast so thatP an averaging effect occurs like in the Monte Carlo
method. In fact, one should impose that n γn = +∞ to ensure that the initial value of the
procedure will be forgotten.
Based on this heuristic analysis, we can reasonably hope that the recursive procedure

∀n ≥ 0, Yn+1 = Yn − γn+1 H(Yn , Zn+1 ) (6.3)

where
(Zn )n≥1 is an i.i.d. sequence with distribution µ defined on (Ω, A, P)
and Y0 is an Rd -valued random vector (independent of the sequence (Zn )n≥1 ) defined on the same
probability space also converges to a zero y∗ of h at least under appropriate assumptions, to be
specified further on, on both H and the gain sequence γ = (γn )n≥1 .
What precedes can be seen as the “meta-theorem” of stochastic approximation. In this frame-
work, the Lyapunov functions mentioned above are called upon to ensure the stability of the
procedure.
 A first toy-example: the Strong Law of Large Numbers. As a first example note that the sequence
of empirical means (Z n )n≥1 of an i.i.d. sequence (Zn ) of integrable random variable satisfies
1
Z n+1 = Z n − (Z n − Zn+1 ), n ≥ 0, Z̄0 = 0,
n+1
i.e. a stochastic approximation procedure with H(y, Z) = y − Z and h(y) := y − z∗ with z∗ = E Z
(so that Yn = Z̄n ). Then the procedure converges a.s. (and in L1 ) to the unique zero z∗ of h.
The (weak) rate of convergence of (Z n )n≥1 is ruled by the CLT which may suggest that the
generic rate of convergence of this kind of procedure is of the same type. In particular the deter-
1
ministic counterpart with the same gain parameter, yn+1 = yn − n+1 (yn − z∗ ), converges at a n1 -rate
to z∗ (and this is clearly not the optimal choice for γn in this deterministic framework).
However, if we do not know the value of the mean z∗ = E Z but if we able to simulate µ-
distributed random vectors, the first recursive stochastic procedure can be easily implemented
whereas the deterministic one cannot. The stochastic procedure we are speaking about is simply
the regular Monte Carlo method!

 A second toy-model: extracting implicit volatility in a Black-Scholes model. A second toy-example


is the extraction of implicit volatility in a Black-Scholes model for a vanilla Call or Put option. In
practice it is carried out by a deterministic Newton procedure (see e.g. [109]) since closed forms
are available for both the premium and the vega of the option. But let us forget about that for a
few lines to illustrate the basic principle of Stochastic Approximation. Let x, K, T ∈ (0, +∞), let
r ∈ R and set for every σ ∈ R,
σ2
Xtx,σ = xe(r− 2
)t+σWt
, t ≥ 0.

We know that σ 7→ P utBS (x, K, σ, r, T ) = e−rT E(K − XTx,σ )+ is an even function, increasing on
(0, +∞), continuous with lim P utBS (x, K, σ, r, T ) = (e−rT K−x)+ and lim P utBS (x, K, σ, r, T ) =
σ→0 σ→+∞
−rT
e K (these bounds are model-free and can be directly derived by arbitrage arguments). Let
134 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

Pmarket (x, K, r, T ) ∈ [(e−rT K − x)+ , e−rT K] be a consistent mark-to-market price for the Put op-
tion with maturity T and strike price K. Then the implied volatility σimpl := σimpl (x, K, r, T ) is
defined as the unique (positive) solution to the equation
P utBS (x, K, σimpl , r, T ) = Pmarket (x, K, r, T )
i.e.
E e−rT (K − XTx,σ )+ − Pmarket (x, K, r, T ) = 0.


This naturally leads to devise the following stochastic algorithm to solve numerically this equation:

  
σ2
(r− 2n )T +σn T Zn+1
σn+1 = σn − γn+1 K − xe − Pmarket (x, K, r, T )
+

where (Zn )n≥1 is an i.i.d. sequence of N (0; 1)-distributed random variables and the step sequence
γ = (γn )n≥1 is e.g. given by γn = nc for some parameter c > 0. After a necessary “tuning” of the
2
constant c (try c = x+K ), one observes that
σn −→ σimpl a.s. as nn → +∞.
A priori one could imagine that σn could converge toward −σimpl (which would not be a real
problem) but it a.s. never happens because this negative solution is repulsive for the related ODEh
and “noisy”. This is an important topic often known in the literature as “how stochastic algorithms
never fall into noisy traps” (see [29, 96, 136], etc).

To conclude this introductory section, let us briefly come back to the case where h = ∇L and
L(y) = E(Λ(y, Z)) so that ∇L(y) = E H(y, Z) with H(y, z) := ∂Λ(y,z)∂y . The function H is sometimes
called a local gradient (of L) and the procedure (6.3) is known as a stochastic gradient procedure.
When Yn converges to some zero y∗ of h = ∇L at which the algorithm is “noisy enough” – say e.g.
E(H(y∗ , Z) H(y∗ , Z)∗ ) > 0 is a definite symmetric matrix – then y∗ is necessarily a local minimum
of the potential L: y∗ cannot be a trap. So, if L is strictly convex and lim|y|n→+∞ L(y) = +∞, ∇L
has a single zero y∗ which is but the global minimum of L: the stochastic gradient turns out to be
a minimization procedure.
However, most recursive stochastic algorithms (6.3) are not stochastic gradients and the Lya-
punov function, if any, is not naturally associated to the algorithm: finding a Lyapunov function
to “stabilize” the algorithm (by bounding a.s. its paths, see Robbins-Siegmund Lemma below) is
often a difficult task which requires a deep understanding of the related ODEh .
As concerns the rate of convergence, one must have in mind that it usually ruled by a CLT at
√ √
a 1/ γn rate which can reach at most the n-rate of the regular CLT . So, such a “toolbox” is
clearly not competitive compared to a deterministic procedure when available but this rate should
be compare to that of the Monte Carlo metrhod (i.e. the SLLN ) since their field of application
are similar: stochastic approximation is the natural extension of the Monte Carlo method to solve
inverse or optimization problems related to functions having a representation as an expectation of
simulatable random functionals.
Recently, several contributions (see [4, 101, 102]) draw the attention of the quants world to
stochastic approximation as a tool for variance reduction,implicitation of parameters, model cal-
ibration, risk management. . . It is also used in other fields of finance like algorithmic trading as
an on line optimizing device for execution of orders (see e.g. [94]). We will briefly discuss several
(toy-)examples of application.
6.2. TYPICAL A.S. CONVERGENCE RESULTS 135

6.2 Typical a.s. convergence results


The stochastic approximation provides various theorems that guarantee the a.s. and/or Lp con-
vergence of stochastic approximation procedures. We provide below a general (multi-dimensional)
preliminary result known as Robbins-Siegmund Lemma from which the main convergence results
will be easily derived (although, strictly speaking, the lemma does not provide any a.s. convergence
result).
In what follows the function H and the sequence (Zn )n≥1 are defined by (6.2) and h is the
vector field from Rd to Rd defined by h(y) = E H(y, Z1 ).

Theorem 6.1 (Robbins-Siegmund Lemma) Let h : Rd → Rd and H : Rd × Rq → Rd satisfy-


ing (6.2). Suppose that there exists a continuously differentiable function L : Rd → R+ satisfying

∇L is Lipschitz continuous continuous and |∇L|2 ≤ C(1 + L) (6.4)

such that h satisfies the mean reverting assumption

(∇L|h) ≥ 0. (6.5)

Furthermore suppose that H satisfies the following (pseudo-)linear growth assumption


p
∀ y ∈ Rd , kH(y, Z)k2 ≤ C 1 + L(y) (6.6)

(which implies |h| ≤ C 1 + L).
Let γ = (γn )n≥1 be a sequence of positive real numbers satisfying the so-called decreasing step
assumption X X
γn = +∞ and γn2 < +∞. (6.7)
n≥1 n≥1

Finally, assume Y0 is independent of (Zn )n≥1 and E L(Y0 ) < +∞.


Then, the recursive procedure defined by (6.3) satisfies the following four properties:
P a.s.&L2 (P)
(i) Yn − Yn−1 −→ 0 as n → +∞,
(ii) the sequence (L(Yn ))n≥0 is L1 (P)-bounded,
a.s.
(iii) L(Yn ) −→ L∞ ∈ L1 (P) as n → +∞,
X
(iv) γn (∇L|h)(Yn−1 ) < +∞ a.s.
n≥1

Remarks and terminology. • The sequence (γn )n≥1 is called a step sequence or a gain parameter
sequence.
• If the function L satisfies (6.4), (6.5), (6.6) and moreover lim|y|→+∞ L(y) = +∞, then L is called
a Lyapunov function of the system like in Ordinary Differential Equation Theory.
√ √
• Note that the assumption (6.4) on L implies that ∇ 1 + L is bounded. Hence L has at most
a linear growth so that L itself has at most a quadratic growth.
• In spite of the standard terminology, the step sequence does not need to be decreasing in As-
sumption (6.7).
136 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

P
• A careful reading of the proof below shows that the assumption n≥1 γn = +∞ is not needed.
However we leave it in the statement because it is dramatically useful for any application of this
Lemma since it implies, combined with (iv) that

lim inf (∇L|h)(Yn−1 ) = 0.


n

These assumptions are known as “Robbins-Siegmund assumptions”.


• When H(y, z) := h(y) (i.e. the procedure is noiseless), the above theorem provides a convergence
result for the original deterministic procedure (6.1).

The key of the proof is the following convergence theorem for non negative super-martingales
(see [116]).

Theorem 6.2 Let (Sn )n≥0 be a non-negative super-martingale with respect to a filtration (Fn )n≥0
on a probability space (Ω, A, P) (i.e. for every n ≥ 0, Sn ∈ L1 (P) and E(Sn+1 |Fn ) ≤ Sn a.s.) then,
Sn converges P-a.s. to an integrable (non-negative) random variable S∞ .

For general convergence theorems for sub-, super- and true martingales we refer to any standard
course on Probability Theory or, preferably to [116].

Proof. Set Fn := σ(Y0 , Z1 , . . . , Zn ), n ≥ 1, and for notational convenience ∆Yn := Yn − Yn−1 ,


n ≥ 1. It follows from the fundamental formula of calculus that there exists ξn+1 ∈ (Yn , Yn+1 )
(geometric interval) such that

L(Yn+1 ) = L(Yn ) + (∇L(ξn+1 )|∆Yn+1 )


≤ L(Yn ) + (∇L(Yn )|∆Yn+1 ) + [∇L]Lip |∆Yn+1 |2
2
= L(Yn ) − γn+1 (∇L(Yn )|H(Yn , Zn+1 )) + [∇L]Lip γn+1 |H(Yn , Zn+1 )|2 (6.8)
≤ L(Yn ) − γn+1 (∇L(Yn )|h(Yn )) − γn+1 (∇L(Yn )|∆Mn+1 ) (6.9)
2
+[∇L]Lip γn+1 |H(Yn , Zn+1 )|2

where
∆Mn+1 = H(Yn , Zn+1 ) − h(Yn ).
We aim at showing that ∆Mn+1 is a (square integrable) Fn -martingale increment satisfying
E(|∆Mn+1 |2 | Fn ) ≤ C(1 + L(Yn )) for an appropriate real constant C > 0.
First note that L(Yn ) ∈ L1 (P) and H(Yn , Zn+1 ) ∈ L2 (P) for every n ≥ 0: this follows from (6.8)
and an easy induction since
1
E|(∇L(Yn )|H(Yn , Zn+1 ))| ≤ (E|∇L(Yn )|2 + E|H(Yn , Zn+1 )|2 ) ≤ C(1 + E L(Yn ))
2
(where we used that |(a|b)| ≤ 21 (|a|2 + |b|2 ), a, b ∈ Rd ).
Now Yn being Fn -measurable and Zn+1 being independent of Fn

E(H(Yn , Zn+1 ) | Fn ) = E(H(y, Z1 ))|y=Yn = h(Yn ).


Consequently E(∆Mn+1 | Fn ) = 0. The inequality for E(|∆Mn+1 |2 | Fn ) follows from |∆Mn+1 |2 ≤
2(|H(Yn , Zn+1 )|2 + |h(Yn )|2 ) and Assumption (6.6).
6.2. TYPICAL A.S. CONVERGENCE RESULTS 137

Now, one derives from the assumptions (6.7) and (6.9) that there exists two positive real con-
stants CL = C[∇L]Lip > 0 such that
Pn−1 P 2
L(Yn ) + k=0 γk+1 (∇L(Yk )|h(Yk )) + CL k≥n+1 γk
Sn = Qn 2
k=1 (1 + CL γk )

is a (non negative) super-martingale with S0 = L(Y0 ) ∈ L1 (P). This uses that (∇L|h) ≥ P0. Hence2 P-
a.s. converging toward an integrable random variable S∞ . Consequently, using that k≥n+1 γk →
0, one gets
n−1
a.s.
X Y
L(Yn ) + γk+1 (∇L|h)(Yk ) −→ Se∞ = S∞ (1 + CL γn2 ) ∈ L1 (P). (6.10)
k=0 n≥1

The super-martingale (Sn )n≥0 being L1 (P)-bounded, one derives likewise that (L(Yn ))n≥0 is L1 -
bounded since
n
!
Y
L(Yn ) ≤ (1 + CL γk2 ) Sn , n ≥ 0.
k=1

Now, a series with non-negative terms which is upper bounded by an (a.s.) converging sequence,
a.s. converges in R+ so that
X
γn+1 (∇L|h)(Yn ) < +∞ P-a.s.
n≥0

n→+∞
It follows from (6.10) that, P-a.s., L(Yn ) −→ L∞ which is integrable since (L(Yn ))n≥0 is L1 -
bounded.
Finally
X X X
E|∆Yn |2 ≤ γn2 E(|H(Yn−1 , Zn )|2 ) ≤ C γn2 (1 + EL(Yn−1 )) < +∞
n≥1 n≥1 n≥1

2)
P
so that n≥1 E(|∆Yn | < +∞ a.s. which yields Yn − Yn−1 → 0 a.s.. ♦

Remark. The same argument which shows that (L(Yn ))n≥1 is L1 (P)-bounded shows that
 
X X
γn E(∇L|h)(Yn−1 ) = E  γn (∇L|h)(Yn−1 ) < +∞ a.s..
n≥1 n≥1

Corollary 6.1 (a) Robbins-Monro algorithm. Assume that the mean function h of the algo-
rithm is continuous and satisfies

∀ y ∈ Rd , y 6= y∗ ,

y − y∗ |h(y) > 0 (6.11)

(which implies that {h = 0} = {y∗ }). Suppose furthermore that Y0 ∈ L2 (P) and that H satisfies

∀ y ∈ Rd ,

H(y, Z) ≤ C(1 + |y|).
2
138 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

Assume that the step sequence (γn )n≥1 satisfies (6.7). Then
a.s.
Yn −→ y∗

and in every Lp (P), p ∈ (0, 2).


(b) Stochastic gradient. Let L : Rd → R+ be a differentiable function satisfying (6.4),
lim L(y) = +∞, and {∇L = 0} = {y∗ }. Assume the mean function of the algorithm is given by
|y|→+∞
h = ∇L and that H satisfies E |H(y, Z)|2 ≤ C 1 + L(y) and that L(Y0 ) ∈ L1 (P). Assume that
 

the step sequence (γn )n≥1 satisfies (6.7). Then L(y∗ ) = minRd L and
a.s.
Yn −→ y∗ .

Moreover ∇L(Yn ) converges to 0 in every Lp (P), p ∈ (0, 2).

Proof. (a) Assumption (6.11) is but the mean-reverting assumption related to the Lyapunov
function L(y) = |y − y∗ |2 . The assumption on H is clearly the linear growth assumption (6.6) for
this function L. Consequently, it follows from the above Robbins-Siegmund Lemma that
X
|Yn − y∗ |2 −→ L∞ ∈ L1 (P) and γn (h(Yn−1 )|Yn−1 − y∗ ) < +∞ P a.s.
n≥1

Furthermore, (|Yn − y∗ |2 )n≥0 is L1 (P)-bounded.


LetPω ∈ Ω such that |Yn (ω) − y∗ |2 converges in R+ and n≥1 γn (Yn−1 (ω) − y∗ |h(Yn−1 )) < +∞.
P
Since n≥1 γn (Yn−1 (ω) − y∗ |h(Yn−1 (ω)) < +∞, it follows that

lim inf (Yn−1 (ω) − y∗ |h(Yn−1 (ω)) = 0.


n

One may assume the existence of a subsequence (φ(n, ω))n such that
n→+∞
(Yφ(n,ω) (ω) − y∗ |h(Yφ(n,ω) (ω))) −→ 0 and Yφ(n,ω) (ω) → y∞ = y∞ (ω).

If lim inf (Yn−1 (ω) − y∗ |h(Yn−1 (ω)) > 0 the convergence of the series induces a contradiction
Xn
with γn = +∞. Now, up to one further extraction, one may assume since (Yn (ω)) is bounded,
n≥1
that Yφ(n,ω) (ω) → y∞ . It follows that (y∞ − y∗ |h(y∞ )) = 0 which in turn implies that y∞ = y∗ .
Now
lim |Yn (ω) − y∗ |2 = lim |Yφ(n,ω) (ω) − y∗ |2 = 0.
n n
2
Finally, for every p ∈ (0, 2), (|Yn − y∗ |p )n≥0 is L p (P)-bounded, hence uniformly integrable. As
a consequence the a.s. convergence holds in L1 i.e. Yn → y∗ converges in Lp (P).
(b) One may apply Robbins-Siegmund’s Lemma with L as Lyapunov function since since (h|∇L)(x) =
|∇L(x)|2 ≥ 0. The assumption on H is but the quadratic linear growth assumption. As a conse-
quence X
L(Yn ) −→ L∞ ∈ L1 (P) and γn |∇L(Yn−1 )|2 < +∞ P-a.s.
n≥1
6.2. TYPICAL A.S. CONVERGENCE RESULTS 139

Let ω ∈ Ω such that L(Yn (ω)) converges in R+ and n≥1 γn |∇L(Yn−1 (ω))|2 < +∞ (and Yn (ω) −
P
Yn−1 (ω) → 0). The same argument as above shows that
lim inf |∇L(Yn (ω))|2 = 0.
n

From the convergence of L(Yn (ω)) toward L∞ (ω) and lim L(y) = +∞, one derives the bound-
|y|→∞
edness of (Yn (ω))n≥0 . Then there exists a subsequence (φ(n, ω))n≥1 such that Yφ(n,ω) → ye and
lim ∇L(Yφ(n,ω) (ω)) = 0 and L(Yφ(n,ω) (ω)) → L∞ (ω).
n

Then ∇L(e y ) = 0 which implies ye = y∗ . Then L∞ (ω) = L(y∗ ). The function L being non-negative,
differentiable and going to infinity at infinity, it implies that L reaches its unique global minimum
at y∗ . In particular {L = L(y∗ )} = {∇L = 0} = {y∗ }. Consequently the only possible limiting
value for the bounded sequence (Yn (ω))n≥1 is y∗ i.e. Yn (ω) converges toward y∗ .
The Lp (P)-convergence to 0 of |∇L(Yn )|, p ∈ (0, 2), follows by the same uniformly integrability
argument like in (a). ♦

 Exercises. 1. Show that Claim (a) remains true if one only assume that
y 7−→ (h(y)|y − y∗ ) is lower semi-continuous.
2 (Non-homogeneous L2 -strong law of large numbers by stochastic approximation).
Let (Zn )n≥1 be an i.i.d. sequence of square integrable random vectors. Let (γn )n≥1 be a sequence
of positive real numbers satisfying the decreasing step Assumption (6.7). Show that the recursive
procedure defined by
Yn+1 = Yn − γn+1 (Yn − Zn+1 )
a.s. converges toward y∗ = E Z1 .

The above settings are in fact some special cases of a more general result the so-called “pseudo-
gradient setting” stated below. However its proof, in particular in a multi-dimensional setting,
needs additional arguments, mainly the so-called ODE method (for Ordinary Differential Equation)
originally introduced by Ljung (see [103]). The underlying idea is to consider that a stochastic
algorithm is a perturbed Euler scheme with a decreasing step of the ODE ẏ = −h(y). For a
detailed proof we refer to classical textbooks on Stochastic Approximation like [21, 43, 89].

Theorem 6.3 (Pseudo-Stochastic Gradient) Assume that L and h and the step sequence
(γn )n≥1 satisfy all the assumptions of the Robbins-Siegmund Lemma. Assume furthermore that
lim L(y) = +∞ and (∇L|h) is lower semi-continuous.
yn→+∞

Then, P(dω)-a.s. there exists ` = `(ω) ≥ 0 and a connected component Y∞ (ω) of {(∇L|h) =
0} ∩ {L = `} such that

dist(Yn (ω), Y∞ (ω)) −→ 0 P-a.s. as n → +∞.


In particular, if for every v ≥ 0, {(∇L|h) = 0} ∩ {L = `} is locally finite (1 ), then
P-a.s., there exists `∞ such that Yn converges toward a point of {(∇L|h) = 0} ∩ {L = `∞ }.
1
By locally finite we mean “finite on every compact set”.
140 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

Proof (One-dimensional case). We consider ω ∈ Ω for which all the conclusions of Robbins-
Siegmund Lemma are true. Combining Yn (ω) − Yn−1 (ω) → 0 with the boundedness of the sequence
(Yn (ω))n≥1 , one can show that the set Y∞ (ω) of the limiting values of (Yn (ω))n≥0 is a connected
compact set (2 ).
On the other hand, Y∞ (ω) ⊂ {L = L∞ (ω)} since L(Yn (ω)) → L∞ (ω). Furthermore, reasoning
like in the proof of claim (b) of the above Corollary shows that there exists a limiting value y∗ ∈
Y∞ (ω) such that (∇L(y∗ )|h(y∗ )) = 0 so that y∗ ∈ {(∇L|h) = 0} ∩ {L = L∞ (ω)}.
At this stage, we do assume that d = 1. Either Y∞ = {y∗ } and the proof is complete. Or Y∞ (ω)
is an interval as a connected subset of R. The function L is constant on this interval, consequently
the derivative L0 is zero on Y∞ (ω). Hence the conclusion. ♦

6.3 Applications to Finance


6.3.1 Application to recursive variance reduction by importance sampling
This section was originally motivated by the paper [4] but follows the lines of [102] which provides
an easier to implement solution. Assume we want to compute the expectation
Z
|z|2 dz
E ϕ(Z) = ϕ(z)e− 2 d .
Rd (2π) 2

where ϕ : Rd → R is integrable with respect the normalized Gaussian measure. In order to deal
with a consistent problem, we assume throughout this section that

P(ϕ(Z) 6= 0) > 0.

A typical example is provided by an option pricing in a d-dimensional Black-Scholes model where,


with the usual notations
!
σi2 √
 
ϕ(z) = e−rT φ xi0 e(r− 2 )T +σi T (Az)i , x0 = (x10 , . . . , xd0 ) ∈ (0, +∞)d ,
1≤i≤d

where A is a lower triangular matrix such that the covariance matrix R = AA∗ has diagonal entries
equal to 1 and φ is a (non-negative) continuous payoff function. This can also be the Gaussian
vector resulting from the Euler scheme of the diffusion dynamics of a risky asset, etc.

Variance reduction by mean translation: first approach (see [4]).


A standard change of variable (Cameron-Martin formula which can be seen here either as a de-
generate version of Girsanov change of probability or as an importance sampling procedure) leads
to
|θ|2
 
E ϕ(Z) = e− 2 E ϕ(Z + θ)e−(θ|Z) . (6.12)

2
The method of proof is to first establish that Y∞ (ω) is a “bien enchaı̂né” set. A subset A ⊂ Rd is “bien enchaı̂né”
if for every a, a0 ∈ A, every ε > 0, there exists p ∈ N∗ , b0 , b1 , . . . , bp ∈ A such that b0 = a, bp = a0 , |bi − bi−1 | ≤ ε.
Any connected set A is “bien enchaı̂né” and the converse is true if A is compact.
6.3. APPLICATIONS TO FINANCE 141

One natural way to optimize the computation by Monte Carlo simulation of the Premium is to
choose among the above representations depending on the parameter θ ∈ Rd the one with the lowest
variance. This means solving, at least roughly, the following minimization problem

min L(θ)
θ∈Rd

with  
2
L(θ) = e−|θ| E ϕ2 (Z + θ)e−2(Z|θ)
|θ|2
since Var(e− 2 ϕ(Z + θ)e−(θ|Z) ) = L(θ) − (E ϕ(Z))2 . A reverse change of variable shows that
|θ|2
 
L(θ) = e 2 E ϕ2 (Z)e−(Z|θ) . (6.13)
 
Hence, if E ϕ(Z)2 |Z|ea|Z| < +∞ for every a ∈ (0, ∞), one can always differentiate L(θ) owing to
Theorem 2.1(b) with
|θ|2
 
∇L(θ) = e 2 E ϕ2 (Z)e−(θ|Z) (θ − Z) . (6.14)

Rewriting Equation (6.13) as


 
|Z|2 1
2 − 2 |Z−θ|2
L(θ) = E ϕ (Z)e e 2

1 2
clearly shows that L is strictly convex since θ 7→ e 2 |θ−z| is strictly convex for every z ∈ Rd (and
ϕ(Z) is not identically 0). Furthermore, Fatou’s Lemma implies lim|θ|→+∞ L(θ) = +∞.
Consequently, L has a unique global minimum θ∗ which is also local and whence satisfies
∇L(θ∗ ) = 0.
We prove now the classical lemma which shows that if L is strictly convex then θ 7→ |θ − θ∗ |2
is a Lyapunov function (in the Robbins-Monro sense defined by (6.11)).

Lemma 6.1 (a) Let L : Rd → R+ be a differentiable convex function. Then

∀ θ, θ0 ∈ Rd , (∇L(θ) − ∇L(θ0 )|θ − θ0 ) ≥ 0.

If furthermore, L is strictly convex, the above inequality is strict as well provided θ 6= θ0 .


(b) If L is twice differentiable and D2 L ≥ αId for some real constant α > 0 (in the sense that
u∗ D2 L(θ)u ≥ α|u|2 , for every θ, u ∈ Rd ), then |im|θ|→+∞ L(θ) = +∞ and

(∇L(θ) − ∇L(θ0 )|θ − θ0 ) ≥ α|θ − θ0 |2 ,



 (i)
0 d
∀ θ, θ ∈ R ,
(ii) L(θ0 ) ≥ L(θ) + (∇L(θ) | θ0 − θ) + 12 α|θ0 − θ|2 .

Proof. (a) One introduces the differentiable function defined on the unit interval

g(t) = L(θ + t(θ0 − θ)) − L(θ), t ∈ [0, 1].


142 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

The function g is convex and differentiable. Hence its derivative


g 0 (t) = (∇L(θ + t(θ0 − θ))|θ0 − θ)
is non-increasing so that g 0 (1) ≥ g 0 (0) which yields the announced inequality. If L is strictly convex,
then g 0 (1) > g 0 (0) (otherwise g 0 (t) ≡ 0 which would imply that L is affine on the geometric interval
[θ, θ0 ]).
(b) The function is twice differentiable under this assumption and g 00 (t) = (θ0 − θ)∗ D2 L θ + t(θ0 −
θ) (θ0 − θ) ≥ α|θ0 − θ|2 . One concludes by noting that g 0 (1) − g 0 (0) ≥ inf s∈[0,1] g 00 (s). Moreover
noting that g(1) ≥ g(0) + g 0 (0) + 21 inf s∈[0,1] g 00 (s) yields the inequality (ii). Finally, setting θ = 0
yields
1
L(θ0 ) ≥ L(0) + α|θ0 |2 − |∇L(0)| |θ0 | → +∞ as |θ0 | → ∞. ♦
2
This suggests (as noted in [4]) to consider the essentially quadratic function W defined by
W (θ) := |θ − θ∗ |2 as a Lyapunov function instead of L. At least rather than L which is usually not
2
essentially quadratic: as soon as ϕ(z) ≥ ε0 > 0, it is obvious that L(θ) ≥ ε20 e|θ| , ε0 > 0 but this
growth is also observed when ϕ is bounded away from zero outside a ball. Hence ∇L cannot be
Lipschitz continuous either.
If one uses the representation (6.14) of ∇L as an expectation to design a stochastic gradient
|z|2 1 2
algorithm (i.e. considering the local gradient H(θ, z) := ϕ2 (z)e− 2 ∂θ ∂
(e 2 |z−θ| )), the “classical”
convergence results like those established in Corollary 6.1 do not apply mainly because the mean
linear growth assumption (6.6) is not fulfilled by this choice for H. In fact, this “naive” procedure
does explode at almost every implementation as pointed out in [4]. This leads Arouna to introduce
in [4] some variants based on repeated re-initializations – the so-called projections “à la Chen” –
to force the stabilization of the algorithm and subsequently prevent explosion. An alternative
approach has been already explored in [5] where the authors change the function to be minimized,
introducing a criterion based on entropy, which turns out to be often close to the original way.

An “unconstrained” approach based on a third change of variable (see [102]).


A new local gradient function: It is often possible to modify this local gradient H(θ, z) in order to
apply the standard results mentioned above. We know and already used above that the Gaussian
density is smooth which is not the case of the payoff ϕ (at least in finance): to differentiate L,
we already switched the parameter θ from ϕ to the Gaussian density by a change of variable (or
equivalently an importance sampling procedure). At this stage, we face the converse problem: we
know the behaviour of ϕ at infinity whereas we cannot control efficiently the behaviour of e−(θ|Z)
inside the expectation as θ goes to infinity. The idea is now to cancel this exponential term by
plugging back θ in the payoff ϕ.
The first step is to proceed a new change of variable. Starting from (6.14), one gets
Z
|θ|2 |z|2 dz
∇L(θ) = e 2 ϕ2 (z)(θ − z)e−(θ|z)− 2
d (2π)d/2
ZR 2
2 |ζ| dζ
= e|θ| ϕ2 (ζ − θ)(2θ − ζ)e− 2 (z := ζ − θ),
Rd (2π)d/2
2
= e|θ| E ϕ2 (Z − θ)(2 θ − Z) .

6.3. APPLICATIONS TO FINANCE 143

Consequently
∇L(θ) = 0 ⇐⇒ E ϕ2 (Z − θ)(2θ − Z) = 0.


From now on, we assume that there exist two positive real constants a, C > 0 such that
a
0 ≤ ϕ(z) ≤ Ce 2 |z| , z ∈ Rd . (6.15)

 Exercises. 1. Show that under this assumption, E ϕ(Z)2 |Z|e|θ||Z| < +∞ for every θ ∈ Rd


which implies that (6.14) holds true.


1. Show that in fact E ϕ(Z)2 |Z|m e|θ||Z| < +∞ for every θ ∈ Rd and every m ≥ 1 which in turn


implies that L is C ∞ . In particular, show that for every θ ∈ Rd ,


|θ|2
 
D2 L(θ) = e 2 E ϕ2 (Z)e−(θ|Z) (Id + (θ − Z)(θ − Z)∗ )

(keep in mind that ∗ stands for transpose). Derive that


|θ|2
 
D2 L(θ) ≥ e 2 E ϕ2 (Z)e−(θ|Z) Id > 0

(in the sense of symmetric matrices) which proves again that L is strictly convex.
Taking this assumption into account, we set
1
2 +1) 2
Ha (θ, z) = e−a(|θ| ϕ(z − θ)2 (2θ − z). (6.16)

One checks that


1  
2 +1) 2
E |Ha (θ, Z)|2 ≤ 2C 4 e−2a(|θ| E e2a|Z|+2a|θ| (4|θ|2 + |Z|2 )
 
≤ 2C 4 4|θ|2 E e2a|Z| + E e2a|Z| |Z|2


≤ C 0 (1 + |θ|2 )

since |Z| has a Laplace transform defined on the whole real line which in turn implies E ea|Z| |Z|m <


+∞ for every a, m > 0.


On the other hand, it follows that the derived mean function ha is given by
1
2 +1) 2
ha (θ) = e−a(|θ| E ϕ(Z − θ)2 (2 θ − Z) .


Note that the function ha satisfies


1
2 +1) 2 −|θ|2
ha (θ) = e−a(|θ| ∇L(θ)

so that ha is continuous, (θ − θ∗ |ha (θ)) > 0 for every θ 6= θ∗ and {ha = 0} = {θ∗ }.
Applying Corollary 6.1(a) (called Robbins-Monro Theorem), one derives that for any step se-
quence γ = (γn )n≥1 satisfying (6.7), the sequence (θn )n≥0 defined by

∀ n ≥ 0, θn+1 = θn − γn+1 Ha (θn , Zn+1 ), (6.17)


144 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

where (Zn )n≥1 is an i.i.d. sequence with distribution N (0; Id ) and θ0 is an independent Rd -valued
random vector, both defined on a probability space (Ω, A, P), satisfies
a.s.
θn −→ θ∗ as n → ∞.
p
Remarks. • The only reason for introducing |θ|2 + 1 is that this function behaves like |θ| (which
can be successfully implemented in practice) but is also everywhere differentiable which simplifies
the discussion about the rate of convergence detailed further on.
• Note that no regularity assumption is made on the payoff ϕ.
• An alternative based on large deviation principle but which needs some regularity assumption on
the payoff ϕ is developed in [58]. See also [137].
• To prevent a possible “freezing” of the procedure, for example when the step sequence has
been misspecified or when the payoff function is too anisotropic, one can replace the above proce-
dure (6.17) by the following fully data-driven variant of the algorithm
∀ n ≥ 0, θen+1 = θen − γn+1 H
e a (θen , Zn+1 ), θe0 = θ0 , (6.18)
where
2
e a (θ, z) := ϕ(z − θ) (2θ − z).
H
1 + ϕ(−θ)2
This procedure also converges a.s. under sub-multiplicativity assumption on the payoff function ϕ
(see [102]).
• A final – and often crucial – trick to boost the convergence when dealing with rare events as it
is often the case with importance sampling is to “drive” a parameter from a “regular” value to the
value that makes the event rare. Typically when trying to reduce the variance of a deep-out-of-
the-money Call option like in the numerical illustrations below, a strategy can be to implement the
above algorithm with a slowly varying strike Kn which goes from K0 = x0 to the “target” strike K
(see below) during the first iterations.

– Rate of convergence (About the): Assume that the step sequence has the following para-
α
metric form γn = , n ≥ 1, and that Dha (θ∗ ) is positive in the following sense: all the
β+n
eigenvalues of Dha (θ∗ ) have a positive real part. Then, the rate of convergence of θn toward θ∗ is

ruled by a CLT (at rate n) if and only if (see 6.4.3)
1
α> >0
2<e(λa,min )
where λa,min is the eigenvalue of Dha (θ∗ ) with the lowest real part. Moreover, one can show that
the (theoretical. . . ) best choice for α is αopt := <e(λ1a,min ) . (We give in section 6.4.3 some insight
about the asymptotic variance.) Let us focus now on Dha (θ∗ ).
Starting from the expression
1
2 +1) 2 −|θ|2
ha (θ) = e−a(|θ| ∇L(θ)
1 |θ|2
 
−a(|θ|2 +1) 2 − 2
= e × E ϕ(Z)2 (θ − Z)e−(θ|Z) by (6.14)
 
= ga (θ) × E ϕ(Z)2 (θ − Z)e−(θ|Z) .
6.3. APPLICATIONS TO FINANCE 145

Then
|θ|2
 
Dha (θ) = ga (θ)E ϕ(Z)2 e−(θ|Z) (Id + ZZ t − θZ t ) + e− 2 ∇L(θ) ⊗ ∇ga (θ).

(where u ⊗ v = [ui vj ]i,j ). Using that ∇L(θ∗ ) = ha (θ∗ ) = 0 (so that ha (θ∗ )(θ∗ )t = 0),
 
Dha (θ∗ ) = ga (θ∗ )E ϕ(Z)2 e−(θ∗ |Z) (Id + (Z − θ∗ )(Z − θ∗ )t )

Hence Dha (θ∗ ) is a definite positive symmetric matrix. Its lowest eigenvalue λa,min satisfies
 
λa,min ≥ ga (θ∗ )E ϕ(Z)2 e−(θ∗ |Z) > 0.

These computations show that if the behaviour of the payoff ϕ at infinity is mis-evaluated,
this leads to a bad calibration of the algorithm. Indeed if one considers two real numbers a, a0
satisfying (6.15) with 0 < a < a0 , then one checks with obvious notations that
1 ga0 (θ∗ ) 1 0 2
1
1 1
= = e(a−a )(|θ∗ | +1) 2 < .
2λa,min ga (θ∗ ) 2λa0 ,min 2λa0 ,min 2λa0 ,min
So the condition on α is more stringent with a0 than it is with a. Of course in practice, the user
does not know these values (since she/he does not know the target θ∗ ), however she/he will be
lead to consider higher values of α than requested which will have as an effect to deteriorate the
asymptotic variance (see again 6.4.3).

Implementation in the computation of E ϕ(Z).


At this stage, like for the variance reduction by regression, we may follow two strategies to reduce
the variance: “batch” or adaptive.

 The batch strategy is the simplest and the most elementary one.
Phase 1: One first computes an hopefully good approximation of the optimal variance reducer
that we will denote θn0 for a large enough n0 (which will remain fixed during the second phase
devoted to the computation of E ϕ(Z)).
|θn0 |2
Phase 2: As a second step, one implements a Monte Carlo simulation based on ϕ(Z+θn0 )e(θn0 |Z)− 2

i.e.
nX
0 +M |θn0 |2
1
E ϕ(Z) = lim ϕ(Zm + θn0 )e−(θn0 |Zm )− 2
M M
m=n0 +1

where (Zm )m≥n0 +1 is an i.i.d. sequence of N (0, Id )-distributed random vectors. This procedure
satisfies a CLT with (conditional) variance L(θn0 ) − (Eϕ(Z))2 (given θn0 ).

 The adaptive strategy is to devise a procedure fully based on the simultaneous computation of the
optimal variance reducer and E ϕ(Z) from the same sequence (Zn )n≥1 of i.i.d. N (0, Id )-distributed
random vectors used in (6.17). This is approach has been introduced in [4]. To be precise this leads
to devise the following adaptive estimator of E ϕ(Z):
M |θm−1 |2
1 X
ϕ(Zm + θm−1 )e−(θm−1 |Zm )− 2 , M ≥ 1, (6.19)
M
m=1
146 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

while the sequence (θm )m≥0 is obtained by iterating (6.17).


We will briefly prove that the above estimator is unbiased and convergent. Let Fm := σ(θ0 , Z1 , . . . , Zm ),
m ≥ 0, be the filtration of the (whole) simulation process. Using that θm is Fm -adapted and Z (m+1)
is independent of Fm with the same distribution as Z, one derives classically that
|2 |2
   
|θ |θ
(m) −(θm−1 |Z (m) )− m−1 −(θ|Z)− m−1
E ϕ(Z + θm−1 )e 2 | Fm−1 = E ϕ(Z + θ)e 2 = E ϕ(Z).
|θ=θm−1

As a first consequence, the estimator defined by (6.19) is unbiased. Now let us define the (FM )-
martingale
|θm−1 |2
M
X ϕ(Zm + θm−1 )e−(θm−1 |Zm )− 2 − E ϕ(Z)
NM := 1{|θm−1 |≤m} .
m
m=1

It is clear that
|θm−1 |2 2
 
a.s.
E ϕ(Zm + θm−1 )e−(θm−1 |Zm )− 2 | Fm−1 1{|θm−1 |≤m} = L(θm−1 )1{|θm−1 |≤m} −→ L(θ∗ )

as m → +∞ which in turn implies that (Nm )m≥1 has square integrable increments so that Nm ∈
L2 (P) for very m ∈ N∗ and
 X
1
hN i∞ ≤ sup L(θm ) < +∞ a.s.
m m2
m≥1

Consequently (see Chapter 11, Proposition 11.4), NM → N∞ where N∞ is an a.s. finite random
variable. Finally, Kronecker’s Lemma (see again Chapter 11, Lemma 11.1) implies
M  |θm−1 |2

1 X
ϕ(Zm + θm−1 )e−(θm−1 |Zm )− 2 − E ϕ(Z) 1{|θm−1 |≤m} −→ 0 as M → +∞.
M
m=1

Since θ → θ∗ as mn → +∞, 1{|θm−1 |≤m} = 1 for large enough m so that

M |θm−1 |2
1 X
ϕ(Zm + θm−1 )e−(θm−1 |Zm )− 2 −→ E ϕ(Z) as M → +∞.
M
m=1

One can show, using the CLT theorem for triangular arrays of martingale increments (see [70]
and Chapter 11, Theorem ??) that
M
!
√ 1 X |θm−1 |2
L
M ϕ(Zm + θm−1 )e−(θm−1 |Zm )− 2 − E ϕ(Z) −→ N (0, (σ∗ )2 )
M
m=1

where (σ∗ )2 = L(θ∗ ) − (E ϕ(Z))2 is the minimal variance.


As set, this second approach seems more performing owing to the asymptotic minimal variance.
For practical use, the verdict is more balanced and the batch approach turns out to be quite
satisfactory.
6.3. APPLICATIONS TO FINANCE 147

2 0.25

0.2

1.5 0.15

0.1

1 0.05

0.5 −0.05

−0.1

0 −0.15

−0.2

−0.5 −0.25
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 0 1 2 3 4 5 6 7 8 9 10
5
x 10

Figure 6.1: B-S Vanilla Call option.T = 1, r = 0.10, σ = 0.5, X0 = 100, K = 100. Left:
convergence toward θ∗ (up to n = 10 000). Right: Monte Carlo simulation of size M = 106 ; dotted
line: θ = 0, solid line: θ = θ10 000 ≈ θ∗ .

Numerical illustrations. (a) At-the-money Black-Scholes setting. We consider a vanilla Call


in B-S model
σ2

d
XT = x0 e(r− 2
)T +σ TZ
, Z = N (0; 1)

with the following parameters: T = 1, r = 0.10, σ = 0.5, x0 = 100, K = 100. The Black-Scholes
reference price of the Vanilla Call is 23.93.
The recursive optimization of θ has been achieved by running the data driven version (6.18)
with a sample (Zn )n of size 10 000. A first renormalization has been made prior to the computation:
we considered the equivalent problem (as far as variance reduction si concerned) where the starting
values of the asset is 1 and the strike is the moneyness K/X0 . The procedure was initialized at
θ0 = 1. (Using (3.9) would have led to set θ0 = −0.2).
We did not try optimizing the choice of the step γn according to the results on the rate of
convergence but applied the heuristic rule that if the function H (here Ha ) takes its (reasonable)
20
values within a few units, then choosing γn = c 20+n with c ≈ 1 (say e.g. c ∈ [1/2, 2]) leads to
satisfactory perfomances for the algorithm.
The resulting value θ10 000 has been used in a standard Monte Carlo simulation of size M =
1 000 000 based on (6.12) and compared to a regular Monte Carlo simulation (i.e. with θ = 0). The
numerical results are as follows:
– θ = 0: Confidence interval (95 %) = [23.92, 24.11] (pointwise estimate: 24.02).
– θ = θ10 000 ≈ 1.51: Confidence interval (95 %) = [23.919, 23.967] (pointwise estimate: 23.94).

42.69
The gain ratio in terms of standard deviations is 11.01 = 3, 88 ≈ 4. This is observed on most
simulations we made although to convergence of θn may be more chaotic than what may be observed
in the figure (the convergence is almost instantaneous). The behaviour of the optimization of θ and
the Monte Carlo simulations are depicted in Figure
√ 6.1. The alternative original “parametrized”
version of the algorithm (Ha (θ, z)) with a = 2σ T yields quite similar results (when implemented
with the same step and the same starting value.
148 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

Further comments: As developed in [124], all what precedes can be extended to non-Gaussian
random vectors Z provided their distribution have a log-concave probability density p satisfying
for some positive ρ,
log(p) + ρ| . |2 is convex.
One can also replace the mean translation by other importance sampling procedures like those
based on the Esscher transform. This has applications e.g. when Z = XT is the value at time T
of a process belonging to the family of subordinated (to Brownian motion) Lévy processes i.e. of
the form Zt = WYt where Y is an increasing Lévy process independent of the standard Brownian
motion W (see [22, 147] for more insight on that topic).

6.3.2 Application to implicit correlation search


One considers a 2-dim B-S “toy” model as defined by (2.1) i.e. Xt0 = ert (riskless asset) and
σi2
)t+σi Wti
Xti = xi0 e(r− 2 , xi0 > 0, i = 1, 2,

for the two risky assets where hW 1 , W 2 it = ρt, ρ ∈ [−1, 1] denotes the correlation between W 1 and
W 2 (that is the correlation between the yields of the risky assets X 1 and X 2 ).
In this market, we consider a best-of call option characterized by its payoff

max(XT1 , XT2 ) − K + .


A market of such best-of calls is a market of the correlation ρ (the respective volatilities being
obtained from the markets of vanilla options on each asset as implicit volatilities). In this 2-
dimensional B-S setting there is a closed formula for the premium involving the bi-variate standard
normal distribution (see [77]), but what follows can be applied as soon as the asset dynamics – or
their time discretization – can be simulated (at a reasonable cost, as usual. . . ).
We will use a stochastic recursive procedure to solve the inverse problem in ρ

PBoC (x10 , x20 , K, σ1 , σ2 , r, ρ, T ) = Pmarket [Mark-to-market premium] (6.20)

where
  
PBoC (x10 , x20 , K, σ1 , σ2 , r, ρ, T ) := e−rT E max(XT1 , XT2 ) − K +
  √ √ √   
−rT 1 µ1 T +σ1 T Z1 2 µ2 T +σ2 T (ρZ 1 + 1−ρ2 Z 2 )
= e E max x0 e , x0 e −K
+

σ2 d
where µi = r − 2i , i = 1, 2, Z = (Z 1 , Z 2 ) = N (0; I2 ).
We assume from now on that the mark-to-market premium Pmarket is consistent i.e. that Equa-
tion (6.20) in ρ has at least one solution, say ρ∗ . Once again, this is a toy example since in this
very setting, more efficient deterministic procedures can be called upon, based on the closed form
for the option premium. On the other hand, what we propose below is a universal approach.
The most convenient way to prevent edge effects due to the fact that ρ ∈ [−1, 1] is to use a
trigonometric parametrization of the correlation by setting

ρ = cos θ, θ ∈ R.
6.3. APPLICATIONS TO FINANCE 149

At this stage, note that


d
p
1 − ρ2 Z 2 = | sin θ|Z 2 = sin θ Z 2
d
since Z 2 = −Z 2 . Consequently, as soon as ρ = cos θ,
d
p
ρZ 1 + 1 − ρ2 Z 2 = cos θ Z 1 + sin θ Z 2

owing to the independence of Z 1 and Z 2 .


This introduces an over-parametrization (inside [0, 2π]) since θ and 2π − θ yield the same
solution, but this is not at all a significant problem for practical implementation: a more careful
examination would show that one equilibrium of these two equilibrium points is “repulsive” and
one is “attractive”. From now on, for convenience, we will just mention the dependence of the
premium function in the variable θ, namely
√ √
    
−rT 1 µ1 T +σ1 T Z1 2 µ2 T +σ2 T (cos θ Z 1 +sin θ Z 2 )
θ 7−→ PBoC (θ) := e E max x0 e , x0 e −K .
+

The function PBoC is a 2π-periodic continuous function. Extracting the implicit correlation from
the market amounts to solving

PBoC (θ) = Pmarket (with ρ = cos θ)

where Pmarket is the quoted premium of the option (mark-to-market). We need an additional
assumption (which is in fact necessary with almost any zero search procedure): we assume that

Pmarket ∈ (min PBoC (θ), max PBoC (θ))


θ θ

i.e. that the (consistent) value Pmarket is not an extremal value of PBoC . So we are looking for a
zero of the function h defined on R by

h(θ) = PBoC (θ) − Pmarket .

This function admits a representation as an expectation given by

h(θ) = E H(θ, Z)

where H : R × R2 → R is defined for every θ ∈ R and every z = (z 1 , z 2 ) ∈ R2 by


  √ 1 √ 1 2
 
H(θ, z) = e−rT max x10 eµ1 T +σ1 T z , x20 eµ2 T +σ2 T (z cos θ+z sin θ) − K − P0market
+

d
and Z = (Z 1 , Z 2 ) = N (0; I2 ).

Proposition 6.1 Under the above assumptions made on the Pmarket and the function PBoC and
if, moreover, the equation PBoC (θ) = Pmarket has finitely many solutions on [0, 2π], the stochastic
zero search recursive procedure defined by

θn+1 = θn − γn+1 H(θn , Zn+1 ), θ0 ∈ R, with (Zn )n≥1 i.i.d. N (0; I2 )

with a step sequence satisfying the decreasing step assumption (6.7) a.s. converges toward θ∗ solution
to PBoC (θ) = Pmarket .
150 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

Proof. For every z ∈ R2 , θ 7→ H(θ, z) is continuous, 2π-periodic and dominated by a function g(z)
such that g(Z) ∈ L2 (P) (g is obtained by replacing z 1 cos θ + z 2 sin θ by |z 1 | + |z 2 | in the above
formula for H). One derives that the mean function h and θ 7→ E H 2 (θ, Z) are both continuous
and 2π-periodic as well (hence bounded).
The main difficulty to apply the Robbins-Siegmund Lemma is to find out the appropriate
Lyapunov function.
Z 2π
The quoted value Pmarket is not an extremum of the function P , hence h± (θ)dθ > 0 where
0
h± := max(±h, 0). We consider θ0 any (fixed) solution to the equation h(θ) = 0 and two real
numbers β ± such that R 2π +
h (θ)dθ
+
0 < β < R02π < β−

0 h (θ)dθ
and we set, for every θ ∈ R,

`(θ) := h+ (θ) − β + h− (θ)1{θ≥θ0 } − β − h− (θ)1{θ≤θ0 } .

The function ` is clearly continuous and is 2π-periodic “on the right” on [θ0 , +∞) and “on the left”
on (−∞, θ0 ]. In particular it is a bounded function. Furthermore, owing to the definition of β ± ,
Z θ0 2+π Z θ0
`(θ)dθ > 0 and `(θ)dθ < 0
θ0 θ0 −2π

so that Z θ
lim `(u)du = +∞.
θ→±∞ θ0

As aconsequence, there exists a real constant C > 0 such that the function
Z θ
L(θ) = `(u)du + C
0

is non-negative. Its derivative is given by L0 = ` so that

L0 h ≥ (h+ )2 + β + (h− )2 ≥ 0 and {L0 h = 0} = {h = 0}.

It remains to prove that L0 = ` is Lipschitz continuous continuous. Calling upon the usual ar-
guments to interchange expectation and differentiation (i.e. Theorem 2.1(b)), one shows that the
function PBoC is differentiable at every θR \ 2π Z with
0
√  
PBoC (θ) = σ2 T E 1{X 2 >max(X 1 ,K)} XT2 (cos(θ)Z 2 − sin(θ)Z 1 ) .
T T

Furthermore
0
√  √ 1 2

|PBoC (θ)| ≤ σ2 T E x20 eµ2 T +σ2 T (|Z |+|Z |) (|Z 2 | + |Z 1 |) < +∞

so that PBoc is clearly Lipschitz continuous continuous on the interval [0, 2π] hence on the whole
real line by periodicity. Consequently h and h± are Lipschitz continuous continuous which implies
in turn that ` is Lipschitz continuous as well.
6.3. APPLICATIONS TO FINANCE 151

2.8 0

−0.1
2.6

−0.2

2.4
−0.3

2.2
−0.4

−0.5
2

−0.6
1.8

−0.7

1.6
−0.8

1.4 −0.9
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
4 4
x 10 x 10

Figure 6.2: B-S Best-of-Call option. T = 1, r = 0.10, σ1 = σ2 = 0.30, X01 = X02 = 100,
K = 100. Left: convergence of θn toward a θ∗ (up to n = 100 000). Right: convergence of
ρn := cos(θn ) toward -0.5

Moreover, one can show that the equation PBoC (θ) = Pmarket has finitely many solutions on
every interval of length 2π.
One may apply Theorem 6.3 (for which we provide a self-contained proof in one dimension) to
derive that θn will converge toward a solution θ∗ of the equation PBoC (θ) = Pmarket . ♦

Exercises. 1. Show that if x16 = x20 or σ1 6= σ2 , then PBoC is continuously differentiable on the
whole real line.
2. Extend what precedes to any payoff ϕ(XT1 , XT2 ) where ϕ : R2+ → R+ is a Lipschitz continuous
function. In particular, show without the help of differentiation that the corresponding function
θ 7→ P (θ) is Lipschitz continuous continuous.

Numerical experiment. We set the model parameters to the following values


x10 = x20 = 100, r = 0.10, σ1 = σ2 = 0.30, ρ = −0.50
and the payoff parameters
T = 1, K = 100.
The reference “Black-Scholes” price 30.75 is used as a market price so that the target of the
stochastic algorithm is θ∗ ∈ Arccos(−0.5). The stochastic approximation procedure parameters are
θ0 = 0, n = 105 .
0.5
The choice of θ0 is “blind” on purpose. Finally we set γn = n . No re-scaling of the procedure has
been made in the below example.
n ρn := cos(θn )
1000 -0.5606
10000 -0.5429
25000 -0.5197
50000 -0.5305
75000 -0.4929
100000 -0.4952
152 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

 Exercise (another toy-example: extracting B-S-implied volatility by stochastic ap-


proximation). Devise a similar procedure to compute the implied volatility in a standard Black-
Scholes model (starting at x > 0 at t = 0 with interest rate r and maturity T ).
(a) Show that the B-S premium CBS (σ) is even, increasing on [0, +∞) and continuous as a function
of the volatility. Show that limσ→0 CBS (σ) = (x − e−rT K)+ and limσ→+∞ CBS (σ) = x.
(b) Derive from (a) that for any market price Pmarket ∈ [(x − e−rT K)+ , x] there is a unique B-S
implicit volatility for this price.
(b) Consider, for every σ ∈ R,

 2

− σ2 T +σ T z −rT
H(σ, z) = χ(σ) xe − Ke
+

σ2
where χ(σ) = (1 + |σ|)e− 2 T . Justify carefully this choice for H and implement the algorithm with
x = K = 100, r = 0.1 and a market price equal to 16.73. Choose the step parameter of the form
γn = xc n1 with c ∈ [0.5, 2] (this is simply a suggestion).

Warning. The above exercise is definitely a toy exercise! More efficient methods for extracting
standard implied volatility are available (see e.g. [109] which is based on a Newton algorithm; a
dichotomy approach is also very efficient).

 Exercise (Extension to more general asset dynamics). One considers now a couple of
risky assets following two correlated local volatility models,

dXti = Xti (rdt + σi Xt )dWti , X01 = x1 > 0, i = 1, 2,




where the functions σi : R2+ → R+ are Lipschitz continuous continuous and bounded and the
Brownian motions W 1 and W 2 are correlated with correlation ρ ∈ [−1, 1] so that

dhW 1 , W 2 it = ρdt.

(This ensures the existence and uniqueness of strong solutions for this SDE, see Chapter 7.)
Assume that we know how to simulate (XT1 , XT2 ), either exactly, or at least as an approximation
d
by an Euler scheme from a d-dimensional normal vector Z = (Z 1 , . . . , Z d ) = N (0; Id ).
Show that the above approach can be extended mutatis mutandis.

6.3.3 Application to correlation search (II): reducing variance using higher or-
der Richardson-Romberg extrapolation
This section requires the reading of Section 7.7, in which we introduced the higher order Richardson-
Romberg extrapolation (of order R = 3). We showed that considering consistent increments in the
(three) Euler schemes made possible to control the variance of the Richardson-Romberg estimator
of E(f (XT )). However, we mentioned, as emphasized in [123], that this choice cannot be shown to
be optimal like at the order R = 2 corresponding to standard Richardson-Romberg extrapolation.
6.3. APPLICATIONS TO FINANCE 153

Stochastic approximation can be a way to search for the optimal correlation structure between
the Brownian increments. For the sake of simplicity, we will start from the continuous time diffusion
dynamics. Let
dXti = b(Xti ) dt + σ(Xti ) dBti , i = 1, 2, 3,
where B = (B 1 , B 2 , B 3 ) is a correlated Brownian motion with (symmetric) correlation matrix
R = [rij ]1≤i,j≤3 (rii = 1). We consider the 3rd -order Richardson-Romberg weights αi , i = 1, 2, 3.
1 2 3

The expectation E f (XT ) is estimated by E α1 f (X̄T ) + α2 f (X̄T ) + α3 f (X̄T ) using a Monte Carlo
simulation. So our objective is then to minimize the variance of this estimator i.e. solving the
problem
2
min E α1 f (X̄T1 ) + α2 f (X̄T2 ) + α3 f (X̄T3 )
R

which converges asymptotically (e.g. if f is continuous) toward the problem


2
min E α1 f (XT1 ) + α2 f (XT2 ) + α3 f (XT3 ) (6.21)
R

or equivalently since E(f (XYi )2 ) does not depend on R to


 
X
min E  αi αj f (XTi )f (XTj ) .
R
1≤i<j≤3

The starting idea is to use the trigonometric representation of covariance matrices derived from
the Cholesky decomposition (see e.g. [119]):

R = T T ∗, T lower triangular,

so that B = T W where W is aP standard 3-dimensional Brownian motion.


The entries tij of T satisfy 1≤j≤i t2ij = rii = 1, i = 1, 2, 3, so that one may write for i = 2, 3,
j−1
Y i−1
Y
tij = cos(θij ) sin(θik ), 1 ≤ j ≤ i − 1, tii = sin(θik ).
k=1 k=1

From now, we will denote T = T (θ) where θ = [θij ]1≤j<i≤2 ∈ R3 .


1+ 3
We assume that b and σ are smooth henough i (say Cb ( ), see [87], Theorem 3.1) to ensure
∂XT
the existence of the θ-sensitivity process ∂θij satisfying the following linear stochastic
1≤j<i≤2
differential system: for every i = 2, 3 and every j ∈ {1, . . . , i − 1},

∂Xti(θ) ∂Xti(θ) ∂Xti(θ)


 
∂Tij 0 (θij ) 0
dWtj .
X
d = b0x (Xti(θ)) dt + σx0 (Xti(θ)) (T (θ)dWt )i + σ(Xti(θ))
∂θij ∂θij ∂θij 0
∂θ
1≤j ≤i−1

T
In practice at that stage one switches to the Euler schemes X̄ i of X i with steps in i = 1, 2, 3
∂Xti
respectively. We proceed likewise with the θ-sensitivity processes ( ∂θij )1≤j≤i−1 which Euler scheme
¯i
∂X
is denoted ( ∂θijt )1≤j≤i−1 .
3
This means that the derivatives of the function are bounded and α-Hölder for some α > 0.
154 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

Suppose now that f : R → R is P a smooth enough (say differentiable or only λ-a.s. if L(X̄T ) has
a density). Then, setting Φf (x) = ( 3i=1 αi f (xi ))2 , we define the potential function L : R3 → R by

L(θ) := E(Φf (XT (θ))).

Our aim is to minimize this potential function or at least to exhibit values of θ such that

L(θ) < L (0, 0, 0)

i.e. to do better than considering Euler schemes with consistent increments. Theoretical results
from [123] strongly suggest that such θ do exist as well as empirical evidence.
The function L is differentiable and its gradient at θ is given by
 
t ∂ X̄T
∇L(θ) := E ∇Φf (X̄T (θ)) .
∂θ 1≤j<i≤3

Then, one devises the resulting stochastic gradient procedure (Θm )m≥0 recursively defiened by

∂ X̄T(m)
Θm+1 = Θn − γm+1 ∇Φf (X̄T(m) (Θm ))t (Θm ), Θ0 ∈ (0, 2π)3
∂θ
 (m) 
∂ X̄
where X̄T(m) , ∂θT , m ≥ 1 are independent copies of the joint Euler scheme at time T , computed
using the covariance matrix Θn .
By contruction the function L plays the role of a Lyapunov function for the algorithm (except
for its behaviour at infinity. . . ).
Under natural assumptions, one shows that Θn a.s. converges toward a random vector Θ∗
which takes values in the zero of ∇L. Although the function L is bounded, one can prove this
convergence by taking advantage of the periodicity of L by applying e.g; results from [50]. Then
relying on results on traps (see [29, 96], etc) in the subset of local minima of L. Hopefully it
may converge to the global minimum of L or at least induce a significant variance reduction with
respect to the Richardson-Romberg-Richardson extrapolation carried out with consistent Brownian
increment (i.e. simulated from a unique underlying Brownian motion).

Numerical illustration: This is illustrated by the numerical experiments carried out by A.


Grangé-Cabane in [68].
[In progress...]

6.3.4 The paradigm of model calibration by simulation


Let Θ ⊂ Rd be an open convex set of Rd . Let

Y : (Θ × Ω, Bor(Θ) ⊗ A) −→ (Rp , Bor(Rp ))


(θ, ω) 7−→ Yθ (ω) = (Yθ1 (ω), . . . , Yθp (ω))

be a random vector representative of p payoffs, “re-centered” by their mark-to-market price (see


examples below). In particular, for every i ∈ {1, . . . , p}, E Yθi is representative of the error between
6.3. APPLICATIONS TO FINANCE 155

the “theoretical price obtained with parameter θ and the quoted price. To make the problem
consistent we assume throughout this section that

∀ θ ∈ Θ, Yθ ∈ L1Rp (Ω, A, P).

Let S ∈ S+ (p, R) ∩ GL(p, R) be a (positive definite) matrix. The resulting inner product is defined
by
∀ u, v ∈ Rp , hu|viS := u∗ Sv
p
and the associated Euclidean norm | . |S by |u|S := hu|uiS .
A natural choice for the matrix S can be a simple diagonal matrix S = Diag(w1 , . . . , wp ) with
“weights” wi > 0, i = 1, . . . , p.
The paradigm of model calibration is to find the parameter θ∗ that minimizes the “aggregated
error” with respect to the | . |S -norm. This leads to the following minimization problem
1
(C) ≡ argminθ |E Yθ |S = argminθ |E Yθ |2S .
2
Here are two simple examples to illustrate this somewhat abstract definition.

Examples. 1. Black-Scholes model. Let, for any x, σ > 0, r ∈ R,


σ2
Xtx,σ = xe(r− 2
)t+σWt
, t ≥ 0,

where W is a standard Brownian motion. Then let (Ki , Ti )i=1,...,p be p couples “maturity-strike
price”. Set
θ := σ, Θ := (0, ∞)
and  
Yθ := e−rTi (XTx,σ
i
− K )
i + − P (T
market i , Ki )
i=1,...,p

where Pmarket (Ti , Ki ) is the mark-to-market price of the option with maturity Ti > 0 and strike
price Ki .
2. Merton model (mini-krach). Now, for every x, σ, λ > 0, a ∈ (0, 1), set
σ2
Xtx,σ,λ,a = x e(r− 2
+λa)t+σWt
(1 − a)Nt , t ≥ 0

where W is as above and N = (Nt )t≥0 is a standard Poisson process with jump intensity λ. Set

θ = (σ, λ, a), Θ = (0, +∞)2 × (0, 1),

and  
Yθ = e −rTi
(XTx,σ,λ,a
i
− Ki )+ − Pmarket (Ti , Ki ) .
i=1...,p

We will also have to make simulability assumptions on Yθ and if necessary its derivatives with
respect to θ (see below) otherwise our simulation based approach would be meaningless.

At this stage, basically two approaches can be considered as far as solving this problem by
simulation is concerned:
156 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

• A Robbins-Siegmund zero search approach of ∇L which needs to have access to a represen-


tation of the gradient – supposed to exist –as an expectation of the function L.

• A more direct treatment based on the so-called Kiefer-Wolfowitz procedure which is a kind
counterpart of the Robbins-Siegmund approach based on a finite difference method (with
decreasing step) which does not require the existence of a representation of ∇L as an expec-
tation.

The Robbins-Siegmund approach


We make the following assumptions: for every θ0 ∈ Θ,

(i) P(dω)-a.s. θ 7−→ Yθ (ω) is differentiable at θ0 with Jacobian ∂θ0 Yθ (ω),


 
Yθ − Yθ0
(ii) ∃ Uθ0 , neighbourhood of θ0 in Θ, such that is uniformly integrable.
|θ − θ0 | θ∈Uθ0 \{θ0 }

One checks – using the exercise “Extension to uniform integrability” that follows Theorem 2.1 –
that θ 7−→ EYθ is differentiable and that its Jacobian is given by

∂θ E Yθ = E ∂θ Yθ .

Then, the function L is differentiable everywhere on Θ and its gradient (with respect to the canonical
Euclidean norm) is given by

∀ θ ∈ Θ, ∇L(θ) = E (∂θ Yθ )∗ S E Yθ = E (∂θ Yθ )∗ S E Yθ .




At this stage we need a representation of ∇L(θ) as an expectation. To this end, we construct for
every θ ∈ Θ, an independent copy Yeθ of Yθ defined as follows: we consider the product probability
space (Ω2 , A⊗2 , P⊗2 ) and set, for every (ω, ω e ) ∈ Ω2 , Yθ (ω, ω
e ) = Yθ (ω) (extension of Yθ on Ω2 still
denoted Yθ ) and Yeθ (ω, ω
e ) = Yθ (e
ω ). It is straightforward by the product measure Theorem that the
two families (Yθ )θ∈Θ and (Yeθ )θ∈Θ are independent with the same distribution. From now on we
will make the usual abuse of notations consisting in assuming that these two independent copies
live on the probability space (Ω, A, P).
Now, one can write

∀ θ ∈ Θ, ∇L(θ) = E (∂θ Yθ )∗ S E Yeθ



 
= E (∂θ Yθ )∗ S Yeθ .

The standard situation, as announced above, is that Yθ is a vector of payoffs written on d traded
risky assets, recentered by their respective quoted prices. The model dynamics of the d risky assets
depends on the parameter θ, say

Yθ = (Fi (θ, XTi (θ)))i=1,...,p

where the price dynamics (X(θ)t )t≥0 , of the d traded assets is driven by a parametrized diffusion
process
dX(θ)t = b(θ, t, Xt (θ)) dt + σ(θ, t, Xt (θ)).dWt , X0 (θ) = x0 (θ) ∈ Rd ,
6.3. APPLICATIONS TO FINANCE 157

where W is an Rq -valued standard Brownian motion defined ona probability space (Ω, A, P), b is
an Rd -valued vector field defined on Θ × [0, T ] × Rd and σ is an M(d, q)-valued field defined on the
same product space, both satisfying appropriate regularity assumptions.
The pathwise differentiability of Yθ in θneeds that of Xt (θ) with respect to θ. This question
is closely related to the θ-tangent process ∂X∂θt (θ) of X(θ). A precise statement is provided
t≥0
in Section 9.2.2 which ensures that if b and σ are smooth enough with respect to the variable θ,
then such a θ-tangent process does exist and is solution to a linear SDE (involving X(θ) in its
coefficients).
Some differentiability properties are also required on the functions Fi in order to fulfill the above
differentiability Assumption (i). As model calibrations on vanilla derivative products performed in
Finance, Fi is never everywhere differentiable – typically Fi (y) := e−rTi (y −Ki )+ −Pmarket (Ti , Ki ) –
but, if Xt (θ) has an absolutely continuous distribution (i.e. a probability density) for every time
t > 0 and every θ ∈ Θ, then Fi only needs to be differentiable outside a Lebesgue negligible subset
of R+ . Finally, we can write formally

H(θ, W (ω)) := (∂θ Yθ (ω))∗ S Yeθ (ω)

where W stands for an abstract random innovation taking values in an appropriate space. We
denote by the capital letter W the innovation because when the underlying dynamics is a Brow-
nian diffusion or its Euler-Maruyama scheme, it refers to a finite-dimensional functional of (two
independent copies of) the Rq -valued standard Brownian motion on an interval [0, T ]: either (two
independent copies of) (WT1 , . . . , WTp ) or (two independent copies of) the sequence (∆W kT )1≤k≤n
n
of Brownian increments with step T /n over the interval [0, T ]. Thus, these increments naturally
appear in the simulation of the Euler scheme (X̄ nkT (θ))0≤k≤n of the process (X(θ)t )t∈[0,T ] when the
n
latter cannot be simulated dirctly (see Chapter 7 entirely devoted to the Euler scheme of Brownian
diffusions). Of course other situations may occur especially when dealing with with jump diffusions
where W usually becomes the increment process of the driving Lévy process.
Whatsoever, we make the following reasonable meta-assumption that
– the process W is simulatable,
– the functional H(θ, w) can be easily computed for any input (θ, w).
Then, one may define recursively the following zero search algorithm for ∇L(θ) = E H(θ, W ),
by setting
θn+1 = θn − γn+1 H(θn , W n+1 )
where (W n )n≥1 is an i.i.d. sequence of copies of W and (γn )n≥1 is a sequence of steps satisfying
the usual decreasing step assumptions
X X
γn = +∞ and γn2 < +∞.
n n

In such a general framework, of course, one cannot ensure that the function s L and H will satisfy
the basic assumptions needed to make stochastic gradient algorithms converge, typically
p
∇L is Lipschitz continuous continuous and ∀ θ ∈ Θ, kH(θ, .)k2 ≤ C(1 + L(θ))
158 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

or one of their numerous variants (see e.g. [21] for a large overview of possible assumptions). How-
ever, in many situations, one can make the problem fit into a converging setting by an appropriate
change of variable on θ or by modifying the function L and introducing an appropriate explicit
(strictly) positive “weight function ” χ(θ) that makes the product χ(θ)H(θ, W (ω)) fit with these
requirements.
Even though, the topological structure of the set {∇L = 0} can be nontrivial, in particular
disconnected. Nonetheless, as seen in Proposition 6.3, one can show, under natural assumptions,
that

θn converges to a connected component of {χ|∇L|2 = 0} = {∇L = 0}.

The next step is that if ∇L has several zeros, they cannot all be local minima of L especially when
there are more than two of them (this is a consequence of the well-known Mountain-Pass Lemma,
see [79]). Some are local maxima or saddle points of various kinds. These equilibrium points
which are not local minima are called traps. An important fact is that, under some non-degeneracy
assumptions on H at such a parasitic equilibrium point θ∞ (typically E H(θ∞ , W )∗ H(θ∞ , W ) is
positive definite at least in the direction of an unstable manifold of h at θ∞ ), the algorithm will
a.s. never converge toward such a “trap”. This question has been extensively investigated in the
literature in various settings for many years (see [96, 29, 136, 20, 51]).
A final problem may arise due to the incompatibility between the geometry of the parameter
set Θ and the above recursive algorithm: to be really defined by the above recursion, we need Θ
to be left stable by (almost) all the mappings θ 7→ θ − γH(θ, w) at least for γ small enough. If not
the case, we need to introduce some constraints on the algorithm by projecting it on Θ whenever
θn skips outside Θ. This question has been originally investigated in [31] when Θ is a convex set.
Once all these technical questions have been circumvented, we may state the following meta-
theorem which says that θn a.s. converges toward a local minimum of L.
At this stage it is clear that calibration looks like quite a generic problem for stochastic op-
timization and that almost all difficulties arising in the field of Stochastic Approximation can be
encountered when implementing such a (pseudo-)stochastic gradient to solve it.

The Kieffer-Wolfowitz approach

Practical implementations of the Robbins-Siegmund approach point out a specific technical diffi-
culty: the random functions θ 7→ Yθ (ω) are not always pathwise differentiable (nor in the Lr (P)-
sense which could be enough). More important in some way, even if one shows that θ 7→ E Y (θ) is
differentiable, possibly by calling upon other techniques (log-likelihood method, Malliavin weights,
etc) the resulting representation for ∂θ Y (θ) may turn out to be difficult to simulate, requiring ing
much programming care, whereas the random vectors Yθ can be simulated in a standard way. In
such a setting, an alternative is provided by the Kiefer-Wolfowitz algorithm (K-W ) which combines
the recursive stochastic approximation principle with a finite difference approach to differentiation.
The idea is simply to approximate the gradient ∇L by

∂L L(θ + η i ei ) − L(θ − η i ei )
(θ) ≈ , 1 ≤ i ≤ p,
∂θi 2η i
6.3. APPLICATIONS TO FINANCE 159

where (ei )1≤i≤p denotes the canonical basis of Rp and η = (η i )1≤i≤p . This finite difference term
has an integral representation given by

L(θ + η i ei ) − L(θ − η i ei ) Λ(θ + η i , W ) − Λ(θ − η i , W )


 
= E
2ηi 2η i

where, with obvious temporary notations,

Λ(θ, W ) := hY (θ, W ), Y (θ, W )iS = Y (θ, W )∗ S Y (θ, W )

(Y (θ, W ) is related to the innovation W ). Starting from this representation, we may derive a
recursive updating formula for θn as follows
i
Λ(θn + ηn+1 , W n+1 ) − Λ(θn − ηn+1
i , W n+1 )
i
θn+1 = θni − γn+1 i
, 1 ≤ i ≤ p.
2ηn+1

We reproduce below a typical convergence result for K-W procedures (see [21]) which is the natural
counterpart of the stochastic gradient framework.

Theorem 6.4 Assume that the function θ 7→ L(θ) is twice differentiable with a Lipschitz continu-
ous Hessian. We assume that

θ 7→ kΛ(θ, W )k2 has (at most) linear growth

and that the two step sequences respectively satisfy

X X X X  γn 2
γn = ηni = +∞, γn2 < +∞, ηn → 0, < +∞.
ηni
n≥1 n≥1 n≥1 n≥1

then, θn a.s. converges to a connected component of a {L = `} ∩ {∇L = 0} for some level ` ≥ 0.

A special case of this procedure in a linear framework is proposed in Chapter 9.1.2: the decreas-
ing step finite difference method for greeks computation. The traps problem for the K-W algorithm
(convergence toward a local minimum of L) has been more specifically investigated in [96].
Users must keep in mind that this procedure needs some care in the tuning of the step pa-
rameters γn and ηn . This may need some preliminary numerical experiments. Of course, all the
recommendations made for the R-Z procedures remain valid. For more details on the K-W proce-
dure we refer to [21].

Numerical implementation: In progress.

6.3.5 Recursive computation of the V @R and the CV @R (I)


For an introduction to Conditional Value-at-Risk (CV@R), see e.g. [143].

 Theoretical background. Let X : (Ω, A, P) → R be a random variable representative of a


loss (i.e. X ≥ 0 stands for a loss equal to X).
160 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

Definition 6.1 The Value at Risk at level α ∈ (0, 1) is the (lowest) α-quantile of the distribution
of X i.e.
V @Rα (X) := inf{ξ | P(X ≤ ξ) ≥ α}. (6.22)
The Value at Risk does exist since lim P(X ≤ ξ) = 1 and satisfies
ξ→+∞

P(X < V @Rα (X)) ≤ α ≤ P(X ≤ V @Rα (X)).


As soon as the distribution function of X is continuous (the distribution of X has no atom),
the value at risk satisfies
P(X ≤ V @Rα (X)) = α
and if the distribution function FX of X is also increasing (strictly) then, it is the unique solution
of the above Equation (otherwise it is the lowest one). In that we will say that the Value-at-Risk
is unique.
Roughly speaking the random variable X represents a loss (i.e. X is a loss when non-negative)
and α represents a confidence level, typically 0.95 or 0.99.
However this measure of risk is not consistent, for several reasons which are discussed e.g. by
Föllmer & Schied in [49].
When X ∈ L1 (P) with a continuous distribution (no atom), a consistent measure of risk is
provided by the Conditional Value-at-Risk (at level α).
Definition 6.2 Let X ∈ L1 (P) with an atomless distribution. The Conditional Value-at-Risk (at
level α) is defined by 
CV @Rα (X) := E X | X ≥ V @Rα (X) . (6.23)
Remark. Note that in case of non-uniqueness, the Conditional Value-at-risk is still well-defined
since the above conditional expectation does not depend upon the choice of the above α-quantile.
 Exercise. Assume that the distribution of X has no atom. Show that
Z +∞
1
CV @Rα (X) = V @Rα (X) + P(X > u) du.
1 − α V @Rα (X)
R +∞
[Hint: use that if the r.v. Y is non-negative, E Y = 0 P(Y ≥ y) dy.]
The following formulation of the V @Rα and CV @Rα as solutions to an optimization problem
is due to Rockafellar and Uryasev in [143].
Proposition 6.2 (Rockafellar and Uryasev’s representation formula) Let X ∈ L1 (P) with
an atomless distribution. The function L : R → R defined by
1
L(ξ) = ξ + E (X − ξ)+
1−α
is convex and lim L(ξ) = +∞. Furthermore, L attains a minimum
|ξ|→+∞

CV @Rα (X) = min L(ξ) ≥ E X


ξ∈R

at
V @Rα (X) = inf argminξ∈R L(ξ).
6.3. APPLICATIONS TO FINANCE 161

Proof. The function L is clearly convex and 1-Lipschitz continuous since both functions ξ 7→ ξ
and ξ 7→ (x − ξ)+ are convex and 1-Lipschitz continuous for every x ∈ R. If X has no atom, then,
calling upon Theorem 2.1(a), the function L is also differentiable on the whole real line with a
derivative given for every ξ ∈ R by

1 1
L0 (ξ) = 1 −

P(X > ξ) = P(X ≤ ξ) − α .
1−α 1−α

This follows from the interchange of differentiation and expectation allowed by Theorem 2.1(a)
1
since ξ 7→ ξ + 1−α (X − ξ)+ is differentiable at a given ξ0 on the event {X = ξ0 }, i.e P-a.s. since X
is atomless and, on the other hand, is Lipschitz continuous continuous with Lipschitz continuous
X
ratio 1−α ∈ L1 (P). The second equality is obvious. Then L attains an absolute minimum, if any,
at any solution ξα of the equation P(X > ξα ) = 1 − α i.e. P(X ≤ ξα ) = α. Hence, L does attain a
minimum at the value-at-risk (which is the lowest solution of this equation). Furthermore

E ((X − ξα )+ )
L(ξα ) = ξα +
P(X > ξα )
ξα E1{X>ξα } + E ((X − ξα )1{X>ξα } )
=
P(X > ξα )
E (X1{X>ξα } )
=
P(X > ξα )
= E (X | {X > ξα }).

Finally, by Jensen’s Inequality


1 
L(ξ) ≥ ξ + X −ξ +
1−α
The function L satisfies
L(ξ) 1 
lim = lim 1 + E(X/ξ − 1)+ = 1
ξ→+∞ ξ ξ→+∞ 1−α

and
L(−ξ) 1  1 α
lim = lim − 1 + E(X/ξ + 1)+ = −1 + = .
ξ→+∞ ξ ξ→+∞ 1−α 1−α 1−α

Hence limξ→±∞ L(ξ)


e = +∞.
One checks that the function in the right hand side of the above inequality atteins its minimum
at its only break of monotony i.e. when ξ = E X. This completes the proof. ♦

Exercise. Show that the conditional value-at-risk CV @Rα (X) is a consistent measure of risk i.e.
that it satisfies the following three properties

• ∀ λ > 0, CV @Rα (λ X) = λCV @Rα (X).

• ∀ a ∈ R, CV @Rα (X + a) = CV @Rα (X) + a.

• Let X, Y ∈ L1 (P), CV @Rα (X + Y ) ≤ CV @Rα (X) + CV @Rα (Y ).


162 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

 First step: a stochastic gradient to compute the Value-at-risk. The Rockafeller-Uryasev


representation suggests to implement a stochastic gradient descent since the function L has a
representation as an expectation
 1 
L(ξ) = E ξ + (X − ξ)+ .
1−α
Furthermore, if the distribution PW has no atom, we know that the function L being convex
and differentiable, it satisfies
1
∀ ξ, ξ 0 ∈ R, (L0 (ξ) − L0 (ξ 0 ))(ξ − ξ 0 ) = F (ξ) − F (ξ 0 ) (ξ − ξ 0 ) ≥ 0

1−α
and if the Value-at-Risk CV @Rα (X) is unique solution to F (ξ) = α,
1
L0 (ξ) ξ − V @Rα (X) =
 
∀ ξ ∈ R, ξ 6= V @Rα (X), F (ξ) − α (ξ − V @Rα (X)) > 0.
1−α
Proposition 6.3 Assume that X ∈ L1 (P) with a unique Value-at-Risk V @Rα (X). Let (Xn )n≥1 be
an i.i.d. sequence of random variables with the same distribution as X, let ξ0 ∈ L1 (P), independent
of (Xn )n≥1 and let (γn )n≥1 be a positive a sequence of real numbers satisfying the decreasing step
assumption (6.7). The stochastic algorithm (ξn )n≥0 defined by

ξn+1 = ξn − γn+1 H(ξn , Xn+1 ), n ≥ 0, (6.24)

where
1 1 
H(ξ, x) := 1 − 1{x≥ξ} = 1{X<ξ} − α , (6.25)
1−α 1−α
a.s. converges toward the Value-at-Risk i.e.
a.s.
ξn −→ V @Rα (X).

Furthermore the sequence (L(ξn ))n≥0 is L1 -bounded so that L(ξn ) → CV @Rα (X) a.s. and in every
Lp (P), p ∈ (0, 1].

Proof. First assume that ξ0 ∈ L2 (P). The sequence (ξn )n≥0 defined by (6.24) is the stochastic
gradient related to the Lyapunov function L(ξ)
e = L(ξ) − E X, but owing to the convexity of L,
it is more convenient to rely on Corollary 6.1(a) (Robbins-Monro algorithm) since it is clear that
α
the function (ξ, x) 7→ H(ξ, x) is bounded by 1−α so that ξ 7→ kH(ξ, X)k2 is bounded as well. The
conclusion directly follows from the Robbins-Monro setting.
2
In the general case – ξ0 ∈ L1 (P) – one introduces the Lyapunov function L̃(ξ) = √ (ξ−ξα )
1+(ξ−ξα )2
(ξ−ξα )(2++(ξ−ξα )2 )
where we set ξ = ξα for convenience. The derivative of L̃ is given by L̃0 (ξ) = 3 ).
(1+(ξ−ξα )2 ) 2
One checks on the one hand that L̃0 is Lipschitz continuous continuous over the real line (e.g.
because L̃” is bounded) and, on the other hand, {L̃0 = 0} ∩ {L0 = 0} = {L0 = 0} = {V @Rα (X)}.
Then Theorem 6.3 (pseudo-Stochastic Gradient) applies and yields the announced conclusion. ♦

 Exercises. 1. Show that if X has a bounded density fX , then a direct application of the
stochastic gradient convergence result (Corollary 6.1(b)) yields the announced result under the
assumption ξ0 ∈ L1 (P) [Hint: show that L0 is Lipschitz continuous continuous].
6.3. APPLICATIONS TO FINANCE 163

2. Under the additional assumption of Exercise 1, show that the mean function h satisfies h0 (x) =
fX (x)
L”(x) = 1−α . Deduce a way to optimize the step sequence of the algorithm based on the CLT
for stochastic algorithms stated further on in Section 6.4.3.

Exercise 2 is inspired by a simpler “α-quantile” approach. It leads to a more general a.s.


convergence result for our algorithm, stated in the proposition below.

Proposition 6.4 If X ∈ Lp (P) for a p > 0 and is atomless with a unique value-at-risk and if
ξ0 ∈ Lp (P), then the algorithm (6.24) a.s. converges toward V @Rα (X).

Remark. The uniqueness of the value-at-risk can also be relaxed. The conclusion becomes that ξn
a.s.converges to a random variable taking values in the “V @Rα (X) set”: {ξ ∈ R | P(X ≤ ξ) = α}
(see [19] for a statement in that direction).

 Second step: adaptive computation of CV @Rα (X). The main aim of this section was to
compute the CV @Rα (X). How can we proceed? The idea is to devise a companion procedure of
the above stochastic gradient. Still set temporarily ξα = V @Rα (X) for convenience. It follows
L(ξ0 ) + · · · + L(ξn−1 )
from what precedes that → CV @Rα (X) a.s. owing to Césaro principle. On
n
the other hand, we know that, for every ξ ∈ R,
(x − ξ)+
L(ξ) = E Λ(ξ, X) where Λ(ξ, x) = ξ + .
1−α
Using that Xn and (ξ0 , ξ1 , . . . , ξn ) are independent, on can re-write the above Césaro converge as
 
Λ(ξ0 , X1 ) + · · · + Λ(ξn−1 , Xn )
E −→ CV @Rα (X)
n

which naturally suggests to consider the sequence (Cn )n≥0 defined by C0 = 0 and
n−1
1X
Cn = Λ(ξk , Xk+1 ), n ≥ 1,
n
k=0

is a candidate to be an estimator of CV @Rα (X). This sequence can clearly be recursively defined
since, for every n ≥ 0,
1 
Cn+1 = Cn − Cn − Λ(ξn , Xn+1 ) . (6.26)
n+1

Proposition 6.5 Assume that X ∈ L1+ρ (P) for ρ ∈ (0, 1] and that ξn −→ V @Rα (X). Then
a.s.
Cn −→ CV @Rα (X) as n → +∞.

Proof. We will prove this claim in details in the quadratic case ρ = 1. The proof in the general
case relies on the Chow Theorem (see [43] or the second exercise right after the proof). First, one
decomposes
n−1 n
1X 1X
Cn − L(ξα ) = L(ξk ) − L(ξα ) + Yk
n n
k=0 k=1
164 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

n−1
1X
with Yk := Λ(ξk−1 , Xk ) − L(ξk−1 ), k ≥ 1. It is clear that L(ξk ) − L(ξα ) → 0 as n → +∞ by
n
k=0
Césaro’s principle.
As concerns the second term, we first note that

1
Λ(ξ, x) − L(ξ) = ((x − ξ)+ − E(X − ξ)+ )
1−α
so that, x 7→ x+ being 1-Lipschitz continuous continuous

1 1 
|Λ(ξ, x) − L(ξ)| ≤ E |X − ξ| ≤ E |X| + |ξ| .
1−α 1−α
Consequently, for every k ≥ 1,

2
E Yk2 ≤ (E X)2 + E X 2 .

(1 − α)2

We consider the natural filtration of the algorithm Fn := σ(ξ0 , X1 , . . . , Xn ). One checks that, for
every k ≥ 1,

E (Yk | Fk−1 ) = E (Λ(ξk−1 , Xk ) | Fk−1 ) − L(ξk−1 ) = L(ξk−1 ) − L(ξk−1 ) = 0.

We consider the martingale defined by N0 = 0 and


n
X Yk
Nn := , n ≥ 1.
k
k=1

This martingale is in L2 (P) for every n and its predictable bracket process is given by
n
X E(Y 2 | Fk−1 )
k
hN in =
k2
k=1

so that
X 1
E hN i∞ ≤ sup E Yn2 × < +∞.
n k2
k≥1

Consequently Nn → N∞ a.s. and in L2 as n → +∞. Then the Kronecker Lemma (see Lemma 11.1)
implies that
n
1 X n→+∞
Yk −→ 0
n
k=1

which finally implies that


n→+∞
Cn −→ CV @Rα (X). ♦

Remark. For practical implementation one may prefer estimating first the V @Rα (X) and, once
it is done, use a regular Monte Carlo procedure to evaluate the CV @Rα (X).
6.3. APPLICATIONS TO FINANCE 165

 Exercises. 1. Show that an alternative method to compute CV @Rα (X) is to design the
following recursive procedure

Cn+1 = Cn − γn+1 (Cn − Λ(ξn , Xn+1 )), n ≥ 0, C0 = 0. (6.27)

where (γn )n≥1 is the step sequence implemented in the algorithm (6.24) which computes V @Rα (X).
2 (Proof of Proposition 6.5). Show that the conclusion of Proposition 6.5 remains valid if
X ∈ L1+ρ (P). [Hint: rely on the Chow theorem (4 ).]

 Warning!. . . Toward an operating procedure As it is presented, what precedes is essen-


tially a toy exercise for the following reason: in practice, the convergence of the above algorithm
will be very slow and chaotic. . . since P(X > V @Rα (X)) = 1−α is close to 0. For a practical imple-
mentation on real life portfolios the above procedure must be combined with importance sampling
to recenter the simulation where things do happen. . . A more realistic procedure is developed and
analyzed in [19].

6.3.6 Numerical methods for Optimal Quantization


Let X : (Ω, A, P) → Rd be a random vector and let N be a non zero integer. We want to produce
an optimal quadratic quantization of X at level N i.e. to produce an N -quantizer which minimizes
the quadratic quantization error as introduced in Chapter 5.
Everything starts from the fact that any optimal (or even locally optimal) quantizer x =
(x1 , . . . , xN ) with N -components satisfies the stationarity equation as briefly recalled below.

Competitive Learning Vector Quantization


This procedure is a stochastic gradient descent derived from the squared quantization error mod-
ulus (see below). Un fortunately it does not fulfill any of the usual assumption needed on a
potential/Lyapunov function to ensure the a.s. convergence of such a procedure. In particular,
the potential does not go to infinity when the norm of the N -quantizer goes to infinity. Except in
one dimension – where it is of little interest since deterministic Newton’s like procedures can be
implemented in general – very partial results are known about the asymptotic behaviour of this
procedure (see e.g. [122, 21]).
However, the theoretical gaps (in particular the possible asymptotic “merge” of components)
is never observed on practical implementations It is recommended to get sharp results when N is
small and d not too high.

The quadratic distortion function is defined as the squared quadratic mean-quantization error i.e.

∀ x = (x1 , . . . , xN ) ∈ (Rd )N , DN
X b x k2 = E dis

(x) := kX − X 2 loc,N
(Γ, X)
4
Let (Mn )n≥0 be an (Fn , P)-martingale null at 0 and let ρ ∈ (0, 1]; then
 
X 
a.s.
|∆Mn |1+ρ | Fn−1 < +∞ .

Mn −→ M∞ on E
 
n≥1
166 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

with (x, ξ) 7→ disloc,N (x, ξ) is a local potential defined by


2
disloc,N (x, ξ) = min |ξ − xi |2 = dist ξ, {x1 , . . . , xN } .
1≤i≤N

Proposition 6.3.1 Let N ∈ N∗ . The distortion function DN X is continuously differentiable at

N -tuples x ∈ (Rd )N with pairwise distinct components as a consequence of the local Lebesgue dif-
ferentiation theorem (Theorem 2.1(a)) and one easily checks that
∂DNX ∂disloc,N
Z
∂disloc,N
(x) := E (x, X) = (x, ξ)PX (dξ),
∂xi ∂xi Rd ∂xi
with a local gradient given by
∂disloc,N
(x, ξ) := 2(xi − ξ)1{Proj (ξ)=xi } , 1≤i≤N
∂xi x

where Projx denotes a (Borel) projection following the nearest neighbour rule on the grid {x1 , . . . , xN }.

As emphasized in the introduction of this chapter, the gradient ∇DN X having an integral represen-
X
tation, it formally possible to minimize DN using a stochastic gradient descent.
X X
Unfortunately, it is easy to check that lim inf DN (x) < +∞ (though lim inf DN (x) = +∞.
|x|→+∞ mini |xi |→+∞
Consequently, it is hopeless to apply the standard convergence theorem “à la Robbins-Siegmund”
for stochastic gradient procedure like that established in Corollary 6.1(b). But of course, we can
still write it down formally and implement it. . .

 Ingredients:
– A sequence ξ 1 , . . . , ξ t , . . . of (simulated) independent copies of X,
– A step sequence γ1 , . . . , γt , . . . .
A
One usually choose the step in the parametric families γt = & 0 (decreasing step) or
B+t
γt = η ≈ 0 (small constant step).

 Stochastic Gradient Descent formula.


• The procedure formally reads

x(t) = x(t − 1) − γt+1 ∇x disloc,N (Γ(t − 1), ξ t ), x(0) ∈ (Rd )N

where x(0) = (x(0)1 , . . . , x(0)N ) has pairwise distinct components x(0)i in Rd .


• Quantizer updating: (t ; t + 1): x(t) := {x1 (t), . . . , xN (t)},

Competition: winner selection i(t + 1) ∈ argmini |xi (t) − ξ t+1 |


(nearest neighbour search)
 i(t+1)
 x (t + 1) := Homothety(ξ t+1 , 1 − γt+1 )(xi(t+1) (t))
Learning:
xi (t + 1) := xi (t), i 6= i(t + 1).

6.3. APPLICATIONS TO FINANCE 167

One can easily check that if x(t) has pairwise distinct components (in Rd , this is preserved by
the “learning” phase. So that the above procedure is well-defined (up to the convention to be made
in case of conflict between several components x(t)j in the “Competitive” phase.
The name of the procedure – Competitive Learning Vector Quantization algorithm – is of course
inspired from these two phases.

 Heuristics: x(t)−→x∗ ∈ argmin(loc)x∈(Rd )N DX


N
(x) as tn → +∞.

 On line computation of the “companion parameters”:


b x∗ = xi,∗ ), i = 1, . . . , N :
• Weights π i,∗ = P(X
a.s. b x∗ = xi,∗ ).
π i,t+1 := (1 − γt+1 )π i,t + γt+1 1{i=i(t+1)} −→ π i,∗ = P(X

b Γ∗ k :
X (Γ∗ ) = kX − X
• (Quadratic) Quantization error DN 2

a.s.
DX,t+1 := (1 − γt+1 )DX,t + γt+1 |xi(t+1),t − ξ t+1 |2 −→ DN
X
(Γ∗ ).
N N

Note that, since the ingredients involved in the above computations are those used in the com-
petition learning phase, there is (almost) no extra C.P.U. time cost induced from these additional
terms, especially if one has in mind (see below) that the costfull part of the algorithm (as well as
that of the Lloyd procedure below) is the “competition phase” (Winner selection) wince it amounts
to a nearest neighbour search.

In some way the CLVQ algorithm can be seen as a Non Linear Monte Carlo Simulation devised
to design an optimal skeleton of the distribution of X.

For (partial) theoretical results on the convergence of the CLVQ algorithm, we refer to [122]
when X has a compactly supported distribution. No theoretical convergence results are known to
us when X has an inbounded support. As for the convergence of the online adaptive companion
procedures, which rely on classical martingale arguments, we refer again to [122], but also to [12].

Randomized Lloyd’s I procedure


This randomized fixed point procedure is recommended to get good results when N is large and
d is medium. We start from the fact that a local minimizer of a differentiable function is a zero
of its gradient. One checks that a local minimizer of the quadratic distortion function DN X must

have pairwise distinct components (see [66, 122] among others). Consequently, if x = (x , . . . , xN )
1

denotes a local minimum, the gradient of the squared quantization error at x must be zero i.e.

∂DNX
(x, ξ) = 2 E (xi − X)1{Proj (X)=xi } = 0, 1 ≤ i ≤ N,

∂xi x

or equivalently
xi = E X | {X
b x = xi } , 1 ≤ i ≤ N.

(6.28)
This fixed point identity can also be rewritten in term of conditional expectation as follows:
bx = E X | X bx

X
168 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

bx =

since, by the characterization of the conditional expectation of a discrete random vector, E X | X
XN
b x = xi } 1 b x i .

E X | {X {X =x }
i=1

 Regular Lloyd’s procedure. The Lloyd procedure is simply the recursive procedure asso-
ciated to the fixed point identity (6.28)

xi (t + 1) = E X | {X
b x(t) = xi } , 1 ≤ i ≤ N, t ∈ N, x(0) ∈ (Rd )N ,


where x(0) has pairwise distinct components in Rd . We leave as an exercise to show that this
procedure is entirely determined by the distribution of the random vector X.
 Exercise. Prove that this recursive procedure only involves the distribution µ = PX of the
random vector X.
Lloyd’s algorithm can be viewed as two step procedure acting on random vectors as follows
b x(t) , x(t + 1) = X(t

(i)Grid updating e + 1) = E X | X
X(t e + 1)(Ω),(6.29)
(ii)Distribution/weight updating b x(t+1) ← X(t
X e + 1). (6.30)

The first step updates the grid, the second step re-assigns to each element of the grid its Voronoi
cell which can be viewed as a weight updating through the

Proposition 6.6 The Lloyd’s algorithm makes the quadratic quantization error decrease i.e.
b x(t) k
t 7−→ kX − X is non-increasing.
2

Proof. It follows the the above decomposition of the procedure and the very definitions of nearest
neighbour projection and conditional expectation as an orthogonal projector in L2 (P) that, for
every t ∈ N,
b x(t+1) k = kdist X, x(t + 1) k

kX − X 2 2
x(t+1) b x(t) k

≤ kX − Xe k2 = kX − E X | X 2
x(t)

≤ X −X
b
2

 Randomized Lloyd’s procedure. It relies on the computation of E(X | {X b x = xi }), 1 ≤ i ≤


N , by a Monte Carlo simulation. This means that we have independent copies ξ1 ,. . . , ξM , . . . of X
that PM
m=1 ξm 1{ξbm
x(t)
=xi (t)}
b x(t) = xi (t) ≈

E X |X x(t)
,
|{1 ≤ m ≤ M, ξbm = xi (t)}|
having in mind that the convergence holds when M → +∞. This amounts to set at every iteration
PM
m=1 ξm 1{ξbm
x(t)
=xi (t)}
xi (t + 1) = x(t)
, i = 1, . . . , N, x(0) ∈ (Rd )N
|{1 ≤ m ≤ M, ξm = x (t)}|
b i

where x(0) has pairwise distinct components in Rd .


6.3. APPLICATIONS TO FINANCE 169

This “randomized procedure” amounts to replacing the distribution of X by the empirical


measure
M
1 X
δξm .
m
m=1

In particular, if we use the same sample at each iteration, we still have the property that the
procedure makes decrease a quantization error modulus (at level N ) related to the distribution µ.
This suggests that the random i.i.d. sample (ξm )m≥1 can also be replaced by deterministic
copies obtained through a QM C procedure based on a representation of X of the form X = ψ(U ),
U ∼ U ([0, 1]r ).
When computing larger and larger quantizer of the same distribution, a significant improvement
of the method is to initialize the (randomized Lloyd’s procedure at “level” N + 1 be adding one
component to the N -quantizer resulting from the procedure applied at level N , namely to start
from the N + 1-tuple (x∗,(N ) , ξ) where x∗,(N ) denotes the limiting value of the procedure at level
N (assumed to exist, which is the case in practice).

Fast nearest neighbour procedure in Rd


This is the key step in all stochastic procedure imagined to compute optimal (or at least “good”)
quantizers in higher dimension. To speed it up, especially when d increases is one of the major
challenge of computer science.

 The Partial Distance Search paradigm (Chen, 1970).


We want to check whether x = (x1 , . . . , xd ) ∈ Rd is closer to 0 for the canonical Euclidean
distance than a given former “minimal record distance” δmin > 0. The “trick” is the following

(x1 )2 ≥ δmin
2
=⇒ |x| ≥ δmin
..
.
(x1 )2 + · · · + (x` )2 ≥ δmin
2
=⇒ |x| ≥ δmin
..
.

This is the simplest idea and the easiest idea to implement but it seems that it is also the only one
that still works as d increases.
 The K-d tree (Friedmann, Bentley, Finkel, 1977, see [53]): store the N points of Rd in a tree
of depth O(log(N )) based on their coordinates on the canonical basis of Rd .
 Further improvements are due to Mc Names (see [108]): the idea is to perform a pre-processing
of the dataset of N point using a Principal Component Axis (PCA) analysis and then implement
the K-d-tree method in the new orthogonal basis induced by the PCA.

Numerical optimization of quantizers for the normal distributions N (0; Id ) on Rd , d ≥ 1


The procedures that minimizes the quantization error are usually stochastic (except in 1-dimension).
The most famous ones are undoubtedly the so-called Competitive Leaning Vector Quantization
170 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

Figure 6.3: An optimal quantization of the bi-variate normal distribution with size N = 500

algorithm (see [129] or [127]) and the Lloyd I procedure (see [129, 124, 56]) which have been just
described and briefly analyzed above. More algorithmic details are also made available on the web
site
www.quantize.maths-fi.com
For normal distributions a large scale optimization have been carried out based on a mixed
CLV Q-Lloyd procedure. To be precise, grids have been computed for For d = 1 up to 10 and
N = 1 ≤ N ≤ 5 000. Furthermore several companion parameters have also been computed (still
by simulation): weight, L1 quantization error, (squared) L2 -quantization error (also known as
distortion), local L1 &L2 -pseudo-inertia of each Voronoi cell. All these grids can be downloaded on
the above website.
Thus Figure 6.3.6 depicts an optimal quadratic N -quantization of the bi-variate normal distri-
bution N (0; I2 ) with N = 500.

6.4 Further results on Stochastic Approximation


This section can be skipped at a first reading. It deals with the connection between the mean
function h in Stochastic Approximation with and the underlying Ordinary Differential Equation
ẋ = −h(x) as mentioned in the introduction. The second part of this section is devoted to the rate
of convergence of stochastic algorithms.

6.4.1 The ODE method


Toward the ODE
The starting idea (due to Ljung in [103]) of the ODE method is to consider the dynamics of a
stochastic algorithm as a perturbed Euler scheme with decreasing step of an Ordinary Differentia
Equation. In this textbook we mainly deal with algorithms having a Markov representation with
6.4. FURTHER RESULTS ON STOCHASTIC APPROXIMATION 171

an i.i.d. sequence of innovations of the form (6.3), that is

Yn+1 = Yn − γn+1 H(Yn , Zn+1 )

where (Zn )n≥1 is an i.i.d. innovation sequence of Rq -valued random vectors, Y0 is independent of
the innovation sequence, H : Rd × Rq → Rd is a Borel function such that. . . and (γn )n≥1 is sequence
of step parameters.
We saw that this algorithm can be represented in a canonical way as follows:

Yn+1 = Yn − γn+1 h(Yn ) + γn+1 ∆Mn+1 (6.31)

where h(y) = E H(y, Z1 ) is the mean filed or mean function of the algorithm and ∆Mn+1 =
h(Yn ) − E(H(Yn , Zn+1 ) | Fn ), n ≥ 0, is a sequence of martingale increments with respect to the
filtration Fn = σ(Y0 , Z1 , . . . , Zn ), n ≥ 0.
In what precedes we established criterions (see Robins-Siegmund’s Lemma) based on the exis-
tence of Lyapunov functions which ensure that the sequence (Yn )n≥1 is a.s. bounded and that the
martingale X
Mnγ = γk ∆Mk is a.s. convergent in Rd .
k≥1

To derive the a.s. convergence of the algorithm itself, we used pathwise arguments based on
elementary topology and functional analysis. The main improvement provided by the ODE method
is to study the asymptotics of the sequence (Yn (ω))n≥0 , assumed a priori ) to be bounded, through
the sequence (Yk (ω))k≥n , n ≥ 1 represented as sequence of function in the scale of the cumulative
function of the step. We will also need an assumption on the paths of the martingale (Mnγ )n≥1 ,
however significantly lighter than the above a.s. convergence property. Let us be more specific:
first we consider a discrete time dynamics

yn+1 = yn − γn+1 h(yn ) + πn+1 , y0 ∈ Rd



(6.32)

where (πn )n≥1 is a sequence of Rd -valued vectors and h : Rd → Rd is a continuous Borel function.
We set Γ0 = 0 and, for every integer n ≥ 1,
n
X
Γn = γk .
k=1

(0)
Then we define the stepwise constant càdlàg function (Yt )t∈R+ by
(0)
yt = yn if t ∈ [Γn , Γn+1 )

and the sequence of time shifted function defined by


(n) (0)
yt = YΓn +t , t ∈ R+ .

Finally, we set for every uR+ ,


 
N (t) = max k : Γk ≤ t = min k : Γk+1 > t
172 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

so that N (t) = n if and only if t ∈ [Γn , Γn+1 ) (in particular N (Γn ) = n).
Developing the recursive Equation (6.32), we get
n
X n
X
yn = y0 − γk h(yk−1 ) − γk π k
k=1 k=1

which can be rewritten, of every t ∈ [Γn , Γn+1 ),


n Z Γk n
(0) (0)
X X
ys(0)

yt = y0 − h ds − γk πk
k=1 Γk−1 |{z}
k=1
=yk
Z Γn n
(0)
X
= y0 − h(ys(0) )ds − γk π k .
0 k=1

As a consequence, for every t ∈ R+ ,

Z ΓN (t) N (t)
(0) (0)
X
yt = y0 − h(ys(0) )ds − γk π k
0 k=1
Z t Z t N (t)
(0)
X
= y0 − h(ys(0) )ds + h(ys(0) )ds − γk π k (6.33)
0 ΓN (t) k=1

Then, by the very definition of the shifted function y (n) and taking advantage of the fact that
ΓN (Γn ) = Γn , we derive by subtracting (6.33) at times Γn + t and Γn , that for every t ∈ R+ ,

(n) (0) (0) (0)


yt = yΓn + yΓn +t − yΓn
Z Γn +t Z Γn +u N (Γn +t)
(0)
X
(0)
= yΓn − h(ys ds + − γk πk .
Γn ΓN (Γn +t) k=n+1

Z Γn +u N (Γn +t)
(0)
X
where we keep in mind that yΓn = yn . The term − γk πk is candidate to be a
ΓN (Γn +t) k=n+1
remainder term as n goes to infinity. Our aim is to make a connection between the asymptotic
behavior of the sequence of vectors (yn )n≥10 and that of the sequence of functions y (n) , n ≥ 0.

Proposition 6.7 (ODE method I) Assume that

• H1 ≡ Both sequences (yn )n≥0 and (h(yn ))n≥0 are bounded,


P
• H2 ≡ ∀ n ≥ 0, γn > 0, limn γn = 0 and n≥1 γn = +∞,

N (Γ
X n +t)

• H3 ≡ ∀ T ∈ (0, +∞), lim sup γk πk = 0.
n→+∞ t∈[0,T ]
k=n+1
6.4. FURTHER RESULTS ON STOCHASTIC APPROXIMATION 173

(a) The set Y ∞ := {limiting points of(yn )n≥0 } is a compact connected set.
(b) The sequence (y (n) )n≥1 is sequentially relatively compact (5 ) for the topology of the uniform
convergence on compacts sets on the space B(R+ , Rd ) of bounded functions from R+ to Rd (6 ) and
all its limiting points lie in C(R+ , Y ∞ ).

Proof. Let L = supn∈N |h(yn )|.


(a) Let T0 = supn∈N γn < +∞. The it follows from H2

N (Γn +t)
X
|γn+1 πn+1 | ≤ sup γk πk = 0.
t∈[0,T0 ] k=n+1

It follows form (6.32)

|yn+1 − yn | ≤ γn+1 L + |γn+1 πn+1 | → 0 as n → +∞.

As a consequence (see [6] for details) the set Y ∞ is compact and bien enchaı̂né (7 ), hence connected.
Z . 
(n)
(b) The sequence h(ys )ds is uniformly Lipschitz continuous with Lipschitz continuous
0 n≥0
coefficient L since, for every s, t ∈ R+ , s ≤ t,
Z t Z s Z t
h(yu(n) )du − (n)
|h(yu(n) )|du ≤ L(t − s).

h(yu )du ≤
0 0 s

hence, it follows from Arzela-Ascoli’s Theorem that (y (n) )n≥0 is relatively compact in C(R+ , Rd )
endowed with the topology, denoted UK , of the uniform convergence on compact sets. On the other
hand, for every T ∈ (0, +∞),

Z t N (Γn +t)
(n) (n) (n)
X
sup yt − y0 +
h(ys )ds ≤ sup γk L + sup
γk πk −→ 0 as n → +∞
t∈[0,T ] 0 k≥n+1 t∈[0,T ] k=n+1

owing to H2 and H3 , i.e.


Z .
(n) (n) U
y − y0 + h(ys(n) )ds −→
K
0 as n → +∞. (6.34)
0

The sequence (yn )n≥0 is bounded and, as a consequence, the sequence


 (y (n) )n≥0 is UK -relatively
(n) R. (n)
compact with the same UK -limiting values as y0 − 0 h(ys )ds . Hence, these limiting values
n≥0
are continuous functions having values in Y ∞ . ♦
5
In a metric space (E, d), a sequence (xn )n≥0 is sequentially relatively compact if from any subsequence on evan
extract a converging subsequence for the distant d.
6 P min(supt∈[0,k] |f (t)−g(t)|,1)
This topology is defined by the metric d(f, g) = k≥1 2n
.
7
A set A in a metric space (E, d) is “bien enchaı̂né” if for every a, a0 ∈ A, and every ε > 0, there exists an integer
n ≥ 1 and a0 , . . . , an such that a0 = a, an = a0 and d(ai , ai+1 ) ≤ ε for every i = 0, . . . , n − 1. Any connected set C
in E is bien enchaı̂né. The converse is true if C is compact.
174 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

Connection algorithm-ODE
To state the first theorem on the so-called ODE method theorem, we introduce the reverse differ-
ential equation

ODE ∗ ≡ ẏ = h(y).

Theorem 6.5 (ODE II) Assume H1 -H3 hold and that the mean field function h is continuous.
(a) Any limiting function of the sequence (y (n) )n≥0 is an Y ∞ -valued solution of
ODE ≡ ẏ = −h(y).
(b) Assume that ODE satisfies the following uniqueness property: for every y0 ∈ Y ∞ , ODE admits
a unique Y ∞ -valued solution (Φt (y0 ))t∈R+ starting at Φ0 (y0 ) = y0 . Also assume uniqueness for
ODE ∗ (defined likewise). Then, the set Y ∞ is a compact, connected set, and flow-invariant for
both ODE and ODE ∗ .

Proof: (a) Given the above proposition, one has to check that any limiting function y (∞) =
(ϕ(n)) (∞) (ϕ(n))
UK − limn→+∞ y (ϕ(n)) is solution the ODE. For every t ∈ R+ , yt → yt , hence h(yt )→
(∞)
h(yt ) since the mean field function h is continuous. Then by the Lebesgue dominated convergence
theorem, one derives that for every t ∈ R+ ,
Z t Z t
(ϕ(n))
h(ys )ds −→ h(ys(∞) )ds.
0 0

(ϕ(n)) (∞)
One also has y0 → y0 so that finally, letting ϕ(n) → +∞ in (6.34), we obtain
Z t
(∞) (∞)
yt = y0 − h(ys(∞) )ds.
0

(b) Any y0 ∈ Y ∞ is the limit of a sequence yϕ(n) . Up to a new extraction, still denoted ϕ(n) for
convenience, we may assume that y (ϕ(n)) → y (∞) as n → ∞, uniformly on compact sets of R+ .
(∞)
The function y (∞) is a Y ∞ -valued solution to ODE and y (∞) = Φ(y0 ) owing to the uniqueness
assumption which implies the invariance of Y ∞ under the flow of ODE.
For every p ∈ N, we consider for large enough n, say n ≥ np , the sequence (yN (Γϕ(n) −p) )n≥np . It
is clear by mimicking the proof of Proposition 6.7 that all sequences of functions (y (N (Γϕ(n) −p)) )n≥np
are UK -relatively compact. By a diagonal extraction procedure, we may assume that, for every
p ∈ N,
UK
y (N (Γϕ(n) −p) −→ y (∞),p as n → +∞.
(N (Γϕ(n) −p−1)) (N (Γϕ(n) −p))
Since yt+1 = yt for every t ∈ R+ and n ≥ np+1 , one has

(∞),p+1 (∞),p
∀ p ∈ N, ∀ t ∈ R+ , yt+1 = yt .

Furthermore, it follows from (a) that the functions y (∞),p are Y ∞ -valued solutions to ODE. One
defines
(∞) (∞),p
ỹt = yp−t , t ∈ [p − 1, p]
6.4. FURTHER RESULTS ON STOCHASTIC APPROXIMATION 175

(∞) (∞),p
which satisfies in fact, for every p ∈ N, ỹt = yp−t , t ∈ [0, p]. This implies that ỹ (∞) is
an Y ∞ -valued solution to ODE ∗ starting from y0 on ∪p≥0 [0, p] = R+ . Uniqueness implies that
ỹ (∞) = Φ∗t (y0 ) which completes the proof. ♦

Remark. If uniqueness fails either for ODE or for ODE ∗ , one still has that Y ∞ is left invariant
by ODE and ODE ∗ in the sense that, from every y0 ∈ Y ∞ , there exists Y ∞ -valued solutions of
ODE and ODE ∗ .

This property is the first step of the deep connection between the the asymptotic behavior of
recursive stochastic algorithm and its associated mean field ODE. Item (b) can be seen a first a
criterion to direct possible candidates to a set of limiting values of the algorithm. This any zero y ∗
of h, or equivalently equilibrium points of ODE satisfies the requested invariance condition since
Φt (y ∗ ) = Φt (y ∗ ) = y ∗ for every t ∈ R+ . No other single point can satisfy this invariance property.
More generally,we have the following result

Corollary 6.2 If the ODE ≡ ẏ = −h(y) has finitely many compact connected two-sided invariant
sets Xi , i ∈ I (I finite), then the sequence (yn )n≥1 converges toward one of these sets.

As an elementary example let us consider the ODE

ẏ = (1 − |y|)y + ςy ⊥ , y0 ∈ R2

where y = (y1 , y2 ) and y ⊥ = (y2 , y1 ). Then the unit circle C(0; 1) is clearly a connect compact
set invariant by ODE and ODE ∗ . The singleton {0} also satisfies this invariant property. In fact
C(0; 1) is an attractor of ODE and {0} is a repeller. So, we know that any stochastic algorithm
which satisfies H1 -H3 with the above mean field function h(y1 , y2 ) = (1 − |y|)y + ςy ⊥ will converter
either toward C(0, 1) or 0.
Sharper characterizations of the possible set of limiting points of the sequence (yn )n≥0 have
been established in close connection with the theory of perturbed dynamical systems (see [20], see
also [50] when uniqueness fails and the ODE has no flow). To be slightly more precise it has
been shown that the set of limiting points of the sequence (yn )n≥0 is internally chain recurrent
or, equivalently, contains no strict attractor for the dynamics of the ODE i.e. as subset A ⊂ Y ∞ ,
A 6= Y ∞ such that φt (y) converges to A uniformly iny ∈ X ∞ .

Unfortunately, the above results are not able to discriminate between these two sets though it
seems more likely that the algorithm converges toward the unit circle, like the flow of the ODE
does (except when starting from 0). This intuition can be confirmed under additional assumptions
on the fact that 0 is a noisy for the algorithm, e.g. if it is at the origin a Markovian algorithm of
the form (6.3), that the symmetric nonnegative matrix

E H(0, Z1 )∗ H(0, Z1 ) 6= 0.

Practical aspects of assumption H3 To make the connection with the original form of stochas-
tic algorithms, we come back to hypothesis H3 in the following proposition. In particular, it em-
phasizes that this condition is less stringent than a standard convergence assumption on the series.
176 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

Proposition 6.8 Assumption H3 is satisfied in tow situations (or their “additive” combination):
(a) Remainder term: If πn → 0 and γn → 0 as n → +∞, then

N (Γn +t)
X
sup γk πk ≤ sup |πk |(ΓN (Γn +T − Γn ) → 0 as n → +∞.
t∈[0,T ] n+1 k≥n+1

(a) Converging martingale term: The series Mnγ = n≥1 γn ∆Mn converges in Rd . Consequently
P
it satisfies a Cauchy condition so that

N (Γn +t)
X X
sup γk πk ≤ sup γk πk as n → +∞.
t∈[0,T ] k=n+1 `≥n+1 `≥n+1

In practical situations, one meets the combination of these situations:

πn = ∆Mn + rn

where rn is a remainder term which goes to 0 a.s. and Mnγ is an a.s. convergent martingale.
This convergence property can even be relaxed for the martingale term. Thus we have the
following classical results where H3 is satisfied while the martingale Mnγ may diverge.

Proposition 6.9 (a) [Métivier-Priouret [113]] Let (∆Mn )n≥1 be a sequence


P of martingale incre-
ments and and (γn )n≥1 be a sequence of nonnegative steps satisfying n n = +∞. Then, H3
γ
is a.s. satisfied (with πn = ∆Mn ) as soon as there exists a couple of Hölder conjugate exponents
(p, q) ∈ [1, +∞)2 (i.e. p1 + 1q = 1) such that
X 1+ 2q
sup E|∆Mn |p < +∞ and γn < +∞
n n

2(p−1)
This allows for the use of steps of the form γn ∼ c1 n−a , a > 2
2+q = 3(p−1)+2 .

(b) Assume that there exist a real number c > 0 such that,
λ2
∀ λ ∈ R, EeλδMn ≤ ec 2 .
X −c
Then for every sequence (γn )n≥0 such that e γn < +∞, Assumption H3 is satisfied with πn =
n≥1
∆Mn . This allows for the use of steps of the form γn ∼ c1 n−a , a > 0, and γn = c1 (log n)−1+a ,
a > 0.

Examples: Typical examples where the sub-Gaussian assumption is satisfied are the following:
λ2
K2
• |∆Mn | ≤ K ∈ R+ since, owing to Hoeffding Inequality, E eλ∆Mn ≤ e 8 .
λ2
K2
• ∆Mn ∼ N (0; σn2 ) with σn ≤ K, then E eλ∆Mn ≤ e 2

The first case is very important since in many situations the perturbation term is a martingale
term and is structurally bounded.
6.4. FURTHER RESULTS ON STOCHASTIC APPROXIMATION 177

Application to an extended Lyapunov approach


By Relying on claim (a) in Proposition 6.7, one can also derive directly a.s. convergence results for
an algorithm.

Proposition 6.10 (G-Lemma, see [50]) Assume H1 -H3 . Let G : Rd → R+ be a function satis-
fying  
(G) ≡ lim yn = y∞ and lim G(yn ) = 0 =⇒ G(y∞ ) = 0.
n→+∞ n→+∞

Assume that the sequence (yn )n≥0 satisfies


X
γn+1 G(yn ) < +∞. (6.35)
n≥0

Then, there exists a connected component X ∗ of the set {G = 0} such that diet(yn , X ∗ ) = 0.
(n
Proof. First, it follows from Proposition (6.7) that the sequence (yn )n≥0 is UK -relatively compact
with limiting functions lying in C(R+ , Y ∞ ) where Y ∞ still denotes the compact connected set of
limiting values of (yn )n≥0 .
Set, for every y ∈ Rd , G(y)

e = lim inf G(x) = inf lim inf G(xn ), xn → y so that 0 ≤ G e ≤ G.
x→y n
The function G
e is the l.s.c. envelope of the function G i.e. the highest l.s.c. function not greater
than G. In particular, under Assumption (G)

{G = 0} = {G
e = 0} is closed.

Assumption (6.35) reads Z +∞


G(ys(0) )ds < +∞.
0

Let y∞ ∈ Y ∞ . Up to at most two extractions, one may assume that y (ϕ(n)) → y (∞) for the UK
(∞)
topology where y0 = y∞ . It follows from Fatou’s Lemma that
Z +∞ Z +∞
(∞)
0≤ G(ys )ds =
e G(lim
e ys(ϕ(n)) )ds
0 0 n
Z +∞
≤ e (ϕ(n)) )ds (since G
lim G(y e is l.s.c.)
n s
0
Z +∞
≤ lim e s(ϕ(n)) )ds (owing to Fatou’s lemma)
G(y
n 0
Z +∞
≤ lim G(ys(ϕ(n)) )ds
n 0
Z +∞
= lim G(ys(0) ds = 0.
n Γϕ(n)
Z +∞
Consequently, G(y e s(∞) ) = 0 ds-a.s.. Now s 7→ ys(∞)
e s(∞) )ds = 0 which implies that G(y
0
e (∞) ) = 0 since G
is continuous it follows that G(y e is l.s.c. Which in turn implies G(y ∞ ) = 0 i.e.
0 0
178 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

G(y∞ ) = 0. As a consequence Y ∞ ⊂ {G = 0} which yields the result since on the other hand it is
a connected set. ♦

Now we are in position to prove the convergence of stochastic and stochastic pseudo-gradient
procedures in the multi-dimensional case.

Corollary 6.3

Proof. ♦

6.4.2 L2 -rate of convergence and application to convex optimization


Proposition 6.11 Let (Yn )n≥1 be a stochastic algorithm defined by (6.3) where the function H
satisfies the quadratic linear growth assumption (6.6), namely

∀ y ∈ Rd , kH(y, Z)k2 ≤ C(1 + |y|).

Assume Y0 ∈ L2 (P) is independent of the i.i.d. sequence (Zn )n≥1 . Assume there exists y ∗ ∈ Rd and
α > 0 such that both the strong mean-reverting assumption

∀ y ∈ Rd , (y − y ∗ |h(y)) > α|y − y ∗ |2 . (6.36)

Fionally, assume that the step sequence (γn )n≥1 satisfies the usual decreasing step assumption (6.7)
and the additonal assumption
  
1 γn  
(G)α ≡ lim sup an = 1 − 2 α γn+1 − 1 = −κ∗ < 0.
n γ n+1 γ n+1

hold, then
a.s. √
Yn −→ y ∗ and kYn − y ∗ k2 = O( γn ).

γ 1
Examples. • If γn = , 2 < ϑ < 1, then (G)α is satisfied for any α > 0.

γ 1 − 2αγ 1
• If γn = , Condition (G)α reads = −κ∗ < 0 or equivalently γ > .
n γ 2α

Proof: The fact that Yn → y ∗ is a straightforward consequence of Corollary 6.1 (Robbins-


Monro). As concerns the quadratic rate of convergence, we re-start from the classical proof of
Robbins-Siegmund’s Lemma.

|Yn+1 − y ∗ |2 = |Yn − y ∗ |2 − 2γn+1 (H(Yn , Zn+1 )|Yn − y ∗ ) + γn+1


2
|H(Yn , Zn+1 )|2 .

Since Yn − y ∗ is Fn -measurable, one has


   
E (H(Yn , Zn+1 )|Yn − y ∗ ) = E E(H(Yn , Zn+1 )|FnZ )|Yn − y ∗ = E(h(Yn )|Yn − y ∗ ).

Likewise we get
E |H(Yn , Zn+1 )|2 ≤ C 2 (1 + |Yn |2 ).
6.4. FURTHER RESULTS ON STOCHASTIC APPROXIMATION 179

This implies

E |Yn+1 − y ∗ |2 = E |Yn − y ∗ |2 − 2γn+1 E(h(Yn )|Yn − y ∗ ) + γn+1


2
E |H(Yn , Zn+1 )|2
≤ E |Yn − y ∗ |2 − 2γn+1 E (h(Yn )|Yn − y ∗ ) + γn+1
2
C 2 (1 + E |Yn |2 )
≤ E |Yn − y ∗ |2 − 2 α γn+1 E |Yn − y ∗ |2 + γn+1
2
C 0 (1 + E |Yn − y ∗ |2 )

owing successively to the linear quadratic growth and the strong mean-reverting assumptions.
Finally,
 
E |Yn+1 − y ∗ |2 = E |Yn − y ∗ |2 1 − 2 α γn+1 + C 0 γn+1
2
+ C 0 γn+1
2
.

If we set for every n ≥ 1,


E |Yn − y ∗ |2
un = , n ≥ 1,
γn
the above inequality can be rewritten using the expression for an ,
γn  
un+1 ≤ un 1 − 2 α γn+1 + C 0 γn+1
2
+ C 0 γn+1
γn+1
 
= un 1 + γn+1 (an + C 0 γn ) + C 0 γn+1 .

κ∗
Let n0 be an integer such that, for every n ≥ n0 , an ≤ − 43 κ∗ and C 0 γn ≤ 4 . For these integers n,
 κ∗ 
un+1 ≤ un 1 − γn+1 + C 0 γn+1 .
2
Then, one derives by induction that

2C 0
 
∀ n ≥ n0 , 0 ≤ un ≤ max un0 , ∗
κ

which completes the proof. ♦

 Exercise. Show a similar result (under appropriate assumptions) for an algorithm of the form

Yn+1 = Yn − γn+1 (h(Yn ) + ∆Mn+1 )

where h : Rd → Rd is Borel continuous function and (∆Mn )n≥1 a sequence of Fn -martingale


increments satsifying

|h(y)| ≤ C(1 + |y − y ∗ |) and sup E |∆Mn+1 |2 | Fn < C(1 + |Yn |2 ).



n

 Application to (strictly convex optimization: Let L : Rd → R+ be a twice differentiable convex


function with a Lipschitz continuous gradient and D2 L ≥ αId where α > 0 (in the sense that
D2 L(x) − αId is a positive definite symmetric matrix). Then, it follows from Lemma 6.1(b) that
lim|y|→+∞ L(y) = +∞ and that, for every x, y ∈ Rd ,

∇L(x) − ∇L(y)|x − y ≥ α|x − y|2



180 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

Hence, L, being non-negative, attains its minimum at a point y ∗ ∈ Rd . In fact, the above inequality
straightforwardly implies that y ∗ is unique since {∇L = 0} is clearly reduced to {y ∗ }.
Moreover, for every y, u ∈ Rd ,

∗ 2 ∇L(y + tu) − ∇L(y)|u
0 ≤ u D L(y)u = lim ≤ [∇L]Lip |u|2 < +∞.
t→0 t
In that framework, the following proposition shows that the “value function” L(Yn ) of stochastic
1

gradient converges in L at a O 1/n -rate. This is a straightforward consequence of the above
Proposition 6.11.

Proposition 6.4.1 Let L : Rd → R+ be a twice differentiable convex function with a Lipschitz


continuous gradient and such that D2 L ≥ αId where α > 0. Let (Yn )n≥0 be a stochastic gradient
descent associated to L i.e. a stochastic algorithm defined by (6.3) with a mean function h = ∇L.
Assume that the linear growth assumption (6.6) on the state function H, the assumptions on Y0
and on the innovation sequence (Zn )n≥1 from Proposition 6.11 hold true. Finally, assume that the
step assumptions (6.7) and (Gα ) are both satisfied. Then
 
E L(Yn ) − L(y ∗ ) = L(Yn ) − L(y ∗ ) 1 = O γn .

Proof. It is clear that such a stochatic algorihm satifies the assumptions of the above Proposi-
tion 6.11, especially the strong mean-reverting assumption (6.36) owing to the preliminaries on L
that precede the propsition. By the fundamental theorem of calculus, for every n ≥ 0 there exists
Ξn ∈ (Yn−1 , Yn ) (geometric interval in Rd ) such that
 1
L(Yn ) − L(y ∗ ) = ∇L(y ∗ )|Yn − y ∗ + (Yn − y ∗ )∗ D2 L(Ξn )(Yn − y ∗ )
2
2
= [∇L]1 Yn − y ∗ .

One concludes by taking the expectation in the above inequality and applying Proposition 6.11. ♦

 Exercise. Show a similar results for a pseudo-stochastic gradient under appropriate assumptions
on the mean function h.

6.4.3 Weak rate of convergence: CLT


The CLT for stochastic algorithms. In standard settings, a stochastic algorithm converges to its
√ n −y∗
target at a γn rate (which suggests to use some rates γn = nc ). To be precise Y√ γn converges in
distribution to some normal distribution with a dispersion matrix based on H(y∗ , Z).
The central Limit theorem has given raise to an extensive literature starting from some pio-
neering (independent) works by Bouton and Kushner in the early 1980’s. We give here a result
obtained by Pelletier which has as a main asset to be “local” in the following sense. The CLT holds
on the set of convergence of the algorithm to an equilibrium which makes possible a straightforward
application to multi-target algorithms. It can probably be of significant help to elucidate the rate
of convergence of algorithms with constraints or multiple projections like those recently introduced
by Chen. So is the case of Arouna’s adaptive variance reduction procedure for which a CLT has
been recently established by Lelong ([101]) using a direct approach.
6.4. FURTHER RESULTS ON STOCHASTIC APPROXIMATION 181

Theorem 6.6 (Pelletier (1995) [135]) We consider the stochastic procedure defined by (6.3).
Let y∗ be an equilibrium point of {h = 0}.
(i) The equilibrium point y∗ is an attractor for ODEh : Assume that y∗ is a “strongly” attractor
for the ODE ≡ ẏ = −h(y) in the following sense (8 ):
h is is differentiable at y∗ and all the eigenvalues of Dh(y∗ ) have positive real parts.
(ii) Regularity and growth control of H: Assume that the function H satisfies the following regu-
larity and growth control property

y 7→ E H(y, Z)H(y, Z)t is continuous at y ∗ and y 7→ E |H(y, Z)|2+δ is locally bounded on Rd

for some δ > 0.


(iii) Non-degenerate asymptotic variance: Assume that the covariance matrix of H(y∗ , Z)

ΣH (y∗ ) := E H(y∗ , Z)(H(y∗ , Z))t is positive definite in S(d, R).



(6.37)

(iv) Specification of the the step sequence:


c 1
∀ n ≥ 1, γn = , b, c > 0, <ϑ≤1
b + nϑ 2
with the additional constraint, when ϑ = 1,
1
c> (6.38)
2<e(λmin )
(λmin denotes the eigenvalue of Dh(y∗ ) with the lowest real part).
Then, the a.s. convergence is ruled on the convergence event {Yn → y∗ } by the following Central
Limit Theorem
√ Lstably
n (Yn − y∗ ) −→ N (0, αΣ), (6.39)
Z +∞  t
Id Id
with Σ := e−(Dh(y∗ )− 2c∗ )s ΣH (y∗ )e−(Dh(y∗ )− 2c∗ )s ds, c∗ = +∞ if ϑ 6= 1 and c∗ = c if
0
ϑ = 1.

The “Lstably ” stable convergence in distribution mentioned in (6.39) means that for every
bounded continuous function and every A ∈ F,
√  n→∞  √ √ 
E 1{Yn →y∗ }∩A f n(Yn − y∗ ) −→ E 1{Yn →y∗ }∩A f ( α Σ ζ) , ζ ∼ N (0; Id ).

Remarks. • Other rates can be obtained when ϑ = 1 and c < 2<e(λ1 min ) or, more generally, when
the step go to 0 faster although it satisfies the usual decreasing step Assumption (6.7). So is the
c
case for example when γn = b log(n+1)n . Thus, in the latter case, one shows that there exists an
Rd -valued random vector Ξ such that
√ √
Yn = y ∗ + γn Ξ + o( γn ) a.s. as n → +∞.
8
This is but a locally uniform attractivity condition for y∗ viewed as an equilibrium the ODE ẏ = −h(y).
182 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

• When d = 1, then λmin = h0 (y∗ ), ΣH (y∗ ) = Var(H(y∗ , Z)) and

c2
c Σ = Var(H(y∗ , Z)) .
2c h0 (y∗ ) − 1

which reaches its minimum as a function of c at


1
copt =
h0 (y∗ )

with a resulting asymptotic variance


Var(H(y∗ , Z))
.
h0 (y∗ )2
One shows that this is the lowest possible variance in such a procedure. Consequently, the best
choice for the step sequence (γn ) is

1 1
γn := or, more generally, γn :=
h0 (y∗ )n h0 (y∗ )(b + n)

where b can be tuned to “control” the step at the beginning of the simulation (when n is small).
At this stage, one encounters the same difficulties as with deterministic procedures since y∗
being unknown, h0 (y∗ ) is “more” known. One can imagine to estimate this quantity as a companion
procedure of the algorithm but this turns out to be not very efficient. A more efficient approach,
although not completely satisfactory in practice, is to implement the algorithm in its averaged
version (see Section 6.4.4 below).
However one must have in mind that this tuning of the step sequence is intended to optimize
the rate of convergence of the algorithm during its final convergence phase. In real applications,
this class of recursive procedures spends post of its time “exploring” the state space before getting
trapped in some attracting basin (which can be the basin of a local minimum in case of multiple
critical points). The CLT rate occurs once the algorithm is trapped.
An alternative to these procedures is to design some simulated annealing procedure which
“super-excites” the algorithm in order to improve the exploring phase. Thus, when the mean
function h is a gradient (h = ∇L), it finally converges – but only in probability – to the true
minimum of the potential/Lyapunov function L. The “final” convergence rate is worse owing
to the additional exciting noise. Practitioners often use the above Robbins-Monro or stochastic
gradient procedure with a sequence of steps (γn )n≥1 which decreases to a positive limit γ.
We now prove this CLT in the 1D-framework when the algorithm a.s. converges toward a
unique “target” y∗ . Our method of proof is the so-called SDE method which heavily relies on
weak functional convergence arguments. We will have to admit few important results about weak
convergence of processes for which we provide precise references. An alternative proof can be
carried out relying on the CLT for triangular arrays of martingale increments (see [70]). Thus, such
an alternative proof – in a one dimensional setting – can be found in [105].
We purpose below a partial proof of Theorem 6.6, dealing only withe case where the equilibrium
point y ∗ is unique. The extension to a multi-target algorithm is not really more involved and we
refer to the original paper [135].
6.4. FURTHER RESULTS ON STOCHASTIC APPROXIMATION 183

Proof of Theorem 6.6. We will not need the Markovian feature of stochastic algorithms we
are dealing with in this chapter. In fact it will be more useful to decompose the algorithm in its
canonical form  
Yn+1 = Yn − γn+1 h(Yn ) + ∆Mn+1
where
∆Mn = H(Yn−1 , Zn ) − h(Yn−1 ), n ≥ 1,
is a sequence of Fn -martingale increments where Fn = σ(Y0 , Z1 , . . . , Zn ), n ≥ 0. The so-called
SDE method follows the same principle as the ODE method but with the quantity of interest
Yn − y∗
Υn := √ , n≥1
γn

(this normalization is strongly suggested by the above L2 -convergence rate theorem). The under-
lying idea is to write a recursion on Υn which appears as an Euler scheme with decreasing step γn
of an SDE having a stationary/steady regime.
Step 1(Toward the SDE): So, we assume that
a.s.
Yn −→ y ∗ ∈ {h = 0}.

We may assume (up to a change of variable by the translation y ← y − y ∗ ) that

y ∗ = 0.

The differentiability of h at y ∗ = 0 reads

h(Yn ) = Yn h0 (0) + Yn η(Yn ) with lim η(y) = η(y ∗ ) = 0.


y→0

Moreover the function η is bounded on the real line owing to the linear growth of h. For every
n ≥ 1, we have

r
γn  
Υn+1 = Υn − γn+1 h(Yn ) + ∆Mn+1
γn+1
√  √
r
γn 
= Υn − γn+1 Yn h0 (0) + η(Yn ) + γn+1 ∆Mn+1
γn+1
√ √  √
r
γn 
= Υn − Υn + Υn − γn+1 γn Υn h0 (0) + η(Yn ) + γn+1 ∆Mn+1
γn+1
r r  √
γn  0  1  γn
= Υn − γn+1 Υn h (0) + η(Yn ) − −1 + γn+1 ∆Mn+1 .
γn+1 γn+1 γn+1

Assume that the sequence (γn )n≥1 is such that there exists c ∈ (0, +∞] satisfying
 r 
0 1  γn 1
lim an = −1 = .
n γn+1 γn+1 2c
γn
Note that this implies limn γn+1 = 1. In fact it is satisfied by our two families of step sequences of
interest since
184 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

1
– if γn = c
b+n , c > 0, b ≥ 0, then lim a0n = > 0,
n 2c
– if γn = c

, c > 0, 1
2 < ϑ < 1, then lim a0n = 0 i.e. c = +∞.
n
Consequently, for every n ≥ 1,
 1  √
Υn+1 = Υn − γn+1 Υn h0 (0) − + αn1 + αn2 η(Yn ) + γn+1 ∆Mn+1
2c
where (αni )n≥1 i = 1, 2 are two deterministic sequences such that αn1 → 0 and αn2 → 1 as n → +∞
0.
Step 2 (Localisation(s)): Since Yn → 0 a.s., one can write the scenarios space Ω as follows
[ n o
∀ ε > 0, Ω= Ωε,N a.s. where Ωε,N := sup |Yn | ≤ ε .
N ≥1 n≥N

Let ε > 0 and N ≥ 1 being temporarily free parameters. We define the function e
h=e
hε by

∀ y ∈ R, h(y) = h(y)1{|y|≤ε} + Ky1{|y|>ε}


e

(K = K(ε) is also a parameter to be specified further on) and


 ε,N
 YN = YN 1{|YN |≤ε} ,
 e

 
 Ye ε,N = Ye ε,N − γ

h ( enε,N ) + 1{|Y |≤ε} ∆Mn+1 , n ≥ N.
Y
n n+1 ε
e
n+1 n

It is straightforward to show by induction that for every ω ∈ Ωε,N ,

∀ n ≥ N, Ynε,N (ω) = Yn (ω).

To alleviate notations we will drop the exponent ε,N in what follows and denote Yen instead of Yenε,N .
The “characteristics” (mean function and Fn -martingale increments) of this new algorithm are

h;e
h and fn+1 = 1{|Y |≤ε} ∆Mn+1 , n ≥ N
∆M n

which satisfy

fn+1 |2+ρ |Fn ) ≤ 21+ρ sup (E|H(θ, X)|2+ρ ) ≤ A(ε) < +∞ a.s..
sup E(|∆M
n≥N |θ|≤ε

In what follows we will study the normalized error defined by

e n := √Yn , n ≥ N
e
Υ
γn

Step 3: (Specification of ε and K = K(ε)) Since h is differentiable at 0 (and h0 (0) > 0),

h(y) = y(h0 (0) + η(y)) with lim η(y) = η(0) = 0.


0
6.4. FURTHER RESULTS ON STOCHASTIC APPROXIMATION 185

c
If γn = n+b with c > 2h01(0) and b ≥ 0, (as prescribed in the statement of the theorem), we may
choose ρ = ρ(h) > 0 small enough so that
1
c> .
2h0 (0)(1 − ρ)
c 1 0
If γn = (n+b)ϑ , 2 < ϑ < 1 and, more generally, as soon as limn a, = 0, any choice of ρ ∈ (0, 1) is

possible. Now let ε(ρ) > 0 such that |y| ≤ ε(ρ) implies |η(y)| ≤ ρh0 (0). It follows that

θh(y) = y 2 (h0 (0) + η(y)) ≥ y 2 (1 − ρ)h0 (0) if |y| ≤ η(ρ).

Now we specify
ε = ε(ρ) and K = (1 − ρ)h0 (0) > 0
As a consequence, the function e
h satisfies

∀ y ∈ R, h(y) ≥ Ky 2
ye

so that, owing to the L2 -rate Theorem,



kYen k2 = O( γn ).

since (G)α is satisfied with α = K and κ∗ = 2K − 1


c > 0.
Step 4: (The SDE method) First we apply Step 1 to our eed framework and we write
 √
e n h0 (0) − 1 + α

Υ e n − γn+1 Υ
e n+1 = Υ en1 + α
en2 ηe(Yen ) + γn+1 ∆M
fn+1
2c
where αen1 → 0 and αen2 → 1 are two deterministic sequences and ηe is a bounded function.
At this stage, we re-write the above recursive equation in continuous time exactly like we did
for the ODE method. Namely, we set

Γn = γ1 + · · · + γn e (0) = Υ
and Υ e n, n ≥ N,
(Γn )

and
∀ t ∈ R+ , N (t) = min {k | Γk+1 ≥ t} .
Hence, setting a = h0 (0) − 1
2c > 0, we get for every n ≥ N ,

Γn  n

Z
e (0) = Υ (0)
X
e − 1 2
Υ (Γn ) N Υ(t) a + α
e eN (t) + α
eN (t) εe(YN (t) ) dt +
e γk ∆M
fk+1 .
ΓN k=N

Finally, for every t ≥ ΓN , we set

t  N (t)
Z X√
e (0) = Υ
Υ e − (0)
Υ(s) ρ + α
e 1
eN (s) + α 2
eN (s) εe(YN (s) ) ds +
e γk ∆M
fk+1 .
(t) N
ΓN k=N
| {z } | {z }
e(0)
=:A f(0)
=:M
(t) (t)
186 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

e (0) but this


As for the ODE, we will be interested in the functional asymptotics at infinity of Υ (t)
time in a weak sense. We set
Υe (n) = Υ(0) , t ≥ 0, n ≥ N.
(t) (Γn +t)

and
e(n) := A
A e(0) and f(n) = M
M f(0) .
(t) (Γn +t) (t) (Γn +t)

At this stage, we need two fundamental results about functional weak convergence. The first one
is a criterion which implies the functional tightness of the distributions of a sequence of right contin-
uous left limited (càdlàg) processes X (n) (viewed as probability measures on the space D(R+ , R) of
càdlàg functions from R+ to R. The second one is an extension of Donsker’s Theorem for sequences
of martingales.
We need to introduce the uniform continuity modulus defined for every function f : R+ → R
and δ, T > 0 by
w(f, δ, T ) = sup |f (t) − f (s)|.
s,t∈[0,T ],|s−t|≤δ

The terminology comes from the seminal property of this modulus: f is (uniformly) continuous
over [0, T ] if and only if limδ→0 w(f, δ, T ) = 0.
n )
Theorem 6.7 (A criterion for C-tightness) ([24], Theorem 15.5, p.127) (a) Let (X(t) t≥0 , n ≥
1, be a sequence of càdlàg processes processes null at t = 0. If, for every T > 0 and every ε > 0,

lim lim sup P(w(X (n) , δ, T ) ≥ ε) = 0


δ→0 n

0
then the sequence (X n )n≥1 is C-tight in the following sense: from any subsequence (X n )n≥1 one
may extract a subsequence (X n” )n≥1 such that X n converges in distribution toward a process X ∞
for the weak topology on the space D(R+ , R) (endowed with the topology of uniform convergence on
compact sets) such that P(X ∞ ∈ C(R+ , R)) = 1.
(b) (See [24], proof of Theorem 8.3 p.56) If, for every T > 0 and every ε > 0,

(n)
lim sup lim sup P( sup |Xt − Xs(n) | ≥ ε) = 0
δ→0 s∈[0,T ] n s≤t≤t+δ

then the above condition in (a) is satisfied.

The second theorem provides a tightness criterion for a sequence of martingales based on the
sequence of their bracket processes.

Theorem 6.8 (Weak functional limit of a sequence of martingales) ([73]). Let (M(t) n )
t≥0 ,
n ≥ 1, be a sequence of càdlàg (local) martingales, null at 0, C-tight, with (existing) predictable
bracket process hM n i. If
n→+∞
∀ t ≥ 0, hM n i(t) −→ σ 2 t, σ > 0,
then
LD(R+ ,R)
Mn −→ σW,
6.4. FURTHER RESULTS ON STOCHASTIC APPROXIMATION 187

where W denotes a standard Brownian motion (9 ).


e(n) and M
Now we can apply these results to the processes A f(n) .
(t) (t)
First we aim at showing that the sequence of continuous processes (A e(n) )n≥1 is C-tight. Since
sup |e εksup < +∞, the sequence (A(n) )n≥N of time integrals satisfies
ε(YeN (t) )| ≤ ke
n≥N,t≥0

∀ s ∈ [ΓN , +∞), E sup |Ae(0) e(0) 2


(t) − A(s) | ≤ Cke
e 2 2
αi ksup sup E|Υn | × δ .
εksup ,ke
s≤t≤s+δ n≥N

Hence

 C i e n |2 δ 2
supn E|Υ
∀ n ≥ N, ∀ s ∈ R+ , P sup |A e(n) | ≥ ε ≤ keεksup ,keα ksup
e(n) − A .
(t) (s) ε2
s≤t≤s+δ

and one concludes that the sequence (A e(n) )n≥1 is C-tight by applying the above Theorem ??.
(n)
We can apply also this result to the weak asymptotics of the martingales (M(t) ). We start from
f(0) defined by
the definition of the martingale M
N (t)
X√
f(0) =
M γk ∆M
fk+1
(t)
k=1

(0)
(the related filtration is Ft = Fn , t ∈ [Γn , Γn+1 )). Since we know that

fn+1 |2+δ ≤ A(ε) < +∞


sup E|∆M
n

we get, owing to B.D.G. Inequality, that, for every ρ > 0 and every s ∈ R+ ,
 1+ ρ
N (s+δ) 2

f(0) 2+ρ ≤ Cρ E 
(0) X
2
E sup f −M
M γk (∆M
fk )
(t) (s)
s≤t≤s+δ
k=N (s)+1
 1+ ρ P  1+ 2ρ
N (s+δ) 2 N (s+δ) f 2
k=N (s)+1 γk (∆Mk ) 
X
≤ Cρ  γk  E PN (s+δ)
k=N (s)+1 k=N (s)+1 γk
 1+ ρ P 
N (s+δ) 2 N (s+δ) f 2+ρ
k=N (s)+1 γk |∆Mk |
X
≤ Cρ  γk  E PN (s+δ) 
k=N (s)+1 γ
k=N (s)+1 k
 ρ
N (s+δ) 2 N (s+δ)
X X
≤ Cρ  γk  fk |2+ρ
γk E|∆M
k=N (s)+1 k=N (s)+1

9

This means that for every bounded functional F : D R+ , R → R, measurable with respect to σ-field spanned
by finite projection α 7→ α(t), t ∈ R+ , and continuous at every α ∈ C R+ , R), one has EF (M n ) → EF (σ W ) as
n → +∞. This rermains true in fact for measurable functionals F which are Pσ W (dα)-a.s. continuous on C(R+ , R),
such that (F (M n ))n≥1 is uniformly integrable.
188 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

where Cρ is a positive real constant. One finally derives that, for every s ∈ R+ ,
 1+ ρ
N (s+δ) 2
(0) (0) 2+ρ
X
E sup M(t) − M(s) ≤ Cρ A(ε)  γk 
s≤t≤t+δ
k=N (s)+1
1+ ρ
≤ Cρ A(ε) δ + sup γk 2

k≥N (s)+1

which in turn implies that


1+ ρ
f(n) 2+ρ ≤ Cρ0 δ +
f(n) − M

∀ n ≥ N, E sup M (t) (s) sup γk 2
.
s≤t≤s+δ k≥N (ΓN )+1

Then, by Markov inequality,


ρ
1 (n)
f −M f(n) ≥ ε) ≤ Cρ0 δ .
2
lim sup P sup M (t) (s) 2+ρ
n δ s≤t≤t+δ ε

f(n) )n≥N follows from Theorem 6.7(b).


The C-tightness of the sequence (M
Furthermore, for every n ≥ N ,

N (Γn +t)
X
f(n) i(t) = fk )2 |Fk−1

hM γk E (∆M
k=n+1
N (Γn +t)
X
γk (E H(y, Zk )2 )|y=Yk−1 − h(Yk−1 )2

=
k=n+1
N (Γn +t)
X
2 2

∼ E H(0, Z) − h(0) γk
k=n+1

Hence
f(n) i(t) −→ E H(0, Z)2 × t
hM as n → +∞.
where we used here that h and θ 7→ E H(θ, Z)2 are both continuous en 0. Theorem 6.8 then implies

LC(R+ ,R)
f(n)
M −→ σ W, W standard Brownian motion.
(n)
Step 5 (Synthesis and conclusion): The sequence Υ(t) satisfies, for every n ≥ N ,

(n) e(n) − M
f(n) .
∀ t ≥ 0, en − A
Υ(t) = Υ (t) (t)

e (n) ) is C-tight since C-tightness is additive and Υ


Consequently, the sequence (Υ e n is tight (by
2 (n) (n)
L -boundedness) and (Ae )n≥1 , (M
(t)
f )n≥1 are both C-tight.
e (n) , M
Consequently, the sequence of processes (Υ f(n) ) is C-tight as well.
6.4. FURTHER RESULTS ON STOCHASTIC APPROXIMATION 189

e (∞) , M
Let (Υ f(∞) ) be a weak functional limiting value. It will solve the Ornstein-Uhlenbeck
(t) (t)
(O.U.) SDE
dΥe (∞) = −aΥe (∞) dt + σ dWt (6.40)
(t) (t)

starting from a random variable Ye ∞ ∈ L2 such that

e ∞ k ≤ sup kΥ
kΥ e nk .
2 2
n

Let ν0 := L-limn Υ
e ϕ(n) be a weak (functional) limiting value of (Υ
e ϕ(n) )n≥1 .

For every t > 0, one considers the sequence of integers ψt (n) uniquely defined by

Γψt (n) := Γϕ(n) − t.

Up to an extraction, we may assume that we also have the convergence of


L e (∞,0)
e (ϕ(n)) −→ e (∞,0) ∼ ν
Υ Υ starting from Υ (0) 0

and
L e (∞,−t)
e (ψt (n)) −→ e (∞,−t) ∼ ν
Υ Υ starting from Υ (0) −t

One checks by strong uniqueness of solutions of the above Ornstein-Uhlenbeck SDE that

e (∞,−t) = Υ
Υ e (∞,0) .
(t) (0)

Now let (P t )t≥0 denote the semi-group of the Ornstein-Uhlenbeck process. From what precedes,
for every t ≥ 0,
ν0 = ν−t P t
Moreover, (ν−t )t≥0 is tight since it is L2 -bounded. Let ν−∞ be a weak limiting value of ν−t as
t → +∞.
Let Υµ(t) denote a solution to (6.40) starting from a µ-distributed random variable. We know
by the confluence property of O.U. paths that
0 0
|Υµt − Υµt | ≤ |Υµ0 − Υµ0 |e−at .

For every Lipschitz continuous continuous function f with compact support,


ν ν
|ν−∞ P t (f ) − ν−t P t (f )| = −∞
|Ef (Υ(t) −t
) − Ef (Υ(t) )|
−∞ ν −t ν
≤ [f ]Lip E|Υ(t) − Υ(t) |
ν ν
≤ [f ]Lip e−at E|Υ(0)
−∞ −t
− Υ(0) |
ν ν
≤ [f ]Lip e−at kΥ(0)−∞
− Υ(0)−t
k2
| {z }
−at
≤ 2 [f ]Lip e sup kΥ
e n k < +∞
2
n
−→ 0 as t → +∞.
190 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

Consequently
σ2
 
t
ν0 = lim ν−∞ P = N 0; .
t→+∞ 2a
 2

We have just proved that the distribution N 0; σ2a is the only possible limiting value hence

σ2
 
L
e n −→
Υ N 0; .
2a

Now we come back to Υn (prior to the localization). We have just proved that for ε = ε(ρ) and
for every N ≥ 1,
σ2
 
ε,N L
Υn −→ N 0;
e as n → +∞. (6.41)
2a
S
On the other hand, we already saw that Yn → 0 a.s. implies that Ω = N ≥1 Ωε,N a.s. where
Ωε,N = {Y ε,N = Yn , n ≥ N } = {Υe ε,N = Υn , n ≥ N }. Moreover the events Ωε,N ar non-decreasing
as N increases so that
lim P(Ωε,N ) = 1.
N →∞

Owing to the localization principle, for every Borel bounded function f ,

∀ n ≥ N, E|f (Υn ) − f (Υε,N c


n )| ≤ 2kf k∞ P(Ωε,N ).

Combined with (6.41), if f is continuous and bounded, we get, for every N ≥ 1,


 σ   
lim sup Ef (Υn ) − Ef √ Ξ ≤ 2kf k∞ P Ωcε,N

n 2a

where Ξ ∼ N (0; 1). One concludes by letting N go to infinity that for every bounded continuous
function f  σ 
lim Ef (Υn ) = E f √ Ξ
n 2a
2
 
L σ
i.e. Υn −→ N 0; . ♦
2a

 Exercise. V@R CV@R, etc. A tirer de ...(enlever les commentaires du script).

6.4.4 The averaging principle for stochastic approximation


Practical implementations of recursive stochastic algorithms show the the convergence, although
ruled by a CLT, is chaotic even in the final convergence phase, except if the step is optimized to
produce the lowest asymptotic variance. Of course, this optimal choice is not realistic in practice.
The original motivation to introduce the averaging principle was to “smoothen” the behaviour
of a converging stochastic algorithm by considering the arithmetic mean of the past values up to
the nth iteration rather than the computed value at the nh iteration. In fact, if this averaging
procedure is combined with the use of “slowly decreasing” step parameter γn one reaches for free
the best possible rate of convergence!
6.4. FURTHER RESULTS ON STOCHASTIC APPROXIMATION 191

To be precise: let (γn )n≥1 be step sequence satisfying


 ϑ
α
γn ∼ , ϑ ∈ (1/2, 1).
β+n
Then, we implement the standard procedure (6.3) and set
Y0 + · · · + Yn
Ȳn := .
n+1
Note that, of course, this empirical mean itself satisfies a recursive formulation:
1
∀ n ≥ 0, Ȳn+1 = Ȳn − (Ȳn − Yn ), Ȳ0 = 0.
n+1
Under some natural assumptions (see [135]), one shows in the different cases investigates above
(stochastic gradient, Robbins-Monro, pseudo-stochastic gradient) that
a.s.
Ȳn −→ y∗
where y∗ is the target of the algorithm and, furthermore,
√ L
n(Ȳn − y∗ ) −→ N (0; Σ∗ )
where Σ∗ is the lowest possible variance-covariance matrix: thus if d = 1, Σ∗ = Var(H(y ∗ ,Z))
h0 (y∗ )2
. Like
for the CLT of the algorithm itself, we again state the CLT for the averaged procedure in the
framework of Markovian algorithms, although
 it can be done for general recursive procedure of the
form Yn+1 = Yn − γn+1 h(Yn ) + ∆Mn+1 where (∆Mn )n≥1 is a sequence of martingale increments.

Theorem 6.9 (Ruppert & Poliak, see [43, 146, 138]) Let H : Rd × Rq → Rd a Borel func-
tion and let (Zn )n≥1 be sequence of i.i.d. Rq -valued random vectors defined on a probability space
(Ω, A, P) such that, for every y ∈ Rd , H(y, Z) ∈ L2 (P). Then the recursively defined procedure
Yn+1 = Yn − γn+1 H(Yn , Zn+1 )
and the mean function h(y) = E H(y, Z) are well-defined. We make the following assumptions that:
(i) The function h has a unique zero a y∗ and is “fast” differentiable at y∗ in the sense that
∀ y ∈ Rd , h(y) = A(y − y∗ ) + O(|y − y∗ |2 )
where all eigenvalues of the Jacobian matrix A = Jh (y∗ ) of h at y∗ have a positive real part.
(ii) The algorithm Yn converges toward y∗ with positive probability.
(iii) There exists an exponent c > 2 and a real constant C > 0 such that
∀ K > 0, sup E(|H(y, Z)|c ) < +∞ and y 7→ E(H(y, Z)H(y, Z)t ) is continuous at y∗ (6.42)
|y|≤K
γ0
Then, if γn = na +b , n ≥ 1, where 1/2 < a < 1 and b ≥ 0, the empirical mean sequence defined
by
Y0 + · · · + Yn−1
Ȳn =
n
satisfies the CLT with the optimal asymptotic variance, on the event {Yn → y∗ }, namely
√ L
n (Ȳn − y∗ ) −→ N 0; A−1 Γ∗ A−1

on {Yn → y∗ }.
192 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

We will prove this result in a more restrictive setting and refer e.g. to [43] for the general case.
Fist we assume (only for convenience) that d = 1. Then, we assume that h satisfies the coercivity
assumption (6.36) from Proposition 6.11, has a Lipschitz continuous derivative . Finally, we assume
that Yn → y∗ a.s.. Note that the step sequences under consideration all satisfy the Condition (G)α
of Proposition 6.11.

Proof (one-dimensional case). We consider the case of a scalar algorithm (d = 1). We assume
without loss of generality that y∗ = 0. We start from the canonical Markov representation

∀ n ≥ 0, Yn+1 = Yn − γn+1 h(Yn ) − γn+1 ∆Mn+1 where ∆Mn+1 = H(Yn , Zn+1 ) − h(Yn )

is a sequence of FnZ -martingale increments. Then, as Yn → 0 a.s., h(Yn ) = h0 (0)Yn + O(|Yn |2 )


where y 7→ O(|y|2 )/|y|2 is bounded by [h0 ]Lip so that, of every n ≥ 0,

Yn − Yn+1
h0 (0)Yn = − ∆Mn+1 + O(|Yn |2 )
γn+1

which in turn implies, by summing up from 0 to n − 1,

n n−1
√ 1 X Yk − Yk−1 1 1 X
h0 (0) n Ȳn = − √ − √ Mn − √ O(|Yk |2 ).
n γk n n
k=1 k=0

We will inspect successively the three sums on the right hand side of the equation. First, by an
Abel transform, we get
n n  
X Yk − Yk−1 Yn Y0 X 1 1
= − + Yk−1 − .
γk γn γ1 γk γk−1
k=1 k=2

Hence, using that the sequence ( γ1n )n≥1 is non-decreasing, we derive


n n  
X Y − Y |Yn | |Y0 | X 1 1
k k−1
≤ + + |Yk−1 | −

γk γn γ1 γk γk−1


k=1 k=2
n
|Yn | |Y0 | X 1 1
≤ + + sup |Yk | −
γn γ1 k≥0 γk γk−1
k=2
 
|Yn | |Y0 | 1 1
= + + sup |Yk | − .
γn γ1 k≥0 γn γ1
1
≤ 3 sup |Yk | a.s.
k≥0 γn

As a consequence, it follows from the assumption limn nγn = +∞, that
n
1 X Yk − Yk−1
lim √ =0
n n γk
k=1
6.4. FURTHER RESULTS ON STOCHASTIC APPROXIMATION 193

As for the second (martingale) term, it is straightforward that, if we set P si(y) = E H(y, Z) −
2
h(y) , then
n
hM in 1X a.s.
= Ψ(Yn ) −→ Ψ(y∗ ).
n n
k=1
It follows from the assumptions made on the function H and Lindeberg’s CLT (Theorem 11.7 in
the Miscellany chapter 11) that
1 Mn d  Ψ(y ) 

0
√ −→ N 0; 0 .
h (0) n (h (0))2
The third term is handled as follows. Under the Assumption (6.36) of Proposition 6.11, we
know that E Yn2 ≤ Cγn since the class of steps we consider satisfy the condition (G)α (see remark
below Proposition 6.11). On the other hand, it follows from the assumption made on the step
sequence that
+∞ n
X E O(|Yk |2 ) X E Yk2
√ ≤ [h0 ]Lip √
k=1
k k=1
k
+∞
γ
√k .
X
≤ C [h0 ]Lip
k=1
k

One concludes by Kronecker’s Lemma that


n−1
1 X
lim √ O(|Yk |2 ) = 0 a.s.
n n
k=0

Slutsky’s Lemma completes the proof. ♦

Remark. As far as the step sequence is concerned, we only used that (γn ) is decreasing, satisfies
(G)α and X γk √
√ < +∞ and lim nγn = +∞.
k n
n
Indeed, we have seen in the former section that this variance is the lowest possible asymptotic
copt
variance in the CLT when specifying the step parameter in an optimal way (γn = n+b ). In fact
this discussion and its conclusions can be easily extended to higher dimensions (if one considers
some matrix-valued step sequences) as emphasized e.g. in [43].
So, the Ruppert & Poliak principle performs as the regular stochastic algorithm with the lowest
asymptotic variance for free!

 Exercise. Test the above averaging principle on the former exercises and “numerical illustra-
3
tions” by considering γn = α n− 4 , n ≥ 1. Compare with a direct approach with a step γ̃n = β+n
α
.

Practitioner’s corner. In practice, one should not start the averaging at the true beginning
of the procedure but rather wait for its stabilization, ideally once the “exploration/search” phase
is finished. On the other hand, the compromise consisting in using a moving window (typically of
length n after 2n iterations does not yield the optimal asymptotic variance as pointed out in [101].
194 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

6.4.5 Traps
In presence of multiple equilibrium points i.e. of points at which the mean function h of the
procedure vanishes, some of them can be seen as parasitic equilibrium. This is the case of saddle
points local maxima in the framework of stochastic gradient descent (to some extend local minima
are parasitic too but this is another story and usual stochastic approximation does not provide
satisfactory answers to this “second order problem”).
There is a wide literature on this problem which says, roughly speaking, that an excited enough
parasitic equilibrium point is a.s. not a possible limit point for a stochastic approximation pro-
cedure. Although natural and expected such a conclusion is far from being straightforward to
establish, as testified by the various works on that topic (see [96, 136, 43, 20, 51], etc).

6.4.6 (Back to) V @Rα and CV @Rα computation (II): weak rate
We can apply both above CLTs to the V @Rα and CV @Rα (X) algorithms (6.24) and (6.26). Since
1   1
EH(ξ, X)2 =

h(ξ) = F (ξ) − α and 2
F (ξ)(1 − F(ξ) ,
1−α (1 − α)
one easily derives from Theorems 6.6 and 6.9 the following results

Theorem 6.10 Assume that PX = f (x)dx where f is a continuous density function (at least at
the ξα∗ = V @Rα (X)).
κ 1
(a) If γn = na +b , 2 < a < 1, b ≥ 0 then

a  L  κα(1 − α) 
n− 2 ξn − ξα∗ −→ N 0;
2f (ξα∗ )

κ 1−α
(b) If γn = n+b , b ≥ 0 and κ > ∗)
2f (ξα then

√  L  κ2 α 
n ξn − ξα∗ −→ N 0;
2κf (ξα∗ ) − (1 − α)

so that the minimal asymptotic variance is attained with κ∗α = 1−α


∗)
f (ξα with an asymptotic variance
α(1−α)
equal to ∗ )2 .
f (ξα
κ 1
(c) Ruppert & Polyiak’s averaging principle: If γn = na +b , 2 < a < 1, b ≥ 0 then

√  L  α(1 − α) 
n ξn − ξα∗ −→ N 0; .
f (ξα∗ )2

The algorithm for the CV @Rα (X) satisfies the same kind of CLT . In progress [. . . ]
This result is not satisfactory because we see that in general the asymptotic variance remains
huge since f (ξα∗ ) is usually very close to 0. Thus if X has a normal distribution N (0; 1), then it is
clear that ξα∗ → +∞ as α → 1. Consequently

f (ξα∗ )
1 − α = P(X ≥ ξα∗ ) ∼ as α→1
ξα∗
6.4. FURTHER RESULTS ON STOCHASTIC APPROXIMATION 195

so that
α(1 − α) 1

∼ ∗ → +∞ as α → 1.
f (ξα ) 2 ξα f (ξα∗ )
This is simply an illustration of the “rare event” effect which implies that when α is close to 1
and the event {Xn+1 > ξn } is are especially when ξn gets close to its limit ξα∗ = V @Rα (X).
The way out is to add an importance sapling procedure to somewhat “re-center” the distribution
around its V @Rα (X). To proceed, we will take advantage of our recursive variance reduction by
importance sampling described and analyzed in Section 6.3.1. This is the object of the next section.

6.4.7 V @Rα and CV @Rα computation (III)


In progress [. . . ]
As emphasized in the previous section, the asymptotic variance of our “naive” algorithms for
V @Rα and CV @Rα computation are not satisfactory, in particular when α is close to 1. To
improve them, the idea is to mix the recursive data-driven variance reduction procedure introduced
in Section 6.3.1 with the above algorithms.
First we make the (not so) restrictive assumption that the r.v. X, representative of a loss, can
d
be represented as a function of a Gaussian normal vector Z = N (0; Id ), namely

X = χ(Z), ϕ : Rd → R, Borel function.

Hence, for a level α ∈ (0, 1], in a (temporarily) static framework (i.e. fixed ξ ∈ R), the function of
interest for variance reduction id defined by
1  
ϕα,ξ (z) = 1{ϕ(z)≤ξ} − α , z ∈ Rd .
1−α
So, still following Section 6.3.1 and taking advantage of the fact that ϕα,ξ is bounded, we design
the following data driven procedure for the variance reducer (with the notations of this section),

θn+1 = θn − γn+1 ϕα,ξ (Zn+1 − θn )2 (2θn − Zn+1 )




so that Eϕα,ξ (Z) can be computed adaptively by


n
|θ|2
  1 X − |θk−1 |2
E ϕα,ξ (Z) = e− 2 E ϕα,ξ (Z)e−(θ|Z) = lim e 2 ϕα,ξ (Zk + θk−1 )e−(θk−1 |Zk ) .
n→+∞ n
k=1

In progress [. . . ]
Considering now a dynamic version of these procedure (i.e. which adapts recursively ξ leads to
design the following procedure

γn+1 − |θn |2 −(θn |Zk ) 


ξen+1 = ξen − e 2 e 1{χ(Zn+1 +θn )≤ξen } − α
1−α
2 
θn+1 = θn − γn+1 1{χ(Zn+1 )≤ξen } − α (2θen − Zn+1 ) .
e e

This procedure is a.s. converging toward it target (θα∗ , ξα ) and satisfies the averaged component
¯
(ξen )n≥0 of (ξn )n≥0 satisfies a CLT.
196 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

Theorem 6.11 (see [19]) (a) If the step sequence satisfies the decreasing step assumption (6.7),
then
n→+∞
(ξen , θen ) −→ (ξα , θα∗ ) with ξα = V @Rα (X).
κ 1
(b) If the step sequence satisfies γn = na +b , 2 < a < 1, b ≥ 0, κ > 0, then

√ Vα,ξα (θα∗ )
 
 L
n ξ˜n − ξα −→ N 0,
f (ξα )2

where  2 
2
Vα,ξ (θ) = e−|θ| E 1{χ(Zn+1 +θn )≤ξen } − α e−2(θ|Z) .

Note that 
Vα,ξ (0) = F (ξ) 1 − F (ξ) .

6.5 From Quasi-Monte Carlo to Quasi-Stochastic Approximation


Plugging quasi-random numbers into a recursive stochastic approximation procedure instead of
pseudo-random numbers is a rather natural idea given the performances of QM C methods for
numerical integration. It goes back, to our knowledge, to the early 1990’s. As expected, several
numerical tests showed that it may significantly accelerate the convergence of the procedure like it
does in Monte Carlo simulations.
In [93],this question is investigated from a theoretical viewpoint. These papers are based on
an extension of uniformly distributed sequences on unit hypercubes called averaging systems. The
two main results are based on the one hand on a contraction assumption, on the other hand on
a monotone assumption which requires some stringent assumption on the function H. In the first
setting some a priori error bounds emphasize that quasi-stochastic approximation does accelerate
the convergence rate of the procedure. Both results are one-dimensional although the first setting
can easily be extended to multi-dimensional state variables.
In this section, we will simply give the counterpart of the Robbins-Siegmund Lemma established
in Section 6.2. It relies on a pathwise Lyapunov function which is of course a rather restrictive
assumption. It also emphasizes what kind of assumption is needed to establish theoretical results
when using deterministic uniformly distributed sequences. An extended version including several
examples of applications are developed in [95].

Theorem 6.12 (Robbins-Siegmund Lemma in a QM C framework) (a) Let h : Rd → Rd and


Borel
H : Rd × [0, 1]q −→ Rd satisfying

d
y ∈ Rd , U = U ([0, 1]q ).

h(y) = E H(y, U ) ,

Suppose that
{h = 0} = {y∗ }, y∗ ∈ Rd
and that there exists a continuously differentiable function L : Rd → R+ with a Lipschitz continuous
continuous gradient ∇L satisfying √
|∇L| ≤ CL 1 + L
6.5. FROM QUASI-MONTE CARLO TO QUASI-STOCHASTIC APPROXIMATION 197

such that H satisfies the following pathwise mean reverting assumption: the function ΦH defined
by
∀ y ∈ Rd , ΦH (y) := inf (∇L(y)|H(y, u) − H(y∗ , u)) is l.s.c. and positive on Rd \ {y∗ }. (6.43)
u∈[0,1]q

Furthermore, assume that


1
∀ y ∈ Rd , ∀ u ∈ [0, 1]q , |H(y, u)| ≤ CH (1 + L(y)) 2 (6.44)
(which implies that h is bounded) and that the function
u 7→ H(y∗ , u) has finite variation in the measure sense.
Let ξ := (ξn )n≥1 be a uniformly distributed sequence over [0, 1]q with low discrepancy, namely
`n := max (kDk∗ (ξ)) = O((log n)q ).
1≤k≤n

Let γ = (γn )n≥1 be a non-increasing sequence of gain parameters satisfying


X X
γn = +∞, γn (log n)q → 0 and max(γn − γn+1 , γn2 )(log n)q < +∞. (6.45)
n≥1 n≥1

Then, the recursive procedure defined by


∀ n ≥ 0, yn+1 = yn − γn+1 H(yn , ξn+1 ), y0 ∈ Rd
satisfies:
yn −→ y∗ as n → +∞.
(b) If (y, u) 7→ H(y, u) is continuous, then Assumption (6.43) reads

∀ y ∈ Rd \ {y∗ }, ∀ u ∈ [0, 1]q , (∇L(y)|H(y, u) − H(y∗ , u)) > 0. (6.46)

Proof. (a) Step 1 (The regular part): The beginning of the proof is rather similar to the “regular”
stochastic case except that we will use as a Lyapounov function

Λ = 1 + L.
First note that ∇Λ = √∇L is bounded (by the constant CL ) so that Λ is C-Lipschitz continuous
2 1+L
continuous. Furthermore, for every x, y ∈ Rd ,

|∇L(y) − ∇L(y 0 )| 1 1
|∇Λ(y) − ∇Λ(y 0 )| ≤ + |∇L(y 0 )| p −p (6.47)

p
1 + L(y 0 )

1 + L(y) 1 + L(y)
|y − y 0 | C p p
≤ [∇L]Lip p +p L | 1 + L(y) − 1 + L(y 0 )|
1 + L(y) 1 + L(y)
|y − y 0 | C2
≤ [∇L]Lip p +p L |y − y 0 |
1 + L(y) 1 + L(y)
|y − y 0 |
≤ CΛ p (6.48)
1 + L(y)
198 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

where CΛ = [∇L]Lip + CL2 .


It followsby using successively the fundamental Theorem of Calculus applied to Λ between yn
and yn+1 and Hölder’s Inequality that there exists ζn+1 ∈ (yn , yn+1 ) (geometric interval) such that

Λ(yn+1 ) = Λ(yn ) − γn+1 (∇Λ(yn ) | H(yn , ξn+1 )) + γn+1 ∇Λ(yn ) − ∇Λ(ζn+1 )|H(yn , ξn+1 )


≤ Λ(yn ) − γn+1 (∇Λ(yn ) | H(yn , ξn+1 )) + γn+1 ∇Λ(yn ) − ∇Λ(ζn+1 ) H(yn , ξn+1 ).

Now, the above inequality (6.48) applied with with y = yn and y 0 = ζn+1 yields, knowing that
|ζn+1 − yn | ≤ |yn+1 − yn |,

2 CΛ
≤ Λ(yn ) − γn+1 (∇Λ(yn ) | H(yn , ξn+1 )) + γn+1 p |H(yn , ξn+1 )|2
1 + L(yn )

Λ(yn+1 ) ≤ Λ(yn ) − γn+1 (∇Λ(yn ) | H(yn , ξn+1 ) − H(y∗ , ξn+1 )) − γn+1 (∇Λ(yn ) | H(y∗ , ξn+1 ))

2
+γn+1 CΛ Λ(yn ).

Then, using (6.44), it follows

2
Λ(yn+1 ) ≤ Λ(yn )(1 + CΛ γn+1 ) − γn+1 ΦH (yn ) − γn+1 (∇Λ(yn ) | H(y∗ , ξn+1 )). (6.49)

Set for every n ≥ 0


Pn H (y
Λ(yn ) + k=1 γk Φ k−1 )
sn := Qn
k=1 (1 + CΛ γk2 )
P
with the usual convention ∅ = 0. It follows from (6.43) that the sequence (sn )n≥0 is non-negative
since all the terms involved in its numerator are non-negative.
Now (6.49) reads

∀ n ≥ 0, 0 ≤ sn+1 ≤ sn − γ̃n+1 (∇Λ(yn ) | H(y∗ , ξn+1 )). (6.50)


γn
where γ
en = Qn 2 , n ≥ 1.
k=1 (1 + CΛ γk )

Step 2 (The “QMC part”). Set for every n ≥ 1,


n
X n
X
mn := γk (∇Λ(yk−1 ) | H(y∗ , ξk )) and Sn∗ = H(y∗ , ξk ).
k=1 k=1

First note that (6.45) combined with the Koksma-Hlawka Inequality imply

|Sn∗ | ≤ Cξ V (H(y∗ , .))(log n)q (6.51)


6.5. FROM QUASI-MONTE CARLO TO QUASI-STOCHASTIC APPROXIMATION 199

where V (H(y∗ , .)) denotes the variation in the measure sense of H(y∗ , .). An Abel transform yields
(with the convention S0∗ = 0)
n−1
X
en (∇Λ(yn−1 ) | Sn∗ ) −
mn = γ ek ∇Λ(yk−1 ) | Sk∗ )
γk+1 ∇Λ(yk ) − γ
(e
k=1
n−1
X
= en (∇Λ(yn−1 ) | Sn∗ ) −
γ ek (∇Λ(yk ) − ∇Λ(yk−1 ) | Sk∗ )
γ
| {z }
(a) |k=1 {z }
(b)

n−1
X
− γk+1 (∇Λ(yk ) | Sk∗ ) .
∆e
|k=1 {z }
(c)

We aim at showing that mn converges in R toward a finite limit by inspecting the above three
terms.
One gets, using that γn ≤ γ
en ,

|(a)| ≤ γn k∇Λk∞ O((log n)q ) = O(γn (log n)q ) → 0 as n → +∞.


Owing to (6.48), the partial sum (b) satisfies
n−1
X |H(yk−1 , ξk )| ∗
ek | (∇Λ(yk ) − ∇Λ(yk−1 ) | Sk∗ ) | ≤ CΛ γ
γ ek γk p |Sk |
k=1
1 + L(yk−1 )
≤ CΛ CH V (H(y∗ , .))γk2 (log k)q

where we used (6.51) in the Psecond inequality.


Consequently the series k≥1 γ ek (∇L(yk ) − ∇L(yk−1 ) | Sk∗ ) is (absolutely) converging owing to
Assumption (6.45).
Finally, one deals with term (c). Notice that

|e
γn+1 − γ 2
en | ≤ |γn+1 − γn | + CΛ γn+1 γn ≤ CΛ0 max(γn2 , |γn+1 − γn |).

One checks that the series (c) is also (absolutely) converging owing to the boundedness of ∇L,
Assumption (6.45) and the upper-bound (6.51) for Sn∗ .
Then mn converges toward a finite limit m∞ . This induces that the sequence (sn + mn ) is lower
bounded since (sn ) is non-negative. Now we know from (6.50) that (sn +mn ) is also non-increasing,
hence converging in R which in turn implies that the sequence (sn )n≥0 itself is converging toward
a finite limit. The same arguments as in the regular stochastic case yield
n→+∞
X
L(yn ) −→ L∞ and γn ΦH (yn−1 ) < +∞
n≥1

Once again, one concludes like in the stochasic case that (yn ) is bounded and eventually converges
toward the unique zero of ΦH i.e. y∗ .
200 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE

(b) is obvious. ♦

Comments for practical implementation • The step assumption (6.45) includes all the step
γn = ncα , α ∈ (0, 1]. Note that as soon as q ≥ 2, the condition γn (log n)q → 0 is redundant (it
follows from the convergence of the series on the right owing to an Abel transform).
• One can replace the (slightly unrealistic) assumption on H(y∗ , .) by a Lipschitz continuous as-
sumption provided one strengthens the step assumption (6.45) into

1− 1q 1− 1q
X X
γn = +∞, γn (log n)n →0 and max(γn − γn+1 , γn2 )(log n)n < +∞. (6.52)
n≥1 n≥1

This is a straightforward consequence of Proı̈nov’s Theorem (Theorem 4.2) which says that

1− 1q
|Sn∗ | ≤ C(log n) n .
c 1
Note the above assumptions are satisfied by the step sequences γn = nρ , 1− q < ρ ≤ 1.
• It is clear that the Lyapunov assumptions on H is much more stringent in this QM C setting.
• It remains that the theoretical spectrum of application of the above theorem is dramatically
more narrow than the original one. From a practical viewpoint, one observes on simulations a very
satisfactory behaviour of such quasi-stochastic procedures, including the improvement of the rate
of convergence with respect to the regular M C implementation.

 Exercise. We assume now that the recursive procedure satisfied by the sequence (yn )n≥0 is
given by
∀ n ≥ 0, yn+1 = yn − γn+1 (H(yn , ξn+1 ) + rn+1 ), y0 ∈ Rd
P
where the sequence (rn )n≥1 is a disturbance term. Show that if, n≥1 γn rn is a converging series,
then the conclusion of the above theorem remains true.

Numerical experiment: We reproduced here (without even trying to check any kind of assump-
tion, indeed) the implicit correlation search recursive procedure tested in Section 6.3.2 implemented
this time with a sequence of some quasi-random normal numbers, namely
p p 
(ζn1 , ζn2 ) = −2 log(ξn1 ) sin(2πξn2 ), −2 log(ξn1 ) cos(2πξn2 ) , n ≥ 1,

where ξn = (ξn1 , ξn2 ), n ≥ 1, is simply a regular 2-dimensional Halton sequence.

n ρn := cos(θn )
1000 -0.4964
10000 -0.4995
25000 -0.4995
50000 -0.4994
75000 -0.4996
100000 -0.4998
6.5. FROM QUASI-MONTE CARLO TO QUASI-STOCHASTIC APPROXIMATION 201

−0.25 −0.25

−0.3 −0.3

−0.35 −0.35

−0.4 −0.4

−0.45 −0.45

−0.5 −0.5

−0.55 −0.55

−0.6 −0.6

−0.65 −0.65

−0.7 −0.7
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
4 4
x 10 x 10

Figure 6.4: B-S Best-of-Call option.T = 1, r = 0.10, σ1 = σ2 = 0.30, X01 = X02 = 100,
K = 100. Convergence of ρn = cos(θn ) toward a ρ∗ = toward − 0.5 (up to n = 100 000). Left: M C
implementation. Right: QM C implementation.

6.5.1 Further readings


From a probabilistic viewpoint, many other moment, regularity assumptions on the Lyapunov
function entail some a.s. convergence results. From the dynamical point of view, stochastic ap-
proximation is rather closely connected to the dynamics of the autonomous Ordinary Differential
Equation (ODE) of its mean function, namely ẋ = −h(x). However, the analysis of a stochastic
algorithm cannot be really “reduced” to that of its mean ODE as emphasized by several authors
([20, 50]).
There is a huge literature about stochastic approximation, motivated by several fields: opti-
mization, statistics, automatic learning, robotics (?), artificial neural networks, self-organization,
etc. For further insight on stochastic approximation, the main textbooks are probably [89, 43, 21]
for prominently probabilistic aspects. One may read [20] for dynamical system oriented point of
view. For an occupation measure approach, one may also see [51].
202 CHAPTER 6. STOCHASTIC APPROXIMATION AND APPLICATIONS TO FINANCE
Chapter 7

Discretization scheme(s) of a
Brownian diffusion

One considers a d-dimensional Brownian diffusion process (Xt )t∈[0,T ] solution to the following
Stochastic Differential Equation (SDE)

(SDE) ≡ dXt = b(t, Xt )dt + σ(t, Xt )dWt , (7.1)

where b : [0, T ] × Rd → Rd , σ : [0, T ] × Rd → M(d, q, R) are continuous functions, (Wt )t∈[0,T ]


denotes a q-dimensional standard Brownian motion defined on a probability space (Ω, A, P) and
X0 : (Ω, A, P) → Rd is a random vector, independent of W . We assume that b and σ are Lipschitz
continuous continuous in x uniformly with respect to t ∈ [0, T ] i.e., if | . | denotes any norm on Rd
and k . k any norm on the matrix space,

∀ t ∈ [0, T ], ∀ x, y ∈ Rd , |b(t, x) − b(t, y)| + kσ(t, x) − σ(t, y)k ≤ K|x − y|. (7.2)

We consider the so-called augmented filtration generated by X0 and σ(Ws , 0 ≤ s ≤ t) i.e.

∀ t ∈ [0, T ], Ft := σ(X0 , NP , Ws , 0 ≤ s ≤ t)

where NP denotes the class of P-negligible sets of A (i.e. all negligible sets if the σ-algebra A is
supposed to be P-complete). One shows using the Kolmogorov 0-1 law that this completed filtration
is right continuous i.e. Ft = ∩s>t Fs for every t ∈ [0, T ). Such a combination of completeness and
right continuity of a filtration is also known as “usual conditions”.

Theorem 7.1 (see e.g. [78], theorem 2.9, p.289) Under the above assumptions on b, σ, X0 and
W , the above SDE has a unique (Ft )-adapted solution X = (Xt )t∈[0,T ] defined on the probability
space (Ω, A, P), starting from X0 at time 0, in the following sense:
Z t Z t
P-a.s. ∀ t ∈ [0, T ], Xt = X0 + b(s, Xs )ds + σ(s, Xs )dWs .
0 0

This solution has P-a.s. continuous paths.

203
204 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

Notation. When X0 = x ∈ Rd , one denotes the solution of (SDE) on [0, T ] by X x or (Xtx )t∈[0,T ] .

Remark. • A solution as described in the above theorem is known as a strong solution in the sense
that it is defined on the probability space on which W lives.
• The continuity assumption onb and σ can be relaxed into Borel measurability, if we add the linear
growth assumption
∀ t ∈ [0, T ], ∀ x ∈ Rd , |b(t, x)| + kσ(t, x)k ≤ K 0 (1 + |x|).
In fact if b and σ are continuous this condition follows from (7.2) applied with (t, x) and (t, 0),
given the fact that t 7→ b(t, 0) is bounded on [0, T ].
• By adding the 0th component t to X i.e. be setting Yt := (t, Xt ) one may always assume that
the (SDE) is homogeneous i.e. that the coefficients b and σ only depend on the space variable.
This is often enough for applications although it induces some too stringent assumptions on the
time variable in many theoretical results. Furthermore, when some ellipticity assumptions are
required, this way of considering the equation no longer works since the equation dt = 1dt + 0dWt
is completely degenerate.

7.1 Euler-Maruyama schemes


Except for some very specific equations, it is impossible to devise an exact simulation of the
process X even at a fixed time T (by exact simulation, we mean writing XT = χ(U ), U ∼ U ([0, 1]))
(nevertheless, when d = 1 and σ ≡ 1, see [23]). Consequently, to approximate E(f (XT )) by a
Monte Carlo method, one needs to approximate X by a process that can be simulated (at least at
a fixed number of instants). To this end, we will introduce three types of Euler schemes associated
to the SDE: the discrete time Euler scheme X̄ = (X̄ kT )0≤k≤n with step Tn , its càdlàg stepwise
n
constant extension known as the stepwise constant (Brownian) Euler scheme and the continuous
(Brownian) Euler scheme.

7.1.1 Discrete time and stepwise constant Euler scheme(s)


 The discrete time Euler scheme is defined by
r
T n n T
X̄tnk+1 = X̄tnk + b(tk , X̄tnk ) + σ(tk , X̄tnk ) Uk+1 , X̄0 = X0 , k = 0, . . . , n − 1, (7.3)
n n
where tnk = kT
n , k = 0, . . . , n and (Uk )1≤k≤n denotes a sequence of i.i.d. N (0; Iq )-distributed random
vectors given by r
n
Uk := (Wtnk − Wtnk−1 ), k = 1, . . . , n.
T
(Strictly speaking we should write Ukn rather than Uk .)
Notation: For convenience, we denote from now on
t := tnk if t ∈ [tnk , tnk+1 ).

 The stepwise constant Euler scheme, denoted (X


et )t∈[0,T ] for convenience, is defined by

X
et = X̄t , t ∈ [0, T ]. (7.4)
7.2. STRONG ERROR RATE AND POLYNOMIAL MOMENTS (I) 205

7.1.2 Genuine (continuous) Euler scheme


At this stage it is natural to extend the definition (7.3) of the Euler scheme at every instant
t ∈ [0, T ] by interpolating the drift with respect to time and the diffusion coefficient with respect
to the Brownian motion, namely

∀ k ∈ {0, . . . , n − 1}, ∀ t ∈ [tnk , tnk+1 ), X̄t = X̄t + (t − t)b(t, X̄t ) + σ(t, X̄t )(Wt − Wt ). (7.5)

It is clear that lim n X̄t = X̄tnk+1 since W has continuous paths. Consequently, so defined,
t→tn
k+1 ,t<tk+1

(X̄t )t∈[0,T ] is an FtW -adapted process with continuous paths.


The following proposition is the key property of the genuine (or continuous) Euler scheme.

Proposition 7.1 Assume that b and σare continuous functions on [0, T ] × Rd . The genuine Euler
scheme is a (continuous) Itô process satisfying the pseudo-SDE with frozen coefficients

dX̄t = b(t, X̄t )dt + σ(t, X̄t )dWt , X̄0 = X0

that is Z t Z t
X̄t = X0 + b(s, X̄s )ds + σ(s, X̄s )dWs . (7.6)
0 0

Proof. It is clear from (7.5), the recursive definition (7.3) at the discretization dates tnk and the
continuity of b and σ that X̄t → X̄tnk+1 as t → tnk+1 . Consequently, for every t ∈ [tnk , tnk+1 ],
Z t Z t
X̄t = X̄tnk + b(s, X̄s )ds + σ(s, X̄s )dWs
tn
k tn
k

so that the conclusion follows by just concatenate the above identities between 0 and tn1 , . . . , tnk
and t. ♦
Notation: In the main statements, we will write X̄ n instead of X̄ to recall the dependence of the
Euler scheme in its step T /n. Idem for X,
e etc.

Then, the main (classical) result is that under the assumptions on the coefficients b and σ
mentioned above, supt∈[0,T ] |Xt − X̄t | goes to zero in every Lp (P), 0 < p < ∞ as n → +∞. Let us
be more specific on that topic by providing error rates under slightly more stringent assumptions.
How to use this continuous scheme for practical simulation seems not obvious, at least not as
obvious as the stepwise constant Euler scheme. However this turns out to be an important method
to improve the convergence rate of M C simulations e.g. for option pricing. Using this scheme in
simulation relies on the so-called diffusion bridge method and will be detailed further on.

7.2 Strong error rate and polynomial moments (I)


7.2.1 Main results and comments
We consider the SDE and its Euler-Maruyama scheme(s) as defined by (7.1) and (7.3), (7.5). A
first version of Theorem 7.2 below (including the second remark that follows) is mainly due to
O. Faure in his PhD thesis (see [48]).
206 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

 Polynomial moment control. It is often useful to have at hand the following uniform bounds for
the solution(s) of (SDE) and its Euler schemes which first appears as a step of the proof of the
rate but has many other applications: thus it is an important step to prove the existence of global
strong solutions to (SDE).

Proposition 7.2 Assume that the coefficients b and σ of the SDE (7.1) are Borel functions that
simply satisfy the following linear growth assumption:

∀ t ∈ [0, T ], ∀ x ∈ Rd , |b(t, x)| + kσ(t, x)k ≤ C(1 + |x|) (7.7)

for some real constant C > 0 and a “horizon” T > 0. Then, for every p ∈ (0, +∞), there exists a
universal positive real constant κp such that every strong solution (Xt )t∈[0,T ] (if any) satisfies

sup |Xt | ≤ 2eκp CT (1 + kX0 kp )

t∈[0,T ] p

and, for every n ≥ 1, the Euler scheme with step T /n satisfies



n
sup |X̄t | ≤ 2eκp CT (1 + kX0 kp ).

t∈[0,T ] p

One noticeable consequence of this proposition is that, if b and σ satisfy (7.7) with the same real
constant C for every T > 0, then the conclusion holds true for every T > 0, providing a “rough”
exponential control in T of any solution.
 Uniform convergence rate in Lp (P). First we introduce the following condition (HTβ ) which
strengthens Assumption (7.2) by adding a time regularity assumption of the Hölder type:

∃ β ∈ [0, 1], ∃ Cb,σ,T > 0 such that ∀s, t ∈ [0, T ], ∀ x, y ∈ Rd ,




β
(HT ) ≡ (7.8)
|b(t, x) − b(s, y)| + kσ(t, x) − σ(s, y)k ≤ Cb,σ,T (|t − s|β + |y − x|).

Theorem 7.2 (Strong Rate for the Euler scheme) (a) Continuous Euler scheme. Sup-
pose the coefficients b and σ of the SDE (7.1) satisfy the above regularity condition (HTβ ) for a
real constant Cb,σ,T > 0 and an exponent β ∈ (0, 1]. Then the continuous Euler scheme (X̄ n )t∈[0,T ]
1
converges toward (Xt )t∈[0,T ] in every Lp (P), p > 0, such that X0 ∈ Lp , at a O(n−( 2 ∧β) )-rate. To be
precise, there exists a universal constant κp > 0 only depending on p such that, for every n ≥ T ,
 β∧ 1
T 2
sup |Xt − X̄tn | ≤ K(p, b, σ, T )(1 + kX0 kp ) (7.9)

t∈[0,T ] p n

where
0
0
K(p, b, σ, T ) = 2κp (Cb,σ,T )2 eκp (1+Cb,σ,T )T
and
0
Cb,σ,T = Cb,σ,T + sup |b(t, 0)| + sup kσ(t, 0)k < +∞. (7.10)
t∈[0,T ] t∈[0,T ]
7.2. STRONG ERROR RATE AND POLYNOMIAL MOMENTS (I) 207

In particular (7.9) is satisfied when the supremum is restricted to discretization instants, namely
 β∧ 1

n
T 2
sup |Xtk − X̄tk | ≤ K(p, b, σ, T )(1 + kX0 kp ) . (7.11)

0≤k≤n p n

(a0 ) If b and σ are defined on the whole R+ × Rd and satisfy (HTβ ) with the same real constant Cb,σ
0
not depending on T and if b( . , 0) and σ( . , 0) are bounded on R+ , then Cb,σ,T does not depend on
T.
So will be the case in the homogeneous case i.e. if b(t, x) = b(x) and σ(t, x) = σ(x), t ∈ R+ ,
x ∈ Rd with b and σ Lipschitz continuous continuous on Rd .
(b) Stepwise constant Euler scheme.  As soon as b and σ satisfy the linear growth assump-
tion (7.7) with a real constant Lb,σ,T > 0, then, for every p ∈ (0, +∞) and every n ≥ T ,
r r !

n

n κ̃p Lb,σ,T T T (1 + log n) 1 + log n
sup |X̄t − X̄t | ≤ κ̃p e (1 + kX0 kp ) =O

t∈[0,T ] n n
p

where κ̃p > 0 is a positive real constant only depending on p (and increasing in p).
 In particular if b and σ satisfy the assumptions of item (a) then the stepwise constant Euler
scheme (Xe n )t∈[0,T ] converges toward (Xt )t∈[0,T ] in every Lp (P), p > 0, such that X0 ∈ Lp and for
every n ≥ T ,
r  β∧ 1 !

e n | ≤ K(p,
 T (1 + log n) T 2
sup |Xt − X e b, σ, T ) 1 + kX0 kp +

t
t∈[0,T ] p n n
 β r !
1 1 + log n
= O +
n n

where
e b, σ, T ) = κ̃0 (1 + C 0 2 κ̃0p (1+Cb,σ,T
0 )T
K(p, p b,σ,T ) e ,
0
κ̃0p > 0 is a positive real constant only depending on p (increasing in p) and Cb,σ,T is given by (7.10).

Warning! The complete and detailed proof of this theorem in its full generality, i.e. including
the management of the constants, is postponed to Section 7.8. It makes use of stochastic calculus.
A first approach to the proof in the one-dimensional quadratic case is proposed in Section 7.2.2.
However, owing to its importance for applications, the optimality of the upper bound for the
stepwise constant Euler scheme will be discussed right after the remarks below.

Remarks. • When n ≤ T , the above explicit bounds still hold true with the same constants
provided one replaces Note that as soon as Tn ≤ 1,
 β∧ 1  β   1 ! r r !
T 2 1 T T 2 T  1 T  T
2 by + and 1 + log n by 1 + log n +
n 2 n n n 2 n n

which significantly simplifies the formulation of the error bound in item (b) of the above theorem.
208 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

• As a consequence, note that the time regularity exponent β rules the convergence rate of the
scheme as soon as β < 1/2. In fact, the method of proof itself will emphasize this fact: the idea
is to use a Gronwall Lemma to upper-bound the error X − X̄ in Lp (P) by the Lp (P)-norm of the
increments of Xs − Xs .
• If b(t, x) and σ(t, x) are globally Lipschitz continuous continuous on R+ × Rd with Lipschitz
continuous coefficient Cb,σ , one may consider time t as a (d + 1)th spatial component of X and
apply directly item (a0 ) of the above theorem.
The following corollary is a straightforward consequence of claims (a) of the theorem (to be
precise (7.11)): it yields a (first) convergence rate for the pricing of “vanilla” European options
(payoff ϕ(XT )).

Corollary 7.1 Let ϕ : Rd → R be an α-Hölder function for an exponent α ∈ (0, 1], i.e. a function
(y)|α
such that [ϕ]α := supx6=y |f (x)−f
|x−y| < +∞. Then, there exists a real constant Cb,σ,T ∈ (0, ∞) such
that, for every n ≥ 1,
T α
n n 2
|E ϕ(XT ) − E ϕ(X̄T )| ≤ E |ϕ(XT ) − ϕ(X̄T )| ≤ Cb,σ,T [ϕ]α .
n
We will see further on that this rate can be considerably improved when b, σ and ϕ share higher
regularity properties.

 About the universality of the rate for the stepwise constant Euler scheme (p ≥ 2). Note that the
rate in claim (b) of the theorem is universal since it holds as a sharp rate for the Brownian motion
itself (here we deal with the case d = 1). Indeed, since W is its own continuous Euler scheme,


k sup |Wt − Wt |kp = max sup |Wt − Wtnk−1 |

t∈[0,T ] k=1,...,n n n
t∈[tk−1 ,tk )
p
r
T
= max sup |W − W |

t k−1
f f
n k=1,...,n t∈[k−1,k)


p
pn
where W
ft :=
T WT t is a standard Brownian motion owing to the scaling property. Hence
n
r
T
sup |Wt − Wt | = max ζk

t∈[0,T ] n k=1,...,n p
p

where the random variables ζk := supt∈[k−1,k) |Wt − Wk−1 | are i.i.d.


Lower bound. Note that, for every k ≥ 1,

ζk ≥ Zk := |Wk − Wk−1 |.

since the Brownian motion Wt has continuous paths. The sequence (Zk )k≥1 is i.i.d. as well, with the
same distribution as |W1 |. Hence, the random variables Zk2 are still i.i.d. with a χ2 (1)-distribution
so that (see item (b) of the exercises below)
r p
∀ p ≥ 2, max |Zk | = k max Zk2 kp/2 ≥ cp log n.

k=1,...,n p k=1,...,n
7.2. STRONG ERROR RATE AND POLYNOMIAL MOMENTS (I) 209

so that, finally, r
Tp
∀ p ≥ 2, sup |Wt − Wt | ≥ cp log n.

t∈[0,T ] p n
Upper bound. To establish the upper-bound, we proceed as follows. First, note that

ζ1 = max( sup Wt , sup − Wt ) .
t∈[0,1) t∈[0,1)

We also know that


d
sup Wt = |W1 |
t∈[0,1)
2
(see e.g. [141], Reflection principle, p.105). Hence using using that for every a, b ≥ 0, e(a∨b) ≤
2 2
ea + eb , that supt∈[0,1) Wt ≥ 0 and that −W is also a standard Brownian motion we derive that
2 2 2
E eθζ1 ≤ E eθ(supt∈[0,1) Wt ) + E eθ(supt∈[0,1)(−Wt ))

2
= 2 E eθ(supt∈[0,1) Wt )
!
u2
Z
θW12 du 2
= 2Ee =2 exp − 1
√ =√ < +∞
R 2( √1−2θ )2 2π 1 − 2θ

as long as θ ∈ (0, 12 ). Consequently, it follows from Lemma 7.1 below applied with the sequence
(ζn2 )n≥1 that r p
k max ζk kp = k max ζk2 kp/2 ≤ Cp 1 + log n
k=1,...,n k=1,...,n

i.e., for every p ∈ (0, ∞),


r
T
k sup |Wt − Wt |kp ≤ CW,p (1 + log n). (7.12)
t∈[0,T ] n
Lemma 7.1 Let Y1 , . . . Yn be non-negative random variables with the same distribution satisfying
E(eλY1 ) < +∞ for some λ > 0. Then,
1
∀ p ∈ (0, +∞), k max(Y1 , . . . Yn )kp ≤ (log n + Cp,Y1 ,λ ).
λ
Proof : We may assume without loss of generality that p ≥ 1 since the k . kp -norm is non-decreasing.
First, assume λ = 1. Let p ≥ 1. One sets
ϕp (x) = (log(ep−1 + x))p − (p − 1)p , x > 0.
The function ϕp is continuous, increasing, concave and one-to-one from R+ onto R+ (the term ep−1
is introduced to ensure the concavity). It follows that ϕ−1 ((p−1)p +y)1/p − ep−1 ≤ ey 1/p for
p (y) = e
1 1 1
every y ≥ 0 (since (u + v) p ≤ u p + v p , u, v ≥ 0) so that
  
p −1 p
E max Yk = E max ϕp ◦ ϕp (Yk )
k=1,...,n k=1,...,n
  
p
= E ϕp max ϕ−1 p (Y1 ), . . . , ϕ −1
p (Yn
p
)
210 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

since ϕp si non-decreasing. Then Jensen’s Inequality implies


 
p −1 p
E max Yk ≤ ϕp E max ϕp (Yk )
k=1,...,n k=1,...,n

n
!
p
X
≤ ϕp E ϕ−1
p (Yk )
k=0

p 
= ϕp n E ϕ−1
p (Y1 )

≤ ϕp n E eY1


p
≤ log ep−1 + n E eY1 .

Hence
 
max Yk ≤ log ep−1 + n E eY1

k=1,...,n p

 ep−1 
= log n + log E eY1 +
n

≤ log n + Cp,Y1
 
where Cp,Y1 = log E eY1 + ep−1 .
Let us come back to the general case i.e. E eλY1 < +∞ for a λ > 0. Then
1
max(Y1 , . . . , Yn ) = max(λY1 , . . . , λYn )

p λ p

1
≤ (log n + Cp,λ,Z1 ). ♦
λ

 Exercises. (a) Let Z be a non-negative random variable with distribution function F (z) =
P(Z ≤ z) and a continuous probability density function f . Show that if the survival function
F̄ (z) := P(Z > z) satisfies
∀ z ≥ a > 0, F̄ (z) ≥ c f (z)
then, if (Zn )n≥1 is i.i.d. with distribution PZ (dz),
n
X 1 − F k (a)
∀ p ≥ 1, max(Z1 , . . . , Zn ) ≥ c = c(log n + log(1 − F (a))) + C + εn

p k
k=1

with limn εn = 0.
[Hint: one may assume p = 1. Then use the classical representation formula
Z +∞
EU = P(U ≥ u)du
0
7.2. STRONG ERROR RATE AND POLYNOMIAL MOMENTS (I) 211

for any non-negative random variable U and some basic facts about Stieljès integral like dF (z) =
f (z)dz, etc.]
−u
(b) Show that the χ2 (1) distribution defined by f (u) := √e 2πu
2
1{u>0} du satisfies the above inequality
for any η > 0 [Hint: use an integration by parts and usual comparison theorems on integrals to
show that
z Z +∞ − u z
2e− 2 e 2 2e− 2
F̄ (z) = √ − √ du ∼ √ as z → +∞ .]
2πz z u 2πu 2πz

 A.s. convergence rate(s). The last important result of this section is devoted to the a.s. conver-
gence of the Euler schemes toward the diffusion process with a first (elementary) approach to its
rate of convergence.

Theorem 7.3 If b and σ satisfy (HTβ ) for a β ∈ (0, 1] and if X0 is a.s. finite, the continuous
Euler scheme X̄ n = (X̄tn )t∈[0,T ] a.s. converges toward the diffusion X for the sup-norm over [0, T ].
Furthermore, for every α ∈ [0, β ∧ 12 ),

a.s.
nα sup |Xt − X̄tn | −→ 0.
t∈[0,T ]

The proof follows from the Lp -convergence theorem by an approach “à la Borel-Cantelli”. The
details are deferred to Section 7.8.6.

7.2.2 Proofs in the quadratic Lipschitz case for homogeneous diffusions


We provide below a proof of both Proposition 7.2 and Theorem 7.2 in a restricted 1-dimensional,
homogeneous and quadratic (p = 2) setting. This means that b(t, x) = b(x) and σ(t, x) = σ(x)
are defined as Lipschitz continuous continuous functions on the real line. Then (SDE) admits a
unique strong solution starting from X0 on every interval [0, T ] which means that there exists a
unique strong solution (Xt )t≥0 starting from X0 .
Furthermore, we will not care about the structure of the real constants that come out, in
particular no control in T is provided in this simplified version of the proof. The complete and
detailed proof is postponed to Section 7.8.

Lemma 7.2 (Gronwall Lemma) Let f : R+ → R+ be a Borel non-negative locally bounded function
and let ψ : R+ → R+ be a non-decreasing function satisfying
Z t
(G) ≡ ∀ t ≥ 0, f (t) ≤ α f (s) ds + ψ(t)
0

for a real constant α > 0. Then

∀ t ≥ 0, sup f (s) ≤ eαt ψ(t).


0≤s≤t
212 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

Proof. It is clear that the non-decreasing (finite) function ϕ(t) := sup0≤s≤t f (s) satisfies (G)
Rt
instead of f . Now the function e−αt 0 ϕ(s) ds has a right derivative at every t ≥ 0 and that
 Z t 0 Z t
−αt −αt
e ϕ(s)ds = e (ϕ(t+) − α ϕ(s) ds)
0 r 0
≤ e−αt ψ(t+)

where ϕ(t+) and ψ(t+) denote right limits of ϕ and ψ at t. Then, it follows from the fundamental
theorem of calculus that
Z t Z t
−αt
e ϕ(s)ds − e−αs ψ(s+) ds is non-increasing.
0 0

Hence, applying that between 0 and t yields


Z t Z t
ϕ(s) ≤ e αt
e−αs ψ(s+)ds.
0 0

Plugging this in the above inequality implies


Z t
ϕ(t) ≤ αe αt
e−αs ψ(s+)ds + ψ(t)
0
Z t
= αeαt e−αs ψ(s)ds + ψ(t)
0
≤ eαt ψ(t)

where we used successively that a monotone function is ds-a.s. continuous and that ψ is non-
decreasing. ♦

Now we recall the classical Doob’s Inequality that is needed to carry out the proof (instead of
the more sophisticated Burkhölder-Davis-Gundy Inequality which is necessary in the non-quadratic
case).

Doob’s Inequality. (see e.g. [91]) (a) Let M = (Mt )t≥0 be a continuous martingale with M0 = 0.
Then, for every T > 0, !
E sup Mt2 ≤ 4 E MT2 = 4 E hM iT .
t∈[0,T ]

(b) If M is simply a continuous local martingale with M0 = 0, then, for every T > 0,
!
E sup Mt2 ≤ 4 E hM iT .
t∈[0,T ]

Proof of Proposition 7.2 (A first partial). We may assume without loss of generality that
E X02 < +∞ (otherwise the inequality is trivially fulfilled). Let τL := min{t : |Xt − X0 | ≥ L},
7.2. STRONG ERROR RATE AND POLYNOMIAL MOMENTS (I) 213

L ∈ N \ {0} (with the usual convention min ∅ = +∞) . It is a positive F-stopping time as the
hitting time of a closed set by a process with continuous paths. Furthermore, for every t ∈ [0, T ],
τ
|Xt L | ≤ L + |X0 |, t ∈ [0, ∞).

In particular this implies that


τ
E sup |Xt L |2 ≤ 2(L2 + E X02 ) < +∞.
t∈[0,T ]

Then,
Z t∧τL Z t∧τL
τ
Xt L = X0 + b(Xs )ds + σ(Xs )dWs
0 0
Z t∧τL Z t∧τL
τ τ
= X0 + b(Xs L )ds + σ(Xs L )dWs
0 0

owing to the local feature of (standard and) stochastic integral(s). The stochastic integral
Z t∧τ
(L) L τ
Mt := σ(Xs L )dWs
0

is a continuous local martingale null at zero with bracket process defined by


Z t∧τ
L τ
hM (L) it = σ 2 (Xs L )ds.
0

Now, using that t ∧ τL ≤ t, we derive that


Z t
τ τ
|Xt L | ≤ |X0 | + |b(Xs L )|ds + sup |Ms(L) |
0 s∈[0,t]

which in turn immediately implies that


Z t
τ τ
sup |Xs L | ≤ |X0 | + |b(Xs L )|ds + sup |Ms(L) |.
s∈[0,T ] 0 s∈[0,t]

The elementary inequality (a + b + c)2 ≤ 3(a2 + b2 + c2 ) (a, b, c ≥ 0), combined with the Schwarz
Inequality successively yields
2 !
Z t
τ τ
sup (Xs L )2 ≤ 3 X02 + |b(Xs L )|ds + sup |Ms(L) |2
s∈[0,t] 0 s∈[0,t]
!
Z t
τ
≤ 3 X02 +t |b(Xs L )|2 ds + sup |Ms(L) |2 .
0 s∈[0,t]

We know that the functions b and σ satisfy a linear growth assumption

|b(x)| + |σ(x)| ≤ Cb,σ (1 + |x|), x ∈ R,


214 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

as Lipschitz continuous continuous functions. Then, taking expectation and using Doob Inequality
for the local martingale M (L) yield for an appropriate real constant Cb,σ,T > 0 (that may vary from
line to line)
 Z t Z t∧τ 
τL 2 τL 2 L τL
2 2
E( sup (Xs ) ) ≤ 3 EX0 + T Cb,σ (1 + E|Xs | )ds + E σ (Xs )ds
s∈[0,t] 0 0
 Z t Z t 
τ τ
≤ Cb,σ,T EX02 + (1 + E|Xs L |2 )ds +E (1 + |Xs L |)2 ds
0 0
 Z t 
τ
= Cb,σ,T EX02 + (1 + E|Xs L |2 )ds
0

where we used again (in the first inequality) that τL ∧ t ≤ t. Finally, this can be rewritten
 Z t 
τ τ
E( sup (Xs L )2 ) ≤ Cb,σ,T 1+ EX02 + E(|Xs L |2 )ds
s∈[0,t] 0

for a new real constant Cb,σ,T . Then the Gronwall Lemma applied to the bounded function fL (t) :=
τ
E(sups∈[0,t] (Xs L )2 ) (at time t = T ) implies
τ
E( sup (Xs L )2 ) ≤ Cb,σ,T (1 + EX02 )eCb,σ,T T .
s∈[0,T ]

This holds or every L ≥ 1. Now τL ↑ +∞ a.s. as L ↑ +∞ since sup0≤s≤t |Xt | < +∞ for every t ≥ 0
a.s. Consequently,
τ
lim sup |Xs L | = sup |Xs |.
L→+∞ s∈[0,T ] s∈[0,T ]

Then Fatou’s Lemma implies


0
E( sup Xs2 ) ≤ Cb,σ,T (1 + EX02 )eCb,σ,T T = Cb,σ,T (1 + EX02 ).
s∈[0,T ]

As for the Euler scheme the same proof works perfectly if we introduce the stopping time

τ̄L = min t : |X̄t − X0 | ≥ L big}

and if we note that, for every s ∈ [0, T ], sup |X̄u | ≤ sup |X̄u |. Then one shows that
u∈[0,s] u∈[0,s]
!
sup E sup (X̄sn )2 ≤ Cb,σ,T (1 + EX02 )eCb,σ,T T . ♦
n≥1 s∈[0,T ]

Proof of Theorem 7.2 (partial) (a) (Convergence rate of the continuous Euler scheme). Com-
bining the equations satisfied by X and its (continuous) Euler scheme yields
Z t Z t
Xt − X̄t = (b(Xs ) − b(X̄s ))ds + (σ(Xs ) − σ(X̄s ))dWs .
0 0
7.2. STRONG ERROR RATE AND POLYNOMIAL MOMENTS (I) 215

Consequently, using that b and σ are Lipschitz, Schwartz and Doob Inequality lead to
Z t 2 Z s 2
2
E sup |Xs − X̄s | ≤ 2E [b]Lip |Xs − X̄s |ds + 2 E sup (σ(Xu ) − σ(X̄u ))dWu
s∈[0,t] 0 s∈[0,t] 0
Z t 2 Z t
≤ 2E [b]Lip |Xs − X̄s |ds + 8 E (σ(Xu ) − σ(X̄u ))2 du
0 0
Z t Z t
2 2 2
≤ 2T [b]Lip E |Xs − X̄s | ds + 8 [σ]Lip E|Xu − X̄u |2 du
0 0
Z t Z t
2 2
≤ Cb,σ,T E|Xs − X̄s | ds + 8[σ]Lip E|Xu − X̄u |2 du
0 0
Z t Z t
2
≤ Cb,σ,T E sup |Xu − X̄u | ds + Cb,σ,T E|X̄s − X̄s |2 ds.
0 u∈[0,s] 0

The function f (t) := E sups∈[0,t] |Xs − X̄s |2 is locally bounded owing to Step 1. Consequently, it
follows from Gronwall Lemma (at t = T ) that
Z T
E sup |Xs − X̄s |2 ≤ Cb,σ,T E|X̄s − X̄s |2 ds eCb,σ,T T .
s∈[0,T ] 0

Now
X̄s − X̄s = b(X̄s )(s − s) + σ(X̄s )(Ws − Ws ) (7.13)
so that, using Step 1 (for the Euler scheme) and the fact that Ws − Ws and X̄s are independent
  
2 T 2 2 2 2
E|X̄s − X̄s | ≤ Cb,σ E b (X̄s ) + E σ (X̄s ) E(Ws − Ws )
n
  
T 2 T
≤ Cb,σ (1 + E sup |X̄t |2 ) +
t∈[0,T ] n n
T
= Cb,σ 1 + E X02 .
n
(b) Stepwise constant Euler scheme. We assume here – for pure convenience – that X0 ∈ L4 . One
derives from (7.13) and the linear growth assumption satisfied by b and σ (since they are Lipschitz
continuous continuous) that
 T 
sup |X̄t − X̄t | ≤ Cb,σ 1 + sup |X̄t | + sup |Wt − Wt |
t∈[0,T ] t∈[0,T ] n t∈[0,T ]

so that,
 T 
k sup |X̄t − X̄t |k2 ≤ Cb,σ 1 + sup |X̄t | + sup |Wt − Wt | .

t∈[0,T ] t∈[0,T ] n t∈[0,T ]
2

Now, if U and V are real valued random variables, Schwarz Inequality implies
p p
kU V k2 = kU 2 V 2 k1 ≤ kU 2 k2 kV 2 k2 = kU k4 kV k4 .
216 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

Consequently

k sup |X̄t − X̄t |k2 ≤ Cb,σ (1 + k sup |X̄t |k4 )(T /n + k sup |Wt − Wt |k4 ).
t∈[0,T ] t∈[0,T ] t∈[0,T ]

Now, as already mentioned in the first remark that follows Theorem 7.2,
r
Tp
k sup |Wt − Wt |k4 ≤ CW 1 + log n
t∈[0,T ] n

which completes the proof (if we admit that k supt∈[0,T ] |X̄t |k4 ≤ Cb,σ,T (1 + kX0 k4 )). ♦

Remarks. • The proof in the general Lp framework follows exactly the same lines, except that
one replaces Doob’s Inequality for continuous (local) martingale (Mt )t≥0 by the so-called Burkh
”older-Davis-Gundy Inequality (see e.g. [141]) which holds for every exponent p > 0 (only in the
continuous setting)
∀ t ≥ 0, k sup |Ms |kp ≤ Cp khM it k2p
s∈[0,t] 2

where Cp is a positive real constant only depending on p. This general setting is developed in full
details in Section 7.8 (in the one dimensional case to alleviate notations).
• In some so-called mean-reverting situations one may even get boundedness over t ∈ (0, ∞).

7.3 Non asymptotic deviation inequalities for the Euler scheme


It will be convenient to introduce in this section temporary notations.
• We denote x1:d = (x1 , . . . , xd ) ∈ Rd . Let |x1:d |2 = x21 + · · · + x2d define the canonical Euclidean
norm on Rd .
1
• Let kAk = Tr(AA∗ ) 2 denote the Euclidean norm of A ∈ Md,q (R) and let |||A||| = sup|x|=1 |Ax|
be the operator norm of (with respect to the Euclidean norms on Rd and Rq . Note that |||A||| ≤
kAk.

We still consider a Brownian diffusion process solution with drift b and diffusion coefficient σ,
starting at X0 , solution to the SDE (7.1). Furthermore, we assume that b : [0, T ] × Rd → Rd and
σ : [0, T ] × Rd → Md,q (R) are Borel functions satisfying a Lipschitz continuous assumption in x
uniformly in t ∈ [0, T ] i.e.
|b(t, x) − b(t, y)| kσ(t, x) − σ(t, y)k
[b]Lip = sup < +∞ and [σ]Lip = sup < +∞.
t∈[0,T ],x6=y |x − y] t∈[0,T ],x6=y |x − y]

The (augmented) natural filtration of the driving standard q-dimensional Brownian motion W
defined on a probability space (Ω, A, P) will be denoted (FtW )t∈[0,T ] .
The definition of the Brownian Euler scheme with step T /n (starting at X0 ), is unchanged but
to alleviate notations we will use the notation X̄kn (rather than X̄tnn ). So we have: X̄0n = X0 and
k
r
n T n n T
X̄k+1 = X̄kn + b(t , X̄ ) + σ(tnk , X̄kn ) Zk+1 , k = 0, . . . , n − 1
n k k n
7.3. NON ASYMPTOTIC DEVIATION INEQUALITIES FOR THE EULER SCHEME 217

where tnk = kT
pn
n , k = 0, . . . , n and Zk = T (Wtk − Wtk−1 ), k = 1, . . . , n is an i.i.d. sequence of
n n

N (0, Iq ) random vectors. When X0 = x ∈ R , we may denote occasionally by (X̄kn,x )0≤k≤n the
d

Euler scheme starting at x.


The Euler scheme defines a Markov chain with transitions Pk,k+1 (x, dy), k = 0, . . . , n − 1,
reading on bounded or non-negative Borel functions f : Rd → R,
r
   T n n T 
Pk,k+1 (f )(x) = E f (X̄k+1 ) | X̄k = x = Ef x + b(tk , x) + σ(tk , x) Z , k = 0, . . . , n − 1.
n n

Then, for every k, ` ∈ {0, . . . , n − 1}, k ≤ `, we set

Pk,` = Pk,k+1 ◦ · · · ◦ P`−1,`

so that we have, still for bounded or nonnegative Borel functions f ,


 
Pk,` (f )(x) = E f (X̄` ) | X̄k = x .

Still for simplicity we write Pk instead of Pk,k+1 .

Lemma 7.3 Under the above assumptions on b and σ, the transitions Pk satisfy

[Pk f ]Lip ≤ [Pk ]Lip [f ]Lip , k = 0, . . . , n − 1,

hT n i2 T 1
2
with [Pk ]Lip = b(tk , .)
1+ + [σ(tnk , .)]2Lip
n Lip n
 T T  12
≤ 1+ Cb,σ + κb
n n

where Cb,σ = 2[b]Lip + [σ]2Lip and κb = [b]2Lip .

Proof. Straightforward consequence of the fact that


r
T T
|Pk f (x) − Pk f (y)| ≤ [f ]Lip (x − y) + (b(tnk , x) − b(tnk , y)) + (σ(tnk , x) − σ(tnk , y))Z

n n 1
r
T T
≤ [f ]Lip (x − y) + (b(tnk , x) − b(tnk , y)) + (σ(tnk , x) − σ(tnk , y))Z .

n n 2

The key property is the following classical exponential inequality for the Gaussian measure for
which we refer to [100].

Proposition 7.3 Let f : Rq → R be Lipschitz continuous continuous function (with respect to the
canonical Euclidean measure) and let Z be an N (0; Iq ) distributed random vector. Then

λ2
 
λ f (Z)−Ef (Z) [f ]2Lip
∀ λ > 0, E e ≤e 2 . (7.14)
218 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

Proof. Step 1 (Preliminaries): We consider a standard Orstein-Uhlenbeck process starting at


x ∈ Rq solution to the Stochastic Differential equation
1
d Ξxt = − Ξxt dt + dWt , Ξx0 = x
2
where W is a standard q-dimensional Brownian motion. This equations has a (unique) explicit
solution on the whole real line given by
Z t
t t s
Ξxt = xe− 2 + e− 2 e 2 dWs , t ∈ R+ .
0

This shows that (Ξxt )t≥0 is a Gaussian process


t
E Ξxt = xe− 2

and, using the Wiener isometry, we derive the covariance matrix ΣΞxt of Ξxt given by
 Z t Z t 
s s
ΣΞxt = et E e 2 dWsk e 2 dWs`
0 0 1≤k,`≤q
Z t
= e−t es dsIq
0
= (1 − e−t )Iq .

(The time covariance structure of the process can also be computed easily but is of no use in this
proof). As a consequence, for every Borel function g : Rq → R with polynomial growth
 
L
p
Qt g(x) := E g Ξxt = E g x e−t/2 + 1 − e−t Z

where Z ∼ N (0; Iq ).

Hence, owing to the Lebesgue dominated convergence theorem,

lim Qt g(x) = Eg(Z).


t→+∞

Moreover, if g is differentiable with a derivative g 0 with polynomial growth,


d t
Qt g(x) = e− 2 E g 0 (Ξxt ). (7.15)
dx
Now, if g is twice continuously differentiable with a Hessian D2 g having polynomial growth, it
follows from Itô’s formula and Fubini’s Theorem that
Z t
Qt g(x) = g(x) + E (Lg)(Ξxs )ds
0

where L is the infinitesimal generator of the above equation (known as the Ornstein-Uhlenbeck
operator) which maps g to
1 
ξ 7−→ Lg(ξ) = ∆g(ξ) − (ξ|∇g(ξ)) .
2
7.3. NON ASYMPTOTIC DEVIATION INEQUALITIES FOR THE EULER SCHEME 219

gx0 i (ξ)
 
..
where ∆g(ξ) = qi=1 g”x2 (ξ) denotes the Laplacian of g and ∇g(ξ) = 
P
. Now, if h and g
 
i
.
0
gxq (ξ)
are both twice differentiable, one has
q Z
X |z|2 dz
gx0 k (z)h0xk (z)e−

E (∇g(Z)|∇h(Z) = 2
d
k=1 Rq (2π) 2

An integration by parts in the k th integral, after noting that

∂  0 |z|2 |z|2

hxk (z)e− 2 = e− 2 h”x2 (z) − zk h0xk (z) ,

∂zk k

yields
 
E (∇g(Z)|∇h(Z) − 2E G(Z)Lh(Z) . (7.16)

since h and all its existing partial derivative have at lost polynomial growth. One also derive from
the above identity and the continuity of s 7→ E (Lg)(Ξxs ) that

d
Qt g(x) = E Lg(Ξxt ) = Qt Lg(x).
dt

We temporarily admit the classical fact that LQt g(x = Qt Lg(x)


Step 2 (The smooth case): Let f : Rq → R be a continuously twice differentiable function with
bounded existing partial derivatives and such that E f (Z) = 0. Let λ ∈ R+ be fixed. We define the
function H : R → R by
Hλ (t) = E eλQt g(Z) .
t
It follows from what precedes that |Qt f (Z) ≤ e−t/2 kf 0 ksup |Z| + |f (Z)|ε|f (0)| + (1 + e− 2 )kf 0 ksup
which ensures the existence of Hλ since Eea|Z| < +∞ for every a ≥ 0. Moreover, Qt f (Z) − f (Z)| ≤
e− t/2kf 0 ksup |Z| so that Qt f (Z) → f (Z) as t → 0 and, by the Lebesgue dominated convergence
theorem,
lim Hλ (t) = eE f (Z) = e0 = 1.
t→+∞

Furthermore, one shows using the same arguments that Hλ is differentiable over the whole real
line and

Hλ0 (t) = λ E eλQt f (Z) Qt Lf (Z)




= λ E eλQt f (Z) LQt f (Z)




1  
= − λE ∇z (eλQt f (z) )|z=Z |Qt f (Z)
2
λ2  
= − e−t E eλQt f (Z) |Qt ∇f (Z)|2
2

where we used successively (7.16) and (7.15) in the last two lines.
220 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

Consequently, as limt→+∞ Hλ (t),


Z +∞
Hλ (t) = 1 − Hλ0 (s)ds
t
Z +∞
λ2  
≤ 1 + kf 0 ksup e−s E eλQs f (Z) ds
2 t
Z +∞
λ2
= 1+K e−s Hλ (s)ds were K = k∇f k2sup .
t 2
where k∇f ksup = supξ∈Rq |∇f (ξ)| (| . | denotes the Euclidean norm). One derives by induction
(using that Hλ is non-increasing since its derivative is negative) that, for very integer m ∈ N∗ ,
m
X e−kt e−(m+1)t
Hλ (t) ≤ Ck + Hλ (0)K m+1 .
k! (m + 1)!
k=0

Letting m → +∞ finally yields


λ2
k∇f k2sup
Hλ (t) ≤ eK = e 2 .
On completes this step of the proof the proof by applying the above inequality to the function
f − E f (Z).
Step 3 (The Lipschitz continuous case): This step relies on an approximation technique which
is closely related with sensitivity computation for options attached to non-regular payoffs (but in
a situation where the Brownian motion is the pseudo-asset). Let f : Rq → Rq be a Lipschitz
d I 
continuous function with Lipschitz coefficient [f ]Lip and ζ = N 0; qq . One considers for every
ε > 0,
√ |u−z|2
Z
− du
fε (z) = Ef (z + εζ) = f (u)e 2εq q .
R q (2πεq) 2

It is clear that fε → f since |fε (z)−f (z)| ≤ ε E |ζ| for every z ∈ Rq so that fε uniformly converges
on Rd toward f . One checks that fε is differentiable with a gradient given by
|u−z|2
Z
1 − du
∇fε (z) = − f (u)e 2εq (u − z) q
εq Rq (2πεq) 2
1 √ 
= √ E f (z + εζ)ζ
q ε
1 √ 
= √ E (f (z + εζ) − f (z))ζ
q ε
since E ζ = 0. Consequently,
[f ]Lip
|∇fε (z)| ≤ E |ζ|2 = [f ]Lip
q
The Hessian of fε is also bounded (by a constant depending on ε but this has no consequence
on the inequality of interest). One concludes by Fatou’s lemma
λ2 λ2
k∇fε k2sup q[f ]2Lip
E eλf (Z) ≤ lim inf E eλfε (Z) ≤ lim inf e 2 ≤e 2 . ♦
ε→0 ε→0

We are now in position to state the main result of this section and its application to the design
of confidence intervals (see also [54]).
7.3. NON ASYMPTOTIC DEVIATION INEQUALITIES FOR THE EULER SCHEME 221

Theorem 7.4 Assume |||σ|||sup := sup(t,x)∈[0,T ]×Rd |||σ(t, x)||| < +∞. Then, for every integer
n0 ≥ 1, there exists a real constant K(b, σ, T, n0 ) ∈ (0, +∞) such that, for every n ≥ n0 and
Lipschitz continuous continuous function f : Rd → R,
 
λ2 2 2
λ f (X̄nn )−E f (X̄nn )
∀ λ > 0, E e ≤ e 2 |||σ|||sup [f ]Lip K(b,σ,T,n) . (7.17)

1 Cb,σ T 1 (Cb,σ +κb nT )T


Moreover lim K(b, σ, T, n) = e . The choice K(b, σ, T, n) = e 0 is admissible.
n Cb,σ Cb,σ

Application. Let us briefly recall that such exponential inequalities yield concentration inequal-
ities in the strong law of large numbers. Let (X̄ n,` )`≥1 be independent copies of the Euler scheme
X̄ n = (X̄kn )0≤k≤n . Then, for every ε > 0, Markov inequality and independence imply, for every
integer n ≥ n0 ,

M
1 X λ2

|||σ|||2sup [f ]2Lip K(b,σ,T,n0 )
f X̄nn,` − E f (X̄nn ) > ε inf e−λM ε+M

P ≤ 2
M λ>0
`=1
ε2 M

2|||σ|||2 2
sup [f ]Lip K(b,σ,T,n0 )
= e

so that by symmetry,

M ε2 M
 1 X  −
n,` n 2|||σ|||2 2
sup [f ]Lip K(b,σ,T,n0 )

f X̄n − E f (X̄n ) > ε ≤ 2e .

P
M
`=1

The crucial facts in the above inequality, beyond the fact that it holds for possibly unbounded
Lipschitz continuous functions f , is that the right hand upper-bound does not depend on the time
step Tn of the Euler scheme. Consequently we can design confidence intervals for Monte Carlo
simulations based on the Euler schemes uniformly in the time discretization step Tn .
Doing so, we can design non asymptotic confidence interval when computing E f (XT ) by a
Monte Carlo simulation. We know that the bias is due to the discretization scheme and only
depends on the step Tn : usually (see Section 7.6 for the (expansion of the) weak error), one has

c
E f (X̄Tn ) = E f (XT ) + + o(1/nα ) with α = 1
2 or 1.

Remark. Under the assumptions we make on b and σ, the Euler scheme converges a.s. and in
every Lp spaces (provided X0 lies in Lp for the sup norm over [0, T ] toward the diffusion process
X, so that we retrieve the result for (independent copies of) the diffusion itself, namely

M ε2 M
 1 X  −
` 2|||σ|||2 2
sup [f ]Lip K(b,σ,T )

f XT − E f (XT ) > ε ≤ 2e

P
M
`=1

Cb,σ T
e
with K(b, σ, T ) = Cb,σ (with Cb,σ = 2[b]Lip + [σ]2Lip ).
222 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

Proof of Theorem 7.4. The starting point is the following: let k ∈ {0, . . . , n − 1}, and let
r
T n n T
Ek (x, z) = x + b(tk , x) + σ(tk , x) z, x ∈ Rd , z ∈ Rq , k = 0, . . . , n − 1,
n n
denote the Euler scheme operator. Then, it follows from Proposition 7.3, that for every Lipschitz
continuous continuous function f : Rd → Rq
2
λ2
 q
T
|||σ(tn [f ]2Lip

λ f (Ek (x,Z))−Pk f (x) k ,x)|||
∀ λ > 0, E e ≤e 2 n .
since z 7→ f (E(x, z)) is Lipschitz continuous continuous with respect to the canonical Euclidean
norm with a coefficient upper bounded by the trace norm of σ(tnk , x).
Consequently,
λ2 T
 n
 n 2 2
∀ λ > 0, E eλf (X̄k+1 ) | FtW
n ≤ eλPk f (X̄k )+ 2 n |||σ|||sup [f ]Lip .
k

where |||σ|||sup := sup(t,x)∈[0,T ]×Rd |||σ(t, x)|||.


Applying this inequality to Pk+1,n f and taking expectation on both sides then yields
 n
  n
 λ2 T 2 2
∀ λ > 0, E eλPk+1,n f (X̄k+1 ) ≤ E eλPk,n f (X̄k ) e 2 n |||σ|||sup [Pk+1,n f ]Lip .
By a straightforward backward induction from k = n − 1 down to k = 0, combined with the fact
that P0,n f (X0 ) = E f (X̄T )|X0 ,
  2
T Pn−1
  λ 2 2
λf (X̄nn )
∀ λ > 0, E e ≤E eλE f (X̄T )|X0
e 2 |||σ|||sup n k=0 [Pk+1,n f ]Lip .

First note that by Jensen’s Inequality


   
E eλE f (X̄T )|X0 ≤ E E eλf (X̄T ) |X0 = E eλf (X̄T ) .

On the other hand, it is clear from their very definition that


n−1
Y
[Pk,n ]Lip ≤ [P` ]Lip
`=k
(with the consistent convention that an empty product is equal to 1). Hence, owing to Lemma 7.3,
 (n−k)
(n) T T

2 (n)
[Pk,n ]Lip ≤ 1 + Cb,σ with Cb,σ,T = Cb,σ + κb
n n
and
n n−1
T X T X
[Pk,n f ]2Lip = [Pn−k,n f ]2Lip
n n
k=1 k=0
(n)
(1 + Cb,σ Tn )n − 1
= (n)
Cb,σ
1 T
≤ (n)
e(Cb,σ +κb n )T [f ]2Lip .
Cb,σ
1 (Cb,σ +κb T )T 2
≤ e n [f ]Lip
Cb,σ
7.3. NON ASYMPTOTIC DEVIATION INEQUALITIES FOR THE EULER SCHEME 223

whenever n ≥ n0 . ♦

Further comments. It seems hopeless to get a concentration inequality for the supremum of the
Monte Carlo error over all Lipschitz continuous functions (with Lipschitz continuous coefficients
bounded by 1) as emphasized below. Let Lip1 (Rd , R) be the set of Lipschitz continuous continuous
functions from Rd to R with a Lipschitz continuous coefficient [f ]Lip ≤ 1.
Let X` : (Ω, A, P) → R, ` ≥ 1, be independent copies of an integrable random vector X with
distribution denoted PX . For every ω ∈ Ω and every M ≥ 1, the function defined for every ξ ∈ Rd
by
fω,M (ξ) = min |X` (ω) − ξ|.
1≤`≤M

It is clear from its very definition that fω,M ∈ Lip1 (Rd , R). Then, for every ω ∈ Ω,

1 X M Z M Z
1 X
sup f (X` (ω)) − f (ξ)PX (dξ) ≥ fω,M (X` (ω)) − fω,M (ξ)PX (dξ)

f ∈Lip1 (Rd ,R) M k=1 M k=1 |

Rd {z } Rd
=0
Z
= fω,M (ξ)PX (dξ)
Rd
≥ e1,M (X) := inf E min |X − xi |.
x1 ,...,xM ∈Rd x1 ,...,xM

The lower bound in the last line is but the optimal L1 -quantization error of the (distribution of
the) random vector X. It follows from Zador’s Theorem (see e.g. [66]) that
1
lim inf M − d e1,M (X) ≥ J1,d kϕX k d
M d+1

where ϕX denotes the density – possibly equal to 0 – of the nonsingular part of the distribution
PX of X with respect to the Lebesgue measure, J1,d ∈ (0, ∞) is a universal constant and the
pseudo-norm kϕX k d is finite as soon as X ∈ L1+ = ∪η>0 Lp+η . Furthermore, it is clear that as
d+1
soon as the support of PX is infinite, e1,M (X) > 0. Consequently, for non-singular random vector
distributions, we have

1 XM
− d1 1
M sup f (X ) − f (X) ≥ inf M − d e1,M (X) > 0.

` E
f ∈Lip1 (Rd ,R) M

M ≥1
k=1

In fact, a careful reading shows that


M
M Z
1 X 1 X
sup f (X` (ω)) − E f (X)) = fω,M (X` (ω)) − fω,M (ξ)PX (dξ)

M M

d
f ∈Lip1 (R ,R) R d
k=1
Z k=1
= fω,M (ξ)PX (dξ).
Rd

If we denote e1,M (x1 , . . . , xM , X) = E minx1 ,...,xM |X − xi | we have


Z
fω,M (ξ)PX (dξ) = e1,M (X1 (ω), . . . , XM (ω), X).
Rd
224 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

The right hand side of the above equation corresponds to the self-random quantization of the distri-
bution PX . It has been shown in [36, 37] that, under appropriate assumptions on the distributions
PX (satisfied e.g. by the normal distribution), one has, P-a.s.,
1
lim sup M − d (e1,M (X1 (ω), . . . , XM (ω), X) < +∞.

which shows that, for a wide class of random vector distributions, a.s.,

1 XM 1
sup f (X` (ω)) − E f (X))  C(f )M − d

M

f ∈Lip1 (Rd ,R)
k=1

This illustrates that the strong law of large numbers/Monte Carlo method is not as “dimension
free” as it is commonly admitted.

7.4 Pricing path-dependent options (I) (Lookback, Asian, etc)


Let n o
D([0, T ], Rd ) := ξ : [0, T ] → Rd , càdlàg .
(càdlàg is the French acronym for “right continuous with left limits”). The above result shows that
if F : D([0, T ], R) → R is a Lipschitz continuous functional for the sup norm i.e. satisfies

|F (ξ) − F (ξ 0 )| ≤ [F ]Lip sup |ξ(t) − ξ 0 (t)|


t∈[0,T ]

then r
1 + log n
|E(F ((Xt )t∈[0,T ] )) − E(F (X̄tn )t∈[0,T ] )| ≤ [F ]Lip Cb,σ,T
n
and
1
|E(F ((Xt )t∈[0,T ] ) − E(F (X̄tn )t∈[0,T ] )| ≤ Cn− 2 .

Typical example in option pricing. Assume that a one dimensional diffusion process X =
(Xt )t∈[0,T ] models the dynamics of a single risky asset (we do not take into account here the con-
sequences on the drift and diffusion coefficient term induced by the preservation of non-negativity
and the martingale property under a risk-neutral probability for the discounted asset. . . ).
 The (partial) Lookback Lookback payoffs:
 
hT := XT − λ min Xt
t∈[0,T ] +

where λ = 1 in the regular Lookback case and λ > 1 in the so-called “partial Lookback” case.
 “Vanilla” payoffs on extrema (like Calls and Puts)

hT = ϕ( sup Xt ),
t∈[0,T ]

where ϕ is Lipschitz continuous on R+ .


7.5. MILSTEIN SCHEME (LOOKING FOR BETTER STRONG RATES. . . ) 225

 Asian payoffs of the form


 Z T 
1
hT = ϕ Xs ds , 0 ≤ T0 < T,
T − T0 T0

where ϕ is Lipschitz continuous on R q.R In fact such Asian payoffs are continuous with respect to
T 2
the pathwise L2 -norm i.e. kf kL2 := 0 f (s)ds.
T

7.5 Milstein scheme (looking for better strong rates. . . )


Throughout this section we consider a homogeneous diffusion just for notational convenience, but
the extension to general non homogeneous diffusions is straightforward (in particular it adds no
further terms to the discretization scheme). The Milstein scheme has been designed (see [114])
to produce a O(1/n)-error (in Lp ) like standard schemes in a deterministic framework. This is a
higher order scheme. In 1-dimension, its expression is simple and it can easily be implemented,
provided b and σ have enough regularity. In higher dimension, some theoretical and simulation
problems make its use more questionable, especially when compared to the results about the weak
error in the Euler scheme that will be developed in the next section.

7.5.1 The 1-dimensional setting


Assume that b and σ are smooth functions (say twice differentiable with bounded existing deriva-
tives). The starting idea is the following. Let X x = (Xtx )t∈[0,T ] denote the (homogeneous) diffusion
solution to (7.1) starting at x ∈ R at time 0. For small t, one has
Z t Z t
Xtx = x + b(Xsx )ds + σ(Xsx )dWs .
0 0

We wish to analyze the behaviour of the diffusion process Xtx for small values of t in order
to detect the term of order at most 1 i.e. that goes to 0 like t when t → 0 (with respect to the
L2 (P)-norm). Let us inspect the two integral terms successively. First,
Z t
b(Xsx )ds = b(x)t + o(t) as t → 0
0

since Xtx → x as t → 0 and b is continuous. Furthermore, as b is Lipschitz continuous and


E sups∈[0,t] |Xsx − x|2 → 0 as t → 0 (see e.g. Proposition 7.6 further on), this expansion holds for
the L2 (P)-norm as well (i.e. o(t) = oL2 (t)).
As concerns the stochastic integral term, we have in mind that E (Wt+∆t − Wt )2 = ∆t so that
one may consider heuristically
√ that in a scheme a Brownian increment dWt := Wt+dt − Wt between
t and t + dt behaves like dt. Then by Itô’s Lemma
Z s Z s
0 1 
σ(Xsx ) = σ(x) + σ (Xux )b(Xux ) x 2 x
+ σ”(Xu )σ (Xu ) du + σ 0 (Xux )σ(Xux )dWu
0 2 0
226 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

so that
Z t Z tZ s
σ(Xsx )dWs = σ(x)Wt + σ(Xux )σ 0 (Xux )dWu dWs + OL2 (t3/2 ) (7.18)
0 0 0
Z t
= σ(x)Wt + σσ 0 (x) Ws dWs + oL2 (t) + OL2 (t3/2 ) (7.19)
0
1
= σ(x)Wt + σσ 0 (x)(Wt2 − t) + oL2 (t).
2

The OL2 (t3/2 ) in (7.18) comes from the fact that u 7→ σ 0 (Xux )b(Xux ) + 21 σ”(Xux )σ 2 (Xux ) is L2 (P)-
bounded in t in the neighbourhood of 0 (note that b and σ have at most linear growth and use
Proposition 7.2). Consequently, using the fundamental isometry property of stochastic integration,
Fubini-Tonnelli Theorem and Proposition 7.2,
Z tZ s  2 Z t Z s  2
0 1 
0 1 
E σ (Xux )b(Xux )+ σ”(Xux )σ 2 (Xux ) dudWs =E σ (Xux )b(Xux )+ σ”(Xux )σ 2 (Xux ) du ds
0 0 2 0 0 2
Z t Z s 2
≤ C(1 + x4 ) du ds = C(1 + x4 )t3 .
0 0

The oL2 (t) in Equation (7.19) also follows from the fundamental isometry property of stochastic
integral (twice) and Fubini-Tonnelli Theorem which yields
Z t Z s  2 Z tZ s
0 0
E σσ (Xux ) − σσ (x) dWu dWs = ε(u)duds
0 0 0 0

where ε(u) = E(σσ 0 (Xux ) − σσ 0 (x))2 goes to 0 as u → 0 by the Lebesgue dominated convergence
Theorem. Finally note that

kWt2 − tk22 = E Wt4 − 2t E Wt2 + t2 = 3t2 − 2t2 + t2 = 2t2

so that this term is of order one.

Now we consider the SDE


dXt = b(Xt )dt + σ(Xt )dWt ,

starting at a random vector X0 , independent of the standard Brownian motion W . Using the
Markov property of the diffusion, one can reproduce the above reasoning on each time step [tnk , tnk+1 )
given the value of the scheme at time tnk . This leads to the following simulatable scheme:

e mil = X0 ,
X0
  r
etmil etmil etmil 1 0 e mil T etn ) T Uk+1 + 1 σσ 0 (X
mil etmil T 2
X n = X n + b(X n ) − σσ (Xtn ) + σ(X n ) U , (7.20)
k+1 k k 2 k n k n 2 k n k+1
pn
where Uk = T (Wtk
n − Wtnk−1 ), k = 1, . . . , n.
7.5. MILSTEIN SCHEME (LOOKING FOR BETTER STRONG RATES. . . ) 227

By interpolating the above scheme between the discretization times we define the continuous
Milstein scheme defined (with our standard notations) by

e mil = X0 ,
X0
e mil = X
X e mil )− 1 σσ 0 (X
e mil +(b(X e mil )(Wt − Wt )+ 1 σσ 0 (X
e mil ))(t − t)+σ(X e mil )(Wt − Wt )2 . (7.21)
t t t t t t
2 2
The following theorem gives the rate of strong pathwise convergence of the Milstein scheme.

Theorem 7.5 (Strong rate for the Milstein scheme) (See e.g. [83]) (a) Assume that b and
σ are C 1 on R with bounded, αb0 and ασ0 -Hölder continuous first derivatives respectively, αb0 ,
ασ0 ∈ (0, 1]. Then, for every p ∈ (0, ∞), there exists a real constant Cb,σ,T,p > 0 such that, for every
X0 ∈ Lp (P), independent of the Brownian motion W , one has
  1+α

mil,n

mil,n T 2
max |Xtnk − Xtn | ≤ sup |Xt − Xt | ≤ Cb,σ,T,p (1 + kX0 kp )
e e
0≤k≤n k p t∈[0,T ] p n

where α = min(αb0 , ασ0 ). In particular if b0 and σ 0 are Lipschitz continuous continuous



e mil,n

e mil,n | T
max |Xtnk − X | ≤ sup |X − X ≤ Cp,b,σ,T (1 + kX0 kp ).

n
tk t t
n

0≤k≤n p t∈[0,T ] p

(b) As concerns the stepwise constant Milstein scheme, one has


r

mil,n
T
sup |Xt − X | ≤ Cb,σ,T (1 + kX0 kp ) (1 + log n).
e
t
t∈[0,T ] p n

Remarks. • As soon as the derivatives of b0 and σ 0 are Hölder, the (continuous) Milstein scheme
converges faster than the Euler scheme.
• The O(1/n)-rate obtained here should be compared to the weak rate investigated in Section 7.6
which is also O(1/n). Comparing performances of both approaches for the computation of E f (XT )
should rely on numerical evidences and (may) depend on the specified diffusion or function.
• The second claim of the theorem shows that the stepwise constant Milstein scheme does not
converge faster than the stepwise constant Euler scheme. To get convinced of this rate without
rigorous proof, one has just to think of the Brownian motion itself: in that case σ 0 ≡ 0 so that
the stepswise constant Milstein and Euler schemes coincide and subsequently converge at the same
rate! As a consequence, since it is the only simulatable version of the Milstein scheme, its use
for the approximate computation (by Monte Carlo simulation) of the expectation of functionals
E F ((Xt )t∈[0,T ] ) of the diffusion should not provide better results than implementing the standard
stepwise Euler scheme as briefly described in Section 7.4.
By contrast, it happens that functionals of the continuous Euler scheme can be simulated: this
is the purpose of Chapter 8 devoted to diffusion bridges.
A detailed proof in the case X0 = x ∈ R and p ∈ [2, ∞) is provided in Section 7.8.8.

 Exercise. Derive from these Lp -rates an a.s. rate of convergence for the Milstein scheme.
228 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

 Exercise. We consider a diffusion process (X x )t∈[0,T ] with drift b : R → R and diffusion


coefficient σ = R → R, both assumed continuously differentiable with bounded derivatives. Fur-
thermore, we suppose that σ and σ 0 are never 0, b and σ 2 are non-decreasing.
(a) Show that, if
1 σ
∀ ξ ∈ R, b(ξ) − (σ 2 )0 (ξ) ≥ 0 and (ξ) ≤ 2ξ
4 σ0
then the Milstein scheme with step Tn starting at x ≥ 0 satisfies (with the usual notations)

∀ k ∈ {0, . . . , n}, e mil


Xtn ≥ 0.
k

(b) Show that, under the assumption of (a), the continuous Milstein scheme also satisfies X etmil ≥ 0
for every t ∈ [0, T ].
(c) Let Yt = eκt Xt , t ∈ [0, T ]. Show that y is solution to a stochastic differential equation
dYt = eb(t, Yt ) + σ
e(t, Yt )dWt
e : [0, T ] × R → R are functions depending on b, σ and κ. Deduce by mimicking the proof
where eb, σ
b
of (a) that the Milstein scheme of Y is non-negative as soon as 2 0 ≥ η > 0 on the real line.
(σ )

7.5.2 Higher dimensional Milstein scheme


In higher dimension when the underlying diffusion process (Xt )t∈[0,T ] is d-dimensional or the driving
Brownian motion W is q-dimensional which means that the drift is a function b : Rd → Rd and
the diffusion coefficient σ = [σij ] : Rd → M(d × q, R), the same reasoning as in the 1-dimensional
setting leads to the following (discrete time) scheme
e mil = X0 ,
X0
Z tn
mil mil T e mil mil
X
mil
k+1
Xtn = Xtn + b(Xtn ) + σ(Xtn )∆Wtk+1 +
e e e n ∂σ.i σ.j (Xtn ) ×
e (Wsi − Wtin )dWsj , (7.22)
k+1 k n k k k
tn
k
1≤i,j≤q
k

k = 0, . . . , n − 1,
where ∆Wtnk+1 := Wtnk+1 − Wtnk , σ.k (x) denotes the k th column of the matrix σ and
d
d
X ∂σ.i
∀ x = (x1 , . . . , xd ) ∈ R , ∂σ.i σ.j (x) := (x)σ`j (x) ∈ Rd .
∂x`
`=1
Having a look at this formula when d = 1 and q = 2 shows that simulating the Milstein scheme
at times tnk in a general setting amounts to being able to simulate the joint distribution of the
triplet  Z t 
1 2 1 2
Wt , Wt , Ws dWs at time t = tn1
0
that is the joint distribution of two independent Brownian motions and their Lévy area. Then, the
simulation of
Z tn !
k
Wt1n − Wt1n Wt2n − Wt2n , (Ws1 − Wt1n )dWs2 , k = 1, . . . , n.
k k−1 k k−1 k
tn
k−1
7.5. MILSTEIN SCHEME (LOOKING FOR BETTER STRONG RATES. . . ) 229

More generally, one has to simulate the joint (q 2 -dimensional) distribution of


 Z t 
1 d i j
Wt , . . . , W t , Ws dWs , 1 ≤ i, j ≤ q, i 6= j
0

No convincing (i.e. efficient) method to achieve that has been proposed so far in the literature at
least when q ≥ 3 or 4.

Proposition 7.4 (a) If the “rectangular” terms commute i.e. if

∀ i 6= j, ∂σ.i σ.j = ∂σ.j σ.i ,

then the Milstein scheme reduces to


q
!
etmil etmil T etmil 1 X
etmil etmil
X n = X n + b(X n ) − ∂σ.i σ.i (X n ) + σ(X n )∆Wtn
k+1 k n k 2 k k k+1
i=1
1 X
+ e mil
∂σ.i σ.j (Xtn )∆Wtin ∆Wtjn , X e mil = X0 .
0 (7.23)
2 k k+1 k+1
1≤i,j≤q

(b) When q = 1 the commutation property is trivially satisfied.

Proof. As a matter of fact if i 6= j, an integration by parts shows that


Z tn Z tn
k+1 k+1
i i
(Ws − Wtn )dWs + j
(Wsj − Wtjn )dWsi = ∆Wtin ∆Wtjn
k k k+1 k+1
tn
k tn
k

and Z tn  
k+1 1 T
(Wsi − Wtin )dWsi = (∆Wtin )2 − .
tn
k
k 2 k+1 n
The announced form for the scheme follows. ♦

In this very special situation, the scheme (7.23) can still be simulated easily in dimension d
since it only involves some Brownian increments.
The rate of convergence of the Milstein scheme is formally the same in higher dimension as it is
in 1-dimension: Theorem 7.5 remains true with a d-dimensional diffusion driven by a q-dimensional
Brownian motion provided b : Rd → Rd and σ : Rd → M(d, q, R) are C 2 with bounded existing
partial derivatives.
Theorem 7.6 (See e.g. [83]) Assume that b and σ are C 2 on Rd with bounded existing partial
derivatives. Then, for every p ∈ (0, ∞) there exists a real constant Cp,b,σ,T > 0 such that for every
X0 ∈ Lp (P), independent of the q-dimensional Brownian motion W , the error bound established in
the former Theorem 7.5(a) in the Lipschitz continuous continuous case remains valid.
However, one must keep in mind that this result has nothing to do with the ability to simulate
this scheme.
In a way, one important consequence of the above theorem about the strong rate of convergence
of the Milstein scheme concerns the Euler scheme.
230 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

Corollary 7.2 If the drift b is C 2 (Rd , Rd ) with bounded existing partial derivatives and if σ(x) = Σ
is constant, then the Euler scheme and the Milstein scheme coincide at the discretization times tnk .
As a consequence, the strong rate of convergence of the Euler scheme is, in that very specific case,
O( n1 ). Namely, for every p ∈ (0, ∞), one has

max |Xtn − X̄tnn | ≤ Cb,σ,T T (1 + kX0 kp ).

0≤k≤n k k
p n

7.6 Weak error for the Euler scheme (I)


In many situations, like the pricing of so-called “vanilla” European options, a discretization scheme
X̄ n = (X̄tn )t∈[0,T ] of a d-dimensional diffusion process X = (Xt )t∈[0,T ] introduce in order to compute
by a Monte Carlo simulation an approximation E f (X̄T ) of E f (XT ), i.e. only at a fixed (terminal)
time. If one relies on the former strong (or pathwise) rates of convergence, we get, as soon as
f : Rd → R is Lipschitz continuous continuous and b, σ satisfy (HTβ ) for β ≥ 12 ,
 1 
E f (XT ) − E f (X̄T ) ≤ [f ]Lip E XT − X̄Tn ≤ [f ]Lip E sup Xt − X̄tn = O √ .
n

t∈[0,T ] n

In fact the first inequality in this chain turns out to be highly non optimal since it switches from
a weak error (the difference only depending on the respective(marginal) distributions of XT and
X̄Tn ) to a pathwise approximation XT − X̄Tn . To improve asymptotically the other two inequalities
is hopeless since it has been shown (see the remark and comments in Section 7.8.6 for a brief
introduction) that under appropriate assumptions XT − X̄Tn satisfy a central limit theorem at rate

n with non-zero asymptotic variance. In fact one even has a functional form of this central limit
theorem for the whole process (Xt − X̄tn )t∈[0,T ] (see [88, 74]). As a consequence a rate faster than
1
n− 2 is possible in a L1 sense would not be consistent with this central limit behaviour.
Furthermore, numerical experiments confirm that the weak rate of convergence between the
1
above two expectations is usually much faster then n− 2 . This fact has been known for long and
has been extensively investigated in the literature, starting with the two seminal papers [152] by
Talay-Tubaro and [9] by Bally-Talay), leading to an expansion of the time discretization error
at an arbitrary accuracy when b and σ are smooth enough as functions. Two main settings have
drawn attention: the case where f is a smooth functions and the case where the diffusion itself is
“regularizing” i.e. propagate the regularizing effect of the driving Brownian motion (1 ) thanks to
a non-degeneracy assumption on the diffusion coefficient σ typically like uniform ellipticity for σ
(see below) or weaker assumption of Hörmander type.
The same kind of question has been investigated for specific classes of functionals F of the
whole path of the diffusion X with some applications to path-dependent option pricing. These
(although partial) results show that the resulting weak rate is the same as the strong rate that can
be obtained with the Milstein scheme for this type of functionals (when Lipschitz continuous with
respect to the sup-norm over [0, T ]).
1
The regularizing property of the Brownian motion should be understood as follows: if f is a Borel bounded
function on Rd , then fσ (x) := E(f (x + σW )) is a C ∞ function for every σ > 0 and converges towards f as σ → 0
in every Lp space, p > 0. This result is but a classical convolution result with a Gaussian kernel rewritten in a
probabilistic form.
7.6. WEAK ERROR FOR THE EULER SCHEME (I) 231

As a second step, we will show how the so-called Richardson-Romberg extrapolation methods
provides a systematic procedure to take optimally advantage of these weak rates, including their
higher order form.

7.6.1 Main results for E f (XT ): the Talay-Tubaro and Bally-Talay Theorems
We adopt the notations of the former section 7.1, except that we consider, mostly for convenience
an homogeneous SDE starting at x ∈ Rd ,

dXtx = b(Xtx )dt + σ(Xtx )dWt , X0x = x.

The notations (Xtx )t∈[0,T ] and (X̄t )t∈[0,T ] denote the diffusion and the Euler scheme of the diffusion
starting at x at time 0 (an exponent n will sometimes be added to emphasize that the step of the
scheme is Tn ).
Our first result is the first result on the weak error, the one which can be obtained with the less
stringent assumptions on b and σ.

Theorem 7.7 (see [152]) Assume b and σ are 4 times continuously differentiable on Rd with
bounded existing partial derivatives (this implies that b and σ are Lipschitz). Assume f : Rd → R
is 4 times differentiable with polynomial growth as well as its existing partial derivatives. Then, for
every x ∈ Rd ,  
x n,x 1
E f (XT ) − E f (X̄T ) = O as n → +∞. (7.24)
n

Sketch of proof. Assume d = 1 for notational convenience. We also assume furthermore b ≡ 0,


σ bounded and f has bounded existing derivatives, for simplicity. The diffusion (Xtx )t≥0,x∈R is a
homogeneous Markov process with transition semi-group (Pt )t≥0 given (see e.g. [141, 78]) by

Pt g(x) := E g(Xtx ).

On the other hand, the Euler scheme with step Tn starting at x ∈ R, denoted (X̄txn )0≤k ≤n , is a
k
discrete time homogeneous Markov (indexed by k) chain with transition
r !
T d
P̄ g(x) = E g x + σ(x) Z , Z = N (0; 1).
n

To be more precise this means for the diffusion process that, for any Borel bounded or non-negative
test function g
∀ s, t ≥ 0, Pt g(x) = E(g(Xs+t | Xs = x) = E g(Xtx )
kT
and for its Euler scheme (still with tnk = n )

P̄ g(x) = E(g(X̄txn ) | X̄txn = x) = E g(X̄txn1 ), k = 0, . . . , n − 1.


k+1 k

Now
E f (XTx ) = PT f (x) = P Tn (f )(x)
n

and
E f (X̄Tx ) = P̄ n (f )(x)
232 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

so that,
n
X
x x
E f (XT ) − E f (X̄T ) = P Tk (P̄ n−k f )(x) − P Tk−1 (P̄ n−(k−1) f )(x)
n n
k=1
Xn
= P Tk−1 ((P T − P̄ )(P̄ n−k f ))(x). (7.25)
n n
k=1

This “domino” sum suggests that we have two tasks to complete:


– the first one is to estimate precisely the asymptotic behaviour of

P T g(x) − P̄ g(x),
n

with respect to the step Tn and the (four) first derivatives of the function g.
– the second one is to control the (four first) derivatives of the functions P̄ ` g for the sup norm
when g is regular, uniformly in ` ∈ {1, . . . , n} and n ≥ 1, in order to propagate the above local
error bound.
Let us deal with the first task. To be precise, let g : R → R be a four times differentiable
function g with bounded existing derivatives. First, Itô’s formula yields
Z t Z t
x 0 x 1
Pt g(x) := E g(XT ) = g(x) + E (g σ)(Xs )dWs + E g”(Xsx )σ 2 (Xsx )ds
0 2 0
| {z }
=0

where we use that g 0 σ is bounded toq


ensure that the stochastic integral is a true martingale.
A Taylor expansion of g(x + σ(x) Tn Z) at x yields for the transition of the Euler scheme (after
taking expectation)

P̄ g(x) = E g(X̄Tx )
!r r !2
T 1 T
= g(x) + g 0 (x)σ(x)E Z + (g”σ 2 )(x)E Z
n 2 n
r !3  r !4 
σ 3 (x) T σ 4 (x) T
+g (3) (x) E Z + E g (4) (ξ) Z 
3! n 4! n
 r !4 
T 4
σ (x)  (4) T
= g(x) + (g”σ 2 )(x) + E g (ξ) Z 
2n 4! n

T σ 4 (x)T 2
= g(x) + (g”σ 2 )(x) + cn (g),
2n 4!n2
where |cn (g)| ≤ 3kg (4) k∞ . This follows from the well-known facts that EZ = EZ 3 = 0 and E Z 2 = 1
and E Z 4 = 3. Consequently
T
σ 4 (x)T 2
Z
1 n
P T g(x) − P̄ g(x) = E((g”σ 2 )(Xsx ) − (g”σ 2 )(x))ds + cn (g). (7.26)
n 2 0 4!n2
7.6. WEAK ERROR FOR THE EULER SCHEME (I) 233

Applying again Itô’s formula to the C 2 function γ := g”σ 2 , yields


Z s 
2 x 2 1 x 2 x
E((g”σ )(Xs ) − (g”σ )(x)) = E γ”(Xu )σ (Xu )du .
2 0

so that s
sup E((g”σ 2 )(Xsx ) − (g”σ 2 )(x)) ≤ kγ”σ 2 ksup .

∀ s ≥ 0,
x∈R 2
Elementary computations show that

kγ”k∞ ≤ Cσ max kg (k) k∞


k=2,3,4

where Cσ depends on kσ (k) k∞ , k = 0, 1, 2 but not on g (with the standard convention σ (0) = σ).
Consequently, we derive from (7.26) that
 2
0 (k) T
|P T (g)(x) − P̄ (g)(x)| ≤ Cσ,T max kg k∞ .
n k=2,3,4 n

The fact that the first derivative g 0 is not involved in these bounds is simply the consequence
of our artificial assumption that b ≡ 0.
Now we pass to the second task. In order to plug this estimate in (7.25), we need now to control
the first four derivatives of P̄ ` f , ` = 1, . . . , n, uniformly with respect to k and n. In fact we do
not directly need to control the first derivative since b ≡ 0 but we will do it as a first example to
illustrate the method in a simpler case.
Let us consider again the generic function g and its four bounded derivatives.
r ! r !!
T T
(P̄ g)0 (x) = E g 0 x + σ(x) Z 1 + σ 0 (x) Z .
n n

so that
r
T
|(P̄ g)0 (x)| ≤ kg 0 k∞ 1 + σ 0 (x) Z

n
r 1
T
≤ kg 0 k∞ 1 + σ 0 (x) Z

n
2
v !2
u r
0
u
0
T
= kg k∞ E 1 + σ (x)
t Z
n
v !
u r
u T T
= kg 0 k∞ tE 1 + 2σ 0 (x) Z + σ 0 (x)2 Z 2
n n
r
T
= kg 0 k∞ 1 + (σ 0 )2 (x)
 n
T
≤ kg 0 k∞ 1 + σ 0 (x)2
2n
234 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION


since 1 + u ≤ 1 + u2 , u ≥ 0. Hence, we derive by induction that, for every n ≥ 1 and every
` ∈ {1, . . . , n},
0 2 T /2
∀ x ∈ R, |(P̄ ` f )0 (x)| ≤ kf 0 k∞ (1 + (σ 0 )2 (x)T /(2n))` ≤ kf 0 k∞ eσ (x) .
(where we used that (1 + u)` ≤ e`u , u ≥ 0). This yields
0 2
k(P̄ ` f )0 ksup ≤ kf 0 k∞ ekσ k∞ T /2 .
Let us deal now with the second derivative,
r ! r !!
d 0 T 0 T
(P̄ g)”(x) = E g x + σ(x) Z 1 + σ (x) Z
dx n n
 r ! r !2  r ! r !
T T T T
= E g” x + σ(x) Z 1 + σ 0 (x) Z  + E g0 x + σ(x) Z σ”(x) Z .
n n n n

Then  ! !2 
r r  
E g” x + σ(x) T Z 0 T 0 2T

1 + σ (x) Z ≤ kg”k∞ 1 + σ (x) n


n n
 q  q
and, using that g 0 x + σ(x) Tn Z = g 0 (x) + g”(ζ)σ(x) Tn Z owing to the fundamental formula
of Calculus, we get
r ! r !
T T T
E g 0 x + σ(x) Z σ”(x) Z ≤ kg”k∞ kσσ”k∞ E(Z 2 )

n n n

so that  
T
∀ x ∈ R, |(P̄ g)”(x)| ≤ kg”k∞ 1 + (2kσσ”k∞ + k(σ 0 )2 k∞ )
n
which implies the boundedness of |(P̄ ` f )”(x)|, ` = 0, . . . , n − 1, n ≥ 1.
The same reasoning yields the boundedness of all derivatives (P̄ ` f )(i) , i = 1, 2, 3, 4, ` = 1, . . . , n,
n ≥ 1.
Now we can combine our local error bound with the control of the derivatives. Plugging these
estimates in each term of (7.25), finally yields
n
0
X T2 T
|E f (XTx ) − E f (X̄Tx )| ≤ Cσ,T max k(P̄ ` f )(i) ksup ≤ Cσ,T,f T
1≤`≤n,i=1,...,4 n2 n
k=1

which completes the proof. ♦

 Exercises. 1. Complete the above proof by inspecting the case of higher order derivatives
(k = 3, 4).
2. Extend the proof to a (bounded) non zero drift b.

If one assumes more regularity on the coefficients or some uniform ellipticity on the diffusion
coefficient σ it is possible to obtain an expansion of the error at any order.
7.6. WEAK ERROR FOR THE EULER SCHEME (I) 235

Theorem 7.8 (a) Talay-Tubaro’s Theorem (see [152]). Assume b and σ are infinitely differen-
tiable with bounded partial derivatives. Assume f : Rd → R is infinitely differentiable with partial
derivative having polynomial growth. Then, for every integer R ∈ N∗ ,
R
X ck
x
(ER+1 ) ≡ E f (XT ) − E f (X̄T x,n
)= + O(n−(R+1) ) as n → +∞, (7.27)
nk
k=1

where the real valued coefficients ck depend on f , T , b and σ.


(b) Bally-Talay’s Theorem (see [9]). If b and σ are bounded, infinitely differentiable with bounded
partial derivatives and if σ is uniformly elliptic i.e.

∀ x ∈ Rd , σσ ∗ (x) ≥ ε0 Id for an ε0 > 0

then the conclusion of (a) holds true for any bounded Borel function.

One method of proof for (a) is to rely on the P DE method i.e. considering the solution of the
parabolic equation  

+ L (u)(t, x) = 0, u(T, . ) = f
∂t
where
1
(Lg)(x) = g 0 (x)b(x) + g”(x)σ 2 (x)
2
denotes the infinitesimal generator of the diffusion. It follows from the Feynmann-Kac formula that
(under some appropriate regularity assumptions)

u(0, x) = E f (XTx ).

Formally (in one dimension), one assumes that u is regular enough to apply Itô’s formula so that
Z T   Z T
x x ∂
f (XT ) = u(T, XT ) = u(0, x) + + L (u)(t, Xtx )dt + ∂x u(t, Xtx )σ(Xtx )dWt .
0 ∂t 0

Taking expectation (and assuming that the local martingale is a true martingale. . . ) and using
that u is solution to the above parabolic P DE yields the announced result.
Then, one uses a domino strategy based on the Euler scheme as follows

E f (XT ) − E f (X̄Tx ) = E(u(0, x) − u(T, X̄Tx ))


n
X  
= − E u(tnk , X̄txnk ) − u(tnk−1 , X̄tnk−1 ) .
k=1

The core of the proof consists in applying Itô’s formula (to u, b and σ) to show that

  E φ(tnk , Xtxnk )
E u(tnk , X̄txnk ) − u(tnk−1 , X̄tnk−1 ) = + o(n−2 ).
n2
236 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

for some continuous function φ. Then, one derives (after new computations) that

E 0T φ(t, Xtx )dt


R
x x
E f (XT ) − E f (X̄T ) = + o(n−1 ).
n
This approach will be developed in Section 7.8.9

Remark. The last important information about weak error is that the weak error induced by
the Milstein scheme has exactly the same order as that of the Euler scheme i.e. O(1/n). So the
Milstein scheme seems of little interest as long as one wishes to compute E f (XT ) with a reasonable
framework like the ones described in the theorem, since, even when it can be implemented without
restriction, its complexity is higher than that of the standard Euler scheme. Furthermore, the next
paragraph about Richardson-Romberg extrapolation will show that it is possible to take advantage
of the higher order time discretization error expansion which become dramatically faster than
Milstein scheme.

7.7 Standard and multistep Richardson-Romberg extrapolation


7.7.1 Richardson-Romberg extrapolation with consistent increments
 Bias-variance decomposition of the quadratic error in a Monte Carlo simulation. Let V be a
vector space of continuous functions with linear growth satisfying (E2 ) (the case of non continuous
functions is investigated in [123]). Let f ∈ V . For notational convenience, in view of what follows,
we set W (1) = W and X (1) := X. A regular Monte Carlo simulation based on M independent
copies (X̄T(1) )m , m = 1, . . . , M , of the Euler scheme X̄T(1) with step T /n induces the following global
(squared) quadratic error
2 2
M M
1 X  2 1 X
E(f (XT ))− f ((X̄T(1) )m ) = E f (XT ) − E f (X̄T(1) ) + E(f (X̄T(1) ))− f ((X̄T(1) )m )

M M
m=1 2 m=1 2
 c 2
1 Var(f (X̄ (1) ))
= + T
+ O(n−3 ). (7.28)
n M
The above formula is but the bias-variance decomposition of the approximation error of the
Monte Carlo estimator. This quadratic error bound (7.28) does not take full advantage of the
above expansion (E2 ).

 Richardson-Romberg extrapolation. To take advantage of the expansion, we will perform a


Richardson-Romberg extrapolation. In that framework (originally introduced in [152]), one con-
siders the strong solution X (2) of a “copy” of Equation (7.1), driven by a second Brownian motion
W (2) defined on the same probability space (Ω, A, P). One may always choose such a Brownian
motion by enlarging Ω if necessary.
T
Then we consider the Euler scheme with a twice smaller step 2n , denoted X̄ (2) , of the diffusion
process X (2) .
We assume from now on that (E3 ) (as defined in (7.27)) holds for f to get more precise estimates
but the principle would work with a function simply satisfying (E2 ). Then combining the two time
discretization error expansions related to X̄ (1) and X̄ (2) , we get
7.7. STANDARD AND MULTISTEP RICHARDSON-ROMBERG EXTRAPOLATION 237

1 c2
E(f (XT )) = E(2f (X̄T(2) ) − f (X̄T(1) )) − + O(n−3 ).
2 n2
Then, the new global (squared) quadratic error becomes
M 2  c 2 Var(2f (X̄ (2) ) − f (X̄ (1) ))
1 X 2
E(f (XT )) − 2f ((X̄T(2) )m ) − f ((X̄T(1) )m ) = + T T
+ O(n−5 ).

M 2 2n2 M
m=1
(7.29)
The structure of this quadratic error suggests the following natural question:
Is it possible to reduce the (asymptotic) time discretization error without increasing the Monte
Carlo error (at least asymptotically in n. . . )?
Or, put differently, to what extent is it possible to control the variance term Var(2f (X̄T(2) ) −
f (X̄T(1) ))?

– Lazy simulation. If one adopts a somewhat “lazy” approach by using the pseudo-random
number generator in a purely sequentially to simulate the two Euler schemes, this corresponds from
(1) (2)
a theoretical point of view to consider independent Gaussian white noises (Uk )k and (Uk )k to
simulate the Brownian increments in bot schemes or equivalently to assume that W (1) and W (2)
are two independent Brownian motions. Then

n→+∞
Var(2f (X̄T(2) ) − f (X̄T(1) )) = 4Var(f (X̄T(2) )) + Var(f (X̄T(1) )) = 5 Var(f (X̄T )) −→ 5 Var(f (X̄T )).

In this approach the cost of one order on the time discretization error (switch from n−1 to n−2 )
is the increase of the variance by a factor 5 and of the complexity by a factor (approximately) 3.

– Consistent simulation (of the Brownian increments). If W (2) = W (1) (= W ) then


n→+∞
Var(2f (X̄T(2) ) − f (X̄T(1) )) −→ Var(2f (XT ) − f (XT )) = Var(f (XT ))

since Euler schemes X̄ (i) , i = 1, 2, converges in Lp (P) to X.


In this second approach, the same gain in terms of time discretization error has no cost in terms
of variance (at least for n large enough). Of course the complexity, remains 3 times higher. Finally
the pseudo-random number generator is less requested by a factor 2/3.
In fact, it is shown in [123], that this choice W (2) = W (1) = W , leading to consistent Brownian
increments for the two schemes is asymptotically optimal among all possible choice of Brownian
motions W (1) and W (2) . This result can be extended to Borel functions f when the diffusion
is uniformly elliptic (and b, σ bounded, infinitely differentiable with bounded partial derivatives,
see [123]).
 Practitioner’s corner. From a practical viewpoint, one first simulates an Euler scheme with step
T (2) (1) of the
2n using a white Gaussian noise (Uk )k≥1 , then one simulates the Gaussian white noise U
T
Euler scheme with step n by setting
(2) (2)
(1) U +U
Uk = 2k √ 2k−1 , k ≥ 1.
2
238 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

 Exercise. Show that if X and Y ∈ L2 (Ω, A, P) have the same distribution. Then, for every
α ∈ [1, +∞)
Var(αX + (1 − α)Y ) ≥ Var(X).

7.7.2 Toward a multistep Richardson-Romberg extrapolation


In [123], a more general approach to multistep Richardson-Romberg extrapolation with consistent
Brownian increments is developed. Given the state of the technology and the needs, it seems
interesting up to R = 4. We will sketch here the case R = 3 which may also works for path
dependent options (see below). Assume E4 holds. Set
1 9
α1 = , α2 = −4, α3 = .
2 2
Then, easy computations show that
 c3 P 3
1≤i≤3 αi /i

(1) (2)
E α1 f (X̄T ) + α2 f (X̄T ) + α3 f (X̄T (3)
) = + O(n−4 ).
n3
T
where X̄ (r) denotes the Euler scheme with step rn , r = 1, 2, 3, with respect to the same Brownian
motion W . Once again this choice induces a control of the variance of the estimator. However, note
that this choice is theoretically no longer optimal as it is for the regular Rombeg extrapolation.
But it is natural and better choices are not easy to specify a priori (keeping in mind that they will
depend on the function f ).
In practice the three Gaussian white noises can be simulated by using the following consistency
(r)
rules: We give below the most efficient way to simulate the three white noises (Uk )1≤k≤r on one
time step T /n. Let Z1 , Z2 , Z3 , Z4 be four i.i.d. copies of N (0; Iq ). Set

(3) (3) Z2 + Z3 (3)


U1 = Z1 , U2 = √ , U3 = Z4 ,
2
√ √
(2) 2 Z1 + Z2 (2) Z3 + 2 Z4
U1 = √ , U2 = √ ,
3 3
(2) (2)
(1) U1 + U2
U1 = √ .
2
(R)
A general formula for the weights αr , r = 1, . . . , R, as well as the consistent increments tables
are provided in [123].
Guided by some complexity considerations, one shows that the parameters of this multistep
Richardson-Romberg extrapolation should satisfy some constraints. Typically if M denotes the
size of the MC simulation and n the discretization parameter, they should be chosen so that

M ∝ n2R .

For details we refer to [123]. The practical limitation of these results about Richardson-Romberg
extrapolation is that the control of the variance is only asymptotic (as n → +∞) whereas the
method is usually implemented for small values of n. However it is efficient up to R = 4 for Monte
Carlo simulation of sizes M = 106 to M = 108 .
7.8. FURTHER PROOFS AND RESULTS 239

 Exercise: We consider the simplest option pricing model, the (risk-neutral) Black-Scholes dy-
namics, but with unusually high volatility. (We are aware that this model used to price Call options
does not fulfill the theoretical assumptions made above). To be precise

dXt = Xt (rdt + σdWt ),

with the following values for the parameters

X0 = 100, K = 100, r = 0.15, σ = 1.0, T = 1.


Note that a volatility σ = 100% per year is equivalent to a 4 year maturity with volatility 50% (or
16 years with volatility 25%). The Black-Scholes reference premium is C0BS = 42.96.

• We consider the Euler scheme with step T /n of this equation.


r !
T T
X̄tk+1 = X̄tk 1 + r + σ Uk+1 , X̄0 = X0 ,
n n

kT
where tk = n , k = 0, . . . , n. We want to price a vanilla Call option i.e. to compute

C0 = e−rT E((XT − K)+ )

using a Monte Carlo simulation with M sample paths, M = 104 , M = 106 , etc.

• Test now the standard Richardson-Romberg extrapolation (R = 2) based on Euler schemes


with steps T /n and T /(2n), n = 2, 4, 6, 8, 10, respectively with
– independent Brownian increments,
– consistent Brownian increments.
Compute an estimator of the variance of the estimator.

7.8 Further proofs and results


kT
In this section, to alleviate notations we will drop the exponent n in tnk = n . Furthermore, for the reader’s
convenience, we will always consider that d = q = 1.

7.8.1 Some useful inequalities


On the non-quadratic case, Doob’s Inequality is not sufficient to carry out the proof: we need the more
general Burkhölder-Davis-Gundy Inequality. Furthermore, to get some real constants having the announced
behaviour as a function of T , we will also need to use the generalized Minkowski Inequality established
below.
 Generalized Minkowski Inequality: For any (bi-measurable) non-negative process X = (Xt )t≥0 , and
for every p ∈ [1, ∞), Z
T Z T
∀ T ∈ [0, ∞], X dt ≤ kXt kp dt. (7.30)

t
0 0
p
240 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

Proof. If p = 1 the inequality is obvious. Assume now p ∈ (1, ∞). Let T ∈ (0, ∞) and let Y be a non-
negative random variable defined on the same probability space as (Xt )t∈[0,T ] . Let M > 0. It follows from
Fubini’s Theorem and Hölder Inequality that
!
Z T Z T
E (Xs ∧ M )ds Y = E((Xs ∧ M )Y )ds
0 0
Z T
p
≤ kXs ∧ M kp kY kq ds where q = p−1
0
Z T
= kY kq kXs ∧ M kp ds.
0
!p−1
Z T
The above inequality applied with Y := Xs ∧ M ds where M is a positive real number yields
0

!p !p !1− p1 Z
Z T Z T +∞
E Xs ∧ M ds ≤ E Xs ∧ M ds kXs kp ds.
0 0 0
R p
T
If E Xs ∧ Mn ds = 0 for any sequence Mn ↑ +∞, the inequality is obvious since, by Beppo Levi’s
0
RT
monotone convergence Theorem, 0 Xs ds = 0 P-a.s.. Otherwise, there is a sequence Mn ↑ ∞ such that all
these integrals are non zero (and finite since X is bounded by M and T is finite). Consequently, one can
divide both sides of the former inequality to obtain
!p ! p1
Z T Z +∞
∀ n ≥ 1, E Xs ∧ Mn ds ≤ kXs kp ds.
0 0

Now letting Mn ↑ +∞ yields exactly the expected result owing to two successive applications of Beppo
Levi’s monotone convergence Theorem, the first with respect to the Lebesgue measure ds, the second with
respect to d P. When T = +∞, one concludes by Fatou’s Lemma by letting T go to infinity in the inequality
obtained for finite T . ♦

 Burkhölder-Davis-Gundy Inequality: For every p ∈ (0, ∞), there exists two positive real constants
cBDG
p and CpBDG such that for every non-negative continuous local martingale (Xt )t∈[0,T ] null at 0,

p p
BDG
cp hXiT ≤ sup |Xt | ≤ CpBDG hXiT .

p t∈[0,T ] p
p

For a detailed proof based on a stochastic calculus approach, we refer to [141], p.160.

7.8.2 Polynomial moments (II)


Proposition 7.5 (a) For every p ∈ (0, +∞), there exists a positive real constant κ0p > 0 (increasing in p),
such that, if b, σ satisfy:

∀ t ∈ [0, T ], ∀ x ∈ Rd , |b(t, x)| + |σ(t, x)| ≤ C(1 + |x|), (7.31)

then, every strong solution of Equation (7.1) starting from the finite random vector X0 (if any), satisfies

0
sup |Xs | ≤ 2 eκp CT 1 + kX0 kp .

∀ p ∈ (0, ∞),

s∈[0,T ]
p
7.8. FURTHER PROOFS AND RESULTS 241

T
(b) The same conclusion holds under the same assumptions for the continuous Euler schemes with step n,
n ≥ 1, as defined by (7.5) with the same constant κ0p (which does not depend n) i.e.

0
sup |X̄s | ≤ 2 eκp CT 1 + kX0 kp .
n

∀ p ∈ (0, ∞), ∀ n ≥ 1,

s∈[0,T ]
p

Remarks. • Note that this proposition makes no assumption neither on the existence of strong solutions
to (7.1) nor on some (strong) uniqueness assumption on a time interval or the whole real line. Furthermore,
/ Lp (P).
the inequality is meaningless when X0 ∈
• The case p ∈ (0, 2) will be discussed at the end of the proof.

Proof: To alleviate the notations we assume from now on that d = q = 1.


(a) Step 1(The process: first reduction): Assume p ∈ [2, ∞). First we introduce for every integer N ≥ 1
the stopping time τN := inf{t ∈ [0, T ] | |Xt − X0 | > N } (convention inf ∅ = +∞). This is a stopping time
[ \ n 1o
since {τN < t} = {|Xr − X0 | > N } ∈ Ft . Moreover {τN ≤ t} = τN < t + for every k0 ≥ 1,
k
r∈[0,t]∩Q k≥k0
\
hence {τN ≤ t} = Ft+ k1 = Ft+ = Ft since the filtration is càd (2 ). Furthermore,
0
k≥1

τ
sup |Xt N | ≤ N + |X0 |
t∈[0,T ]


so that the non-decreasing function fN defined by fN (t) := sup |Xs∧τN | , t ∈ [0, T ], is bounded by

s∈[0,t] p

N + kX0 kp . On the other hand


Z t∧τN Z
s∧τN

sup |Xs∧τN | ≤ |X0 | + |b(s, Xs )|ds + sup σ(u, Xu )dWu .
s∈[0,t] 0 s∈[0,t] 0

It follows from successive applications of both the regular and the generalized Minkowski Inequalities and
of the BDG Inequality that

s
Z t Z t∧τN
BDG

fN (t) ≤ kX0 kp + k1{s≤τN } b(s, Xs )kp ds + Cp 2
σ(s, Xs ) ds

0 0
p
s
Z t Z t
kb(s ∧ τN , Xs∧τN )kp ds + CpBDG

≤ kX0 kp + σ(s ∧ τ , X ) 2 ds
N s∧τN
0 0
p
s
Z t Z t
(1 + kXs∧τN kp )ds + CpBDG C

≤ kX0 kp +C (1 + |Xs∧τN |)2 ds
0 0
p
s

Z t Z t
(1 + kXs∧τN kp )ds + CpBDG C

≤ kX0 kp +C 2
t+ Xs∧τ N
ds

0 0
p

2
This holds true for any hitting time of an open set by an Ft -adapted càd process.
242 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

where we used in the last line that the Minkowski inequality on L2 [[0, T ], dt) endowed with its usual Hilbert
√ √ 1
norm. Hence, the Lp (P)-Minkowski Inequality and the obvious identity k .kp = k .k 2p yield
2


21
t √
Z Z t
(1 + kXs∧τN kp )ds + CpBDG C  t + 2

fN (t) ≤ kX0 kp + C Xs∧τ N
ds

.
0 0 p
2

p p
Now the generalized L 2 (P)-Minkowski’s Inequality (7.30) (use 2 ≥ 1) yields
 21 !
t √ t
Z Z
CpBDG C
2
fN (t) ≤ kX0 kp + C (1 + kXs∧τN kp )ds + t+ Xs∧τ p ds
N
2
0 0
 21 !
t √ t
Z Z
2
= kX0 kp + C (1 + kXs∧τN kp )ds + CpBDG C t+ kXs∧τN kp ds .
0 0

Consequently the function fN satisfies


Z t Z t  21 !
fN (t) ≤ C fN (s)ds + CpBDG 2
fN (s)ds + ψ(t)
0 0

where √
ψ(t) = kX0 kp + C(t + CpBDG t).
Step 2: “À la Gronwall” Lemma.
Lemma 7.4 Let f : [0, T ] → R+ and ψ : [0, T ] → R+ be two non-negative non-decreasing functions
satisfying
Z t Z t  12
2
∀ t ∈ [0, T ], f (t) ≤ A f (s)ds + B f (s)ds + ψ(t)
0 0
where A, B are two positive real constants. Then
2
f (t) ≤ 2 e(2A+B )t ψ(t).
∀ t ∈ [0, T ],

Proof. First, it follows from the elementary inequality x y ≤ 12 (x/B + By), x, y ≥ 0, B > 0, that

t  21 Z t  21
f (t) B t
Z  Z
2
f (s)ds ≤ f (t) f (s)ds ≤ + f (s)ds.
0 0 2B 2 0
Plugging this in the original inequality yields
Z t
f (t) ≤ (2A + B 2 ) f (s)ds + 2 ψ(t).
0

Regular Gronwall’s Lemma finally yields the announced result. ♦

Step 3: Applying the above generalized Gronwall’s Lemma to the functions fN and ψ defined in Step 1,
leads to
BDG 2
 √ 
∀ t ∈ [0, T ], sup |Xs∧τN | = fN (t) ≤ 2 e(2+(Cp ) )Ct kX0 kp + C(t + CpBDG t) .

s∈[0,t] p

The sequence of stopping times τN is non-decreasing and converges toward τ∞ taking values in [0, T ] ∪ {∞}.
On the event {τ∞ ≤ T }, |XτN − X0 | ≥ N so that |Xτ∞ − X0 | = limN →+∞ |XτN − X0 | = +∞ since Xt
7.8. FURTHER PROOFS AND RESULTS 243

has continuous paths. This is a.s. impossible since the process (Xt )t≥0 has continuous paths on [0, T ]. As a
consequence τ∞ = ∞ a.s. which in turn implies that
lim sup |Xs∧τN | = sup |Xs | a.s.
N s∈[0,t] s∈[0,t]

Then Fatou’s Lemma implies by letting N go to infinity that


BDG 2
 √ 
∀ t ∈ [0, T ], sup |Xs | ≤ lim inf sup |Xs∧τN | ≤ 2 e(2+(Cp ) )Ct kX0 kp + C(t + CpBDG t) .

s∈[0,t] p N s∈[0,t] p


which finally yields, using that max( u, u) ≤ eu , u ≥ 0,
BDG 2
 BDGp 2

∀ t ∈ [0, T ], sup |Xs | ≤ 2 e(2+(Cp ) )Ct kX0 kp + eCt + e(C ) t
.

s∈[0,t] p

One derives the existence of a positive real constant κ0p > 0, only depending on p, such that
0
∀ t ∈ [0, T ], sup |Xs | ≤ 2eκp Ct (1 + kX0 kp ).

s∈[0,t] p

Step 4 (p ∈ (0, 2)). The extension can be carried out as follows: for every x ∈ Rd , the diffusion process
starting at x, denoted (X̄tn,x )t∈[0,T ] , satisfies the following two obvious facts:
NP
– the process X x is FtW -adapted where FtW := σ(Ws , s ≤ t) .
d
– If X0 is an R -valued random vector defined on (Ω, A, P), independent of W , then the process X =
(Xt )t∈[0,T ] starting from X0 satisfies
Xt = XtX0 .
Consequently, using that p 7→ k . kp is non-decreasing, it follows that
0
sup |Xsx | ≤ sup |Xs | ≤ 2 eκ2 Ct (1 + |x|).

s∈[0,t] p s∈[0,t] 2

Now ! Z !
0
E sup |Xt |p = PX0 (dx)E sup |Xtx |p ≤ 2(p−1)+ 2p epκ2 CT (1 + E|X0 |p )
t∈[0,T ] R d t∈[0,T ]

p (p−1)+ p p
(where we used that (u + v) ≤ 2 (u + v ), u, v ≥ 0) so that
1 0 1
sup |Xt | ≤ 2(1− p )+ 2 eκ2 CT 2( p −1)+ 1 + kX0 kp

t∈[0,T ] p
1 0
2|1− p | 2 eκ2 CT 1 + kX0 kp

=
As concerns the SDE (7.1) itself, the same reasoning can be carried out only if (7.1) satisfies an existence
and uniqueness assumption for any starting value X0 .
(b) (Euler scheme) The proof follows the same lines as above. One starts from the integral form (7.6) of the
continuous Euler scheme and one introduces for every n, N ≥ 1 the stopping times
n
τ̄N = τ̄N := inf{t ∈ [0, T ] | |X̄tn − X0 | > N }.
At this stage, one notes that b (as well as σ 2 ) satisfy this type of inequality
!

∀ s ∈ [0, t], k1{s≤τ̄N } b(s, X̄s )kp ≤ C 1+ sup |X̄s | .

s∈[0,t∧τ̄N ] p

which makes possible to reproduce formally the above proof to the continuous Euler scheme. ♦
244 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

7.8.3 Lp -pathwise regularity


Lemma 7.5 Let p ≥ 1 and let (Yt )t∈[0,T ] be an Itô process defined on [0, T ] by
Z t Z t
Yt = Y0 + Gs ds + Hs dWs
0 0

RT
where H and G are (Ft )-progressively measurable satisfying 0
|Gs | + Hs2 ds < +∞ a.s.
(a) For every p ≥ 2,
1
∀ s, t ∈ [0, T ], kYt − Ys kp ≤ CpBDG sup kHt kp |t − s| 2 + sup kGt kp |t − s|
t∈[0,T ] t∈[0,T ]
√ 1
≤ (CpBDG sup kHt kp + T sup kGt kp )|t − s| 2 .
t∈[0,T ] t∈[0,T ]

1
In particular if supt∈[0,T ] kHt kp + supt∈[0,T ] kGt kp < +∞, the process t 7→ Yt is Hölder with exponent 2 from
[0, T ] into Lp (P).
(b) If p ∈ [1, 2), then
1
∀ s, t ∈ [0, T ], kYt − Ys kp ≤ CpBDG k sup |Ht |kp |t − s| 2 + sup kGt kp |t − s|
t∈[0,T ] t∈[0,T ]
√ 1
≤ (CpBDG k sup |Ht |kp + T sup kGt kp )|t − s| 2 .
t∈[0,T ] t∈[0,T ]

Proof. (a) Let 0 ≤ s ≤ t ≤ T . It follows from the standard and generalized Minkowski Inequalities and the
R s+u
BDG Inequality (applied to the continuous local martingale ( s Hr dWr )u≥0 ) that
Z t Z t

kYt − Ys kp ≤
|G u |du +
H u dW u


s p s p
s
Z t Z t
BDG

≤ kGu kp du + Cp 2
Hu du

s s
p
Z t 12
BDG 2

≤ sup kGt kp (t − s) + Cp Hu du

t∈[0,T ] s p/2
1 1
≤ sup kGt kp (t − s) + CpBDG sup kHu2 kp/2
2
(t − s) 2
t∈[0,T ] u∈[0,T ]
1
= sup kGt kp (t − s) + CpBDG sup kHu kp (t − s) 2 .
t∈[0,T ] u∈[0,T ]

√ 1
The second inequality simply follows from |t − s| ≤ T |t − s| 2 .
(b) If p ∈ [1, 2], one simply uses that

Z t 21 12
1
1

2 2

Hu du ≤ |t − s| sup Hu = |t − s| sup |Hu | .
2
2

u∈[0,T ] u∈[0,T ]
s p/2

p/2 p

and one concludes likewise. ♦


7.8. FURTHER PROOFS AND RESULTS 245

Remark : If H, G and Y are defined on the whole real line R+ and sup (kGu kp + kHu kp ) < +∞ then
t∈R+
t 7→ Yt est locally 1/2-Hölder on R+ . If H = 0, the process is in fact Lipschitz continuous on [0, T ].
Combining the above result for Itô processes with those of Proposition 7.5 leads to the following result
on pathwise regularity of the diffusion solution to (7.1) (when it does exists) and the related Euler schemes.

Proposition 7.6 If the coefficients b and σ satisfy the linear growth assumption (7.31) over [0, T ] × Rd
with a real constant C > 0, then the Euler scheme with step T /n and any strong solution of (7.1) satisfy for
every p ≥ 1,
√ 1
kXt − Xs kp + kX̄tn − X̄sn kp ≤ κ”p Ceκ”p CT (1 + T ) 1 + kX0 kp |t − s| 2 .

∀ n ≥ 1, ∀ s, t ∈ [0, T ],
where κ”p ∈ (0, ∞) is real constant only depending on p (increasing in p).

Proof. As concerns the process X, this is a straightforward consequence of the above Lemma 7.5 by setting
Gt = b(t, Xt ) and Ht = σ(t, Xt )
since
max(k sup |Gt |kp , k sup |Ht |kp ) ≤ C(1 + k sup |Xt |kp ).
t∈[0,T ] t∈[0,T ] t∈[0,T ]

One specifies the real constant κ”p using Proposition 7.5. ♦

7.8.4 Lp -converge rate (II): proof of Theorem 7.2


Step 1 (p ≥ 2): One sets
εt := Xt − X̄tn , t ∈ [0, T ]
Z t  Z t 
= b(s, Xs ) − b(s, X̄s ) ds + σ(s, Xs ) − σ(s, X̄s ) dWs
0 0

so that Z t Z s



sup |εs | ≤ |(b(s, Xs ) − b(s, X̄s )|ds + sup (σ(u, Xu ) − σ(u, X̄u ))dWu .

s∈[0,t] 0 s∈[0,t] 0

One sets for every t ∈ [0, T ],


f (t) := k sup |εs |kp .
s∈[0,t]

It follows from regular and generalized Minkowski Inequalities and BDG Inequality that
Z t Z
t  12

BDG
2
f (t) ≤ kb(s, Xs ) − b(s, X̄s )kp ds + Cp σ(s, Xs ) − σ(s, X̄s ) ds


0 0
p
Z t Z t 12
BDG 2

= kb(s, Xs ) − b(s, X̄s )kp ds + Cp (σ(s, Xs ) − σ(s, X̄s )) ds p
0 0 2
Z t Z t  21
≤ kb(s, Xs ) − b(s, X̄s )kp ds + CpBDG 2
kσ(s, Xs ) − σ(s, X̄s )kp ds
0 0
Z t Z t  21 !
β
≤ Cb,σ ((s − s) + kXs − X̄s kp )ds + CpBDG (s − s) 2β
+ kXs − X̄s k2p ds
0 0
 β  21 !
√ √ Z t Z t
T
≤ Cb,σ ( t+ CpBDG ) t+ BDG
kXs − X̄s kp ds + Cp 2
kXs − X̄s kp ds
n 0 0
246 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

√ √ √
where we we used that 0 ≤ s − s ≤ Tn and the elementary inequality u + v ≤ u + v, u, v ≥ 0 and Cb,σ
denotes a positive real constant only depending on b and σ. Now, noting that

kXs − X̄s kp ≤ kXs − Xs kp + kXs − X̄s kp

= kXs − Xs kp + kεs kp

≤ kXs − Xs kp + f (s),

it follows that
 21 !
t √ t
Z Z
f (t) ≤ Cb,σ f (s)ds + 2 CpBDG 2
f (s) ds + ψ(t) (7.32)
0 0

where
 β  12
√ √ Z t √ BDG
Z t
T
ψ(t) := ( t + CpBDG ) t+ kXs − Xs kp ds + 2 Cp 2
kXs − Xs kp ds . (7.33)
n 0 0

Step 2: It follows from Lemma 7.4 that


BDG

f (t) ≤ 2 Cb,σ e2Cb,σ (1+Cp / 2)t
ψ(t). (7.34)

Now, we will use the path regularity of the diffusion process X in Lp (P) obtained in Proposition 7.6 to
provide an upper-bound for the function ψ. We first note that since b and σ satisfy (HTβ ) with a positive
real constant Cb,σ , they satisfy the linear growth assumption yields (7.31) with
0
Cb,σ,T := Cb,σ + sup (|b(t, 0)| + |σ(t, 0)|) < +∞
t∈[0,T ]

(b(., 0) and σ(., 0) are β-Hölder hence bounded on [0, T ]). It follows from (7.33) and Proposition 7.6 that
 β   12
√ T √ 0 0 √ T √ √
( t + CpBDG ) κ”p Cb,σ t
(t + 2CpBDG t)

ψ(t) ≤ t + κ”p Cb,σ e 1 + kX0 kp (1 + t)
n n
 β
T √ 0 0
≤ CpBDG (1 + t) + 2(1 + 2CpBDG )(1 + t)2 κ”p Cb,σ eκ”p Cb,σ t 1 + kX0 kp

n
√ √ √
where we used the elementary inequality u ≤ 1 + u which implies that ( t + CpBDG ) t ≤ CpBDG (1 + t)
√ √ BDG √ √ BDG
and (1 + t)(t + 2Cp t) ≤ 2(1 + 2Cp )(1 + t)2 .
Hence, there exists a real constant κ̃p > 0 such that
 β
T 0 0
eκ̃p Cb,σ t 1 + kX0 kp (1 + t)2

ψ(t) ≤ κ̃p (1 + t) + κ̃p Cb,σ
n
 β
T 0 0
≤ κ̃p et eκ̃p (1+Cb,σ )t 1 + kX0 kp .

+ 2κ̃p Cb,σ
n

since 1 + u ≤ eu and (1 + u)2 ≤ 2eu , u ≥ 0. Finally one plugs this bound in (7.34) (at time T ) to get the
announced upper-bound by setting the real constant κp at an appropriate value.
Step 2 (p ∈ (0, 2)): It remains to deal with the case p ∈ [1, 2). In fact, once noticed that Assumption
(HTβ ) ensures global existence and uniqueness of the solution X of (7.1) starting from a given random
variable X0 (independent of W ), it can be solved following the approach developed in Step 4 of the proof of
Proposition 7.5. We leave the details to the reader. ♦
7.8. FURTHER PROOFS AND RESULTS 247

Corollary 7.3 (Lipschitz continuous framework) If b and σ satisfy Condition (HT1 ) i.e.

∀ s, t ∈ [0, T ], ∀ x, y ∈ Rd , |b(s, x) − b(t, y)| + |σ(s, x) − σ(t, y)| ≤ Cb,σ,T (|t − s| + |x − y|)

then for every p ∈ [1, ∞),


r
T
∀ n ≥ 1, sup |Xt − X̄tn | ≤ κp Cb,σ,T e κp (1+Cb,σ,T )T
(1 + kX0 kp ) .

n

t∈[0,T ] p

7.8.5 The stepwise constant Euler scheme


The aim of this section is to prove in full generality Claim (b) of Theorem 7.2. We recall that the stepwise
constant Euler scheme is defined by
∀ t ∈ [0, T ], X
et := X̄t
et = X̃t , si t ∈ [tt , tk+1 ).
i.e. X k

It has been seen in Section 7.2.1 that when X = W , a log n factor comes out in the error rate. One must
again have in mind that this question is quite crucial since, at least in higher dimension, since the simulation
of the continuous Euler scheme may rise difficult problems whereas the simulation of the stepwise constant
Euler scheme remains straightforward in any dimension (provided b and σ are known).

Proof of Theorem 7.2(b). Step 1 (Deterministic X0 ): We first assume that X0 = x. Then, one may
assume without loss of generality that p ∈ [1, ∞). Then

X̄tn − X̃tn = X̄tn − X̄tn

Z t Z t
= b(s, X̄s )ds + σ(s, X̄s )dWs .
t t

One derives that


T
sup |X̄tn − X̃tn | ≤ sup |b(t, X̄t )| + sup |σ(t, X̄t )(Wt − Wt )|. (7.35)
t∈[0,T ] n t∈[0,T ] t∈[0,T ]

Now, it follows from Proposition 7.5(b) that



0
sup |b(t, X̄t )| ≤ 2eκp Cb,σ,T T (1 + |x|)
n

t∈[0,T ]
p

On the other hand, using the extended Hölder Inequality: for every p ∈ (0, ∞),
1 1
∀ r, s ≥ 1, + = 1, kf gkp ≤ kf krp kgksp ,
r s
with r = 1 + η and s = 1 + 1/η, η > 0, leads to


sup |σ(t, X̄t )(Wt − Wt )| ≤ sup |σ(t, X̄t )| sup |Wt − Wt |

t∈[0,T ] t∈[0,T ] t∈[0,T ]
p p


≤ sup |σ(t, X̄t )| sup |Wt − Wt |

t∈[0,T ] t∈[0,T ]
p(1+η) p(1+1/η)
248 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

(for convenience we set η = 1 in what follows). Now, like for the drift b, one has

0
sup |σ(t, X̄t )| ≤ 2eκ2p Cb,σ,T T (1 + |x|).

t∈[0,T ]
p(1+η)

As concerns the Brownian term, one has


r
T
sup |Wt − Wt | ≤ CW,2p (1 + log n)

t∈[0,T ] n
2p

owing to (7.12) in Section 7.2.1. Finally, plugging these estimates in (7.35), yields

T 0
sup |X̄tn − X ≤ 2 eκp Cb,σ,T T (1 + |x|)
etn |

t∈[0,T ] n
p
r
κ02p Cb,σ,T T T
+2e (1 + |x|) × CW,2p (1 + log n),
n
r !
κ02p Cb,σ,T T T T
≤ 2(CW,2p + 1)e (1 + |x|) (1 + log n) + .
n n
q q
One concludes by noting that Tn 1 + log n + Tn ≤ 2 Tn 1 + log n for every integer n ≥ 1 and by setting
 

κ̃p := 2 max(2(CW,2p + 1), κ02p ).


Step 2 (Random X0 ): When X0 is no longer deterministic one uses that X0 and W are independent so
that, with obvious notations,
Z
n,X0 n,X0 p
E sup |X̄ − X̃ | = PX0 (dx0 )E sup |X̄ n,x0 − X̃ n,x0 |p
t∈[0,T ] Rd t∈[0,T ]

which yields the announced result.


Step 3 Combination of the upper-bounds: This is a straightforward consequence of Claims (a) and (b). ♦

7.8.6 Application to the a.s.-convergence of the Euler schemes and its rate.
One can derive from the above Lp -rate of convergence an a.s.-convergence result. The main result is given
in the following theorem (which extends Theorem 7.3 stated in the homogeneous Lipschitz continuous con-
tinuous case).

Theorem 7.9 If (HTβ ) holds and if X0 is a.s. finite, the continuous Euler scheme X̄ n = (X̄tn )t∈[0,T ] a.s.
converges toward the diffusion X for the sup-norm over [0, T ]. Furthermore, for every α ∈ [0, β ∧ 21 ),
a.s.
nα sup |Xt − X̄tn | −→ 0.
t∈[0,T ]

The same convergence rate holds with the stepwise constant Euler scheme (X̃tn )t∈[0,T ] .

Proof: We make no a priori integrability assumption on X0 . We rely on the localization principle (at the
(N ) X0
origin). Let N > 1; set X0 := X0 1{|X0 |≤N } + N |X0|
1{|X0 |>N } . Stochastic integration being a local
(N ) (N )
operator, the solutions (Xt )t∈[0,T ] and (Xt )t∈[0,T ] of the SDE (7.1) are equal on {X0 = X0 }, namely
7.8. FURTHER PROOFS AND RESULTS 249

on {|X0 | ≤ N }. The same property is obvious for the Euler schemes X̄ n and X̄ n,(N ) starting from X0 and
(N )
X0 respectively. For a fixed N , we know from Theorem 7.2 (a) that, for every p ≥ 1,
 p(β∧ 21 )

(N ) (N )
 T
∃ Cp,b,σ,β,T > 0 such that, ∀ n ≥ 1, E sup |X̄t − Xt |p ≤ Cp,b,σ,β,T (1 + kX0 kp )p .
t∈[0,T ] n

In particular
   
E 1{|X0 |≤N } sup |X̄tn,(N ) − Xt |p = E 1{|X0 |≤N } sup |X̄tn,(N ) − Xt(N ) |p
t∈[0,T ] t∈[0,T ]
 
n,(N ) (N ) p
≤ E sup |X̄t − Xt |
t∈[0,T ]
 p(β∧ 21 )
(N ) T
≤ Cp,b,σ,β,T (1 + kX0 kp )p .
n

1
X 1
On chooses p > β∧ 12
,so that 1 < +∞. Consequently Beppo Levi’s Theorem for series with
n≥1
np(β∧ 2 )
non-negative terms implies  
X
E 1{|X0 |≤N } sup |X̄tn − Xt |p < +∞.
n≥1 t∈[0,T ]

Hence
a.s.
X [
sup |X̄tn − Xt |p < ∞, P a.s. on {|X0 | ≤ N } = {X0 ∈ Rd } = Ω
n≥1 t∈[0,T ] n≥1

n→∞
so that sup |X̄tn − Xt | −→ 0 P a.s.
t∈[0,T ]
One can improve the above approach to get a much more powerful result: let α ∈]0, β∧ 21 [, and p > 1
β∧ 12 −α
,

 p(β∧ 12 )


(N ) (N )

(N ) T
n E sup |X̄t − Xt |p ≤ Cp,b,σ,β,T (1 + kX0 kp )p npα
t∈[0,t] n
(N ) 1
0
= Cp,b,σ,β,T (1 + kX0 kp )p n−p(β∧ 2 −α)

Finally, one gets:  


n→∞ 1
P-a.s. sup |X̄tn − Xt | −→ O .
t∈[0,T ] nα
The proof for the stepwise constant Euler scheme follows exactly the same lines since an additional log n
term plays no role in the convergence of the above series. ♦

Remarks and comments. • The above rate result strongly suggests that the critical index for the a.s. rate
of convergence is β ∧ 12 . The question is then: what happens when α = β ∧ 12 ? It is shown in [88, 74] that
√ L
(when β = 1), n(Xt − X̄t ) −→ Ξt where Ξ = (Ξt )t∈[0,T ] is a diffusion process. This weak convergence holds
in a functional sense, namely for the topology of the uniform convergence on C([0, T ], Rd ). This process Ξ is
not P-a.s. ≡ 0 if σ(x) 6≡ 0., even a.s non-zero if σ never vanishes. The “weak functional” feature means first
that we consider the processes as random variables taking values in their natural path space, namely the
separable Banach space (C([0, T ], Rd ), k . ksup ). Then, one may consider the weak convergence of probability
measures defined on (the Borel σ-field of) this space (see [24] for an introduction). The connection with the
above elementary results is the following:
250 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

L
If Yn −→ Y , then P-a.s. ∀ ε > 0

limn nε kYn ksup = +∞ on {Y 6= 0} and Y =0 on {limn |Yn ksup = 0}.

(This follows either from the Skokhorod representation theorem or from a direct approach)

• When β = 1 (Lipschitz continuous case), one checks that, for every ε > 0, limn ( n)1+ε sup |Xt − X̄tn | =
t∈[0,T ]
+∞ P-a.s. If one changes the time scale, it is of iterated logarithm type.

t
 Exercise. One considers the geometric Brownian motion Xt = e− 2 +Wt solution to

dXt = Xt dWt , X0 = 1.

(a) Show that for every n ≥ 1 and every k ≥ 0,

k
Y
`T

X̄tnk = 1 + ∆Wtn` where tn` = n , ∆Wtn` = Wtn` − Wtn`−1 , ` ≥ 1.
`=1

(b) Show that



∀ε > 0, limn ( n)1+ε |XT − X̄Tn | = ∞ P-a.s.

7.8.7 Flow of an SDE, Lipschitz continuous regularity


If Assumption (7.2) holds, there exists a unique solution to the the SDE (7.1) starting from x ∈ Rd defined
on [0, T ], denoted (Xtx )t∈[0,T ] from now on. The mapping (x, t) 7→ Xtx defined on [0, T ] × Rd is called the
flow of the SDE (7.1). One defines likewise the flow of the Euler schemes (which always exists). We will
now elucidate the regularity of these flows when Assumption (HTβ ) holds.

2
Theorem 7.10 There exists a function κ03 : R∗+ → R+ such that if the coefficient b and σ of (7.1) satisfy
Assumption (HTβ ) for a real constant C > 0, then the unique strong solution (Xtx )t∈[0,T ] starting from x ∈ Rd
sur [0, T ] and the continuous Euler scheme (X̄ n,x )t∈[0,T ] satisfy

0
y n,x n,y
∀ x, y ∈ Rd , ∀ n ≥ 1 sup |Xtx − Xt | + sup |X̄t − X̄t | ≤ κ03 (p, C)eκ3 (p,C)T |x − y|.

t∈[0,T ] t∈[0,T ]
p p

Proof: We focus on the diffusion process (Xt )t∈[0,T ] . First note that if the above bound holds for some
p > 0 then it holds true for any p0 ∈ (0, p) since the k . kp -norm is non-decreasing in p. Starting from
Z t Z t
Xtx − Xty (b(s, Xsx ) − b(s, Xsy ))ds + σ(s, Xsx ) − σ(s, Xsy ) dWs

= (x − y) +
0 0

one gets
Z t Z s
sup |Xsx − Xsy | |b(s, Xsx ) − b(s, Xsy )|ds + sup σ(u, Xux ) − σ(u, Xuy ) dWu .

≤ |x − y| +
s∈[0,t] 0 s∈[0,t] 0
7.8. FURTHER PROOFS AND RESULTS 251

Then setting for every p ≥ 2, f (t) := sup |Xsx − Xsy | , it follows from the generalized Minkowski and

s∈[0,t] p

the BDG Inequalities


s
Z t Z t
x y BDG

x y  2
f (t) ≤ |x − y| + C kXs − Xs kp ds + Cp σ(s, Xs ) − σ(s, Xs ) ds
0 0
p

Z t Z t 21
x y BDG x y 2

≤ |x − y| + C kXs − Xs kp ds + Cp C
|Xs − Xs | ds
0 0 p
2

Z t Z t  21
2
≤ |x − y| + C kXsx − Xsy kp ds + CpBDG C kXsx − Xsy k ds .
0 0

Consequently the function f satisfies


Z t Z t  21 !
f (t) ≤ |x − y| + C f (s)ds + CpBDG + f (s)ds .
0 0

One concludes like in Step 2 of Theorem 7.2 that


BDG
∀ t ∈ [0, T ], f (t) ≤ eC(2+Cp )t
|x − y|. ♦

7.8.8 Strong error rate for the Milstein scheme: proof of Theorem 7.5
In this section, we prove Theorem 7.5 i.e. the scalar case d = q = 1. Throughout this section Cb,σ,p,T and
Kb,σ,p,T are positive real constants that may vary from line to line.
First we note that the (interpolated) continuous Milstein scheme as defined by (7.21) can be written in
an integral form as follows
Z t Z t Z tZ s
Xetmil = x + esmil )ds +
b(X esmil )dWs +
σ(X (σσ 0 )(X
eumil )dWu dWs (7.36)
0 0 0 s

with our usual notation t (note that u = s on [s, s]). For notational convenience, we will also drop throughout
this section the superscript mil since we will deal exclusively with the Milstein scheme.
(a) Step 1 (Moment control): Our first aim is to prove that the Milstein scheme has uniformly controlled
moments at any order, namely that, for every p ∈ (0, ∞), there exists s a real constant Cp,b,σ,T > 0 such that

∀ n ≥ 1, sup sup |X
etn |
< Cb,σ,T (1 + kX0 kp ). (7.37)
n≥1 t∈[0,T ] p

Set Z s
Hs = σ(X
es ) + (σσ 0 )(X es ) + (σσ 0 )(X
eu )dWu = σ(X es )(Ws − Ws )
s
so that Z t Z t
X
et = X0 + es )ds +
b(X Hs dWs .
0 0
It follows from the boundedness of b0 and σ 0 that b and σ satisfy a linear growth assumption.
We will follow the lines of the proof of Proposition 7.5, the specificity of the Milstein framework being
that the diffusion coefficient is replace by the process Hs . So, our task is to control the term
Z s∧τ̃N
sup Hu dWu

s∈[0,t] 0
252 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

in Lp where τ̃N = τ̃Nn esn − X0 | > N }, n, N ≥ 1.


:= inf{t ∈ [0, T ] | |X
R t∧τ̃
First assume that p ∈ [2, ∞). A careful reading of the proof of shows Since 0 N Hs dWs is a continuous
local martingale, it follows from the BDG Inequality that
Z Z 21
s∧τ̃N t∧τ̃N
BDG 2
sup Hu dWu ≤ Cp Hs ds .


s∈[0,t] 0 0
p p/2

Consequently, using the generalized Minkowski Inequality




Z
s∧τ̃N

Z t  21
CpBDG 1{s≤τ̃ } Hs 2 ds

sup Hu dWu ≤

s∈[0,t] 0 N p
0
p
Z t  21
CpBDG 1{s≤τ̃ } Hs∧τ̃ 2 ds

= N N p
.
0

Now, for every s ∈ [0, t],

es∧τ̃ )kp + k1{s≤τ̃ } (σσ 0 )(X



1{s≤τ̃ } Hs∧τ̃ N
N p
≤ kσ(X N N
es∧τ̃ )(Ws∧τ̃ N − Ws∧τ̃ )kp
N N

≤ es∧τ̃ )kp + k(σσ 0 )(X


kσ(X es∧τ̃ )(Ws − Ws )kp
N N

es∧τ̃ )kp + k(σσ 0 )(X


= kσ(X es∧τ̃ )kp kWs − Ws kp
N N

where we used that (σσ 0 )(X


es∧τ̃ ) and Ws − Ws are independent. Consequently, using that σ 0 is bounded so
N
0
that σ and σσ have at most linear growth, we get

es∧τ̃ kp ).
1{s≤τ̃ } Hs∧τ̃ N ≤ Cb,σ,T (1 + kX
N p N

Finally, following the lines of the step 1 of the proof of Proposition 7.5 leads to
 ! 21 
Z s∧τ̃N √ Z t
sup Hu dWu ≤ Cb,σ,T CpBDG  t + k sup |X eu |k2 ds  .

p
s∈[0,t] 0 p 0 u∈[0,s∧τ̃N ]

One concludes still following the lines of the proof Proposition 7.5 (including the step 4 to deal with the
case p ∈ (0, 2)).
Furthermore, as a by-product we get that, for every p > 0 and every n ≥ 1,

sup |Ht | ≤ Kb,σ,T,p 1 + kX0 kp < +∞ (7.38)

t∈[0,T ] p

where κb,σ,T,p does not depend on the discretization step n. As a matter of fact, this follows from
  
sup |Ht | ≤ Cb,σ 1 + sup |X etn | 1 + 2 sup |Wt |
t∈[0,T ] t∈[0,T ] t∈[0,T ]

so that, by the Schwarz Inequality when p ≥ 1,


Cb,σ   
etn |k2p 1 + 2k sup |Wt |k2p .
sup |Ht | ≤ 1 + sup k sup |X

t∈[0,T ] p 2 n≥1 t∈[0,T ] t∈[0,T ]

Similar bound holds when p ∈ (0, 1).


7.8. FURTHER PROOFS AND RESULTS 253

Now, by Lemma 7.5 devoted to the Lp -regularity of Itô processes, one derives the existence of a real
constant Kb,σ,p,T ∈ (0, ∞) (not depending on n ≥ 1) such that
 1
n n
 T 2
∀ t ∈ [0, T ], ∀ n ≥ 1, kXt − Xt kp ≤ Kb,σ,p,T 1 + kX0 kp
e e . (7.39)
n

Step 2 (Decomposition and analysis of the error, p ∈ [2, +∞), X0 = x ∈ Rd ): Set εt := Xt − Xet , t ∈ [0, T ],
and, for every p ∈ [1, +∞),
f (t) := k sup |εs |kp , t ∈ [0, T ].
s∈[0,t]

Using the diffusion equation and the continuous Milstein scheme one gets
Z t Z t Z tZ s
εt = (b(Xs ) − b(Xs ))ds +
e (σ(Xs ) − σ(Xs ))dWs −
e (σσ 0 )(X
eu )dWu dWs
0 0 0 s
Z t Z t
= (b(Xs ) − b(Xes ))ds + (σ(Xs ) − σ(X es ))dWs
0 0
Z t Z t 
+ es ) − b(X
(b(X es ))ds + σ(X es ) − σ(X es ) − σσ 0 (X
es )(Ws − Ws ) dWs .
0 0
First one derives that
Z t Z s
0

sup |εs | ≤ kb ksup sup |εu |ds + sup (σ(Xu ) − σ(Xu ))dWu
e
s∈[0,t] 0 u∈[0,s] s∈[0,t] 0
Z u

+ sup eu ) − b(X
b(X eu )du

u∈[0,s] 0
Z u  
0

+ sup σ(Xu ) − σ(Xu ) − (σσ )(Xu )(Wu − Wu ) dWu
e e e
u∈[0,s] 0

so that, using the generalized Minkowski Inequality (twice) and the BDG Inequality, one gets classically
s
Z t Z t
0
f (t) ≤ kb ksup f (s)ds + CpBDG kσ 0 ksup
f (s)2 ds
0 0
Z s Z s 
 
eu ) − (σσ 0 )(X

+ sup
eu ) − b(X
b(X eu ) du + sup

σ(Xeu ) − σ(X eu )(Wu − Wu ) dWu .
s∈[0,t] 0 s∈[0,t] 0

p p
| {z } | {z }
B C

Now using that b0 is αb0 -Hölder yields for every u ∈ [0, T ],


eu ) − b(X
b(X eu ) = b0 (Xeu )(Xeu − X
eu ) + ρb (u)|Xeu − Xeu |1+αb0
Z u
= bb0 (Xeu )(u − u) + b0 (X
eu ) eu |1+αb0
eu − X
Hv dWv + ρb (u)|X
u

where ρb (u) is defined by the above equation on the event {X eu 6= Xeu } and is equal to 0 otherwise. This
defines an (Fu )-adapted process, bounded by the Hölder coefficient [b0 ]αb0 of b0 . Using that for every x ∈ R,
|bb0 |(x) ≤ kb0 ksup (kb0 ksup + |b(0)|)|x| and (7.39) yields
1+α 0
  b
T  T 2
0 0 0
B ≤ kb ksup (kb ksup + |b(0)|) sup |Xt | + [b ]αb0 Kb,σ,p,T 1 + |x|
e
t∈[0,T ] pn n
Z s Z u
+ sup b0 (X
eu ) Hv dWv du .

s∈[0,t] 0 u p
254 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

The last term


q in the right hand side of the above equation needs a specific treatment: a naive approach
T
would yield a n term that would make the whole proof crash down. So we will transform the regular
Lebesgue integral into a stochastic integral (hence a local martingale). This can be done either by a stochastic
Fubini theorem, or in a more elementary way by an integration by parts.
RT
Lemma 7.8.1 Let G : Ω × R → R be an (Ft )t∈[0,T ] -progressively measurable process such that 0
G2s ds <
+∞ a.s. For every t ∈ [0, T ],
Z t Z s  Z t
Gu dWu ds = (ū ∧ t − u)Gu dWu
0 s 0
h 
kT (k−1)T
where ū := n if u ∈ n , kT
n .

Proof. For every k = 1, . . . , n, an integration by parts yields

kT kT
!
Z n
Z s  Z n
Z s
Gu dWu ds = Gu dWu ds
(k−1)T (k−1)T (k−1)T
n s n n
Z kT  
n kT
= − s− Gs dWs .
(k−1)T
n
n
h 
(`−1)T
Likewise if t ∈ n , `T
n , then

`T
Z t Z s  Z n
Gu dWu ds = (t − s)Gs dWs
(`−1)T (`−1)T
n s n

which completes the proof by summing up all the terms. ♦

We apply this lemma to the continuous adapted process Gt = b0 (X


et )Ht . We can derive by standard
arguments
! 21
Z s Z u Z T
0 0
sup b (X
eu ) Hv dWv du ≤ CpBDG k(t̄ ∧ T − t)b et )Ht k2p dt
(X

s∈[0,t] 0 u p 0
! 21
Z T
T
≤ CpBDG kb0 ksup kHt k2p dt
n 0
T
≤ Cb,σ,p,T 1 + kX0 kp
n
T
where we used first that 0 ≤ t̄ ∧ T − t ≤ n and then (7.38). Finally, one gets that
 1+αb0
 T
B ≤ Cb,σ,p,T 1 + |x| .
n

We adopt a similar approach for C. Elementary computations show that

eu ) T + 1 σ(σ 0 )2 (X
 
σ(X eu )−(σσ 0 )(X
eu )−σ(X eu )(Wu −Wu ) = σ 0 b(X eu ) (Wu −Wu )2 −(u−u) +ρσ (s)|X eu |1+ασ0
e u −X
n 2
7.8. FURTHER PROOFS AND RESULTS 255

where ρσ (s) is an (Fu )-adapted process bounded by the Hölder coefficient [σ 0 ]ασ0 of σ 0 . Consequently for
every p ≥ 1,

σ(Xu ) − σ(X
e eu ) − (σσ 0 )(X ≤ kσ 0 b(X
eu )(Wu − Wu ) eu )kp (u − u)
p

1
+ kσ(σ 0 )2 (X eu )kp k(Wu − Wu )2 − (u − u)kp
2
+[σ 0 ]ασ0 k|X
eu − X eu |1+ασ0 k1+ασ0
p(1+α 0 )
σ

Cb,σ,p,T (1 + |x|) (u − u) + kZ 2 − 1kp (u − u) + [σ 0 ]ασ0 (u − u)1+ασ0




  1+α2 σ0
T
≤ Cb,σ,p,T (1 + |x|) .
n

Now, owing to BDG Inequality, we derive that for every p ≥ 2,


Z t 2  12
0
C ≤ CpBDG σ(Xu ) − σ(X
eu ) − (σσ )(X
eu )(Wu − Wu )
du
e
0 p

  1+α2 σ0
T
≤ CpBDG Cb,σ,p,T (1 + |x|) .
n

Finally combining the upper bounds for B and C leads to


Z t
s
Z t   1+ασ20 ∨αb0
0 BDG 0 2
T
f (t) ≤ kb ksup f (s)ds + Cp kσ ksup f (s)ds + Cb,σ,p,T (1 + |x|)
0 0 n

so that by the “à la Gronwall” Lemma there exists a real constant


  1+ασ20 ∨αb0
T
f (T ) ≤ Cb,σ,p,T (1 + |x|) .
n

Step 2 (Extension to p ∈ (0, 2) and random starting values X0 ): First one uses that p 7→ k . kp is non-
decreasing to extend the above bound to p ∈ (0, 2). Then one uses that, if X0 and W are independent, for
any non-negative functional Φ : C([0, T ], Rd )2 → R+ , one has with obvious notations
Z
EΦ(X, X) e = PX0 (dx) E Φ(X x , Xe x ).
Rd

e(t)|p completes the proof of this item.


e) = supt∈[0,T ] |x(t) − x
Applying this identity with Φ(x, x

(b) This second claim follows from the error bound established for the Brownian motion itself: as concerns
the Brownian motion, both stepwise constant and continuous versions of the Milstein and the Euler scheme
coincide. So a better convergence rate is hopeless. ♦

7.8.9 Weak error expansion for the Euler scheme by the P DE method
We make the following regularity assumption on b and σ on the one hand and on the function f on the other
hand:
k +k k +k

∂ 1 2b ∂ 1 2σ
(R∞ ) ≡ b, σ ∈ C ([0, T ]×R) and ∀ k1 , k2 ∈ N, k1 +k2 ≥ 1, sup ∂tk1 ∂xk2 (t, x) + ∂tk1 ∂xk2 (t, x) < +∞.

(t,x)∈[0,T ]×R
256 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

In particular, b et σ are Lipschitz continuous in (t, x) ∈ [0, T ] × R since, e.g., for every t, t0 ∈ [0, T ], and every
x, x0 ∈ R,

0 0
∂b 0 ∂b 0
|b(t , x ) − b(t, x)| ≤ sup ∂x (s, ξ) |x − x| +
sup ∂t (t, ξ) |t − t|.

(s,ξ)∈[0,T ]×R (s,ξ)∈[0,T ]×R

Consequently (see e.g. [27]), the SDE (7.1) always has a unique strong solution starting from any Rd -
valued random vector X0 (independent of the Brownian motion W ). Furthermore, since |b(t, x)| + |σ(t, x)| ≤
sup (|b(t, 0)|+|σ(t, 0)|)+C|x| ≤ C(1+|x|), any such strong solution (Xt )t∈[0,T ] satisfies (see Proposition 7.5):
t∈[0,T ]

X0 ∈ Lp (P) =⇒ E sup |Xt |p + sup sup |X̄tn |p < +∞.



∀ p ≥ 1, (7.40)
t∈[0,T ] n t∈[0,T ]

The following regularity and growth assumption is made on f .

(F ) ≡ f ∈ C ∞ (R) and ∃ r ∈ N, ∃ C ∈ R∗+ , |f (x)| ≤ C(1 + |x|r ).

The infinitesimal generator L of the diffusion is defined on every function g ∈ C 1,2 ([0, T ] × R) by

∂g 1 ∂2g
L(g)(t, x) = b(t, x) + σ 2 (t, x) 2 .
∂x 2 ∂x

Theorem 7.11 (a) If both Assumptions (R∞ ) and (F ) hold, the parabolic P DE

∂u
+ Lu = 0, u(T, x) = f (x) (7.41)
∂t
has a unique solution u ∈ C ∞ ([0, T ] × R, R). This solution satisfies
k
∂ u
sup k (t, x) ≤ Ck,T (1 + |x|r(k,T ) ).

∀ k ≥ 0,
t∈[0,T ] ∂x

(b) Feynmann-Kac’s formula: The solution u admits the following representation

∀ t ∈ [0, T ], u(t, x) = E f (XT ) | Xt = x = E f (XTx,t )




where (Xsx,t )s∈[t,T ] denotes the unique solution of the SDE (7.1) starting at x at time t. If furthermore
b(t, x) = b(x) and σ(t, x) = σ(x) then

∀ t ∈ [0, T ], u(t, x) = E f (XTx −t ).

2
Notation: To alleviate notations we will use throughout this section the notations ∂x f , ∂t f , ∂xt f , etc, for
∂f ∂f 2
∂ u
the partial derivatives instead of ∂x , ∂t , ∂x∂t . . .

 Exercise. Combining the above bound for the spatial partial derivatives with ∂t u = −Lu, show that
k +k
∂ 1 2u
∀ k = (k1 , k2 ) ∈ N2 , sup k1 k2 (t, x) ≤ Ck,T (1 + |x|r(k,T ) ).

t∈[0,T ] ∂t ∂x

Proof. (a) For this result on P DE’s we refer to [].


(b) This representation is the so-called Feynmann-Kac formula. Let u be the solution to the P DE (7.41).
It is smooth enough to apply Itô’s formula. Let (Xt )t∈[0,T ] be the unique (strong) solution to the SDE
starting at X0 , random vector independent of the Brownian motion W on a probability space (Ω, A, P).
7.8. FURTHER PROOFS AND RESULTS 257

Such a solution does exist since b and σ are Lipschitz continuous in x uniformly with respect to t and
continuous in (t, x).
For every t ∈ [0, T ],
Z T Z T
1 T 2
Z
u(T, XT ) = u(t, Xt ) + ∂t u(s, Xs )ds + ∂x u(s, Xs )dXs + ∂ u(s, Xs )dhXs i
t t 2 t x,x
Z T Z T
= u(t, Xt ) + (∂t u + Lu)(s, Xs )ds + ∂x u(s, Xs )σ(s, Xs )dWs
t t
Z T
= u(t, Xt ) + ∂x u(s, Xs )σ(s, Xs )dWs
t
Z t
since u satisfies the P DE (7.41). Now the local martingale Mt := ∂x u(s, Xs )σ(s, Xs )dWs is a true
0
martingale since
Z t  
hM it = (∂x u(s, Xs ))2 σ 2 (s, Xs )ds ≤ C 1 + sup |Xt |θ ∈ L1 (P)
0 t∈[0,T ]

for an exponent θ ≥ 0. The above inequality follows from the assumptions on b and σ and the induced
growth properties of u. The integrability is a consequence of Proposition 7.31. Consequently (Mt )t∈[0,T ] is
a true martingale. As a consequence, using the assumption u(T, .) = f , one derives that

∀ t ∈ [0, T ], E f (XT ) | Ft = u(t, Xt ).
The announced representation follows from the Markov property satisfied by (Xt )t≥0 (with respect to the
filtration Ft = σ(X0 ) ∨ FtW , t ≥ 0).
One shows likewise that when b and σ do not depend on t, then (weak) uniqueness of the solutions
x,t
of (7.1) (3 ) implies that (Xt+s x
)s∈[0,T −t] and Xs∈[0,T −t] have the same distribution. ♦

Theorem 7.12 (Talay-Tubaro [152]): If (R∞ ) and (F ) hold, then there exists a non-decreasing function
Cb,σ,f : R+ → R+ such that
T
max E(f (X̄ n,x )) − (f (X x
)) ≤ Cb,σ,f (T ) .

E
0≤k≤n kT
n
kT
n n

Remark. The result also holds under some weaker smoothness assumptions on b, σ and f (say Cb5 ).

Proof. Step 1 (Representing and estimating E(f (XTx ))− E(f (X̄Tx ))): To alleviate notations we temporarily
drop the exponent x in Xtx or X̄tn,x . Noting that

E f (XTx ) = u(0, x) and E f (X̄Tn ) = E u(T, X̄Tn ),


it follows that

E f (X̄Tn )) − f (XT ) E u(T, X̄Tn ) − u(0, X̄0n )


 
=
n
X
E u(tk , X̄tnk ) − u(tk−1 , X̄tnk−1 ) .

=
k=1

We apply Itô’s formula between tk−1 and tk to the function u and use that the Euler scheme satisfies the
“frozen” SDE
dX̄tn = b(t, X̄tn )dt + σ(t, X̄tn )dWt .
3
which in turns follows from the strong existence and uniqueness for this SDE, see e.g. [141].
258 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

This yields
Z tk Z tk Z tk
1
u(tk , X̄tnk ) − u(tk−1 , X̄tnk−1 ) = ∂t u(s, X̄sn )ds + ∂x u(s, X̄sn )dX̄sn + ∂xx u(s, X̄sn )dhX̄ n is
tk−1 tk−1 2 tk−1
Z tk Z tk
= (∂t + L̄)u(s, s, X̄sn , X̄sn )ds + σ(s, X̄sn )∂x u(s, X̄sn )dWs
tk−1 tk−1

where L̄ is the “frozen” infinitesimal generator defined by


1
L̄g(s, s, x, x) = b(s, x)∂x g(s, x) + σ 2 (s, x)∂gxx (s, x)
2
Z t
(and ∂t u(s, s, x, x) = ∂t u(s, x)). The bracket process of the local martingale Mt = ∂x u(s, X̄sn )σ(s, X̄sn )dWs
0
is given for every t ∈ [0, T ] by
Z T
hM iT = (∂x u)2 (s, X̄s )σ 2 (s, X̄sn )ds.
0
Consequently  
hM iT ≤ C 1 + sup |X̄tn |2 + sup |X̄tn |r ∈ L1 (P)
t∈[0,T ] t∈[0,T ]

so that (Mt )t∈[0,T ] is a true martingale. Consequently,


!
Z tk
E (u(tk , X̄tnk ) − u(tk−1 , X̄tnk−1 )) =E (∂t + L̄)u(s, s, X̄sn , X̄sn )ds
tk−1

(the integrability of the integral term follows from ∂t u = −Lu which ensures the polynomial growth of
2
(∂t +L̄)u(s, s, x, x)). At this stage, the idea is to expand the above expectation into a term φ̄(s, X̄sn ) Tn +O( Tn ).
To this end we will use again Itô’s formula to ∂t u(s, X̄sn ), ∂x u(s, X̄sn ) and ∂xx u(s, X̄sn ), taking advantage of
the regularity of u.
– Term 1. The function ∂t u being C 1,2 ([0, T ] × R), Itô’s formula yields between s = tk−1 and s
Z s  Z s
2 2
∂t u(s, X̄s ) = ∂t u(s, X̄s ) + ∂tt u(r, X̄r ) + L̄(∂t u)(r, X̄r ) dr + σ(r, X̄r )∂xt u(r, Xr )dWr .
s s

First let us show that the locale martingale term is the increment between s and s of a true martingale
(1) 2
(denoted (Mt ) from now on). Note that ∂t u = −L u so that ∂xt u = −∂x L u which is clearly a function
with polynomial growth in x uniformly in t ∈ [0, T ] since
 
∂x b(t, x)∂x u + 1 σ 2 (t, x)∂xx
2 ≤ C(1 + |x|θ0 ).

u (t, x)
2

Consequently, (Mt )t∈[0,T ] is a true martingale since E(hM iT ) < +∞. On the other hand, using (twice) that
∂t u = −L u, leads to
2
∂tt u(r, X̄r ) + L̄(∂t u)(r, r, X̄r , X̄r ) = −∂t L(u)(r, X̄r ) − L̄ ◦ Lu(r, r, X̄r , X̄r )
=: φ̄(1) (r, r, X̄r , X̄r )

where φ̄(1) satisfies for every x, y ∈ R and every t, t ∈ [0, T ],

|φ̄(1) (t, t, x, x)| ≤ C1 (1 + |x|θ1 + |x|θ1 ).


7.8. FURTHER PROOFS AND RESULTS 259

This follows from the fact that φ̄(1) is defined as a linear combination products of b, ∂t b, ∂x b, ∂xx b, σ, ∂t σ,
∂x σ, ∂xx σ, ∂x u, ∂xx u at (t, x) or (t, x) (with “x = X̄r ” and “x = X̄r ”).
– Term 2. The function ∂x u being C 1,2 , Itô’s formula yields
Z s
2

∂x u(s, X̄s ) = ∂x u(s, X̄s ) + ∂tx u(r, X̄r ) + L̄(∂x u)(r, r, X̄r , X̄r ) dr
s
Z s
2
+ ∂xx u(r, X̄r )σ(r, X̄r )dWr
s

(2)
The stochastic integral is the increment of a true martingale (denoted (Mt ) in what follows) and using
2
that ∂tx u = ∂x (−L u), one shows likewise that
 
2
∂tx u(r, X̄r ) + L̄(∂x u)(r, r, X̄r , X̄r ) = L̄(∂x u) − ∂x (Lu) (r, r, X̄r , X̄r )
= φ̄(2) (r, r, X̄r , X̄r )

where (t, t, x, x) 7→ φ̄(2) (t, t, x, x) has a polynomial growth in (x, x) uniformly in t, t ∈ [0, T ].
– Term 3. Following the same lines one shows that
Z s
2 2
∂xx u(s, X̄s ) = ∂xx u(s, X̄s ) + φ̄(3) (r, r, X̄r , X̄r )dr + Ms(3) − Ms(3)
s

(3)
where (Mt ) is a martingale and φ̄(3) has a polynomial growth in (x, x) uniformly in (t, t).
Step 2: Collecting all the results obtained in Step 2 yields

u(tk , X̄tk ) − u(tk−1 , X̄tk−1 ) = (∂t + L)(u)(tk−1 , X̄tk−1 )


Z tk Z s
+ φ̄(r, r, X̄r , X̄r )dr ds
tk−1 tk−1
Z tk 
(1) (2) 1 2 (3)
+ Ms(1) − Mtk−1 + b(tk−1 , X̄tk−1 )(Ms(2) − Mtk−1 ) (3)
+ σ (tk−1 , X̄tk−1 )(Ms − Mtk−1 ) ds
tk−1 2
+ Mtk − Mtk−1 .

where
1
φ̄(r, r, x, x) = φ̄(1) (r, x, x) + b(r, x)φ̄(2) (r, x, x) + σ 2 (r, x)φ̄(3) (r, x, x).
2
Hence the function φ̄ satisfies a polynomial growth assumption
0
|φ̄(t, x, y)| ≤ Cφ (1 + |x|θ + |y|θ ), θ, θ0 ∈ N.

The first term on the right hand side of the equality vanishes since ∂t u + Lu = 0.
As concerns the third term, we will show that it has a zero expectation. One can use Fubini’s Theorem
since sup |X̄t | ∈ Lp (P) for every p > 0 (this ensures the integrability of the integrand). Consequently
t∈[0,T ]

Z tk   !
(1) 1 2 (2) (3)
E − Ms(1) + Mtk−1 b(tk−1 , X̄tk−1 )(Ms(2)
− (3)
+ σ (tk−1 , X̄tk−1 )(Ms − Mtk−1 ) ds =
Mtk−1 )
tk−1 2
Z tk   1  
E (Ms(1) − Mt(1) (2) (3)
) + E b(tk−1 , X̄tk−1 )(Ms(2) − Mtk−1 ) + E σ 2 (tk−1 , X̄tk−1 )(Ms(3) − Mtk−1 ) ds.
tk−1
k−1
2
260 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

Now all the three expectations inside the integral are zero since the M (k) are true martingale. Thus
   
E b(tk−1 , X̄tk−1 )(Ms(2) − Mt(2)
k−1
) =E b(tk−1 , X̄tk−1 ) E(Ms(2) − Mtk−1 | Ftk−1 )
(2)
= 0,
| {z } | {z }
Ftk−1 -measurable =0

etc. Finally the original expansion amounts to


Z tk Z s
E u(tk , X̄tk ) − u(tk−1 , X̄tk−1 ) = E φ̄(r, X̄r , X̄r ) dr ds (7.42)
tk−1 tk−1

so that


Z tk Z s 
E u(tk , X̄t ) − u(tk−1 , X̄t ) ≤ ds dr E |φ̄(r, X̄r , X̄r )|
k k−1
tk−1 tk−1
 0  (tk − tk−1 )2
≤ Cφ 1 + 2 E sup |X̄tn |θ∨θ .
t∈[0,T ] 2
 2
T
≤ Cb,σ,f (T )
n

where, owing to Proposition 6.6, the function Cb,σ,f (.) only depends on T (in a non-decreasing manner).
Summing over the terms for k = 1, . . . , n yields the result at time T .
T tk
Step 3: Let k ∈ {0, . . . , n − 1}. It follows from the obvious equalities n = k , k = 1, . . . , n, and what
precedes that

E(f (X̄tn,x )) − E(f (Xtx )) ≤ Cb,σ,f (tk ) tk ≤ Cb,σ,f (T ) T



k k
k n
which yields the announced result. ♦

 Exercise. Compute an explicit (closed) form for the function φ̄.

7.8.10 Toward higher order expansion


To obtain an expansion one must come back to the identity (7.42)
Z tk Z s
E u(tk , X̄tnk ) − u(tk−1 , X̄tnk−1 ) = E φ̄(r, r, X̄rn , X̄rn )dr ds.

tk−1 tk−1

This function φ̄ can be written explicit as a polynomial of b, σ, u and (some of) its partial derivatives. So,
under suitable assumptions on b and σ and f (like (R∞ ) and (F )) one can show that φ̄ satisfies in turn

(i) φ̄ is continuous in (t, t, x, x).


0
0
(ii) ∂xm+m
m ,y m0 φ̄(t, t, x, x) ≤ CT (1 + |x|
θ(m,T )
+ |x|θ (m,T ) ) t ∈ [0, T ].

0
(iii) ∂tmm φ̄(t, t, x, x) ≤ CT (1 + |x|θ(m,T ) + |x|θ (m,T ) ) t ∈ [0, T ].
7.8. FURTHER PROOFS AND RESULTS 261

In fact, as above, a C 1,2 -regularity in (t, x) is sufficient to get a second order expansion. The idea is once
again to apply Itô’s formula, this time to φ̄. Let r ∈ [tk−1 , tk ] (so that r = tk−1 ).
Z r
φ̄(r, r, X̄rn , X̄rn ) = φ̄(r, r, X̄rn , X̄rn ) + ∂x φ̄(v, r, X̄vn , X̄rn )dX̄vn
tk−1
Z r
1 2
+ ∂t φ̄(v, r, X̄vn , X̄rn ) + ∂xx φ(v, r, X̄vn , X̄rn )σ 2 (v, X̄vn ) dv
tk−1 2
Z r  
n n
= φ̄(r, r, X̄r , X̄r ) + ∂t φ̄(v, r, X̄vn , X̄rn ) + L̄φ̄(., r, ., X̄rn )(v, v, X̄vn , X̄vn ) dv
tk−1
Z r
+ ∂x φ̄(v, r, X̄vn , X̄rn )σ(v, X̄vn )dWv .
tk−1

The stochastic integral turns out to be a true square integrable martingale (increment) on [0, T ] since

sup ∂x φ̄(v, r, X̄vn , X̄rn )σ(v, X̄vn ) ≤ C(1 + sup |X̄vn |αT + sup |X̄rn |βT ) ∈ L1 (P).

v,r∈[0,T ] v∈[0,T ] r∈[0,T ]

Then, using Fubini’s Theorem,


Z tk Z s 
E u(tk , X̄tnk ) − u(tk−1 , X̄tnk−1 ) φ̄(r, r, X̄rn , X̄rn ) dr ds

= E
tk−1 tk−1
Z tk Z s Z r  
+E ∂t φ̄(v, r, X̄vn , X̄rn ) + L̄φ̄(., r, ., X̄rn )(v, v, X̄vn , X̄vn ) dv dr ds
tk−1 tk−1 tk−1
Z tk Z s
+ E(Mr − Mtk−1 )dr ds .
tk−1 tk−1
| {z }
=0

Now, !
E sup |L̄φ̄(., r, ., X̄vn )(v, v, X̄vn , X̄vn )| < +∞
v,r∈[0,T ]

owing to (7.40) and the polynomial growth of b, σ, u and its partial derivatives. The same holds for
∂t φ̄(v, r, X̄vn , X̄rn ) so that
Z tk Z s Z r  !  3

n n n n n
 1 T
∂t φ̄(v, r, X̄v , X̄r ) + L̄φ̄(., r, ., X̄r )(v, v, X̄v , X̄v ) dv dr ds ≤ Cb,σ,f,T .

E
tk−1 tk−1 tk−1 3 n

Summing up from k = 1 up to n yields


n
X
E(f (X̄Tn ) − f (XT )) = E u(tk , X̄tnk ) − u(tk−1 , X̄tnk−1 )


k=1

T  T
Z   
 T 2
= E φ̄(s, s, X̄sn , X̄sn )ds + O
2n 0 n
Z T   
2
T   T
= E ψ̄(s, X̄sn )ds + O
2n 0 n
where ψ̄(t, x) := φ̄(t, t, x, x) is at least a C 1,2 -function. In turn, for every k ∈ {0, . . . , n}, the function ψ(tk , .)
satisfies Assumption (F ) with some uniform bounds in k (this follows from the fact that its time partial
derivative does exist continuously on [0, T ]). Consequently
T
max E(ψ̄(tk , X̄tn,x 0
)) − E(ψ̄(tk , Xtxk )) ≤ Cb,σ,f

(T )
0≤k≤n k
n
262 CHAPTER 7. DISCRETIZATION SCHEME(S) OF A BROWNIAN DIFFUSION

so that
Z T Z T ! Z
T 
n n
ψ̄(s, X̄s )ds − ψ̄(s, Xs )ds = E ψ̄(s, X̄s ) − E ψ̄(s, Xs ) ds

E
0 0 0
0 T
≤ T Cb,σ,f .
n
Applying Itô’s formula to ψ̄(u, Xu ) between s and s shows that
Z s Z s
ψ̄(s, Xs ) = ψ̄(s, Xs ) + (∂t + L)(ψ̄)(r, Xr )dr + ∂x ψ̄(r, Xr )σ(Xr )dWr
s s

which implies
T
sup |E ψ̄(s, Xs ) − E ψ̄(s, Xs )| ≤ C”f,b,σ,T .
s∈[0,T ] n
Hence
Z T  Z T  T2
ψ̄(s, Xs )ds − E ψ̄(s, Xs )ds ≤ C”b,σ,f .

E
0 0 n
Finally, combining all these bounds yields
Z T
T 1
E(f (X̄Tn,x ) − f (XTx )) = E ψ̄(s, Xsx ) ds + O
2n 0 n2

which completes the proof. ♦

Remark. One can make ψ̄ explicit and by induction we eventually obtain the following Theorem.

Theorem 7.13 (Talay-Tubaro, 1989) If both (F ) and (R∞ ) holds then the weak error can be expanded at
any order, that is
R
X ak  1 
∀ R ∈ N∗ , E(f (X̄Tn )) − E(f (XT )) = k
+ O R+1 .
n n
k=1

The main application is of course Richardson-Romberg extrapolation developed in Section 7.7. For more
recent results on weak errors we refer to [69].

Remark. What happens when f is no longer smooth? In fact the result still holds provided the diffu-
sion coefficient satisfies a uniform ellipticity condition (or at least a uniform Hormander hypo-ellipticity
assumption). This is the purpose of Bally-Talay’s Theorem, see [9].
Chapter 8

The diffusion bridge method :


application to the pricing of
path-dependent options (II)

8.1 Theoretical results about time discretization and path-dependent


payoffs
In this section we deal with some “path-dependent” (European) options. Such contracts are charac-
terized by the fact that their payoffs depend on the whole past of the underlying asset(s) between
the origin t = 0 of the contract and its maturity T . This means that these payoffs are of the
form F ((Xt )t∈[0,T ] ) where F is a functional usually naturally defined from D([0, T ], Rd ) → R+ and
X = (Xt )t∈[0,T ] stands for the dynamics of the underlying asset. We will assume from now on that
X = (Xtx )t∈[0,T ] is still a solution to an SDE of type (7.1), i.e.:

dXt = b(t, Xt )dt + σ(t, Xt )dWt , X0 = x.

In the recent past years, several papers provided weak rates of convergence for some families of
functionals F . These works were essentially motivated by the pricing of path-dependent (European)
options, like Asian or Lookback options in 1-dimension, corresponding to functionals defined for
every α ∈ D([0, T ], Rd ) by
Z T   
F (α) := f α(s)ds , F (x) := f α(T ), sup α(t), inf α(t)
0 t∈[0,T ] t∈[0,T ]

and, this time possibly in higher dimension, to barrier options with functionals of the form

F (α) = f (α(T ))1{τD (α)>T }

where D is an (open) domain of Rd and τD (α) := inf{s ∈ [0, T ], α(s) or α(s−) ∈ / D} is the escape
1
time from D by the generic càdlàg path α ( ). In both frameworks f is usually Lipschitz continuous
1
when α is continuous or stepwise constant and càdlàg, τD (α) := inf{s ∈ [0, T ], α(s) ∈
/ D}.

263
264CHAPTER 8. THE DIFFUSION BRIDGE METHOD : APPLICATION TO THE PRICING OF PATH-DEP

continuous. Let us quote two well-known examples of results (in a homogeneous framework i.e.
b(t, x) = b(x) and σ(t, x) = σ(x)):
– In [63] is established the following theorem.

Theorem 8.1 (a) If the domain D is bounded and has a smooth enough boundary (in fact C 3 ), if
b ∈ C 3 (Rd , Rd ), σ ∈ C 3 Rd , M(d, q, R) , σ uniformly elliptic on D (i.e. σσ ∗ (x) ≥ ε0 Id , ε0 > 0),
then for every bounded Borel function f vanishing in a neighbourhood of ∂D,
 
  
e n )1
 1
E f (XT )1{τD (X)>T } − E f (X T e n )>T }
{τD (X
= O √ as n → +∞. (8.1)
n

(where X
e denotes the stepwise constant Euler scheme).

(b) If furthermore, b and σ are C 5 and D is half-space, then the continuous Euler scheme satisfies
 
  
n
 1
E f (XT )1{τD (X)>T } − E f (X̄T )1{τD (X̄ n )>T } =O as n → +∞. (8.2)
n

Note however that these assumptions are not satisfied by usual barrier options (see below).
– It is suggested in [150] (including a rigorous proof when X = W ) that if b, σ ∈ Cb4 (R, R), σ is
4,2
uniformly elliptic and f ∈ Cpol (R2 ) (existing partial derivatives with polynomial growth), then
 
  
e n , max X
en ) = O √
 1
E f (XT , max Xt ) − E f (XT tk as n → +∞. (8.3)
t∈[0,T ] 0≤k≤n n

A similar improvement – O( n1 ) rate – as above can be expected (but still holds a conjecture) when
replacing Xe by the continuous Euler scheme X̄, namely
 
    1
E f (XT , max Xt ) − E f (X̄Tn , max X̄tnk ) = O as n → +∞.
t∈[0,T ] t∈[0,T ] n

If we forget about the regularity assumptions, the formal intersection between these two classes
of path-dependent options is not empty since the payoff a barrier option with domain D = (−∞, L)
can be written
 
f (XT )1{τD (X)>T } = g XT , sup Xt with g(x, y) = f (x)1{y<L} .
t∈[0,T ]

Unfortunately, such a function g is never a smooth function so that if the second result is true it
does not solve the first one.
By contrast with the “vanilla case”, these results are somewhat disappointing since they point
out that the weak error obtained with the stepwise√ constant Euler scheme is not significantly
better than the strong error (the only gain is a 1 + log n factor). The positive part is that we
can reasonably hope that using the continuous Euler scheme yield again the O(1/n)-rate and the
expansion of the time discretization error, provided we know how to simulate this scheme.
8.2. FROM BROWNIAN TO DIFFUSION BRIDGE: HOW TO SIMULATE FUNCTIONALS OF THE GENUINE

8.2 From Brownian to diffusion bridge: how to simulate function-


als of the genuine Euler scheme
To take advantage of the above rates, in fact we do not need to simulate the continuous Euler
scheme itself (which is meaningless) but some specific functionals of this scheme like the maximum,
the minimum, the time average, etc. between two time discretization instants tnk and tnk+1 , given
the (simulated) values of the stepwise constant Euler scheme. First we deal with the standard
Brownian motion itself.

8.2.1 The Brownian bridge method


We denote by (FtW )t≥0 the (completed) natural filtration of a standard Brownian motion W . Keep
in mind that, owing to the Kolmogorov 0-1 law, this filtration coincides with the “naive” natural
filtration of W up to the negligible sets of the probability space on which W is defined.
Proposition 8.1 Let W = (Wt )t≥0 be a standard Brownian motion.
(a) Let T > 0. Then, the so-called standard Brownian bridge on [0, T ] defined by
t
YtW,T := Wt − WT , t ∈ [0, T ], (8.4)
T
is an FTW -measurable centered Gaussian process, independent of (WT +s )s≥0 characterized by its
covariance structure
st (s ∧ t)(T − s ∨ t)
E YsW,T YtW,T = s ∧ t − =

, 0 ≤ s, t ≤ T.
T T
(b) Let 0 ≤ T0 < T1 < +∞. Then
  
L (Wt )t∈[T0 ,T1 ] | Ws , s ∈
/ (T0 , T1 ) = L (Wt )t∈[T0 ,T1 ] | WT0 , WT1

so that (Wt )t∈[T0 ,T1 ] and (Ws )s∈(T


/ 0 ,T1 ) are independent given (WT0 , WT1 ) and this conditional dis-
tribution is given by
 
  t − T0 B,T1 −T0
L (Wt )t∈[T0 ,T1 ] | WT0 = x, WT1 = y = L x + (y − x) + (Yt−T0 )t∈[T0 ,T1 ]
T1 − T0
where B is a generic standard Brownian motion.
Proof. (a) The process Y W,T is clearly centered since W is. Elementary computations based on
the covariance structure of the standard Brownian Motion
E Ws Wt = s ∧ t, s, t ∈ [0, T ]
show that, for every s, t ∈ [0, T ],
s t st
E (YtW,T YsW,T ) = E Wt Ws − E Wt WT − E Ws WT + 2 EWT2
T T T
st ts ts
= s∧t− − +
T T T
ts
= s∧t−
T
(s ∧ t)(T − s ∨ t)
= .
T
266CHAPTER 8. THE DIFFUSION BRIDGE METHOD : APPLICATION TO THE PRICING OF PATH-DEP

Likewise one shows that, for every u ≥ T , E(YtW,T Wu ) = 0 so that YtW,T ⊥ span(Wu , u ≥ T ).
Now W being a Gaussian process independence and no correlation coincide in the closed subspace of
L2 (Ω, A, P) spanned by W . The process Y W,T belongs to this space by construction. Consequently
Y W,T is independent of (WT +u )u≥0 .
(b) First note that for every t ∈ [T0 , T1 ],
(T )
Wt = WT0 + Wt−T0 0
(T )
where Wt 0 = WT0 +t − WT0 , t ≥ 0, is a standard Brownian motion, independent of FTW0 . Rewrit-
ing (8.4) for W (T0 ) leads to
s (T ) (T0 )
Ws(T0 ) = W 0 + YsW ,T− T0 .
T1 − T0 T1 −T0
Plugging this identity in the above equality (at time s = t − T0 ) leads to
t − T0 W (T0 ) ,T1 −T0
Wt = WT0 + (W − WT0 ) + Yt−T . (8.5)
T1 − T0 T1 0

W (T0 ) ,T1 −T0


It follows from (a) that the process Ye := (Yt−T 0
)t∈[T0 ,T1 ] is a Gaussian process measurable
W (T0 )
with respect to F by (a), hence it is independent of F W since W (T0 ) is. Consequently Ye :=
T1 −T0 T0
W (T0 ) ,T1 −T0 W (T0 ) ,T −T
1 0
(Yt−T0 )t∈[T0 ,T1 ] is independent of (Wt )t∈[0,T0 ] . Furthermore Y := (Yt−T0
)t∈[T0 ,T1 ] is
(T0 )
independent of (WT1 −T )
0 +u u≥0
by (a) i.e. it is independent of WT1 +u , u ≥ 0 since WT1 +u =
(T0 )
WT1 −T0 +u + WT0 .
Finally the same argument (no correlation implies independence in the Gaussian space spanned
by W ) implies the independence of Ye with σ(Ws , s ∈
/ (T0 , T1 )). The end of the proof follows from
the above identity (8.5) and the exercises below. ♦

 Exercises. 1. Let X, Y , Z be three random vectors defined on a probability space (Ω, A, P)


taking values in RkX , RkY and RkZ respectively. Assume that Y and (X, Z) are independent.
Show, using (8.5) that for every bounded Borel function f : RkY → R and every Borel function
g : RkX → RkY ,
E(f (g(X) + Y ) | (X, Z)) = E(f (g(X) + Y ) | X).
Deduce that L(g(X) + Y | (X, Z)) = L(g(X) + Y | X).
2. Deduce from the previous exercise that

L (Wt )t∈[T0 ,T1 ] ) | (Wt )t∈(T
/ 0 ,T1 ) = L((Wt )t∈[T0 ,T1 ] ) | (WT0 , WT1 )).

[Hint: consider the finite dimensional marginal distributions (Wt1 , . . . , Wtn ) given (X, Ws1 , . . . , Wsp )
where ti ∈ [T0 , T1 ], i = 1, . . . , n, sj ∈ [T0 , T1 ]c , j = 1, . . . , p, and X = (WT0 , WT1 ) and use the
decomposition (8.5).]
3. The conditional distribution of (Wt )t∈[T0 ,T1 ] given WT0 , WT1 is that of a Gaussian process hence
it can also be characterized by its expectation and covariance structure. Show that they are given
respectively by
T1 − t t − T0
E (Wt | WT0 = x, WT1 = y) = x+ y, t ∈ [T0 , T1 ],
T1 − T0 T1 − T0
8.2. FROM BROWNIAN TO DIFFUSION BRIDGE: HOW TO SIMULATE FUNCTIONALS OF THE GENUINE

and   (T − t)(s − T )
1 0
Cov Wt , Ws | WT0 = x, WT1 = y = , s ≤ t, s, t ∈ [T0 , T1 ].
T1 − T0

8.2.2 The diffusion bridge (bridge of the Euler scheme)


Now we come back to the (continuous) Euler scheme.

Proposition 8.2 (Bridge of the Euler scheme) Assume that σ(t, x) 6= 0 for every t ∈ [0, T ],
x ∈ R.
(a) The processes (X̄t )t∈[tnk ,tnk+1 ] , k = 0, . . . , n − 1, are conditionally independent given the σ-field
σ(X̄tnk , k = 0, . . . , n).
(b) Furthermore, the conditional distribution
!
   t − tnk n B, T /n

L (X̄t )t∈[tnk ,tnk+1 ] | X̄tnk = xk , X̄tnk+1 = xk+1 = L xk + (x k+1 − x k ) + σ(tk , xk )Y t−t n
tnk+1 − tnk k t∈[tn n
k ,tk+1 ]

B, T /n
where (Ys )s∈[0,T /n] is a Brownian bridge (related to a generic Brownian motion B) as defined
by (8.4). The distribution of this Gaussian process (sometimes called a diffusion bridge) is entirely
characterized by:
– its expectation function
!
t − tnk
xk + n (xk+1 − xk )
tk+1 − tnk
t∈[tn n
k ,tk+1 ]

and
– its covariance operator

2 (s ∧ t − tnk )(tnk+1 − s ∨ t)
σ (tnk , xk ) .
tnk+1 − tnk

Proof. Elementary computations show that for every t ∈ [tnk , tnk+1 ],

t − tnk n
n
W (tk ) ,T /n
X̄t = X̄tnk + ( X̄ n
tk+1 − X̄ n
tk ) + σ(t k , X̄ n
tk )Yt−tn
tnk+1 − tnk k

(with in mind that tnk+1 − tnk = T /n). Consequently the conditional independence claim will follow
n
W (tk ) ,T /n
if the processes (Yt )t∈[0,T /n] , k = 0, . . . , n − 1, are independent given σ(X̄tn` , ` = 0, . . . , n).
Now, it follows from the assumption on σ that

σ(X̄tn` , ` = 0, . . . , n) = σ(X0 , Wtn` , ` = 1, . . . , n).


n
W (tk ) ,T /n
So we have to establish the conditional independence of the processes (Yt )t∈[0,T /n] , k =
0, . . . , n − 1, given σ(X0 , Wtk , k = 1, . . . , n) or equivalently given σ(Wtk , k = 1, . . . , n) since X0
n n
268CHAPTER 8. THE DIFFUSION BRIDGE METHOD : APPLICATION TO THE PRICING OF PATH-DEP

and W are independent (note all the above bridges are FTW -measurable). First note that all
n
W (tk ) ,T /n
the bridges (Yt )t∈[0,T /n] , k = 0, . . . , n − 1 and W live in a Gaussian space. We know from
n
W (tk ) ,T /n
Proposition 8.1(a) that each bridge (Yt )t∈[0,T /n] is independent of both FtWn and σ(Wtn
k+1 +s

k
Wtk , s ≥ 0) hence in particular of all σ({Wt` , ` = 1, . . . , n}) (we use here a specificity of Gaussian
n n

processes). On the other hand, all bridges are independent since they are built from independent
n
(tn ) W (tk ) ,T /n
Brownian motions (Wt k )t∈[0,T /n] . Hence, the bridges (Yt )t∈[0,T /n] , k = 0, . . . , n − 1 are
i.i.d. and independent of σ(Wtnk , k = 1, . . . , n).
Now X̄tnk is σ({Wtn` , ` = 1, . . . , k})-measurable consequently σ(Xtnk , Wtnk , X̄tnk+1 ) ⊂ σ({Wtn` , ` =
n
W (tk ) ,T /n
1, . . . , n}) so that (Yt )t∈[0,T /n] is independent of (Xtnk , Wtnk , X̄tnk+1 ). The conclusion follows. ♦

Now we know the distribution of the continuous Euler scheme between two successive discretiza-
tion tnk and tnk+1 times conditionally to the Euler scheme at its discretization times. Now, we are
in position to simulate some functionals of the continuous Euler scheme, namely its supremum.

Proposition 8.3 The distribution of the supremum of the Brownian bridge starting at 0 and ar-
riving at y at time T , defined by YtW,T,y = Tt y + Wt − Tt WT on [0, T ] is given by
2
 
   1 − exp − T z(z − y) if z ≥ max(y, 0),
P sup YtW,T,y ≤ z =
t∈[0,T ] 
=0 if z ≤ max(y, 0).

Proof. The key is to have in mind that the distribution of Y W,T,y is that of the conditional
distribution of W given WT = y. So, we can derive the result from an expression of the joint
distribution of (supt∈[0,T ] Wt , WT ), e.g. from
 
P sup Wt ≥ z, WT ≤ y .
t∈[0,T ]

It is well-known from the symmetry principle that, for every z ≥ max(y, 0),
 
P sup Wt ≥ z, WT ≤ y = P(WT ≥ 2z − y).
t∈[0,T ]

We briefly reproduce the proof for the reader’s convenience. One introduces the hitting time
τz := inf{s > 0 | Ws = z}. This random time is in fact a stopping time with respect to the filtration
(FtW ) of the Brownian motion W since it is the hitting time of the closed set [z, +∞) by the
continuous process W (this uses that z ≥ 0). Furthermore, τz is a.s. finite since lim supt Wt = +∞
a.s. Consequently, still by continuity of its paths, Wτz = z a.s. and Wτz +t − Wτz is independent of
FτWz . As a consequence, for every z ≥ max(y, 0),
 
P sup Wt ≥ z, WT ≤ y = P(τz ≤ T, WT − Wτz ≤ y − z)
t∈[0,T ]
= P(τz ≤ T, −(WT − Wτz ) ≤ y − z)
= P(τz ≤ T, WT ≥ 2z − y)
= P(WT ≥ 2z − y)
8.2. FROM BROWNIAN TO DIFFUSION BRIDGE: HOW TO SIMULATE FUNCTIONALS OF THE GENUINE

since 2z − y ≥ z. Consequently, one may write for every z ≥ max(y, 0),


  Z +∞
P sup Wt ≥ z, WT ≤ y = h(ξ)dξ
t∈[0,T ] 2z−y

ξ2
e− 2T
where h(ξ) = √ . Hence, since the involved functions are differentiable one has
2πT
 
P(WT ≥ 2z − (y + η)) − P(WT ≥ 2z − y) /η
P( sup Wt ≥ z | WT = y) = lim  
η→0
t∈[0,T ] P(WT ≤ y + η) − P(WT ≤ y) /η
∂ P(WT ≥2z−y)
∂y h(2z − y)
= ∂ P(WT ≤y)
=
h(y)
∂y
(2z−y)2 −y 2 2z(z−y)
= e− 2T = e− 2T . ♦

Corollary 8.1 Let λ > 0 and let x, y ∈ R. If Y W,T denotes the standard Brownian bridge related
to W between 0 and T , then for every
!  1 − exp − 2 (z − x)(z − y) if z ≥ max(x, y)
 t   T λ2
P sup x + (y − x) + λYtW,T ≤ z =
t∈[0,T ] T
0 if z < max(x, y).

(8.6)

Proof. First note that


 
t W,T 0 0 t W,T
x + (y − x) + λYt = λx + λ y + Yt with x0 = x/λ, y 0 = (y − x)/λ.
T T

Then one concludes by the previous proposition, using that for any real-valued random variable ξ,
every α ∈ R, and every β ∈ (0, ∞),
 z − α  z − α
P(α + β ξ ≤ z) = P ξ ≤ =1−P ξ > . ♦
β β

d
 Exercise. Show using −W = W that

 exp − T 2λ2 (z − x)(z − y) if z ≤ min(x, y)


 
 
t
P inf (x + (y − x) + λYtW,T ) ≤ z =
t∈[0,T ] T
1 if z > min(x, y).

8.2.3 Application to Lookback like path dependent options


In this section, we focus on Lookback like options (including general barrier options) i.e. exotic
options related to payoffs of the form hT := f XT , supt∈[0,T ] Xt .
270CHAPTER 8. THE DIFFUSION BRIDGE METHOD : APPLICATION TO THE PRICING OF PATH-DEP

We want to compute an approximation of e−rT E hT using a Monte Carlo simulation  based on


the continuous time Euler scheme i.e. we want to compute e−rT E f X̄Tn , supt∈[0,T ] X̄tn . We first
note, owing to the chaining rule for conditional expectation, that 
The key is to simulate efficiently the couple X̄Tn , supt∈[0,T ] X̄t . We know that
   
n,k
e−rT E f X̄Tn , sup X̄tn = e−rT E f X̄Tn , max MX̄ n ,X̄ n
t∈[0,T ] 0≤k≤n−1 tn tn
k k+1

where n
n,k
 nt W (tk ) ,T /n

Mx,y := sup x+ (y − x) + σ(tnk , x)Yt .
t∈[0,T /n] T
n,k
It follows from Proposition 8.3 that the random variables MX̄ n ,X̄ n
, k = 0, . . . , n − 1, are con-
t t
k k+1
ditionally independent given X̄ , k = 0, . . . , n. Following Corollary 8.1, the distribution function
tn
k
Gn,k n,k
x,y of Mx,y is given by
  
2n
Gn,k
x,y (z) = 1 − exp − (z − x)(z − y) 1{z≥max(x,y)} , z ∈ R.
T σ 2 (tnk , x)

Then, the inverse distribution simulation rule yields that


!
n
t n W (tk ) ,T /n d −1 d
sup x + T (y − x) + σ(tk , x)Yt = (Gn,k
x,y ) (U ), U = U ([0, 1])
t∈[0,T /n] n
d −1
= (Gn,k
x,y ) (1 − U ), (8.7)

d
where we used that U = 1 − U . From a computational viewpoint ζ := (Gn,k −1
x,y ) (1 − u) is the
admissible solution (i.e. such that ≥ max(x, y)) to the equation
 
2n
1 − exp − 2 n (ζ − x)(ζ − y) = 1 − u
T σ (tk , x)

or equivalently to
T 2 n
ζ 2 − (x + y)ζ + xy + σ (tk , x) log(u) = 0.
2n
This leads to
 
1 q
(Gnx,y )−1 (1 − u) = 2 2 n
x + y + (x − y) − 2T σ (tk , x) log(u)/n .
2

Finally
   
L max X̄tn | {X̄tnk = xk , k = 0, . . . , n} = L max (Gnxk ,xk+1 )−1 (1 − Uk )
t∈[0,T ] 0≤k≤n−1

where (Uk )0≤k≤n−1 are i.i.d. and uniformly distributed random variables over the unit interval.

Pseudo-code for Lookback like options.


8.2. FROM BROWNIAN TO DIFFUSION BRIDGE: HOW TO SIMULATE FUNCTIONALS OF THE GENUINE

We assume for the sake of simplicity that the interest rate r is 0. By Lookback like options we
mean the class of options whose payoff involve possibly both X̄T and the maximum of (Xt ) over
[0, T ] i.e.
E f (X̄T , sup X̄t ).
t∈[0,T ]

Regular Call on maximum is obtained by setting f (x, y) = (y − K)+ , the regular maximum
Lookback option by setting f (x, y) = y − x .
We want to compute a Monte Carlo approximation of E f (supt∈[0,T ] X̄t ) using the continuous
Euler scheme. We reproduce below a pseudo-script to illustrate how to use the above result on the
conditional distribution of the maximum of the Brownian bridge.

• Set S f = 0.
for m = 1 to M
(m)
• Simulate a path of the discrete time Euler scheme and set xk := X̄tn , k = 0, . . . , n.
k

(m) (m)
• Simulate Ξ(m) := max0≤k≤n−1 (Gnxk ,xk+1 )−1 (1 − Uk ), where (Uk )1≤k≤n are i.i.d. with
U ([0, 1])-distribution.

• Compute f (X̄T(m) , Ξ(m) ).


f f
• Compute Sm := f (X̄T(m) , Ξ(m) ) + Sm−1 .

end.(m)

• Eventually,
f
SM
E f ( sup X̄t ) ≈ ,
t∈[0,T ] M
at least for large enough M (2 ).
Once one can simulate supt∈[0,T ] X̄t (and its minimum, see exercise below), it is easy to price
by simulation the exotic options mentioned in the former section (Lookback, options on maximum)
but also the barrier options since one can decide whether or not the continuous Euler scheme strikes
or not a barrier (up or down). Brownian bridge is also involved in the methods designed for pricing
Asian options.

 Exercise. Derive a formula similar to (8.7) for the conditional distribution of the minimum of
the continuous Euler scheme using now the inverse distribution functions
 
n,k −1 1 q
2 2 n
(Fx,y ) (u) = x + y − (x − y) − 2T σ (tk , x) log(u)/n , u ∈ [0, 1].
2
2
. . . Of course one needs to compute the empirical variance (approximately) given by
M M
!2
1 X (m) 2 1 X (m)
f (Ξ ) − f (Ξ ) .
M M
k=1 k=1

in order to design a confidence interval without which the method is simply non sense. . . .
272CHAPTER 8. THE DIFFUSION BRIDGE METHOD : APPLICATION TO THE PRICING OF PATH-DEP

of !
t W,T /n
inf x+ T
(y − x) + σ(x)Yt .
t∈[0,T /n]
n

8.2.4 Application to regular barrier options: variance reduction by pre-conditioning


By regular barrier options we mean barrier options having a constant level as a barrier. A down-
and-out Call is a typical example of such options with a payoff given by
hT = (XT − K)+ 1{supt∈[0,T ] Xt ≤L}
where K denotes the strike price of the option and L (L > K) its barrier.
In practice, the “Call” part is activated at T only if the process (Xt ) hits the barrier L ≤ K
between 0 and T . In fact, as far as simulation is concerned, this “Call part” can be replaced by
any Borel function f such that both f (XT ) and f (X̄T ) are integrable (this is always true if f has
polynomial growth owing to Proposition 7.2). Note that these so-called barrier options are in fact a
sub-class of generalized maximum Lookback options having the specificity that the maximum only
shows up through an indicator function.
Then, one may derive a general weighted formula for E(f (X̄T )1{supt∈[0,T ] X̄t ≤L} ) (which is an
approximation of E(f (XT )1{supt∈[0,T ] Xt ≤L} )).
Proposition 8.4
  
n−1 (X̄t −L)(X̄t −L)
k k+1
  − 2n
σ 2 (tn ,Xt )
Y T
E f (X̄T )1{supt∈[0,T ] X̄t ≤L} = E f (X̄T )1{max0≤k≤n X̄tk ≤L} 1 − e k k  . (8.8)
k=0

Proof of Equation (8.8). This formula is the typical application of pre-conditioning described
in Section 3.4. We start from the chaining identity for conditional expectation:
    
E f (X̄T )1{supt∈[0,T ] X̄t ≤L} = E E f (X̄T )1{supt∈[0,T ] X̄t ≤L} | X̄tk , k = 0, . . . , n
Then we use the conditional independence of the Brownian bridges given the values X̄tk , k =
0, . . . , n, of the Euler scheme which has been established in Proposition 8.2. It follows
!
 
E f (X̄T )1{supt∈[0,T ] X̄t ≤L} = E f (X̄T )P( sup X̄t ≤ L | X̄tk , k = 0, . . . , n)
t∈[0,T ]
n
!
Y
= E f (X̄T ) GnX̄t ,X̄t (L)
k k+1
k=1
  
n−1 (X̄t −L)(X̄t −L)
k k+1
− 2n
σ 2 (tn ,Xt )
Y T
= E f (X̄T )1{max0≤k≤n X̄t ≤L}
1 − e k k  . ♦
k
k=0

Furthermore, we know that the random variable in the right hand side always has a lower
variance since it is a conditional expectation of the random variable in the left hand side, namely
  
n−1 (X̄t −L)(X̄t −L)
2n k k+1
−  
σ 2 (tn ,Xt )
Y
Var f (X̄T )1{max0≤k≤n X̄t ≤L} 1 − e T k k  ≤ Var f (X̄T )1{sup
k t∈[0,T ] X̄t ≤L}
k=0
8.2. FROM BROWNIAN TO DIFFUSION BRIDGE: HOW TO SIMULATE FUNCTIONALS OF THE GENUINE

 Exercises. 1. Show likewise that for every Borel function f ∈ L1 (PX̄ ),


T

 
Pn−1 (X̄t −L)(X̄t −L)
k k+1
  − 2n
T k=0 σ 2 (tn ,Xt )
E f (X̄T )1{inf t∈[0,T ] X̄t ≥L} = E f (X̄T )1{min0≤k≤n X̄tk ≥L} e k k  (8.9)

and that the expression in the second expectation has a lower variance.
2. Extend the above results to barriers of the form

L(t) := eat+b , a, b ∈ R.

8.2.5 Weak errors and Richardson-Romberg extrapolation for path-dependent


options
It
PR is clear that all the asymptotic control of the variance obtained in the former section for the estimator
(r) d
r=1 αr f (X̄T ) of E(f (XT )) when f is continuous can be extended to functionals F : ID([0, T ], R ) →
R which are PX -a.s. continuous with respect to the sup-norm defined by kxksup := supt∈[0,T ] |x(t)| with
polynomial growth (i.e. F (x) = O(kxk`sup ) for some natural integer ` as kxksup → ∞). This simply follows
from the fact that the (continuous) Euler scheme X̄ (with step T /n) defined by
Z t Z t
∀ t ∈ [0, T ], X̄t = x0 + b(X̄s )ds + σ(X̄s )dWs , s = bns/T cT /n
0 0

converges for the sup-norm toward X in every Lp (P).


Furthermore, this asymptotic
P control of the variance holds true with any R-tuple α = (αr )1≤r≤R of
weights coefficients satisfying 1≤r≤R αr = 1, so these coefficients can be adapted to the structure of the
weak error expansion.
For both classes of functionals (with D as a half-line in 1-dimension in the first setting), the practical
implementation of the continuous Euler scheme is known as the Brownian bridge method.
At this stage there are two ways to implement the (multistep) Romberg extrapolation with consistent
Brownian increments in order to improve the performances of the original (stepwise constant or continuous)
Euler schemes. Both rely on natural conjectures about the existence of a higher order expansion of the time
discretization error suggested by the above rates of convergence (8.1), (8.2) and (8.3).
• Stepwise constant Euler scheme: As concerns the standard Euler scheme, this means the existence of
a vector space V (stable by product) of admissible functionals satisfying
R−1
1
X ck R
(ER2 ,V ) ≡ ∀ F ∈ V, E(F (X)) = E(F (X))
e + k + O(n− 2 ). (8.10)
k=1 n 2

still for some real constant ck depending on b, σ, F , etc For small values of R, one checks that
(1) √ (1) √ √
R=2: α1 2 = −(1 + 2), α2 2 = 2(1 + 2).
√ √ √ √
( 21 ) 3− 2 ( 21 ) 3−1 (1) 2−1
∗[.4em]R = 3 : α1 = √ √ , α2 = −2 √ √ , α3 2 =3 √ √ .
2 2− 3−1 2 2− 3−1 2 2− 3−1

Note that these coefficients have greater absolute values than in the standard case. Thus if R = 4,
P ( 21 ) 2
1≤r≤4 (αr ) ≈ 10 900! which induces an increase of the variance term for too small values of the time
discretization parameter n even when increments are consistently generated. The complexity computations of
274CHAPTER 8. THE DIFFUSION BRIDGE METHOD : APPLICATION TO THE PRICING OF PATH-DEP

the procedure needs to be updated but grosso modo the optimal choice for the time discretization parameter
n as a function of the M C size M is
M ∝ nR .
• The continuous Euler scheme: The conjecture is simply to assume that the expansion (ER ) now holds
for a vector space V of functionals F (with polynomial growth with respect to the sup-norm). The increase
of the complexity induced by the Brownian bridge method is difficult to quantize: it amounts to computing
−1
log(Uk ) and the inverse distribution functions Fx,y and G−1
x,y .
The second difficulty is that simulating (the extrema of) some of continuous Euler schemes using the
Brownian bridge in a consistent way is not straightforward at all. However, one can reasonably expect that
using independent Brownian bridges “relying” on stepwise constant Euler schemes with consistent Brownian
increments will have a small impact on the global variance (although slightly increasing it).

To illustrate and compare these approaches we carried some numerical tests on partial Lookback and
barrier options in the Black-Scholes model presented in the previous section.

 Partial Lookback options: The partial Lookback Call option is defined by its payoff functional
 
F (x) = e−rT x(T ) − λ min x(s) , x ∈ C([0, T ], R),
s∈[0,T ] +

where λ > 0 (if λ ≤ 1, the ( . )+ can be dropped). The premium

CallLkb
0 = e−rT E((XT − λ min Xt )+ )
t∈[0,T ]

is given by
σ2
 
2r 2r
CallLkb
0
BS
= X0 Call (1, λ, σ, r, T ) + λ X0 PutBS
λ , 1, , r, T .
σ 2
2r σ
We took the same values for the B-S parameters as in the former section and set the coefficient λ at λ = 1.1.
For this set of parameters CallLkb
0 = 57.475.
As concerns the MC simulation size, we still set M = 106 . We compared the following three methods
for every choice of n:
– A 3-step Richardson-Romberg extrapolation (R = 3) of the stepwise constant Euler scheme (for which
3
a O(n− 2 )-rate can be expected from the conjecture).
– A 3-step Richardson-Romberg extrapolation (R = 3) based on the continuous Euler scheme (Brownian
bridge method) for which a O( n13 )-rate can be conjectured (see [63]).
– A continuous Euler scheme (Brownian bridge method) of equivalent complexity i.e. with discretization
parameter 6n for which a O( n1 )-rate can be expected (see [63]).
The three procedures have the same complexity if one neglects the cost of the bridge simulation with
respect to that of the diffusion coefficients (note this is very conservative in favour of “bridged schemes”).
We do not reproduce the results obtained for the standard stepwise constant Euler scheme which are
clearly out of the game (as already emphasized in [63]). In Figure 8.2.5, the abscissas represent the size of
Euler scheme with equivalent complexity (i.e. 6n, n = 2, 4, 6, 8, 10). Figure 8.2.5(a) (left) shows that both
3-step Richardson-Romberg extrapolation methods converge significantly faster than the “bridged” Euler
scheme with equivalent complexity in this high volatility framework. The standard deviations depicted in
Figure 8.2.5(a) (right) show that the 3-step Richardson-Romberg extrapolation of the Brownian bridge is
controlled even for small values of n. This is not the case with the 3-step Richardson-Romberg extrapolation
method of the stepwise constant Euler scheme. Other simulations – not reproduced here – show this is
already true for the standard Romberg extrapolation and the bridged Euler scheme. In any case the multistep
Romberg extrapolation with R = 3 significantly outperforms the bridged Euler scheme.
8.2. FROM BROWNIAN TO DIFFUSION BRIDGE: HOW TO SIMULATE FUNCTIONALS OF THE GENUINE

(a)
4 350

3
300
2
Error = MC Premium − BS Premium

1 250

Std Deviation MC Premium


0
200

−1

150
−2

−3 100

−4
50
−5

−6 0
10 15 20 25 30 35 40 45 50 55 60 10 15 20 25 30 35 40 45 50 55 60
BS−LkbCall, S0 = 100, lambda = 1.1, sig = 100%, r = 15 %, R = 3 : Consist. Romberg & Brownian bridge, R = 3 BS−LkbCall, S0 = 100, lambda = 1.1, sig = 30%, r = 5 %, R = 3 : Consist. Romberg & Brownian bridge, R = 3

(b)
4 350

3
300
2
Error = MC Premium − BS Premium

1
Std Deviation MC Premium 250

0
200

−1

150
−2

−3 100

−4
50
−5

−6 0
10 15 20 25 30 35 40 45 50 55 60 10 15 20 25 30 35 40 45 50 55 60
BS−LkbCall, S0 = 100, lambda = 1.1, sig = 100%, r = 15 %, R = 3 : Consist. Romberg & Brownian bridge, R = 3 BS−LkbCall, S0 = 100, lambda = 1.1, sig = 30%, r = 5 %, R = 3 : Consist. Romberg & Brownian bridge, R = 3

Figure 8.1: B-S Euro Partial Lookback Call option. (a) M = 106 . Richardson-Romberg extrapola-
tion (R = 3) of the Euler scheme with Brownian bridge: o−−o−−o. Consistent Richardson-Romberg extrap-
olation (R = 3): ×—×—×. Euler scheme with Brownian bridge with equivalent complexity: + − − + − − +.
X0 = 100, σ = 100%, r = 15 %, λ = 1.1. Abscissas: 6n, n = 2, 4, 6, 8, 10. Left: Premia. Right: Standard
Deviations. (b) Idem with M = 108 .

When M = 108 , one verifies (see Figure 8.2.5(b)) that the time discretization error of the 3-step
Richardson-Romberg extrapolation vanishes like for the partial Lookback option. In fact for n = 10 the
3-step bridged Euler scheme yields a premium equal to 57.480 which corresponds to less than half a cent
error, i.e. 0.05 % accuracy! This result being obtained without any control variate variable.
The Richardson-Romberg extrapolation of the standard Euler scheme also provides excellent results. In
fact it seems difficult to discriminate them with those obtained with the bridged schemes, which is slightly
unexpected if one think about the natural conjecture about the time discretization error expansion.

As a theoretical conclusion, these results strongly support both conjectures about the existence of ex-
pansion for the weak error in the (n−p/2 )p≥1 and (n−p )p≥1 scales respectively.

 Up & out Call option: Let 0 ≤ K ≤ L. The Up-and-Out Call option with strike K and barrier L is
defined by its payoff functional

F (x) = e−rT (x(T ) − K)+ 1{maxs∈[0,T ] x(s)≤L} , x ∈ C([0, T ], R).


276CHAPTER 8. THE DIFFUSION BRIDGE METHOD : APPLICATION TO THE PRICING OF PATH-DEP

It is again classical background, that in a B-S model

CallU &O (X0 , r, σ, T ) = CallBS(X0 , K, r, σ, T ) − CallBS(X0 , L, r, σ, T ) − e−rT (L−K)Φ(d− (L))


 1+µ
L  
− CallBS(X0 , K 0 , r, σ, T )−CallBS(X0 , L0 , r, σ, T )−e−rT (L0 −K 0 )Φ(d− (L0 ))
X0

with
2 2 σ2 x
log(X0 /L) + (r −
 
2 )T
Z
0 X0 0 X0 − ξ2 dξ
K =K , L =L , d (L) = √ and Φ(x) := e− 2 √
L L σ T −∞ 2π

2r
and µ = .
σ2
We took again the same values for the B-S parameters as for the vanilla call. We set the barrier value at
L = 300. For this set of parameters C0U O = 8.54. We tested the same three schemes. The numerical results
are depicted in Figure 8.2.5.
The conclusion (see Figure 8.2.5(a) (left)) is that, at this very high level of volatility, when M = 106
(which is a standard size given the high volatility setting) the (quasi-)consistent 3-step Romberg extrapolation
with Brownian bridge clearly outperforms the continuous Euler scheme (Brownian bridge) of equivalent
complexity while the 3-step Richardson-Romberg extrapolation based on the stepwise constant Euler schemes
with consistent Brownian increments is not competitive at all: it suffers from both a too high variance (see
Figure 8.2.5(a) (right)) for the considered sizes of the Monte Carlo simulation and from its too slow rate of
convergence in time.
When M = 108 (see Figure 8.2.5(b) (left)), one verifies again that the time discretization error of the
3-step Richardson-Romberg extrapolation almost vanishes like for the partial Lookback option. This no
longer the case with the 3-step Richardson-Romberg extrapolation of stepwise constant Euler schemes. It
seems clear that the discretization time error is more prominent for the barrier option: thus with n = 10,
the relative error is 9.09−8.54
8.54 ≈ 6.5% by this first Richardson-Romberg extrapolation whereas, the 3-step
Romberg method based on the quasi-consistent “bridged” method yields an approximate premium of 8.58
corresponding to a relative error of 8.58−8.54
8.54 ≈ 0.4%. These specific results (obtained without any control
variate) are representative of the global behaviour of the methods as emphasized by Figure 8.2.5(b)(left).

8.2.6 The case of Asian options


The family of Asian options is related to the payoffs of the form
Z T 
hT := h Xs ds .
0

where h : R+ → R+ is a non-negative Borel function. This kind of options needs some specific
treatments to improve Rthe rate of convergence of its time discretization. This is due to the continuity
T
of the functional f 7→ 0 f (s)ds in (L1 ([0, T ], dt), k . k1 ).
This problem has been extensively investigated (essentially for a Black-Scholes dynamics) by
E. Temam in his PHD thesis (see [97]). What follows comes from this work.

Approximation phase: Let

σ2
Xtx = x exp (µt + σWt ), µ=r− , x > 0.
2
8.2. FROM BROWNIAN TO DIFFUSION BRIDGE: HOW TO SIMULATE FUNCTIONALS OF THE GENUINE

(a)
2 900

800
1.5

700
Error = MC Premium − BS Premium

Std Deviation MC Premium


600

0.5
500

400
0

300
−0.5

200

−1
100

−1.5 0
10 15 20 25 30 35 40 45 50 55 60 10 15 20 25 30 35 40 45 50 55 60
BS−UOCall, S0=K=100, L = 300, sig = 100%, r = 15 %, Consist. Romberg Brownian bridge R = 3 vs Equiv Brownian BS−UOCall, S0=K=100, L = 300, sig = 100%, r = 15 %, CConsist. Romberg & Brownian bridge, R = 3

(b)
1.5 900

1 800

700
0.5
Error = MC Premium − BS Premium

600
Std Deviation MC Premium

500
−0.5
400

−1
300

−1.5
200

−2 100

−2.5 0
10 15 20 25 30 35 40 45 50 55 60 10 15 20 25 30 35 40 45 50 55 60
BS−UOCall, S =K=100, L = 300, sig = 100%, r = 15 %, Consist. Romberg Brownian bridge R = 3 vs Equiv Brownian BS−UOCall, S =K=100, L = 300, sig = 100%, r = 15 %, CConsist. Romberg & Brownian bridge, R = 3
0 0

Figure 8.2: B-S Euro up-&-out Call option. (a) M = 106 . Richardson-Romberg extrapolation (R = 3)
of the Euler scheme with Brownian bridge: o−−o−−o. Consistent Richardson-Romberg extrapolation (R = 3):
—×—×—×—. Euler scheme with Brownian bridge and equivalent complexity: + − − + − − +. X0 = K = 100,
L = 300, σ = 100%, r = 15%. Abscissas: 6n, n = 2, 4, 6, 8, 10. Left: Premia. Right: Standard Deviations.
(b) Idem with M = 108 .
278CHAPTER 8. THE DIFFUSION BRIDGE METHOD : APPLICATION TO THE PRICING OF PATH-DEP

Then
Z T n−1
X Z tnk+1
Xsx ds = Xsx ds
0 k=0 tn
k
n−1 Z T /n
X (tn
k)
= Xtxn exp (µs + σWs )ds.
k
k=0 0

So, we need to approximate the integral coming out in the above equation.
p Let B be a standard
Brownian motion. Roughly speaking, sups∈[0,T /n] |Bs | is proportional to T /n. Although it is not
true in the a.s. (owing to a missing log log term), it holds true in any Lp (P), p ∈ [1, ∞) by a simple
scaling so that, we may write “almost” rigorously
σ2 2 3
∀ s ∈ [0, T /n], exp (µs + σBs ) = 1 + µs + σBs + Bs + “O(n− 2 )”.
2
Hence
Z T /n T /n
µ T2 σ2 T 2 σ 2 T /n 2
Z Z
T 5
exp (µs + σBs )ds = + +σ Bs ds + + (Bs − s)ds + “O(n− 2 )”
0 n 2 n2 0 2 2n 2 2 0
2 Z T /n 2
 2 Z 1
T rT σ T eu2 − u)du + “O(n− 52 )”
= + +σ Bs ds + (B
n 2 n2 0 2 n 0
q
T
where B u , u ∈ [0, 1], is a standard Brownian motion (on the unit interval) since,
eu =
n BT n
combining a scaling and a change of variable,
Z T /n  2 Z 1
2 T e 2 − u)du.
(Bs − s)ds = (B u
0 n 0
R1 2
Owing to the fact that the random variable 0 (B e − u)du is centered and that, when B is replaced
u
(tn)
sucessively by W k , k = 1, . . . , n, the resulting random variables are i.i.d., one can in fact consider
5
that the contribution of this term is O(n− 2 ) (3 ). This leads us to use the following approximation
Z T /n Z T /n
(tn ) T r T2 (tn
k)
exp (µs + σWs k )ds ≈ Ikn := + 2
+ σ W s ds.
0 n 2 n 0
n
fu(tk ) )2 − u) du is independent of FtW
3
R1
To be more precise, the random variable 0 ((W n , k = 0, . . . , n − 1, so that
k
n−1 Z 1 2 n−1 Z 1 2
(tn (tn
X
k) 2 k) 2
x
X x
X (( W ) − u) du = X (( W ) − u) du
n
n
u u
tk
f f
tk

k=0 0 0k=0 2
2
n−1 Z 1 2
X xn 2 n
fu(tk ) )2 − u) du

= Xtk (( W
2 0
k=0 2
 4 Z 1 2
T (Bu2 − u) du kX x k2

≤ n T 2
n
0

2

since (Xtx )2 is a sub-martingale. As a consequence


n−1 Z 1
(tn
X 3
k) 2
x
Xtk ((Wu ) − u) du = O(n− 2 )
n f

0k=0

2
5
which justifies to consider “conventionally” the contribution of each term to be O(n− 2 ) i.e. negligible.
8.2. FROM BROWNIAN TO DIFFUSION BRIDGE: HOW TO SIMULATE FUNCTIONALS OF THE GENUINE

Simulation phase: Now, it follows from Proposition 8.2 applied to the Brownian motion (which
is its own continuous Euler scheme), that the n-tuple of processes (Wt )t∈[tnk ,tnk+1 ] , k = 0, . . . , n − 1
(tn )
are independent processes given σ(Wtnk , k = 1, . . . , n). This also reads as (Wt k )t∈[0,T /n] , k =
0, . . . , n − 1 are independent processes given σ(Wtnk , k = 1, . . . , n). The same proposition implies
that for every ` ∈ {0, . . . , n − 1},
  !

(tn )
 nt W,T /n
L (Wt ` )t∈[0,T /n] | Wtnk = wk , k = 1, . . . , n = L (w`+1 − w` ) + Yt
T t∈[0,T /n]

are independent processes given σ({Wtnk , k = 1, . . . , n}). Consequently the random variables
R T /n (tn` )
0 Ws ds, ` = 1, . . . , n, are conditionally i.i.d. given σ(Wtnk = wk , k = 1, . . . , n) with a condi-
tional Gaussian distribution with (conditional mean) given by
Z T /n
nt T
(w`+1 − w` )dt = (w`+1 − w` ).
0 T 2n
As concerns the variance, we can use the exercise below Proposition 8.1 but, at this stage, we will
detail the computation for a generic Brownian motion, say W , between 0 and T /n.
!  !2 
Z T /n Z T /n
Var Ws ds | W T = E YsW,T /n ds 
n
0 0
Z
= E(YsW,T /n YtW,T /n ) ds dt
[0,T /n]2
Z  
n T
= (s ∧ t) − (s ∨ t) ds dt
T [0,T /n]2 n
 3 Z
T
= (u ∧ v)(1 − (u ∨ v)) du dv
n [0,1]2

1 T 3
 
= .
12 n
 Exercise. Use stochastic calculus to show directly that
Z T 2 ! Z T 2 Z T  2
W,T T T T3
E Ys ds =E Ws ds − WT = − s ds = .
0 0 2 0 2 12
R 
T
No we can describe the pseudo-code for the pricing of an Asian option with payoff h 0 Xsx ds
by a Monte Carlo simulation.
for m := 1 to M
q
(m) d T (m) (m)
• Simulate the Brownian increments ∆Wtn := n Zk , k = 1, . . . , n, Zk i.i.d. with distri-
k+1
bution N (0; 1); set
q  
(m) (m) (m)
– wk := Tn Z1 + · · · + Zn ,
(m) (m)
– xk := x exp (µtnk + σwk ), k = 0, . . . , n.
280CHAPTER 8. THE DIFFUSION BRIDGE METHOD : APPLICATION TO THE PRICING OF PATH-DEP

(m)
• Simulate independently ζk , k = 1, . . . , n, i.i.d. with distribution N (0; 1) and set
 2  3 !
n,(m) d T r T T 1 T 2 (m)
Ik := + +σ (wk+1 − wk ) + √ ζk , k = 0, . . . , n − 1.
n 2 n 2n 12 n

• Compute
n−1
!
(m) n,(m)
X
(m)
hT =: h xk Ik
k=0

end.(m)
M
1 X (m)
Premium ≈ e−rT hT .
M
m=1
end.
Time discretization error estimates: Set
Z T n−1
X
AT = Xsx ds and ĀnT = Xtxn Ikn .
k
0 k=0

This scheme induces the following time discretization error.

Proposition 8.5 (see [97]) For every p ≥ 2,


3
kAT − ĀnT kp = O(n− 2 )

so that, if g : R2 → R is Lipschitz continuous continuous, then


3
kg(XTx , AT ) − g(XTx , ĀnT )kp = O(n− 2 ).

Proof. In progress. ♦

Remark. The main reason for not considering higher order expansions
Rt R t of exp(µt + σBt ) is that we
are not able to simulate at a reasonable cost the triplet (Bt , 0 Bs ds, 0 Bs2 ds) which is no longer a
Rt Rt
Gaussian vector and, consequently, ( 0 Bs ds, 0 Bs2 ds) given Bt .
Chapter 9

Back to sensitivity computation

Let Z : (Ω, A, P) → (E, E) be an E-valued random variable where (E, E) is an abstract measurable
space, let I be a nonempty open interval of R and let F : I × E → R be a Bor(I) ⊗ E-measurable
function such that, for every x ∈ I, F (x, Z) ∈ L2 (P) (1 ). Then set

f (x) = E F (x, Z).

Assume that the function f is regular, at least at some points. Our aim is to devise a method to
compute by simulation f 0 (x) at such points (or higher derivatives f (k) (x), k ≥ 1). If the functional
F (x, z) is differentiable at x (with respect to its first variable) for PZ almost every z, if a domination
or uniform integrability property holds (like (9.1) or (9.5) below), if the partial derivative ∂F ∂x (x, z)
can be computed at a reasonable cost and Z is a simulatable random vector (still at a reasonable. . . ),
it is natural to compute f 0 (x) using a Monte Carlo simulation based on the representation formula
 
∂F
f 0 (x) = E (x, Z) .
∂x
This approach has already been introduced in Chapter 2 and will be more deeply developed further
on in Section 9.2 mainly devoted to the tangent process.
Otherwise, when ∂F ∂x (x, z) does not exist or cannot be compute easily (whereas F can), a natural
idea is to introduce a stochastic finite difference approach. Other methods based on the introduction
of an appropriate weight will be introduced in the last two sections of this chapter.

9.1 Finite difference method(s)


The finite difference method is in some way the most elementary and natural method to compute
sensitivity parameters (Greeks) although it is an approximate method in its standard form. It
is also known in financial Engineering as the “Bump Method” or “Shock method” . It can be
described in a very general setting (which does correspond to its wide field of application). These
methods have been originally investigated in [59, 60, 99].
1
In fact E can be replaced by the probability space Ω itself: Z becomes the canonical variable/process on this
probability space (endowed with the distribution P = PZ of the process). In particular Z can be the Brownian
motion or any process at time T starting at x or its entire path, etc. The notation is essentially formal and could be
replaced by the more general F (x, ω)

281
282 CHAPTER 9. BACK TO SENSITIVITY COMPUTATION

9.1.1 The constant step approach


We consider the framework described in the introduction caption. We will distinguish two cases:
in the first one – called “regular setting” – the function x 7→ F (x, Z(ω)) is “not far” from being
pathwise differentiable whereas in the second one – called “singular setting” – f remains smooth
but F becomes “singular”.

The regular setting


Proposition 9.1 Let x ∈ R. Assume that F satisfies the following local mean quadratic Lipschitz
continuous assumption (“at x”)
∃ ε0 > 0, ∀ x0 ∈ (x − ε0 , x + ε0 ), kF (x, Z) − F (x0 , Z)k2 ≤ CF,Z |x − x0 |. (9.1)
Assume the function f is twice differentiable with a Lipschitz continuous second derivative on
(x − ε0 , x + ε0 ). Let (Zk )k≥1 be a sequence of i.i.d. random vectors with the same distribution as
Z, then for every ε ∈ (0, ε0 ),
s
2 − (f 0 (x) − ε2 [f ”] )2

M

0 1 X F (x + ε, Zk ) − F (x − ε, Zk )
 ε2 2 CF,Z 2 Lip +
f (x) − ≤ [f ”]Lip + (9.2)
M 2ε 2 M
k=1 2
s
2
 ε2 2 CF,Z
≤ [f ”]Lip + (9.3)
2 M
ε2 C
≤ [f ”]Lip + √F,Z .
2 M
Furthermore, if f is three times differentiable on (x − ε0 , x + ε0 ) with a bounded third derivative,
then one can replace [f ”]Lip by 13 sup|ξ−x|≤ε |f (3) (ξ)|.
2 CF,Z
Remark. In the above sum [f ”]Lip ε2 represents the bias and √
M
is the statistical error.

Proof. Let ε ∈ (0, ε0 ). It follows from the Taylor formula applied to f between x and x ± ε
respectively that
f (x + ε) − f (x − ε) ε2
|f 0 (x) − | ≤ [f ”]Lip . (9.4)
2ε 2
On the other hand
 
F (x + ε, Z) − F (x − ε, Z) f (x + ε) − f (x − ε)
E =
2ε 2ε
and
   2 !
F (x + ε, Z) − F (x − ε, Z) F (x + ε, Z) − F (x − ε, Z)  f (x + ε) − f (x − ε) 2
Var = E −
2ε 2ε 2ε
E(F (x + ε, Z) − F (x − ε, Z))2  f (x + ε) − f (x − ε) 2
= −
4ε2 2ε
 f (x + ε) − f (x − ε) 2
2
≤ CF,Z − (9.5)

2
≤ CF,Z . (9.6)
9.1. FINITE DIFFERENCE METHOD(S) 283

Using the Huyguens Theorem (2 ), we get


M
2
f (x + ε) − f (x − ε) 2
 

0 1 X F (x + ε, Zk ) − F (x − ε, Zk )
f (x) − = f (x) −

M 2ε 2ε
k=1 2
M
!
1 X F (x + ε, Zk ) − F (x − ε, Zk )
+Var
M 2ε
k=1
f (x + ε) − f (x − ε) 2
 
= f (x) −

 
1 F (x + ε, Z) − F (x − ε, Z)
+ Var
M 2ε
2
  f (x + ε) − f (x − ε) 2 
 ε 2 1 2
≤ [f ”]Lip + CF,Z −
2 M 2ε
where we combined (9.4) and (??) to derive the last inequality i.e. (9.3). To get the improved
bound (9.2), we first derive from (9.4) that
f (x + ε) − f (x − ε) ε2
≥ f 0 (x) − [f ”]Lip
2ε 2
so that
 

E F (x + ε, Z) − F (x − ε, Z) f (x + ε) − f (x − ε)
=
2ε 2ε
 
f (x + ε) − f (x − ε)

2ε +
2
 
ε
≥ f 0 (x) − [f ”]Lip .
2 +
Plugging this lower bound into the definition of the variance yields the announced inequality. ♦

Numerical recommendations. The above result suggests to choose M = M (ε) (and ε) so that
M (ε) = o ε−4 .


As a matter of fact, it is useless to pursue the Monte Carlo simulation so that the statistical error
becomes smaller than that induced by the approximation of the derivative. From a practical point
of view this means that, in order to reduce the error by a factor 2, we need to reduce ε and increase
M as follows: √
ε ; ε/ 2 and M ; 4 M.
Remark (what should never be done). Imagine that one uses two independent samples (Zk )k≥1
and (Zek )k≥1 to simulate F (x − ε, Z) and F (x + ε, Z). Then, it is straightforward that
M
!
1 X F (x + ε, Zk ) − F (x − ε, Zek ) 1  
Var = Var(F (x + ε, Z)) + Var(F (x − ε, Z))
M 2ε 4M ε2
k=1
Var(F (x, Z))
≈ .
2M ε2
2
which formally simply reads E|X − a|2 = (a − E X)2 + Var(X).
284 CHAPTER 9. BACK TO SENSITIVITY COMPUTATION

f (x+ε)−f (x−ε)
Then the asymptotic variance of the estimator of 2 explodes as ε → 0 and the resulting
quadratic error read approximately

ε2 σ(F (x, Z))


[f ”]Lip + √ .
2 ε 2M
p
where σ(F (x, Z)) = Var(F (x, Z)) is the standard deviation of F (x, Z). This leads to consider
the unrealistic constraint M (ε) ∝ ε−6 to keep the balance between the bias term and the variance
term; or equivalently to √
ε ; ε/ 2 and M ; 8 M.
to reduce the error by a 2 factor.

Examples: (a) Black-Scholes model. The basic fields of applications in finance is the Greeks
computation. This corresponds to functions F of the form

 2

−rT (r− σ2 )T +σ T z
F (x, z) = e h xe , z ∈ R, x ∈ (0, ∞)

where h : R+ → R is a Borel function with lienar growth. If h is Lipschitz continuous continuous,


then √
σ2
|F (x, Z) − F (x0 , Z)| ≤ [h]Lip |x − x0 |e− 2 T +σ T Z
d
so that elementary computations show, using that Z = N (0; 1),
σ2
kF (x, Z) − F (x0 , Z)k2 ≤ [h]Lip |x − x0 |e 2
T
.

The regularity of f follows from the following easy change of variable


2
Z  √  2 dz
Z log(y/x)−µT
dy
− z2
f (x) = e−rT f x eµT +σ Tz
e √ = e−rT f (y)e− 2σ 2 T √ .
R 2π R yσ 2π
2
where µ = r − σ2 . This change of variable makes the integral appears as a “log”-convolution with
similar regularizing effects as the standard convolution. Under appropriate growth assumption on
the function h (say polynomial growth), one shows from the above identity that the function f is
in fact infinitely differentiable over (0, +∞). In particular it is twice differentiable with Lipschitz
continuous second derivative over any compact interval included in (0, +∞).
(b) Diffusion model with Lipschitz continuous coefficients. Let X x = (Xtx )t∈[0,T ] denote the Brow-
nian diffusion solution of the SDE

dXt = b(Xt )dt + ϑ(Xt )dWt , X0 = x

where b and ϑ are locally Lipschitz continuous functions (on the real line) with at most linear
growth (which implies the existence and uniqueness of a strong solution (Xtx )t∈[0,T ] starting from
X0x = x. In such a case one should rather writing

F (x, ω) = h(XTx (ω)).


9.1. FINITE DIFFERENCE METHOD(S) 285

The Lipschitz continuous continuity of the flow of the above SDE (see Theorem 7.10) shows that

kF (x, .) − F (x0 , .)k2 ≤ Cb,ϑ [h]Lip |x − x0 |eCbϑ T

where Cb,ϑ is a positive constant only depending on the Lipschitz continuous coefficients of b and ϑ.
In fact this also holds for multi-dimensional diffusion processes and for path-dependent functionals.
The regularity of the function f is a less straightforward question. But the answer is positive in
two situations: either h, b and σ are regular enough to apply results on the flow of the SDE which
allows pathwise differentiation of x 7→ F (x, ω) (see Theorem 9.1 further on in Section 9.2.2) or ϑ
satisfies a uniform ellipticity assumption ϑ ≥ ε0 > 0.
(c) Euler scheme of a diffusion model with Lipschitz continuous coefficients. The same holds for
the Euler scheme. Furthermore Assumption (9.1) holds uniformly with respect to n if T /n is the
step size of the Euler scheme.
(d) F can also stand for a functional of the whole path of a diffusion provided F is Lipschitz
continuous with respect to the sup-norm over [0, T ].
As emphasized in the section devoted to the tangent process below, the generic parameter x
can be the maturity T (in practice the residual maturity T − t, also known as seniority ), or any
finite dimensional parameter on which the diffusion coefficient depend since they always can be
seen as a component or a starting value of the diffusion.

 Exercises. 1. Adapt the results of this section to the case where f 0 (x) is estimated by its
“forward” approximation
f (x + ε) − f (x)
f 0 (x) ≈ .
ε
2. Apply the above method to approximate the γ-parameter by setting

f (x + ε) + f (x − ε) − 2f (x)
f ”(x) =
ε2
under suitable assumptions on f and its derivatives.

The singular setting

In the setting described in the above Proposition 9.1, we are close to a framework in which one
can interchange derivation
 0 and expectation:
 the (local) Lipschitz continuous assumption on x0 7→
F (x0 , Z) implies that F (x ,Z)−F
x0 −x
(x,Z)
0
is a uniformly integrable family. Hence as soon
x ∈(x−ε0 ,x+ε0 )
as x0 7→ F (x0 , Z) is P-a.s. pathwise differentiable at x (or even simply in L2 (P)), one has f 0 (x) =
E Fx0 (x, Z).
Consequently it is important to investigate the singular setting in which f is differentiable at x
and F (., Z) is not Lipschitz continuous in L2 . This is the purpose of the next proposition (whose
proof is quite similar to that the Lipschitz continuous setting and is subsequently left to the reader
as an exercise).
286 CHAPTER 9. BACK TO SENSITIVITY COMPUTATION

Proposition 9.2 Let x ∈ R. Assume that F satisfies in a neighbourhood (x − ε0 , x + ε0 ), ε0 > 0,


of x the following local mean quadratic θ-Hölder assumption (θ ∈ (0, 1]) assumption (“at x”) i.e.
there exists a positive real constant CHol,F,Z

∀ x0 , x” ∈ (x − ε0 , x + ε0 ), kF (x”, Z) − F (x0 , Z)k2 ≤ CHol,F,Z |x” − x0 |θ .


Assume the function f is twice differentiable with a Lipschitz continuous second derivative on
(x − ε0 , x + ε0 ). Let (Zk )k≥1 be a sequence of i.i.d. random vectors with the same distribution as
Z, then for every ε ∈ (0, ε0 ),
s
M
ε2 2 2
CHol,F,Z


0 1 X F (x + ε, Zk ) − F (x − ε, Z )
k

f (x) − ≤ [f ”]Lip + . (9.7)
M
k=1
2ε 2 (2ε)2(1−θ) M
2

ε2 CHol,F,Z
≤ [f ”]Lip + √ .
2 (2ε)1−θ M
Once again, the variance of the finite difference estimator explodes as ε → 0 as soon as θ 6= 1.
As a consequence, in such a framework, to divide the quadratic error by a 2 factor, we need to

ε ; ε/ 2 and M ; 21−θ × 4 M.

A slightly different point of view in this singular case is to (roughly) optimize the parameter
ε = ε(M ), given a simulation size M in order to minimize the quadratic error, or at least its natural
upper bounds. Such an optimization performed in (9.7) yields
! 1
2θ CHol,F,Z 3−θ
εopt = √ .
[f ”]Lip M
which of course depends on M so that it breaks the recursiveness of the estimator. Moreover its
sensitivity to [f ”]Lip makes its use rather unrealistic in practice.
2−θ
 
− 3−θ
The resulting rate of decay of the quadratic error is O M . This rate show that when
θ ∈ (0, 1), the lack of L2 -regularity of x 7→ F (x, Z) slows down the convergence of the finite difference
method by contrast with the Lipschitz continuous case where the standard rate of convergence of
the Monte Carlo method is preserved.

Example of the digital option. A typical example of such a situation is the pricing of digital
options (or equivalently the computation of the δ-hedge of a Call or Put options).
Let us consider in the still in the standard risk neutral Black-Scholes model a digital Call option
with strike price K > 0 defined by its payoff
h(ξ) = 1{ξ≥K}

 
σ2
and set F (x, z) = e−rT h x e(r− 2 )T +σ T z , z ∈ R, x ∈ (0, ∞) (r denotes the constant interest
rate as usual). We know that the premium of this option is given for every initial price x > 0 of
the underlying risky asset by
d
f (x) = E F (x, Z) where Z = N (0; 1).
9.1. FINITE DIFFERENCE METHOD(S) 287

σ2
Set µ = r − 2 . It is clear (using that Z and −Z have the same distribution), that
 √ 
f (x) = e−rT P x eµ T +σ T Z ≥ K
 log(x/K) + µT 
= e−rT P Z ≥ − √
σ T
 log(x/K) + µT 
= e−rT Φ0 √
σ T
where Φ0 denotes the distribution function of the N (0; 1) distribution. Hence the function f is
infinitely differentiable on (0, ∞) since the probability density of the standard normal distribution
is on the real line.
On the other hand, still using that Z and −Z have the same distribution, for every x, x0 ∈ R,
2

0 2 −2rT n

kF (x, Z) − F (x , Z)k2 = e 1
Z≥− log(x/K)+µT
o − 1 n
0
log(x /K)+µT
o
Z≥−
√ √

σ T σ T
2
2

= e−2rT E 1n log(x/K)+µT o − 1n log(x0 /K)+µT o

Z≤ σ√T Z≤ √
σ T
log(max(x, x0 )/K) + µT log(min(x, x0 )/K) + µT 
    
= e−2rT Φ0 √ − Φ0 √ .
σ T σ T
Using that that Φ00 is bounded (by κ0 = √1 ),

we derive that

κ0 e−2rT
kF (x, Z) − F (x0 , Z)k22 ≤ log x − log x0 .


σ T
Consequently for every interval I ⊂ (0, ∞) bounded away from 0, there exists a real constant
Cr,σ,T,I > 0 such that

∀ x, x0 ∈ I, kF (x, Z) − F (x0 , Z)k2 ≤ Cr,σ,T,I |x − x0 |


p

i.e. the functional F is 12 -Hölder in L2 (P) and the above proposition applies.

 Exercises. 1. Prove the above proposition 9.2.


2. Digital option. (a) Consider in the risk neutral Black-Scholes model a digital option defined by
its payoff
h(ξ) = 1{ξ≥K}

 2

−rT (r− σ2 )T +σ T z
and set F (x, z) = e h xe , z ∈ R, x ∈ (0, ∞) (r is a constant interest rate as
d
usual). We still consider the computation of f (x) = E F (x, Z) where Z = N (0; 1).
(b) Verify on a numerical simulation that the variance of the finite difference estimator introduced
in Theorem 9.1 explodes as ε → 0 at the rate expected from the computations that precede.
(c) Show that, in a fixed neighbourhood of x ∈ (0, +∞). Derive from what precedes a way to
“synchronize” the step ε and the size M of the simulation.
288 CHAPTER 9. BACK TO SENSITIVITY COMPUTATION

9.1.2 A recursive approach: finite difference with decreasing step


In the former finite difference method with constant step, the bias never fades. Consequently,
increasing the accuracy of the sensitivity computation, requires to resume it from the beginning
with a new ε. In fact it is easy to propose a recursive version of the above finite difference procedure
by considering some variable steps ε which go to 0. This can be seen as an application of the Kiefer-
Wolfowitz principle originally developed for Stochastic Approximation purpose.
We will focus on the “regular setting” (F Lipschitz) in this section, the singular setting is
proposed as an exercise. Let (εk )k≥1 be a sequence of positive real numbers decreasing to 0. With
the notations and the assumptions of the former section, consider the estimator
M
0
1 X F (x + εk , Zk ) − F (x − εk , Zk )
f (x)M :=
d . (9.8)
M 2εk
k=1

It can be computed in a recursive way since


 
0 0
1 F (x + εM +1 , ZM +1 ) − F (x − εM +1 , ZM +1 ) 0
f (x)M +1 = f (x)M +
d d − f (x)M .
d
M +1 2εM +1

Elementary computations show that the squared quadratic error satisfy

M
!2

0 0 (x)
2
0 1 X f (x + εk ) − f (x − εk )
f (x) − fd = f (x) −
M
2 M 2εk
k=1
M
1 X Var(F (x + εk , Zk ) − F (x − εk , Zk ))
+
M2 4ε2k
k=1
M
!2
[f ”]2Lip X 2
2
CF,Z
≤ εk + (9.9)
4M 2 M
k=1
2
 
M
2
!
1  [f ”] X
= Lip
ε2k 2 
+ CF,Z
M 4M
k=1

where we used again (9.4) to get Inequality (9.9).


As a consequence
v !2
M
u

0
1 u [f ”]2Lip X
0
f (x) − f (x)M ≤ √ ε2k 2 .
+ CF,Z (9.10)
d t
2 M 4M
k=1


Standard M -L2 of convergence In order to prove a √1M rate (like in a standard Monte Carlo
simulation) we need the sequence (εm )m≥1 and the size M to satisfy

M
!2
X
ε2k = O(M ).
k=1
9.1. FINITE DIFFERENCE METHOD(S) 289

This leads to choose εk of the form


1
εk = O k − 4

as k → +∞,

since
M M  1
√ 1 X k −2 √ √
 Z 1
X 1 dx
1 = M ∼ M √ =2 M as M → +∞.
k2 M M 0 x
k=1 k=1

Erasing the asymptotic bias A more efficient way to take advantage of a decreasing step
approach is to erase the bias by considering steps of the form
1
εk = o k − 4

as k → +∞.

Then, the first


√ term in the right hand side of (9.10) goes to zero. So the bias asymptotically fades
in the scale M
However, the choice of too small steps εm may introduce numerical instability in the computa-
tions, so we recommend to choose the εk of the form
1
εk = o k −( 4 +δ)

with δ > 0 small enough.

One can refine the bound obtained in (9.10): note that

M F (x + ε , Z ) − F (x − ε , Z ) 2
M

X Var(F (x + εk , Zk ) − F (x − εk , Zk )) X k k k k 2
=
4ε2k 4ε2k
k=1 k=1
M  2
X f (x + εk ) − f (x − εk )
− .
4ε2k
k=1

f (x + εk ) − f (x − εk )
Now, since → f 0 (x) as k → +∞,
2εk
M 
1 X f (x + εk ) − f (x − εk ) 2

2 → f 0 (x)2 as M → +∞.
M 4εk
k=1

Plugging this in the above computations yields the refined asymptotic upper-bound
√ q
lim sup M f 0 (x) − fd
0 (x) ≤ 2
CF,Z − (f 0 (x))2 .

M

M →+∞ 2

This approach has the same quadratic rate of convergence as a regular Monte Carlo simulation
(e.g. a simulation carried out with ∂F
∂x (x, Z) if it exists).
Now, we show that the estimator fd 0 (x) is consistent i.e. converging toward f 0 (x).
M

Proposition 9.3 Under the assumptions of Proposition 9.1 and if εk goes to zero as k goes to
0 (x)
infinity, the estimator fd M
a.s. converges to its target f 0 (x).
290 CHAPTER 9. BACK TO SENSITIVITY COMPUTATION

Proof. It amounts to showing that


M
1 X F (x + εk , Zk ) − F (x − εk , Zk ) f (x + εk ) − f (x − εk ) a.s.
− −→ 0. (9.11)
M 2εk 2εk
k=1

This is (again) a straightforward consequence of the a.s. convergence of L2 -bounded martingales


combined with the Kronecker Lemma (see Lemma 11.1): first define the martingale

M
X 1 F (x + εk , Zk ) − F (x − εk , Zk ) − (f (x + εk ) − f (x − εk ))
LM = , M ≥ 1.
k 2εk
k=1

One checks that


M M
2
X
2
X 1
E LM = E (∆Lk ) = Var(F (x + εk , Zk ) − F (x − εk , Zk ))
4k 2 ε2k
k=1 k=1
M
X 1
≤ E (F (x + εk , Zk ) − F (x − εk , Zk ))2
4k 2 ε2k
k=1
M M
X 1 2 2 2
X 1
≤ C 4 ε = C
4k 2 ε2k F,Z k F,Z
k2
k=1 k=1

so that
sup EL2M < +∞.
M

Consequently, LM a.s. converges to a square integrable (hence a.s. finite) random variable L∞ as
M → +∞. The announced a.s. convergence in (9.11) follows from the Kronecker Lemma (see
Lemma 11.1). ♦

 Exercises. 1. (Central Limit Theorem) Assume that x 7→ F (x, Z) is Lipschitz continuous from
R to L2+η (P) for an η > 0. Show that the convergence of the finite difference estimator with
decreasing step fd 0 (x) defined in (9.8) satisfies the following property: from every subsequence
M
(M 0 ) of (M ) one may extract a subsequence (M ”) such that
√ 
0

L
0 (x)
M ” fd − f (x) −→ N (0; v), v ∈ [0, v̄], as M → +∞
M”

2 − f 0 (x) 2 .

where v̄ = CF,Z
 
[Hint: Note that the sequence Var F (x+εk ,Zk )−F
2εk
(x−εk ,Zk ) 
, k ≥ 1, is bounded and use the fol-
lowing Central Limit Theorem: if (Yn )n≥1 is a sequence of i.i.d. random variables such that there
exists η > 0 satisfying
Nn
1 X n→+∞
sup E|Yn |2+η < +∞ and ∃ Nn → +∞ with Var(Yk ) −→ σ 2 > 0
n Nn
k=1
9.2. PATHWISE DIFFERENTIATION METHOD 291

then
Nn
1 X L
√ Yk −→ N (0; σ 2 ) as n → +∞.]
Nn k=1

2. (Hölder framework) Assume that x 7→ F (x, Z) is only θ-Hölder from R to L2 (P) with θ ∈ (0, 1),
like in Proposition 9.2.
(a) Show that a natural upper-bound for the quadratic error induced by the symmetric finite
0 (x)
difference estimator with decreasing step fd defined in (9.8) is given by
M

v !2
M M
u 2 2
1ut [f ”]Lip CF,Z 1
X X
ε2k + .
M 4
k=1
22(1−θ) 2(1−θ)
k=1 εk

0 (x)
(b) Show that the resulting estimator fd M
a.s. converges to its target f 0 (x) as soon as
X 1
< +∞.
2 2(1−θ)
k≥1 k εk

(c) Assume that εk = kca , k ≥ 1, where c is a positive real constant and a ∈ (0, 1). Show that the
1
exponent a corresponds to an admissible step iff a ∈ (0, 2(1−θ) ). Justify the choice of a∗ = 2(3−θ)
1
for
√ − 2 
the exponent a and derive that the resulting rate of decay of the quadratic error is O M 3−θ .

The above exercise shows that the (lack of) regularity of x 7→ F (x, Z) in L2 (P) does impact the
rate of convergence of the finite difference method.

9.2 Pathwise differentiation method


9.2.1 (Temporary) abstract point of view
We keep the notations of the former section. We assume that there exists p ∈ [1, +∞) such that

∀ ξ ∈ (x − ε0 , x + ε0 ), F (ξ, Z) ∈ Lp (P).

Definition 9.1 The function ξ 7→ F (ξ, Z) from (x − ε0 , x + ε0 ) to Lp (P) is Lp -differentiable at x


if there exists a random vector denoted ∂x F (x, .) ∈ Lp (P) such that

F (x, Z) − F (ξ, Z)
lim − ∂x F (x, ω)
= 0.
ξ→x,ξ6=x x−ξ p

Proposition 9.4 If ξ 7→ F (ξ, Z) is Lp -differentiable at x, then the function f defined by f (ξ) =


E F (ξ, Z) is differentiable at x and

f 0 (x) = E ∂x F (x, ω) .

292 CHAPTER 9. BACK TO SENSITIVITY COMPUTATION

Proof. It is a straightforward consequence of the inequality |E Y | ≤ kY kp (which holds for any


Y ∈ Lp (P)) applied to Y (ω) = F (x,Z(ω))−F
x−ξ
(ξ,Z(ω))
− ∂x F (x, ω) by letting ξ converge to x. ♦

Practical application. As soon as ∂x F (x, ω) is simulatable at a reasonable cost, one may


compute f 0 (x) by a standard Monte Carlo simulation.

The usual criterion to establish Lp -differentiability, especially when the underlying source of
randomness comes from a diffusion (see below), is to establish a pathwise differentiability of
ξ 7→ F (ξ, Z(ω)) combined with an Lp -uniform integrability property of the ratio F (x,Z)−F x−ξ
(ξ,Z)

(see Theorem 11.2 and the Corollary that follows in Chapter 11 for a short background on uniform
integrability).
Usually, this is applied with p = 2 since one needs ∂x F (x, ω) to be in L2 (P) to ensure that the
Central Limit Theorem applies to rule the rate of convergence in the Monte Carlo simulation.
This can be summed up in the following proposition which proof is obvious.

Proposition 9.5 Let p ∈ [1, +∞). If


(i) there exists a random variable ∂x F (x, .) such that P(dω)-a.s. ξ 7→ F (ξ, Z(ω)) is differentiable
at x with derivative ∂x F (x, ω),
 
(ii) there exists ε0 > 0 such that the family F (x,Z)−F
x−ξ
(ξ,Z)
is Lp -uniformly inte-
ξ∈(x−ε0 ,x+ε0 )\{x}
grable,
then ∂x F (x, .) ∈ Lp (P) and ξ 7→ F (ξ, Z) is Lp -differentiable at x with derivative ∂x F (x, .).

9.2.2 Tangent process of a diffusion and application to sensitivity computation


In a diffusion framework, it is important to have in mind that, more or less like in a deterministic
setting, the solution of an SDE is usually smooth when viewed as a (random) function of its starting
value. This smoothness even holds in a pathwise sense with a an (almost) explicit differential. The
main result in that direction is due to Kunita (see [87], Theorem 3.1).
This result must be understood as follows: when a sensitivity (the δ-hedge and the γ parameter
but also other “greek parameters” as will be seen further on) related to the premium E h(XTx ) of
a an option cannot be computed by “simply” interchanging differentiation and expectation, this
lack of differentiablity comes from the payoff function h. This also holds true for path-dependent
options.
Let us come to Kunita’s theorem on the regularity of the flow of an SDE.

Theorem 9.1 (a) Let b : R+ × Rd → Rd , ϑ : R+ × Rd → M(d, q, R), with regularity Cb1 with
bounded α-Hölder partial derivatives for an α > 0. Let X x = (Xtx )t≥0 denote the unique strong
solution of the SDE
dXt = b(t, Xt )dt + ϑ(t, Xt )dWt , X0 = x ∈ Rd , (9.12)

where W = (W 1 , . . . , W q ) is a q-dimensional Brownian motion defined on a probability space


(Ω, A, P). Then at every t ∈ R+ , the mapping x 7→ Xtx is a.s. continuously differentiable and its
9.2. PATHWISE DIFFERENTIATION METHOD 293

∂(Xtx )i
h i
gradient Yt (x) := ∇x Xtx = ∂xj
satisfies the linear stochastic differential system
1≤i,j≤d

d Z t q Z t
d X
∂bi ∂ϑik
Ytij (x)
X X
x `j
∀ t ∈ R+ , = δij + `
(s, Xs )Ys (x)ds+ `
(s, Xsx )Ys`j (x)dWsk , 1 ≤ i, j ≤ d,
0 ∂y 0 ∂y
`=1 `=1 k=1

where δij denote the Kronecker symbol.


A verifier!!! (b) Furthermore, the tangent process Y (x) takes values in the set GL(d, R) of invertible
square matrices. (see Theorem ?? in [?]).

Remark. One easily derives from the above theorem the slightly more general result about the
tangent process to the solution (Xst,x )s∈[t,T ] starting from x at time t. This process, denoted
Y (tx)s )σ≥t can be deduced from Y (x) be
∀ s ≥ t, Y (t, x)s = Y (x)s Y −1 (x)t .
This is a consequence often unique ness of the solution of a linear SDE.
Remark. • Higher order differentiability properties hold true if b are ϑ are smoother. For a more
precise statement, see Section 9.2.2 below.

Example. If d = q = 1, the above SDE reads


 
dYt (x) = Yt (x) b0x (t, Xt )dt + ϑ0x (t, Xtx )dWt , Y0 (x) = 1

and elementary computations show that


Z t Z t 
0 x 1 0 x 2
 0 x
Yt (x) = exp bx (s, Xs ) − (ϑx (s, Xs )) ds + ϑx (s, Xs )dWs (9.13)
0 2 0

so that, in the Black-Scholes model (b(t, x) = rx, ϑ(t, x) = ϑx), one retrieves that
d x Xx
Xt = Yt (x) = t .
dx x
 Exercise. Let d = q = 1. Show that under the assumptions of Theorem 9.1, the tangent process
d
at x Yt (x) = dx Xtx satisfies
Yt (x)
sup ∈ Lp (P), p > 0.
Y
s,t∈[0,T ] s (x)

Applications to δ-hedging. The tangent process and the δ hedge are closely related. Assume
that the interest rate is 0 (for convenience) and that a basket is made up of d risky assets whose
price dynamics (Xtx )t∈[0,T ] , X0x = x ∈ (0, +∞)d , (Xtx )t∈[0,T ] is solution to (9.12).
Then the premium of the payoff h(XTx ) on the basket is given by
f (x) := E h(XTx ).
The δ-hedge vector of this option (at time 0 and) at x = (x1 , . . . , xd ) ∈ (0, +∞)d is given by
∇f (x).
We have the following proposition that establishes the existence and the representation of
f 0 (x) as an expectation (with in view its computation by a Monte Carlo simulation). It is a
straightforward application of Theorem 2.1(b).
294 CHAPTER 9. BACK TO SENSITIVITY COMPUTATION

Proposition 9.6 (δ-hedge of vanilla European options) If a Borel function h : Rd → R satisfies


the following assumptions:
(i) A.s. differentiability: ∇h(y) exists PX x (dy)-a.s.,
T

(ii) Uniform integrability property (3 ): there exists a neighbouring interval (x − ε0 , x + ε0 )


(ε0 > 0) of x such that,
0
!
|h(XTx ) − h(XTx )|
is uniformly integrable.
|x − x0 |
|x0 −x|<ε 0 ,x0 6=x

Then f is differentiable at x and ∇f (x) has the following representation as an expectation:


x E
x ∂XT
D
∂f
(x) = E ∇h(XT )| , i = 1, . . . , d. (9.14)
∂xi ∂xi

Remark. One can also consider a forward start payoff h(XTx1 , . . . , XTx ). Then, under similar
N
assumptions its premium v(x) is differentiable and
N
X D E
∇f (x) = E ∇yj h(XTx1 , . . . , XTx )|∇x XTx .
N j
j=1

Computation by simulation
One uses these formulae to compute some sensibility by Monte Carlo simulations since sensitivity
parameters are functions of the couple made by the diffusion process X x solution to the SDE
starting at x and its tangent process ∇x X x at x: it suffices to consider the Euler scheme of this
couple (Xtx , ∇x Xtx ) over [0, T ] with step Tn .
Assume d = 1 for notational convenience and set Yt = ∇x Xtx :

dXtx = b(Xtx )dt + ϑ(Xtx )dWt , X0x = x ∈ R


dYt = Yt (b0 (Xtx )dt + ϑ0 (Xtx )dWt ), Y0 = 1.

In 1-dimension, one can take advantage of the semi-closed formula (9.13) obtained in the above
exercise for the tangent process.

Extension to an exogenous parameter θ


All the theoretical results obtained for the δ, i.e. for the differentiation of the flow of an SDE with
respect to its initial value can be extended to any parameter provided no ellipticity is required.
This follows from the remark that if the coefficient(s) b and/or ϑ of a diffusion depend(s) on a
parameter θ then the couple (Xt , θ) is still diffusion process, namely

dXt = b(θ, Xt )dt + ϑ(θ, Xt )dWt , X0 = x,


dθt = 0, θ0 = θ.
3
If h is Lipschitz continuous and X x is a solution to an SDE with Lipschitz continuous coefficients b and ϑ in the
sense of (7.2), this uniform integrability is always satisfied since it follows from Theorem 7.10 applied with p > 1.
9.3. SENSITIVITY COMPUTATION FOR NON SMOOTH PAYOFFS 295

Set xe = (x, θ) and Xetx0 := (Xtx , θt ). Thus, following Theorems 3.1 and 3.3 of Section 3 in [87], if
(θ, x) 7→ b(θ, x) and (θ, x) 7→ ϑ(θ, x) are Cbk+α (0 < α < 1) with respect to x and θ (4 ) then the
solution of the SDE at a given time t will be C k+β (0 < β < α) as a function of (x and θ). A more
specific approach would show that some regularity in the sole variable θ would be enough but then
this result does not follow for free from the general theorem of the differentiability of the flows.
Assume b = b(θ, . ) and ϑ = ϑ(θ, . ) and the initial value x = x(θ), θ ∈ Θ, Θ ⊂ Rq (open set).
One can also differentiate a SDE with respect to this parameter θ. We can assume that q = 1 (by
considering a partial derivative if necessary). Then, one gets
Z t 
∂Xt (θ) ∂x(θ) ∂b  ∂b  ∂Xs (θ) 
= + θ, Xs (θ) + θ, Xs (θ) ds
∂θ ∂θ 0 ∂θ ∂x ∂θ
Z t 
∂ϑ  ∂ϑ  ∂Xs (θ) 
+ θ, Xs (θ) + θ, Xs (θ) dWs .
0 ∂θ ∂x ∂θ

Toward practical implementation (practitioner’s corner)


Once coupled with the original diffusion process X x , this yields some expressions for the the sensi-
tivity with respect to the parameter θ, possibly
 closed, butusually computable by a Monte Carlo
simulation of the Euler scheme of the couple Xt (θ), ∂X∂θt (θ) .

As a conclusion let us mention that this tangent process approach is close to the finite difference
method applied to F (x, ω) = h(XTx (ω)): it appears as a limit case of the finite difference method.

9.3 Sensitivity computation for non smooth payoffs


The tangent process based approach needs a smoothness assumption of the payoff function, typically
almost everywhere differentiability, as emphasized above (and in Section 2.3). Unfortunately this
assumption is not fulfilled by many usual payoffs functions like the digital payoff

hT = h(XT ) with h(x) := 1{x≥K}



whose δ hedge parameter cannot be computed by the tangent process method (in fact ∂x E h(XTx )
x
is but the probability density of XT at the strike K). For similar reasons, so is the case for the γ
sensitivity parameter of a vanilla Call option.
We also saw in Section 2.3 that in the Black-Scholes model, this problem can be overcome since
integration by parts or differentiating the log-likelihood leads to some sensitivity formulas for non
smooth payoffs. Is it possible to extend this idea to more general models?

9.3.1 The log-likelihood approach (II)


A general abstract result
We saw in Section 2.3 that a family of random vectors X(θ) indexed by a parameter θ ∈ Θ, Θ open
interval of R, all have a positive probability density p(θ, y) with respect to a reference nonnengative
4
A function, g has a Cbk+α regularity if g is C k with k-th order partial derivatives globally α-Holder and all partial
derivatives up to k-th order bounded.
296 CHAPTER 9. BACK TO SENSITIVITY COMPUTATION

measure µ on Rd . Assume this density is positive on a domain D of Rd for every θ ∈ Θ. Then one
could derive the sensitivity of functions
Z
f (θ) := E ϕ(X(θ)) = ϕ(y)p(θ, y)µ(dy)
D

(with respect to θ ∈ Θ, ϕ Borel function with appropriate integrability assumptions) provided the
density function p is smooth enough as a function of θ, regardless of the regularity of ϕ. In fact,
this was briefly developed in a one dimensional Black-Scholes framework but the extension to an
abstract framework is straightforward and yields the following result.

Proposition 9.7 If the probability density p(θ, y) as a function defined on Θ × D satisfies

(i) θ 7−→ p(θ, y) is differentiable on Θ µ(dy)-a.e.


 
(ii) ∃ gRd → R, Borel such that gϕ ∈ L1 (µ) and ∀ θ ∈ Θ, |∂θ p(θ, y)| ≤ g(y) µ(dy)-a.e. ,

then  
0
 ∂ log p
∀ θ ∈ Θ, f (θ) = E ϕ X(θ) (θ, X(θ)) .
∂θ

The log-likelihood method for the Euler scheme


At a first glance this approach is attractive, unfortunately, in most situations we have no explicitly
computable form for the density p(θ, y) even when its existence is proved. However, if one thinks
about diffusion approximation (in a non-degenerate setting) by some discretization schemes like
Euler schemes, the application of the log-likelihood method becomes much less unrealistic at least
to compute by simulation some proxies of the greek parameters by a Monte Carlo simulation.
As a matter of fact, under slight ellipticity assumption, the (constant step) Euler scheme of
a diffusion does have a probability density at each time t ≥ 0 which can be made explicit (in
some way, see below). This is a straightforward consequence of the fact that it is a discrete time
Markov process with conditional Gaussian increments. The principle is the following. We consider
a diffusion (Xt (θ))t∈[0,T ] depending on a parameter θ ∈ Θ, say

dXt (θ) = b(θ, Xt (θ)) dt + ϑ(θ, Xt (θ))dWt , X0 (θ) = x.

Let pT (θ, x, y) and p̄T (θ, x, y) denote the density of XTx (θ) and its Euler scheme X̄Tx (θ) (with
step size Tn ). Then one may naturally propose the following naive approximation

   
0 x ∂ log pT x x ∂ log p̄T x
f (θ) = E ϕ(XT (θ)) (θ, x, XT (θ)) ≈ E ϕ(X̄T (θ)) (θ, x, X̄T (θ)) .
∂θ ∂θ

In fact the story is not as straightforward because what can be made explicit and tractable is
the density of the whole n-tuple (X̄txn , . . . , X̄txn , . . . , X̄txnn ) (with tnn = T ).
1 k

Proposition 9.8 Let q ≥ d and ϑϑ∗ (θ, x) ∈ GL(d, R) for every x ∈ Rd , θ ∈ Θ.


9.4. A FLAVOUR OF STOCHASTIC VARIATIONAL CALCULUS: FROM BISMUT TO MALLIAVIN297

(a) Then the distribution PX̄ x (dy) of X̄ xT has a probability density given by
T n
n

1 n T ∗ ∗ −1 T
p̄ T (θ, x, y) = d e− 2T (y−x− n b(θ,x)) (ϑϑ (θ,x)) (y−x− n b(θ,x)) .
(2π Tn ) 2
p
n ∗
det ϑϑ (θ, x)

(b) The distribution P(X̄ xn ,...,X̄ xn ,...,X̄ xn ) (dy1 , . . . , dyn ) of the n-tuple (X̄txn , . . . , X̄txn , . . . , X̄txnn ) has a
t1 t tn 1 k
k
probability density given by
n
Y
p̄tn ,...,tnn (θ, x, y1 , . . . , yn ) = p̄ T (θ, yk−1 , yk )
1 n
k=1

with the convention y0 = x.

Proof. (a) is a straightforward consequence of the definition of the Euler scheme at time Tn and
the formula for the density of a Gaussian vector. Claim (b) follows from an easy induction based
on the Markov property satisfied by the Euler scheme [Details in progress, to be continued...]. ♦

The above proposition shows that every marginal X̄txn has a density which, unfortunately, cannot
k
be made explicit. So to take advantage of the above closed form for the density of the n-tuple, we
can write

!
∂ log p̄tn ,...,tnn
0 x
f (θ) ≈ E ϕ(X̄T (θ)) 1
(θ, x, X̄txn1 (θ), . . . , X̄txn1 (θ))
∂θ
n ∂ log p̄ T
!
X
x
= E ϕ(X̄T (θ)) n
(θ, X̄txn (θ), X̄txn (θ)) .
∂θ k−1 k
k=1

At this stage it appears clearly that the method also works for path dependent problems i.e.
when considering Φ((Xt (θ))t∈[0,T ] ) instead of ϕ(XT (θ)) (at least for specific functionals Φ involving
time averaging, a finite number of instants, supremum, infimum, etc). This leads to new difficulties
in connection with the Brownian bridge method for diffusions, that need to be encompassed.
Finally, let us mention that evaluating the rate of convergence of these approximations from
a theoretical point of view is quite a challenging problems since it involves not only the rate of
convergence of the Euler scheme itself but also that of the probability density functions of the
scheme toward that of the diffusion (see [10]).

 Exercise. Apply what precedes to the case θ = x (starting value) when d = 1.

9.4 A flavour of stochastic variational calculus: from Bismut to


Malliavin
9.4.1 Bismut’s formula
In this section for the sake of simplicity, we assume that d = 1 and q = 1 (scalar Brownian motion).
298 CHAPTER 9. BACK TO SENSITIVITY COMPUTATION

Theorem 9.2 (Bismut formula) Let W = (Wt )t∈[0,T ] be a standard Brownian motion on a prob-
ability space (Ω, A, P) and let F := (Ft )t∈[0,T ] be its augmented (hence càd) natural filtration. Let
X x = (Xtx )t∈[0,T ] be a diffusion process solution to the SDE

dXt = b(Xt )dt + ϑ(Xt )dWt , X0 = x

where b and ϑ are Cb1 (hence Lipschitz continuous). Let f : R → R be a continuously differentiable
function such that  
E f 2 (XTx ) + (f 0 )2 (XTx ) < +∞.
Let (Ht )t∈[0,T ] be an F-progressively measurable (5 ) process lying in L2 ([0, T ] × Ω, dt ⊗ dP) i.e.
RT
satisfying E 0 Hs2 ds < ∞. Then
Z T Z T
ϑ(Xsx )Hs
   
x 0 x
E f (XT ) Hs dWs = E f (XT )YT ds
0 0 Ys
dXtx
where Yt = dx is the tangent process of X x at x.
Proof (Sketch of ). We will assume to simplify the arguments in the proof that |Ht | ≤ C < +∞,
C ∈ R+ and that f and f 0 are bounded functions. Let ε ≥ 0. Set on the probability space
(Ω, FT , P),
Pε = L(ε)
T .P
where  Z t
ε2 t 2
Z 
(ε)
Lt = exp −ε Hs dWs − Hs ds , t ∈ [0, T ]
0 2 0
is a P-martingale since H is bounded. It follows from Girsanov’s Theorem that
 Z t   
ε
W := Wt + ε
f Hs ds is a P(ε) , (Ft )t∈[0,T ] -Brownian motion.
0 t∈[0,T ]

Now it follows from Theorem 2.1 that


 Z T 
∂ (ε)
E f (XT ) Hs dWs = − E(f (XT )LT )|ε=0
0 ∂ε
On the other hand
∂   ∂
E f (XT )L(ε) = EP (f (XT ))|ε=0 .
∂ε T
|ε=0 ∂ε ε
Now we can rewrite the SDE satisfied by X as follows

dXt = b(Xt )dt + ϑ(Xt )dWt


f (ε) .

= b(Xt ) − εHt ϑ(Xt ) dt + ϑ(Xt )dW t

Consequently (see [141], Theorem 1.11, p.372), X has the same distribution under P(ε) as X (ε)
solution to
(ε)
dX (ε) = (b(X (ε) ) − εHt ϑ(X (ε) ))dt + ϑ(X (ε) )dWt , X0 = x.
5
This means that for every t ∈ [0, T ], (Hs (ω))(s,ω)∈[0,T ]×Ω is Bor([0, t]) ⊗ Ft -measurable.
9.4. A FLAVOUR OF STOCHASTIC VARIATIONAL CALCULUS: FROM BISMUT TO MALLIAVIN299

Now we can write


 Z T 

E f (XT ) Hs dWs = − E(f (XT(ε) ))|ε=0
0 ∂ε
 ! 
∂X (ε)
= −E f 0 (XT ) T 
∂ε
|ε=0

where we used once again Theorem 2.1 and the obvious fact that X (0) = X.
Using the tangent
! process method with ε as an auxiliary variable, one derives that the process
(ε)
∂Xt
Ut := satisfies
∂ε
|ε=0

dUt = Ut (b0 (Xt )dt + ϑ0 (Xt )dWt ) − Ht ϑ(Xt )dt.

Plugging the regular tangent process Y in this equation yields


Ut
dUt = dYt − Ht ϑ(Xt )dt. (9.15)
Yt
We know that Yt is never 0 so (up to some localization if necessary) we can apply Itô formula
to the ratio UYtt : elementary computations of the partial derivatives of the function (u, y) 7→ uy on
R × (0, +∞) combined with Equation (9.15) show that
   
Ut dUt Ut dYt 1 dhU, Y it 2Ut dhY it
d = − + −2 +
Yt Yt Yt2 2 Yt2 Yt3
 
Ht ϑ(Xt ) 1 dhU, Y it 2Ut dhY it
= − dt + −2 +
Yt 2 Yt2 Yt3
Then we derive from (9.15) that
Ut
dhU, Y it = dhY it
Yt
which yields  
Ut ϑ(Xt )Ht
d =− dt.
Yt Yt
(ε)
dX0 dx
Noting that U0 = dε = dε = 0 finally leads to
Z t
ϑ(Xs )Hs
Ut = −Yt ds, t ∈ [0, T ],
0 Ys
which completes this step of the proof.
The extension to more general processes H can be done by introducing for every n ≥ 1
(n)
Ht (ω) := Ht (ω)1{|Ht (ω)|≤n} .

It is clear by the Lebesgue dominated convergence Theorem that H (n) converges to H in L2 ([0, T ]×
Ω, dt ⊗ dP). Then one checks that both sides of Bismut’s identity are continuous with respect to
this topology (using Hölder’s Inequality).
300 CHAPTER 9. BACK TO SENSITIVITY COMPUTATION

The extension to unbounded functions and derivatives f follows by approximation of f by


bounded Cb1 functions. ♦

Application to the computation of the δ-parameter: Assume b and ϑ are Cb1 . If f is


continuous with polynomial growth and satisfies
Z T 2 !
Y t
E f 2 (XTx )+ dt < +∞,
0 ϑ(Xtx )
then  
 Z T 
∂ x
 x 1 Y s 
E f (XT ) = E  f (XT ) x
dWs 
. (9.16)
∂x 
 T 0 ϑ(Xs ) 
| {z }
weight
Proof. We proceed like we did with the Black-Scholes model in Section 2.3: we first assume that
f is regular, namely bounded, differentiable with bounded derivative. Then, using the tangent
process method approach

E f (XTx ) = E f 0 (XTx )YT

∂x
dXtx
still with the notation Yt = dx . Then, we set
Yt
Ht = .
ϑ(Xtx )
Under the above assumption, we can apply Bismut’s formula to get
 Z T 
0 x
 x Yt
T E f (XT )YT = E f (XT ) x dWt
0 ϑ(Xt )

which yields the announced result. The extension to continuous functions with polynomial growth
relies on an approximation argument. ♦

Remarks. • One retrieves in the case of a Black-Scholes model the formula (2.6) obtained for the
Xx
δ in Section 2.3 by using an elementary integration by parts since Yt = xt and ϑ(x) = σx.
• Note that the assumption
Z T 2
Yt
dt < +∞
0 ϑ(Xtx )
is basically an ellipticity assumption. Thus if ϑ2 (x) ≥ ε0 > 0, one checks that the assumption is
always satisfied.

 Exercises. 1. Apply what precedes to get a formula for the γ-parameter in a general diffusion
model. [Hint: Apply the above “derivative free” formula to the δ-formula obtained using the
tangent process method].
0
2. Show that if b0 − b ϑϑ − 12 ϑ”ϑ = c ∈ R, then
∂ ecT
E f (XTx ) = E f (XTx )WT

∂x ϑ(x)
9.4. A FLAVOUR OF STOCHASTIC VARIATIONAL CALCULUS: FROM BISMUT TO MALLIAVIN301

9.4.2 The Haussman-Clark-Occone formula: toward Malliavin calculus


In this section we state an elementary version of the so-called Haussman-Clark-Occone formula,
following the seminal paper by Haussman [71]. We still consider the standard SDE

dXt = b(Xt )dt + ϑ(Xt )dWt , X0 = x, t ∈ [0, T ].

with Lipschitz continuous coefficients b and σ. We denote by X x = (Xtx )t∈[0,T ] its unique solution
starting at x and by (Ft )t∈[0,T ] the (augmented) filtration of the Brownian motion W . We state
the result in a one-dimensional setting for (at least notational) convenience.

Theorem 9.3 Let F : (C([0, T ], R), k . ksup ) → R be a differentiable functional with differential
DF . Then Z T
F (X x ) = E F (X x ) + E(DF (X).(1[t,T ] Y.(t) ) | Ft )ϑ(Xt )dWt
0

where Y (t) is the tangent process of Xx at time t, solution to


(t) (t)
dYt = Ys(t) (b0 (Xsx )ds + ϑ0 (Xsx )dWs ), s ∈ [t, T ], Yt = 1.

This tangent process Y (t) also reads


Yu (0)
Yu(t) = , u ∈ [t, T ] where Yy = Yt is the tangent process of X x at the origin.
Yt
Remarks. • The starting point to understand this formula is to see it as a more explicit version
of the classical representation formula of Brownian martingales, namely Mt = E(F (X) | Ft ) which
admits a formal representation as a Brownian stochastic integral
Z T
Mt = M0 + Ht dWt .
0

So, the Clark-Occone-Haussman provides a kind of closed form for the process H.
• The differential DF (ξ) of the functional F at an element ξ ∈ C([0, T ], R) is a continuous linear
form on C([0, T ], R). Hence, following the Riesz representation Theorem (see e.g. [30]), it can be
represented by a finite signed measure, say µDF (ξ) (ds) so that the term DF (ξ).(1[t,T ] Y.(t) ) reads
Z T Z T
(t)
DF (ξ).(1[t,T ] Y. ) = 1[t,T ] (s)Ys(t) µDF (ξ) (ds) = Ys(t) µDF (ξ) (ds).
0 t

Toward Malliavin derivative.


Assume F (x) = f (x(t0 )), x ∈ C([0, T ], R), f : R → R differentiable with derivative f 0 . Then
µx (ds) = f 0 (x(t0 ))δt0 (ds) where δt0 (ds) denotes the Dirac mass at time t0 . Consequently
(t)
DF (X).(1[t,T ] Y.(t) ) = f 0 (Xtx0 )Yt0 1[0,t0 ] (t)

whence one derives that


Z t0
f (Xtx0 ) = E f (Xtx0 ) + E(f 0 (Xt0 )Yt(t)
0
| Ft ) ϑ(Xtx ) dWt . (9.17)
0
302 CHAPTER 9. BACK TO SENSITIVITY COMPUTATION

This leads to introduce the Malliavin derivative Dt F (X x )t of F (X x ) at time t which is a


derivative with respect to the path of Brownian motion W (viewed at time t) itself by

DF (X x ).(1[t,T ] Y.(t) )ϑ(Xtx ) if t ≤ t0



x
Dt F (X ) := .
0 if t > t0
The simplest interpretation is to write the following formal chain rule for differentiation
∂F (X x ) ∂Xtx
Dt F (X x ) = × .
∂Xtx ∂W
Xtx ,t x
If one notes that, for any s ≥ t, Xsx = Xs , the first term “ ∂F∂X
(X )
x ” in the above product is clearly
t
equal to DF (X).(1[t,T ] Y.(t) ) whereas the second term is the result of a formal differentiation of the
SDE at time t with respect to W , namely ϑ(Xtx ).
An interesting feature of this derivative is that it satisfies usual chaining rules like Dt F 2 (X x ) =
2F (X x )Dt F (X x ) and more generally Dt Φ(F (X x )) = DΦ(F (X x ))Dt F (X x ), etc .

What is called Malliavin calculus is a way to extend this notion of differentiation to more
general functionals using some functional analysis arguments (closure of operators), etc) using e.g.
the domain of the operator Dt F (see []).

Using the Haussman-Clark-Occone formula to get Bismut’s formula. As a first con-


clusion we will show that the Haussman-Clark-Occone formula contains the Bismut formula. Let
X x , H, f and T be like in Section 9.4.1. We consider the two true martingales
Z t Z t
Mt = Hs dWs and Nt = E(f (XTx )) + E(f 0 (XT )YT(s) | Fs )dWs t ∈ [0, T ].
0 0

and perform a (stochastic) integration by parts. Owing to (9.17), we get, under appropriate inte-
grality conditions,
 Z T  Z T Z T
x
E f (XT ) Hs dWs = 0+E [. . . ] dMt + E [. . . ] dNt
0 0 0
Z T 
(s)
+E E(f 0 (XT )YT | Fs ) ϑ(Xsx )Hs ds
0
Z T  
= E E(f 0 (XT )YT(s) | Fs ) ϑ(Xsx )Hs ds
0

owing to Fubini’s Theorem. Finally, using the characterization of conditional expectation to get
rid of the conditioning, we obtain
 Z T  Z T  
E f (XTx ) Hs dWs = E f 0 (XT )YT(s) ϑ(Xsx )Hs ds.
0 0

(s) YT
Finally, a reverse application of Fubini’s Theorem and the identity YT = Ys leads to
T T
ϑ(Xsx )
 Z   Z 
E f (XTx ) Hs dWs = E f 0 (XT )YT Hs ds
0 0 Ys
9.4. A FLAVOUR OF STOCHASTIC VARIATIONAL CALCULUS: FROM BISMUT TO MALLIAVIN303

which is but the Bismut formula (9.2).


Z T 
 Exercise. (a) Considers the functional ξ 7→ F (ξ) = ϕ f (ξ(s))ds , where ϕ : Rd → R is a
0
differentiable function. Show that
T T T
ϑ(Xtx )
Z  Z Z 
F (X x ) = E F (X x ) + E ϕ0 Xsx ds Ys ds | Ft dWt
0 0 t Yt

(b) Derive using the homogeneity of the Equation (9.12) that


 Z T Z T    Z T −t Z T −t 
Ys
E ϕ0 Xsx ds ds | Ft = 0
E ϕ x̄ + Xsx ds Ys ds
0 t Yt 0 0 |x=Xtx ,x̄=
Rt
Xsx ds
0
 Z t 
=: Φ T − s, Xtx , Xsx ds .
0

9.4.3 Toward practical implementation: the paradigm of localization


For practical implementation, one should be aware that the weighted estimators often suffer from
high variance compared to formulas derived from the tangent process as this can be measured when
two formulas co-exist (the one with a differentiation of the payoff and the weighted one). This is
can be seen on the formula for the δ-hedge when the maturity is small. Consequently, weighted
formula usually need to be speeded up by variance reduction methods. Consequently the usual
approach is to isolate the singular part (where differentiation does not apply) from the smooth
part.
Let us illustrate the principle of localization functions on a very simple toy-example (d = 1):
assume that, for every ε > 0,

|F (x, z) − F (x0 , z)| ≤ CF,ε |x − x0 |, x, x0 , z ∈ R, |x − z|, |x0 − z| ≥ ε > 0

with lim inf ε→0 CF,ε = +∞. Assume furthermore that,

∀ x, z ∈ Rd , x 6= z, Fx0 (x, z) does exist

(hence bounded by CF,ε if |x − z| ≥ ε).


On the other hand, assume that e.g. F (x, Z) = h(G(x, Z)) where G(x, Z) has a probabil-
ity density which is regular in x whereas h is “highly” singular when in the neighbourhood of
{G(z, z), z ∈ R} (think of an indicator function). Then, the function f (x) := E F (x, Z) is differen-
tiable.
Then, one considers ϕ ∈ C ∞ (R, [0, 1]) function such that ϕ ≡ 1 on [−ε, ε] and supp(ϕ) ⊂
[−2ε, 2ε]. Then, one may decompose

F (x, Z) = (1 − ϕ(x − Z))F (x, Z) + ϕ(x − Z)F (x, Z) := F1 (x, Z) + F2 (x, Z).

Functions ϕ can be obtained as mollifiers in convolution theory but other choices are possible, like
simply Lipschitz continuous continuous functions (see the numerical illustration in Section 9.4.4).
304 CHAPTER 9. BACK TO SENSITIVITY COMPUTATION

Set fi (x) = E Fi (x, Z)) so that f (x) = f1 (x) + f2 (x). Then one may use a direct differentiation to
compute  
0 ∂F1 (x, Z)
f1 (x) = E
∂x
(or a finite difference method with constant or decreasing increments). As concerns f20 (x), since
F2 (x, Z) is singular, it is natural to look for a weighted estimator

f20 (x) = E (F2 (x, Z)Π)

obtained e.g. by the above described method if we are in a diffusion framework.


When working in a diffusion setting at a fixed time T , the above Bismut formula makes the job.
When we work with a functional of the whole trajectory, typically some path-depedent options (like
barriers, etc) in local or stochastic volatility models, Malliavin calculus methods is a convenient
and powerful tool even if, in most settings, more elementary approaches can often be used to derive
these explicit weights. . . which may lead to different results. For an example of weight computation
by means of Malliavin calculus in the case of Lookback or barriers options, we refer to [64].

9.4.4 Numerical illustration: what localization is useful for (with V. Lemaire)


Let us consider in a standard Black-Scholes model (St )t∈[0,T ] with interest rate r > 0 and volatility
σ > 0, two binary options:
– a digital Call with strike K > 0 and
– an asset-or-nothing Call with strike K > 0,
defined respectively by their payoff functions

h1 (ξ) = 1{ξ≥K} and hS (ξ) = ξ1{ξ≥K} ,

reproduced in figures 9.1 and 9.2 (pay attention to the scales of y-axis in each figure).
2 100
payoff payoff
smooth payoff smooth payoff

80
1.5

60

40

0.5

20

0
0

-0.5 -20
35 40 45 50 55 60 65 35 40 45 50 55 60 65

Figure 9.1: Payoff h1 of the digital Call with Figure 9.2: Payoff hS of the asset-or-nothing
strike K = 50. Call with strike K = 50.
σ2
√ σ2

Denoting F (x, z) = e−rT h1 (xe(r− 2 )T +σ T z ) and F (x, z) = e−rT hS (xe(r− 2 )T +σ T z ) in the
digital Call case and in the asset-or-nothing Call respectively, we consider f (x) = E F (x, Z) where
9.4. A FLAVOUR OF STOCHASTIC VARIATIONAL CALCULUS: FROM BISMUT TO MALLIAVIN305

Z is a standard Gaussian variable. With both payoff functions, we are in the singular setting in
which F (., Z) is not Lipschitz continuous continuous but only 12 -Hölder in L2 . As expected, we are
interested in computing the delta of the two options i.e. f 0 (x).
In such a singular case, the variance of the finite difference estimator explodes as ε → 0 (see
Proposition 9.2) and ξ 7→ F (ξ, Z) is not Lp -differentiable for p ≥ 1 so that the tangent process
approach can not be used (see Section 9.2.2).
We first illustrate the variance explosion in Figures 9.3 and 9.4 where the parameters have been
set to r = 0.04, σ = 0.1, T = 1/12 (one month), x0 = K = 50 and M = 106 .
10 5000
Crude estimator Crude estimator
RR extrapolation RR extrapolation
4500

8 4000

3500

6 3000

2500

4
2000

1500

2
1000

500

0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Figure 9.3: Variance of the two estimator as Figure 9.4: Variance of the two estimator as
a function of ε (digital Call). a function of ε (asset-or-nothing Call).

To avoid the explosion of the variance one considers a smooth (namely Lipschitz continuous
continuous) approximation of both payoffs. Given a small parameter η > 0 one defines
(
if |ξ − K| > η,

h1 (ξ) if |ξ − K| > η, hS (ξ)
h1,η (ξ) = 1 1 K and hη,S (ξ) = K+η K+η K
2η ξ + 2 (1 − η ) if |ξ − K| ≤ η 2η ξ + 2 (1 − η ) if |ξ − K| ≤ η

We define Fη (x, Z) and fη (x) similarly as in the singular case.


In this numerical section, we introduce a Richardson-Romberg (RR) extrapolation of the finite
difference estimator. The extrapolation is done using a linear combination of the finite difference
estimator of step ε and the one of step 2ε . This linear extrapolation allow us to “kill” the first bias
term in the expansion of the error. Similarly as in the proof of Propositions 9.1 and 9.2 we then
prove that 4 1 
= O(ε3 ) + O √1
0  
f (x) − fd0 (x) ε − fd0 (x) , (9.18)
,M ε,M
3 2 3 2
ε M
0 (x) 1 PM F (x+ε,Zk )−F (x−ε,Zk )
where fd ε,M = M k=1 2ε and

fη (x) − 4 fd 1 d = O(ε3 ) + O √1 ,
0   
0 (x) − f 0 (x) (9.19)
η ε
,M η ε,M
3 2 3 2
M
M
0 (x)
1 X Fη (x + ε, Zk ) − Fη (x − ε, Zk )
where, as usual, fd
η ε,M
= .
M 2ε
k=1
306 CHAPTER 9. BACK TO SENSITIVITY COMPUTATION

 Exercise. Prove (9.18) and (9.19).

The control of the variance in the smooth case is illustrated in Figures 9.5 and 9.6 when η = 2,
and in Figures 9.7 and 9.8 when η = 0.5. The variance increases when η decreases to 0 but does
not explode as ε goes to 0.
For a given ε, note that the variance is usually higher using the RR extrapolation. However,
in the Lipschitz continuous case the variance of the RR estimator and that of the crude finite
difference converge toward the same value when ε goes to 0. Moreover from (9.18) we deduce the
choice ε = O(M −1/8 ) to keep the balance between the bias term of the RR estimator and the
variance term.
As a consequence, for a given level of the L2 -error, we can choose a bigger ε with the RR
estimator, which reduces the bias without increasing the variance.

0.035 80
Crude estimator Crude estimator
RR extrapolation RR extrapolation

70
0.03

60

0.025

50

0.02

40

0.015
30

0.01 20
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Figure 9.5: Variance of the two estimator as Figure 9.6: Variance of the two estimator
a function of ε (digital Call). Smooth payoff as a function of ε (asset-or-nothing Call).
with η = 1. Smooth payoff with η = 1.

0.25 700
Crude estimator Crude estimator
RR extrapolation RR extrapolation

600
0.2

500

0.15
400

300
0.1

200

0.05
100

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1

Figure 9.7: Variance of the two estimator as Figure 9.8: Variance of the two estimator
a function of ε (digital Call). Smooth payoff as a function of ε (asset-or-nothing Call).
with η = 0.5. Smooth payoff with η = 0.5.
9.4. A FLAVOUR OF STOCHASTIC VARIATIONAL CALCULUS: FROM BISMUT TO MALLIAVIN307

1
The parameters of the model are the following: x0 = 50, K = 50, r = 0.04, σ = 0.1 and T = 52
1 6
(one week) or T = 12 (one month). The number of Monte Carlo simulations is fixed to M = 10 .
We now compare the following estimators in the two cases with two different maturities T = 1/12
(one month) and T = 1/52 (one week):
1
• Finite difference estimator on the non-smooth payoffs h1 and hS with ε = M − 4 ' 0.03.

• Finite difference estimator with Richardson-Romberg extrapolation on the non-smooth pay-


1
offs with ε = M − 6 = 0.1.

• Crude weighted estimator (with standard Black-Scholes δ-weight) on the non-smooth payoffs
h1 and hS .

• Localization: Finite difference estimator on the smooth payoffs h1,η and hS,η with η = 1.5 and
1
ε = M − 4 combined with the weighted estimator on the (non-smooth) differences h1 − h1,η
and hS − hS,η .

• Localization: Finite difference estimator with Richardson-Romberg extrapolation on the


1
smooth payoffs h1,η and hS,η with η = 1.5 and ε = M − 6 combined with the weighted
estimator on the (non-smooth) differences h1 − h1,η and hS − hS,η .

The results are summarized in the following Tables ?? and ?? for the delta of the digital Call
option and in the Tables ?? and ?? for the one of the asset-or-nothing Call option.
[In progress...]
• Multi-dimensional case
• Variance reduction?
308 CHAPTER 9. BACK TO SENSITIVITY COMPUTATION
Chapter 10

Multi-asset American/Bermuda
Options, swing options

10.1 Introduction
In this chapter devoted to numerical methods for multi-asset American and Bermuda options, we
will consider a slightly more general framework that the nonengative dynamics traded assets in
complete or incomplete markets. This is we will switch from a dynamics (St )t∈[0,T ] to a more
general Rd -valued Brownian diffusion (Xt )t∈[0,T ] satisfying the SDE

dXt = b(t, Xt )dt + σ(t, Xt )dWt , X0 ∈ L1Rd (Ω, A, P) (10.1)

where b : [0, T ] × Rd → Rd , σ : [0, T ] × Rd → M(d, q, R) are Lipschitz continuous continuous in x,


uniformly in t ∈ [0, T ] and W is a q-dimensional standard Brownian motion defined on (Ω, A, P),
independent of X0 . The filtration of interest is Ft = σ(X0 , NP ) ∨ FtW .
Each component X i , i = 1, . . . , d, of X can still be seen as the price process of a traded risky
asset (although it can be negative a priori in what follows).
To keep the link with American option pricing we will call American “vanilla” payoff any
process of the form (f (t, Xt ))t∈[0,T ] where f : [0, T ] × Rd → R+ (temporarily) is a Borel function.
If r denotes the interest rate (supposed to be constant for the sake of simplicity), the “obstacle
process” is given by
Zet = e−rt f (t, Xt ).
We will always make the following assumptions on the payoff function f

(Lip) ≡ f is Lipschitz continuous continuous in x ∈ Rd , uniformly in t ∈ [0, T ].

This assumption will be sometimes refined into a semi-convex condition

(SC) ≡ f satisfies (Lip) and is semi-convex in the following sense: there exists δf : [0, T ]× Rd →
Rd , Borel and bounded, and ρ ∈ (0, ∞) such that

∀ x, y ∈ Rd , f (t, y) − f (t, x) ≥ δf (t, x) | y − x − ρ|y − x|2 .



∀ t ∈ [0, T ],

309
310 CHAPTER 10. MULTI-ASSET AMERICAN/BERMUDA OPTIONS, SWING OPTIONS

Examples. 1. If f is Lipschitz continuous continuous and convex, then f is semi-convex with


  
∂f
δf (t, x) = (t, x)
∂xi r 1≤i≤d
 
∂f
where ∂xi
(t, x) denotes the right derivative with respect to xi at (t, x).
r
2. If f (t, .) ∈ C 1 (Rd ) and, for every t ∈ [0, T ], ∇x h(t, .) is Lipschitz continuous in x, uniformly in
t ∈ [0, T ], then

h(t, y) − h(t, x) ≥ (∇x h(t, x)|y − x) − [∇x h]Lip |ξ − x||y − x|, ξ ∈ [x, y]
≥ h∇x h(t, x)|y − xi − [∇x h]Lip |y − x|2

where [∇x h]Lip = sup [∇x h(t, .)]Lip .


t∈[0,T ]

Proposition 10.1 (see [148]) There exists a continuous function F : Rd → R+ such that
n   o
F (t, Xt ) = P-esssup E e−rτ f (τ, Xτ ) | Ft , τ ∈ TtF

where τ ∈ TtF denotes the set of (Fs )s∈[0,T ] -stopping times having values in [t, T ].

Note that F (0, x) is but the premium of the American option with payoff f (t, Xtx ), t ∈ [0, T ].

10.2 Time discretization


We will discretize the optimal stopping problem and, if necessary, the diffusion itself to have at
hand a simulatable underlying structure process.
First we note that if tnk = kT
n , k = 0, . . . , n, (Xtk )0≤k≤n is an Ftk )0≤k≤n -Markov chain with
n n

transitions
Pk (x, dy) = P(Xtnk+1 ∈ dy | Xtnk = x), k = 0, . . . , n − 1.

Proposition 10.2 Let n ∈ N∗ ; Set


n o
Tkn = τ : (Ω, A, P) → {tn` , ` = k, . . . , n}, (Ftn` )-stopping time

and, for every x ∈ Rd ,


tn ,x
 n

Fn (tnk , x) sup R e−r(tτ −tk ) f (τ, Xτk )
τ ∈Tkn

tn ,x
where (Xsk )s∈[tnk ,T ] is the unique solution to the (SDE) starting from x at time tnk .
 n
  n

(a) e−rtk Fn (tnk , Xtnk ) = (P, (Ftnk ))-Snell e−rtk f (tnk , Xtnk ) .
k=0,...,n
 
(b) Fn (tnk , Xtnk ) satisfies the “pathwise” backward dynamic programming principle (denoted
k=0,...,n
BDP P in the sequel)
Fn (T, XT = f (T, XT )
10.2. TIME DISCRETIZATION 311

and  T 
Fn (tnk , Xtnk ) = max f (tnk , Xtnk ), e−r n E Fn (tnk , Xtnk+1 ) | FtW
n , k = 0, . . . , n − 1.
k

(c) The functions Fn (tnk , .) satisfy the “functional” backward dynamic programming principle
 T

Fn (T, x) = f (T, x) and Fn (tnk , x) = max f (tnk , x), e−r n Pk Fn (tnk , .))(x) , k = 0, . . . , n − 1.

Proof. We proceed backward as well. We consider the functions Fn (tnk , .), k = 0, . . . , n, as defined
in (c).
(c) ⇒ (b). The result follows from the Markov property which implies for every k = 0, . . . , n − 1,

E Fn (tn, Xtnk+1 ) | FtW = Pk Fn (tnk , .))(x).



n
k

 n

(b) ⇒ (a). This is a trivial consequence of the (BDP P ) since e−rtk Fn (tnk , Xtnk ) is the
  k=0,...,n
n
Snell envelope associated to the obstacle sequence e−rtk f (tnk , Xtnk ) .
k=0,...,n

Applying what precedes to the case X0 = x, we derive from the general theory on optimal stopping
that n o
Fn (0, x) = sup E e−rτ f (τ, Xτ , τ ∈ T[0,T
n

] .

The extension to times tnk , k = 1, . . . , n follows likewise from the same reasoning carried out
tn ,x
with (Xtnk )`=k,...,n . ♦
`

Remark-Exercise. It is straightforward to establish, given that the flow of the SDE is Lipschitz
continuous (see Theorem 7.10) from Rd to any Lp (P), 1 ≤ p < +∞, in the sense

sup |Xtx − X y | ≤ Cb,σ,T |x − y|,



t p
t∈[0,T ]

that the functions Fn (tnk , .) are Lipschitz continuous continuous, uniformly in tnk , k = 0, . . . , n,
n ≥ 1.

Now we pass to the (discrete time) Euler scheme as defined by Equation (7.3) in Chapter 7.
We recall its definition for convenience (with a slight change of notation concerning the Gaussian
noise):
r
T n n T
X̄tnk+1 = X̄tnk + b(tk , X̄tnk ) + σ(tk , X̄tnk ) Zk+1 , X̄0 = X0 , k = 0, . . . , n − 1, (10.2)
n n

where (Zk )1≤k≤n denotes a sequence of i.i.d. N (0; Iq )-distributed random vectors given by
r
n
Uk := (Wtnk − Wtnk−1 ), k = 1, . . . , n.
T

(Strictly speaking we should write Zkn rather than Zk .)


312 CHAPTER 10. MULTI-ASSET AMERICAN/BERMUDA OPTIONS, SWING OPTIONS

Proposition 10.3 (Euler scheme) The above proposition remains true when replacing the se-
quence (Xtnk )0≤k≤n by its Euler scheme with step Tn , still with the filtration (σ(X0 , FtW
n ))0≤k≤n . In
k
both cases one just has to replace the transitions Pk (xdy) of the original process by that of its Euler
scheme with step Tn , namely P̄k,k+1
n (x, dy) defined by
r
n
 T T 
P̄k,k+1 f (x) = E f (x + b(tnk−1 , x) + σ(tnk−1 , x)Z) , Z ∼ N (0; Iq ),
n n
and Fn by F̄n .

Remark. In fact the result holds true as well with the (smaller) natural innovation filtration
FkX0 ,Z = σ(X0 , Z1 , . . . , Zk ), 0 ≤ k ≤ n, since the Euler scheme remains a Markov chain with the
same transitions with respect to this filtration

Now we state a convergence rate result for the “réduites” and the value functions. In what
follows we consider the value function of the original continuous time stopping time problem denoted
F (t, x) and defined by
n o
F (t, x) = sup E e−rτ f (τ, Xτx ) , τ : (Ω, A) → [t, T ], (FtW )-stopping time

(10.3)
 n   o
= E P-esssup E e−rτ f (τ, Xτx )|FtW , τ : (Ω, A) → [t, T ], (FtW )-stopping time .(10.4)

Furthermore on can also prove that


n o
F (t, Xtx ) = P-esssup E e−rτ f (τ, Xτx ) , τ : (Ω, A) → [t, T ], (FtW )-stopping time .


General interpretation. In an optimal stopping framework, F (t, x) is also known as the


“réduite” at time t of the optimal stopping problem (with horizon T ). The second equality is (true
but) not trivial and we refer to [149].
If (Xtx )t∈[0,T ] denotes an underlying (Markov) structure process and f (t, Xt ) is the gain you
have if you leave the “game” at time t the above right hand side of the above equality, represents
the supremum of your mean gains over all “honest” stopping strategies. By honest we mean here
non-anticipative with respect to the “history” (or filtration) of the process W . In mathematical
terms it reads that admissible strategies τ are stopping times with respect to the filtration (FtW )
i.e. random variables τ (Ω, A) → [0, T ] satisfying

∀ t ∈ [0, T ], {τ ≤ t} ∈ FtW .

This means that you decide to leave the game between 0 end t based on your observation of the
process W between 0 and t.
Asking {τ = t} to lie in FtW (I decide to stop exactly at time t) seems more natural assumption,
but in a continuous time framework this condition turns out to be technically not strong enough
to make the theory work.

 Exercise. Show that for any F W -stopping time one has, for every t ∈ [0, T ], {τ = t} ∈ FtW (the
converse is not true in general in a continuous time setting).
10.2. TIME DISCRETIZATION 313

In practice, the underlying assumption that the Brownian motion W is observable by the player
is highly unlikely, or at least induces dramatic limitations for the modeling of (Xt )t∈[0,T ] which
models in Finance the vector of market prices of traded assets.
In fact, since the Brownian diffusion process X is a F W -Markov process, one shows (see [149])
one can replace in the above definition and characterizations of F (t, x) the natural filtration of the
Brownian motion W by that of X. This is clearly more in accordance with usual models in Finance
(and elsewhere) since X represents in most models the observable structure process.

Interpretation in terms of American options. As far as derivative pricing is concerned, the


(non-negative) function f defined on [0, T ] × Rd+ is an American payoff function. This means that
the holder of the contract exercises his/her option at time t, he/she will receive a monetary flow
equal to f (t, x) if the vector of market prices is equal to x ∈ Rd+ at time t. In a complete market
model, one shows that, for hedging purpose, one should price all the American payoffs under the
unique risk-neutral probability measure i.e. the unique probability measure P∗ equivalent to the
historical probability that makes the price process X x = (Xtx )t∈[0,T ] a P∗ -martingale (x is here the
starting value of the process X x ). However, as far as numerical aspects are concerned, this kind
of restriction is of little interest in the sense that it has no impact on the methods or on their
performances. This is the reason why, in what follows we will keep a general form for the drift b of
our Brownian diffusion dynamics (and still denote by P the probability measure on (Ω, A).

Theorem 10.1 (see [13], 2003) (a) Discretization of the stopping rules for the struc-
ture process X: If f satisfies (Lip), i.e. is Lipschitz continuous in x uniformly in t ∈ [0, T ], then
so are the value functions Fn (tnk , .) and F (tnk , .), uniformly with respect to tnk , k = 0, . . . , n, n ≥ 1.
Furthermore F (tnk , .) ≥ Fn (tnk , .) and
  Cb,σ,f,T
max F (tnk , Xtnk ) − Fn (tnk , Xtnk ) ≤ √

0≤k≤n p n

and, for every compact set K ⊂ Rd ,


   C
n n b,σ,f,T,K
0 ≤ sup max F (tk , x) − Fn (tk , x) ≤ √ .
x∈K 0≤k≤n n

(b) If f is semi-convex, then there exists real constants Cb,σ,f,T and Cb,σ,f,T,K > 0 such that
  Cb,σ,f,T
max Fn (tnk , Xtnk ) − F (tnk , Xtnk ) ≤

0≤k≤n p n

and, for every compact set K ⊂ Rd ,


   C
n n b,σ,f,T,K
0 ≤ sup max F (tk , x) − Fn (tk , x) ≤ .
x∈K 0≤k≤n n

(c) Euler approximation scheme X̄ n : There exists real constants Cb,σ,f,T and Cb,σ,f,T,K > 0
such that
n n
Cb,σ,f,T
max |Fn (tk , Xtk ) − F̄n (tk , Xtk )| ≤ √
n n

0≤k≤n p n
314 CHAPTER 10. MULTI-ASSET AMERICAN/BERMUDA OPTIONS, SWING OPTIONS

and, for every compact set K ⊂ Rd ,


   C
n n b,σ,f,T,K
0 ≤ sup max F (tk , x) − F̄n (tk , x) ≤ √ .
x∈K 0≤k≤n n
How to proceed in practice?
 If the diffusion process X is simulatable at times tnk (in an exact way), it is useless to
introduce the Euler scheme and it will be possible to take advantage of the semi-convexity of the
payoff/obstacle function f to get a time discretization error of the réduite F by Fn at a O(1/n)-rate.
A typical example of this situation is provided by the multi-dimensional Black-Scholes model
(and its avatars for FX (Garman-Kohlagen) or future (Black)) and more generally by models where
the process (Xtx ) can be written at each time t ∈ [0, T ] as an explicit function of Wt , namely

∀ t ∈ [0, T ], Xt = ϕ(t, Wt )

where ϕ(t, x) can be computed at a very low cost. When d = 1 (although of smaller interest for
application in view of the available analytical methods based on variational inequalities), one can
also rely on the exact simulation method of one dimensional diffusions, see [23]).
 In the general case, we will rely on this two successive steps of discretization: one for the optimal
stopping rules and one for the underlying process to make it simulatable. However in both cases,
as far as numerics are concerned, we will rely on the BDP P which itself requires the computation
of conditional expectations.
In both case we have now access to a simulatable Markov chain (either the Euler scheme or
the process itself at times tnk ).This task requires a new discretization phase, making possible the
computation of the discrete time Snell envelope (and its réduite).

10.3 A generic discrete time Markov chain model


We consider a standard discrete time Markovian framework: let (Xk )0≤k≤n be an Rd -valued
(Fk )0≤k≤n -Markov chain defined on a filtered probability space (Ω, A, (Fk )0≤k≤n P) with transi-
tions 
Pk (x, dy) = P Xk+1 ∈ dy | Xk = x , k = 0, . . . , n − 1,
and let Z = (Zk )0≤k≤n be an (Fn )-adapted obstacle/payoff sequence of non-negative integrable
random variables of the form

0 ≤ Zk = fk (Xk ), k = 0, . . . , n.

We want to compute the so-called (P, (Fn )0≤k≤n ))-Snell envelope U = (Uk )0≤k≤n defined by
n  o
Uk = esssup E f (τ, Xτ ) | Fk , τ : (Ω, A) → {k, . . . , n}, (F` )0≤k≤n stopping time

and its “réduite”, i.e. E Uk .

Proposition 10.4 The Snell envelope (Uk )0≤k≤n is solution to the following Backward Dynamic
Programing Principle
 
Un = Zn and Uk = max Zk , E Uk+1 | Fk , k = 0, . . . , n − 1.
10.3. A GENERIC DISCRETE TIME MARKOV CHAIN MODEL 315

Furthermore, for every k ∈ {0, . . . , n} there exists a Borel function uk : Rd → R such that

Uk = uk (Xk ), k = 0, . . . , n,

where

un = fn and uk = max fk , Pk uk+1 , k = 0, . . . , n − 1.

Proof. Temporarily left as an exercise (see also [116]). ♦

Moreover, general discrete time optimal stopping theory (with finite horizon) ensures that, for
every k ∈ {0, . . . , n}, there exists an optimal stopping time τk for this problem when starting to
play at time k i.e. such that

Uk = E Zτk | Fk .

This optimal stopping time may be not unique but



τk = min ` ∈ {k, . . . , n}, Uk = Zk

the lowest one in any case.


An alternative to the above “regular” BDP P formula is its dual form involving these optimal
stopping times. This second programing principle is used in Longstaff-Schwarz’s original paper on
regression methods (see next section).

Proposition 10.5 The sequence of optimal times (τk )0≤k≤n is defined by

τn = n and τk = k1{Zk ≥E(Zτk+1 |Xk )} + τk+1 1{Zk ≥E(Zτk+1 |Xk )}c , k = 0, . . . , n − 1.

Proof. We proceed
 by a backward induction on k. If k = the result is obvious. If Uk+1 =
E Zτk+1 | Fk+1 , then
     
E Zτk+1 | Xk = E E Zτk+1 | Fk+1 | Xk = E Uk+1 | Xk = E uk+1 (Xk+1 ) | Xk = E Uk+1 | Fk

by the Markov property, so that


 
E Zτk | Fk = Zk 1{Zk ≥E(Uk+1 |Fk )} + E Zτk+1 | Fk 1{Zk ≥E(Uk+1 |Fk )}c
 
= max Zk , E Uk+1 | Fk
= Uk .

Furthermore,
τk = k1{Zk ≥E(Uk+1 |Fk )} + τk+1 1{Zk ≥E(Uk+1 |Fk )}c

which implies by a straightforward induction that



τk = min ` ∈ {k, . . . , n}, Uk = Zk . ♦
316 CHAPTER 10. MULTI-ASSET AMERICAN/BERMUDA OPTIONS, SWING OPTIONS

10.3.1 Principle of the regression method


These methods are also known as Longstaff-Schwarz method in reference to the paper [104] (see
also [32]). Assume that all the random variables Zk , k = 0, . . . , n, are square integrable, then so are
Zτk s. The idea is to replace the conditional expectation operator E( . |Xk ) by a linear regression
on the first N elements of a Hilbert basis of (L2 (Ω, σ(Xk ), P), h., .iL2 (P) ). This is a very natural
idea to approximate conditional expectation (see e.g. [27], chap. for an introduction in a general
framework).
Mainly for convenience we will consider a sequence ei : Rd → R, i ∈ N∗ , of Borel functions such
that, for every k ∈ {0, . . . , n}, (ei (Xk ))i∈N∗ is a Hilbert basis of L2 (Ω, σ(Xk ), P), i.e.

{ei (Xk ), i ∈ N∗ }⊥ = {0}, k = 0, . . . , n.

In practice, one may choose different functions at every time k, i.e. families ei,k so that
(ei,k (Xk ))i≥1 makes up a Hilbert basis of L2 (Ω, σ(Xk ), P).

Example. If Xk = Wtnk , k = 0, . . . , n, show that Hermite polynomials provide a possible solution


(up to an appropriate normalization at every time k). What is its specificity?

Meta-script of a regression procedure.

• Approximation 1: Dimension Truncation

 At every time k ∈ {0, . . . , n}, truncate at level Nk

e[Nk ] (Xk ) := (e1 (Xk ), e2 (Xk ), . . . , eNk (Xk )).

and set

[Nk ]
 τn := n,

[Nk ] [N ]
 τk := k1 [Nk ] + τk+1k 1 [Nk ] , k = 0, . . . , n − 1.
{Zk >(αk |e[N ] (Xk ))} {Zk ≤(αk |e[Nk ] (Xk ))}

 
[N ]
where αk k := argmin E(Z [N ]
k − (α|e[Nk ] (Xk )))2 , α∈ RNk .
τk+1

In fact this finite dimensional optimization problem has a well-known solution given by
  
[N ] [N ]
αk = Gram(e[Nk ] (Xk ))−1 hZ [Nk ] | e` k (Xk )iL2 (P)
τk+1
1≤`≤Nk

where the so-called Gram matrix of e[Nk ] (Xk ) defined by


h i
[N ] [N ]
Gram(e[Nk ] (Xk )) = he` k (Xk ) | e`0 k (Xk )iL2 (P) .
1≤`,`0 ≤N
10.3. A GENERIC DISCRETE TIME MARKOV CHAIN MODEL 317

• Approximation 2 : Monte Carlo approximation

This second approximation phase is itself decomposed into two successive phases :

– a forward M C simulation of the underlying “structure” Markov process (Xk )0≤k≤n


[N ]
– followed by a backward approximation of τk , k = 0, . . . , n. In a more formal way, the idea
is to replace the true distribution of the Markov chain (Xk )0≤k≤n by the empirical measure
of a simulated sample of size M of the chain.
[N ]
For notational convenience we will set N = (N0 , . . . , Nn ) ∈ Nn+1 and will denote ei (Xk )
[N ]
instead of ei k (Xk ), etc.

 Forward Monte Carlo simulation phase: Simulate (and store) M independent copies
X (1) , . . . , X (m) , . . . , X (M ) of X = (Xk )0≤k≤n in order to have access to the empirical measure
M
1 X
δX (`) .
m
m=1

 Backward phase:

– At time n: For every m ∈ {1, . . . , M },

τn[N,m,M ] := n

– for k = n − 1 downto 0.
Compute
M
!2
[N ],M 1 X (m)
αk := argminα∈RN Z [N ],m,M − α.e[N ] (X (m) )
M τk+1
m=1

using the closed form formula


M
!−1 M
!
[N ],M 1 X [N ] (m) [N ] (m) 1 X (m) [N ] (m)
αk = e` (Xk )e`0 (Xk ) Z [N ],m,M e` (Xk ) .
M M τk+1
m=1 1≤`,`0 ≤Nk m=1 1≤`≤Nk

– For every m ∈ {1, . . . , M }, set


[N ],m,N [N ],m,N
τk+1 := k1{Z (m) >α[N ],M .e[N ] (X + τk+1 1{Z (m) ≤α[N ],M .e[N ] (X .
k k k )} k k k )}

Finally, the resulting approximation of the mean value at the origin of the Snell envelope, also
called “réduite”, reads
M
1 X (m)
E U0 = E(Zτ0 ) ≈ E(Zτ [N0 ] ) ≈ Z [N ],m,M as M → +∞.
0 M τ0
m=1
318 CHAPTER 10. MULTI-ASSET AMERICAN/BERMUDA OPTIONS, SWING OPTIONS

Note that when F0 = {∅, Ω}, (so that X0 = x0 ∈ Rd ), U0 = E U0 = u0 (x0 ).

Remarks. • One may rewrite formally the second approximation phase by simply replacing the
distribution of the chain (Xk )0≤k≤n by the empirical measure

M
1 X
δX (`) .
M
m=1

(Nk ) (Nk )

• The Gram matrices [E ei (Xk ) ]1≤i,j≤Nk can be computed “off line” in the sense that
(Xk )ej
 
[Nk ]
they are not “payoff dependent” by contrast with the term hZ [Nk ] | e` (Xk )iL2 (P) .
τk+1
1≤`≤Nk
In various situations, it is even possible to have some closed form for this Gram matrix, e.g.
when ei (Xk ), i ≥ 1, happens to be an orthonormal basis of L2 (Ω, σ(Xk ), P). In thatpcase the
n
Gram matrix is reduced to the identity matrix! So is the case for example when Xk = kT W kT ,
n
k = 0, . . . , n and ei = Hi−1 where (Hi )i≥0 is the basis of Hermite polynomials (see [78], chapter 3,
p.167).
• The algorithmic analysis of the above described procedure shows that its implementation requires
– a forward simulation of M paths of the Markov chain,
– a backward nonlinear optimization phase in which all the (stored) paths have to interact
[N ],M (m)
through the computation of αk which depends on all the values Xk , k ≥ 1.
However, still in very specific situations, the forward phase can be skipped if a backward sim-
ulation simulation method for the Markov chain (Xk )0≤k≤n is available. So is the case for the
Brownian motion at times nk T using Brownian bridge simulation methods (see Chapter 8).
The rate of convergence of the Monte Carlo step of the procedure is ruled by a Central Limit
Theorem stated below.

Theorem 10.2 (Clément-Lamberton-Protter (2003), see [34]) The Monte Carlo approximation
satisfies a CLT
M
!
√ 1 X (m) [N ],M [N ] L
M Z [N ],m,M − E Zτ [N ] , αk − αk −→ N (0; Σ)
M τk k
m=1 0≤k≤n−1

where Σ is non-degenerate covariance matrix.

Pros & Cons of the regression method


Pros
• The method is “natural”: Approximation of conditional expectation by (affine) regression
operator on a truncated basis of L2 (σ(Xk ), P).
• The method appears as “flexible”: the opportunity to change or adapt the (truncated) basis
of L2 (σ(Xk ), P)) at each step of the procedure (e.g. by including the payoff function from time n
in the (truncated) basis at each step).
10.3. A GENERIC DISCRETE TIME MARKOV CHAIN MODEL 319

Cons
• From a purely theoretical point of view the regression approach does not provide error bounds
or rate of approximation for the convergence of E Zτ [N0 ] ) toward E Zτ0 = E U0 which is mostly ruled
0
by the rate at which the family e[Nk ] (Xk ) “fills” L2 (Ω, σ(Xk ), P) when Nk goes to infinity. However,
in practice this information would be of little interest since, especially in higher dimension, the
possible choices for the size Nk are limited by the storing capacity of the used computing device.

• Almost all computations are made on-line since they are payoff dependent. However, note
[N ]
that the Gram matrix of (e` (Xk ))1≤`≤N can be computed off-line since it only depends on the
structure process.

• The choice of the functions ei (x), i ≥ 1, is crucial and needs much care (and intuition).
In practical implementations, it may vary at every times step. Furthermore, it may have a biased
effect for options deep in or deep out of the money since the coordinates of the functions ei (Xk )
are computed locally “where things happen most of the time” inducing an effect on the prices at
long range through their behaviour at infinity. On the other hand this choice of the functions, if
they are smooth, has a smoothing effect which can be interesting to users (if it does not induce
hidden arbitrages. . . ). To overcome the first problem one may choose local function like indicator
functions of a Voronoi diagram (see the next section 10.4 devoted to quantization tree methods or
Chapter ??) with the counterpart that no smoothing effect can be expected any more.
When there is a family of distributions “related” to the underlying Markov structure process, a
natural idea can be to consider an orthonormal basis of L2 (µ0 ) where µ0 is a normalized distribution
of the family. A typical example is the sequence of Hermite polynomials for the normal distribution
N (0, 1).
When no simple solution is available, considering the simple basis (t` )`≥0 remains a quite natural
and efficient choice in 1-dimension.
In higher dimensions (in fact the only case of interest in practice since one dimensional setting
is usually solved by PDE methods!), this choice becomes more and more influenced by the payoff
itself.

• Huge RAM capacity are needed to store all the paths of the simulated Markov chain (for-
ward phase) except when a backward simulation procedure is available. This induces a stringent
limitation of the size M of the simulation, even with recent devices, to prevent a swapping effect
which would dramatically slow down the procedure. By swapping effect we mean that when the
quantity of data to be stored becomes too large the computer uses its hard disk to store them but
the access to this ROM memory is incredibly slow compared to the access to RAM memory.

• Regression methods are strongly payoff dependent in the sense that a significant part of the
procedure (product of the inverted Gram matrix by the projection of the payoff at every time k)
has to be done for each payoff.

 Exercise Write a regression algorithm based on the“primal” BDPP.


320 CHAPTER 10. MULTI-ASSET AMERICAN/BERMUDA OPTIONS, SWING OPTIONS

10.4 Quantization (tree) methods (II)


10.4.1 Approximation by a quantization tree
In this section, we still deal with the simple discrete time Markovian optimal stopping problem
introduced in the former section. The underlying idea of quantization tree method is to approximate
the whole Markovian dynamics of the chain (Xk )0≤k≤n using a skeleton of the distribution supported
by a tree.

For every k ∈ {0, . . . , n}, we replace the marginal Xk by a function X bk of Xk taking values in a
grid Γk , namely Xk = πk (Xk ), where πk : R → Γk is a Borel function. The grid Γk = πk (Rd ) (also
b d

known as a codebook in Signal processing or Information Theory) will always be supposed finite
in practice, with size |Γk | = Nk ∈ N∗ although the error bounds established below still hold if the
grids are infinite provided πk is sublinear (|πk (x)| ≤ C(1 + |x|)) so that X bk has at least as many
moments as Xk .
b
We saw in Chapter 5 an optimal way to specify the function πk (including Γk ) when trying
by minimizing the induced Lp -mean quadratic error kXk − X bk kp . This is the purpose of optimal
quantization Theory. We will come back to these aspects further on.
The starting point, being aware that the sequence (X bk )0≤k≤n has no reason to share a Markov
property, is to force this Markov property in the Backward Dynamic Programing  Principle. This
means defining by induction a quantized pseudo-Snell envelope of fk (Xk ) 0≤k≤n (assumed to lie
at least in L1 ), namely
 
Ubn = fn (X bn ), Ubk = max fk (Xbk ), E Ubk+1 | X
bk . (10.5)

bk rather than by the σ-field Fbk :=


The forced Markov property results from the conditioning by X
σ(X` ), 0 ≤ ` ≤ k).
b

It is straightforward by induction that, for every k ∈ {0, . . . , n},

U
bk = u
bk (X
bk ) bk : Rd → R+ , Borel function.
u

See subsection 10.4.4 for the detailed implementation.

10.4.2 Error bounds


The following theorem establishes the control on the approximation of the true Snell envelope
bk )0≤k≤n using the Lp -mean approximation
(Uk )0≤k≤n by the quantized pseudo-Snell envelope (U
errors kXk − X
b k kp .

Theorem 10.3 (see [12] (2001), [132] (2011)) Assume that all functions fk : Rd → R+ are
Lipschitz continuous and that the transitions Pk (x, dy) = P(Xk+1 ∈ dy | Xk = x) are Lipschitz
continuous in the following sense

[Pk ]Lip = sup [Pk g]Lip < +∞, k = 0, . . . , n.


[g]Lip ≤1

Set [P ]Lip = max [Pk ]Lip and [f ]Lip = max0≤k≤n [fk ]Lip .
0≤k≤n−1
10.4. QUANTIZATION (TREE) METHODS (II) 321

n
X
Let p ∈ [1, +∞). We assume that kXk kp + kX
bk kp < +∞.
k=1

(a) For every k ∈ {0, . . . , n},


n
X n−`
kUk − U
bk kp ≤ 2[f ]Lip [P ]Lip ∨ 1 kX` − X
b ` kp .
`=k

(b) If p = 2, for every k ∈ {0, . . . , n},

n
!1
√ X 2(n−`) 2

kUk − U
bk k2 ≤ 2[f ]Lip [P ]Lip ∨ 1 b ` k2
kX` − X .
2
`=k

Proof. Step 1. First, we control the Lipschitz continuous constants of the functions uk . It follows
from the classical inequality

∀ ai , bi ∈ R, | sup ai − sup bi | ≤ sup |ai − bi |


i∈I i∈I i∈I

that

[uk ]Lip ≤ max [fk ]Lip , [Pk uk+1 ]Lip

≤ max [f ]Lip , [Pk ]Lip [uk+1 ]Lip
n−k
with the convention [un+1 ]Lip = 0. An easy backward induction yields [uk ]Lip ≤ [f ]Lip [P ]Lip ∨1 .
Step 2. We focus on claim (b) when p = 2. First we show

b1 2 ≤ [P1 g]2 X1 − X b2 ) 2 .
b1 2 + g(X2 ) − h(X

P1 g(X1 ) − E h(X
b2 ) | X
Lip (10.6)
2 2 2

We have
   +  
P1 g(X1 ) − E h(X
b2 ) | X
b1 = E g(X2 )|X1 − E E g(X2 )|X1 | X
b1 ⊥ E g(X2 )|X
b1 − E h(X2 )|X1

where we used the chaining rule for conditional expectation E(E( . |X1 )|Xb1 ) = E( . |X
b1 ) since X
b1
is σ(X1 )-measurable. The orthogonality follows from the very definition of conditional expectation
as a projection. Consequently

2 2 2
   
P1 g(X1 ) − E h(X2 ) | X1 ≤ P1 g(X1 ) − E P1 g(X1 )|X1 + E g(X2 )|X1 − E h(X2 )|X1
b b b b
2 2 2
2 2
≤ P1 g(X1 ) − P1 g(X1 ) + g(X2 ) − h(X2 )
b b
2 2

where we used in the last line the very definition of conditional expectation E(.|X
b1 ) as the best
2
L -approximation by a σ(X b1 )-measurable random vector and the fact that conditional expectation
operator are projectors (hence contraction with operator norm at most equal to 1).
322 CHAPTER 10. MULTI-ASSET AMERICAN/BERMUDA OPTIONS, SWING OPTIONS

Now, it follows from both dynamic programing formulas (original and quantized) that
   
|Uk − U
bk | ≤ max |fk (Xk ) − fk (X bk+1 |X
bk )|, E Uk+1 |Xk − E U bk

so that
bk |2 ≤ |fk (Xk ) − fk (X
bk )|2 + E Uk+1 |Xk − E U bk 2 .
 
|Uk − U bk+1 |X

Now, we derive from (10.6) applied with Pk and g = fk and h = u


bk+1 that
 2
bk 2 + uk+1 (Xk+1 ) − u
≤ [Pk uk+1 ]2 Xk − X bk+1 ) 2 .

E Uk+1 |Xk − E U
bk+1 |X bk+1 (X
bk
2 2 2

Plugging this inequality in the original one yields for every k ∈ {0, . . . , n},
 
bk 2 ≤ [f ]2 + [P ]2 [uk+1 ]2 Xk − X bk+1 2
bk 2 + Uk+1 − U

Uk − U Lip Lip Lip
2 2 2

still with the convention [un+1 ]Lip = 0. Now,


2(n−(k+1))
[f ]2Lip + [P ]2Lip [uk+1 ]2Lip ≤ [f ]2Lip + [P ]2Lip 1 ∨ [Pk ]Lip
2(n−k)
≤ 2[f ]2Lip 1 ∨ [Pk ]Lip .

Consequently
n−1
bk 2 ≤
2(n−k)
b` 2 + [f ]2Lip Xn − X
bn 2
X
2[f ]2Lip 1 ∨ [Pk ]Lip

Uk − U X` − X
2 2 2
`=k
n
2(n−k)
b` 2
X
≤ 2[f ]2Lip

1 ∨ [Pk ]Lip X` − X
2
`=k

which completes the proof. ♦

Remark. The above control emphasizes the interest of minimizing the “quantization” error kXk −
bk kp at each tim step of the Markov chain to reduce the final resulting error.
X

 Exercise. Prove claim (a) starting from the inequality



P1 g(X1 ) − E h(X
b2 ) | X
b1 ≤ [P1 g]Lip X1 − X
b1 + g(X2 ) − h(X
b2 ) .
p p p

Example of application: the Euler scheme. Let (X̄tnn )0≤k≤n be the Euler scheme with step Tn
k
of the d-dimensional diffusion solution to the SDE (10.1). It defines an homogenous Markov chain
with transition
r !
T T L
P̄kn g(x) = E g x + b(tnk , X̄tnn ) + σ(tnk , X̄tnn ) Z , Z ∼ N (0, Iq ).
n k k n
10.4. QUANTIZATION (TREE) METHODS (II) 323

If f is Lipschitz
r 2
n
P̄ g(x) − P̄ n g(x0 ) 2 ≤ [g]2Lip E x − x0 + T n n 0
 T n n 0

b(t , x) − b(t , x ) + σ(t , x) − σ(t , x ) Z

k k k k k k
n n


2 !
T 2
≤ [g]2Lip x − x0 + b(tnk , x) − b(tnk , x0 ) + σ(tnk , x) − σ(tnk , x0 )

n
T2 2
 
2 0 2 T 2 2T
≤ [g]Lip |x − x | 1 + [σ]Lip + [b]Lip + 2 [b]Lip .
n n n
As a consequence  
Cb,σ,T
[P̄kn g]Lip ≤ 1+ [g]Lip , k = 0, . . . , n − 1,
n
i.e.  
n Cb,σ,T
[P̄ ]Lip ≤ 1+ .
n
Applying the control established in claim (b) of the above theorem yields with obvious notations
n 
!1
√ X Cb,σ,T 2(n−`) 2 2
Uk − U
bk ≤ 2[f ]Lip 1+ X` − X b`
2 n 2
`=k
n
!1
√ X n 2
2

2eCb,σ,T [f ]Lip e2Cb,σ,T t` X` − X



≤ b`
2
.
`=k

 Exercise. Derive a result in the case p 6= 2 based on Claim (a) of the theorem.

10.4.3 Backkground on optimal quantization


See (and update if necessary) Chapter 5

10.4.4 Implementation of a quantization tree method


 Quantization tree. The pathwise Quantized backward Dynamic Programing Principle (10.5)
can be rewritten in distribution as follows. Let for every k ∈ {0, . . . , n},
n o
Γk = xk1 , . . . , xkNk .

Keeping in mind that U


bk = u bk (X
bk ), k = 0, . . . , n, we first get
(
u
bn = fn on Γn ,   (10.7)
bk (xi ) = max fk (xki ), E u
u k bk+1 (Xbk+1 )|X bk = xk , i = 1, . . . , Nk , k = 0, . . . , n − 1
i

which finally leads to


bn (xni ) = fn (xni), i = 1, . . . , Nn ,
(
u
PNk+1 k  (10.8)
bk (xki ) = max fk (xki ), j=1
u bk+1 (xk+1
pij u j ) , i = 1, . . . , Nk , k = 0, . . . , n − 1
324 CHAPTER 10. MULTI-ASSET AMERICAN/BERMUDA OPTIONS, SWING OPTIONS

where the transition weight “super-matrix” [pkij ] is defined by

bk+1 = xk+1 |X
pkij = P X bx = xk , 1 ≤ i ≤ Nk , 1 ≤ j ≤ Nk+1 , k = 0, . . . , n − 1.

j i (10.9)

Although the above super-matrix defines a family of Markov transitions, the sequence (X
bk )0≤k≤n
bk+1 = xk+1 |X bk = xk and

is definitely not a Markov Chain since there is no reason why P X j i
bk+1 = xk+1 |Xk = xk , X
b` x = x` , ` = 0, . . . , i − 1 .

P X j i i`
In fact one should rather seeing the quantized transitions
Nk+1
X
Pbk (xki , dy) = pkij δxk+1 , xki ∈ Γk , k = 0 = . . . , n − 1,
j
j=1

as spatial discretizations of the original transitions Pk (x, du) of the original Markov chain.

Definition 10.1 The family of grids (Γk ), 0 ≤ k ≤ n, and the transition super-matrix [pkij ] defined
by (10.9) defines a quantization tree of size N = N0 + · · · + Nn .

Remark. A quantization tree in the sense of the above definition does not characterize the distri-
bution of the sequence (X
bk )0≤k≤n .

 Implementation. The implementation of the whole quantization tree method relies on the
computation of this transition super-matrix. Once the grids (optimal or not ) have been specified
and the weights of the super-matrix have been computed or, to be more precise have been estimated,
the computation of the pseudo-réduite E U b0 at the origin amounts to an almost instantaneous
“backward descent” of the quantization tree based on (10.8).
If we can simulate M independent copies of the Markov chain (Xk )0≤k≤n denoted X (1) , . . . , X (M ) ,
then the weights pkij can be estimated by a standard Monte Carlo estimator

b (m) = xk+1 &Xb (m) = xk



(M ),k # m ∈ {1, . . . , M }, Xk+1 j k i M →+∞
pij = −→ pkij .
b (m) = xk

# m ∈ {1, . . . , M }, X k i

Remark. By contrast with the regression methods for which theoretical results are mostly focused
on the rate of convergence of the Monte Carlo phase, we will analyze here this part of the procedure
for which we refer to [8] where the Monte Carlo simulation procedure to compute the transition
super-matrix and its impact on the quantization tree is deeply analyzed.

Application. We can apply what precedes, still within the framework of an Euler scheme, to
0
our original optimal stopping problem. We assume that all random vectors Xk lie in Lp (P), for
a real exponent p0 > 2 and that they have been optimally quantized (in L2 ) by grids of size Nk ,
k = 0, . . . , n, then, relying on the non-asymptotic Zador’s Theorem (claim (b) from Theorem 5.1.2),
we get with obvious notations,

n
!1
c̄n k ≤ √2eCb,σ,T [f ] C 0 0
2
X −2
kŪ0n −U0 2 Lip p ,d e 2Cb,σ,T
σp0 (X̄tnn )2 Nk d .
k
k=0
10.4. QUANTIZATION (TREE) METHODS (II) 325

Quantization tree optimal design At this stage, one can proceed an optimization of the
quantization tee. To be precise, one can optimize the size of the grids Γk subject to a “budget” (or
total allocation) constraint, typically
( n )
2
n 2 −d
X
2Cb,σ,T
min e σp0 (X̄tn ) Nk , Nk ≥ 1, N0 + · · · + Nn = N .
k
k=0

 Exercise. (a) Solve rigorously the constrained optimization problem


( n )
2
X −
min e2Cb,σ,T σp0 (X̄tnn )2 xk d , xk ∈ R+ , x0 + · · · + xn = N .
k
k=0

(b) Derive a asymptotically optimal (suboptimal) choice for the grid size allocation (as N → +∞).

This optimization turns out to have a significant numerical impact, even if, in terms of rate,
the uniform choice Nk = N̄ = Nn−1 (doing so we assume implicitly that X0 = x0 ∈ Rd so that
Xb0 = X0 = x0 , N0 = 1 and Ub0 = u
b0 (x0 )) leads to a quantization error of the form

c̄n (x0 )| ≤ Ū0n − U


|ūn0 (x0 ) − u
c̄n
0 2

0

n
≤ κb,σ,T [f ]Lip max σp0 (X̄tnn ) × 1
0≤k≤n k
√ N̄ d
0 n
≤ [f ]Lip κb,σ,T,p0 1
N̄ d
0
since we know (see Proposition 7.2 in Chapter 7) that sup E max |X̄tnn |p

< +∞. If we plug that
0≤k≤n k
n
in the global estimate obtained in Theorem 10.1(b), we obtain the typical error bound
1
!
c̄n (x0 )| ≤ Cb,σ,T,f,d 1 n 2
|u0 (x0 ) − u 0 1 + 1 . (10.10)
n2 N̄ d

Remark. If we can simulate directly the sampled (Xtnk )0≤k≤n of the diffusion instead of its Euler
scheme and if the obstacle/payoff function is semi-convex in the sense Condition (SC), then we get
as a typical error bound
1
!
c̄n (x0 )| ≤ Cb,σ,T,f,d 1 n 2
|u0 (x0 ) − u 0 + .
n N̄ d1

1
Remark. • The rate of decay N̄ − d which becomes worse and worse as the dimension d of the
structure Markov process increases can be encompassed by such tree methods. It is a consequence
of Zador’s Theorem and is known as the curse of dimensionality.
• These rates can be significantly improved by introducing a Romberg like extrapolation method
or/and some martingale corrections to the quantization tree (see [132]).
326 CHAPTER 10. MULTI-ASSET AMERICAN/BERMUDA OPTIONS, SWING OPTIONS

10.4.5 How to improve convergence performances?


 Tree optimization vs BDDP complexity If, n being fixed, we set
 
 2d 
 (σ2+δ (Xk )) d+2 
 ∨ 1, k = 0, . . . , n
Nk = P
 2d N

0≤`≤n 2+δ (X k )) d+2

(see exercise, in the previous section) we (asymptotically) minimize the resulting quantization error
induced by the BDDP descent. This error reads with N e = N0 + · · · + Nn (usually > N ).

Examples: • Brownian motion Xk = Wtk : Then W c0 = 0 and



kWtk k2+δ = Cδ tk , k = 0, . . . , n.
Hence N0 = 1 and
  d
2(d + 1) k 2(d+1)
Nk ≈ N, k = 1, . . . , n;
d+2 n
1− 1 1
n1+ d

2(d + 1) d  n 
|V0 − vb0 (0)| ≤ CW,δ 1 =O 1
d+2 N N̄ d d

N
with N̄ = n . Theoretically this choice may look not crucial since it has no impact on the conver-
gence rate, but in practice, it does influence the numerical performances.
• Stationary process: The process (Xk )0≤k≤n is stationary and X0 ∈ L2+δ for some δ > 0. A
typical example in the Gaussian world, is, as expected, the stationary Ornstein-Uhlenbeck process
(sampled at times tnk = kT n , k = 0, . . . , n). The essential feature of such a setting is that the
quantization tree only relies on one optimal N̄ -grid (say L2h optimal for the distribution i of X
b0 )
Γ = Γ∗N = {x1 , . . . , xN̄ } and one quantized transition matrix P(X b Γ = xj | X
b Γ = xi ) .
1 0
1≤i,j≤n
l m
N
– For every k ∈ {0, . . . , n}, kXk k2+δ = kX0 k2+δ hence Nk = n+1 , k = 0, . . . , n and
1 1
n2+d n N
kV0 − vb0 (X
b0 )k ≤ CX,δ
2 1 ≤ CX,δ 1 with N̄ = n+1 .
N d N̄ d

 Richardson-Romberg extrapolation(s) Romberg extrapolation in this framework is based


on an heuristic guess: there exists a “sharp rate” of convergence of the quantization tree method as
the total budget N goes to infinity and this rate of convergence is given by (10.10) when replacing
N̄ by N/n.
On can proceed a Romberg extrapolation in N for fixed n or even a full Romberg extrapolation
involving both the time discretization step n and the size N of the quantization tree.

 Exercise (temporary). Make the assumption that the error in a quantization scheme admits
a first order expansion of the form
1 1
c1 c2 n 2 + d
Err(n, N ) = α + 1 .
n Nd
10.4. QUANTIZATION (TREE) METHODS (II) 327

Devise a Richardson-Romberg extrapolation based on two quantization trees with sizes N (1)
and N (2) respectively.

 Martingale correction When (Xk )0≤k≤n , is a martingale, one can force this martingale
property on the quantized chain by freezing the transition weigh super-matrix and by moving in a
backward way the grids Γk so that the resulting grids Γe k satisfy

b Γek−1 = X
b Γek | X b Γek−1 .

E Xk k−1 k−1

In fact this identity defines by a backward induction the grids Γ


b k . As a final step, one translate
the grids so that Γ0 and Γ0 have the same mean. Of course such a procedure is entirely heuristic.
e

10.4.6 Numerical optimization of the grids: Gaussian and non-Gaussian vectors


The procedures that minimizes the quantization error are usually stochastic (except in 1-dimension).
The most famous ones are undoubtedly the so-called Competitive Leaning Vector Quantization
algorithm (see [129] or [127]) and the Lloyd procedure (see [129, 124, 56]) which have been described
in details and briefly analyzed in Section 6.3.6 of Chapter 6. More algorithmic details can also be
found on the website

www.quantize.maths-fi.com

as well as optimal quantizers of the normal distributions for d = 1 to to d = 10 and N = 1 up to


105 .

10.4.7 The case of normal distribution N (0; Id ) on Rd , d ≥ 1


For these distributions a large scale optimization have been carried out based on a mixed CLV Q-
Lloyd’s procedure. To be precise, grids have been computed for For d = 1 up to 10 and N =
1 ≤ N ≤ 5 000. Furthermore several companion parameters have also been computed (still by
simulation): weight, L1 quantization error, (squared) L2 -distortion, local L1 &L2 -pseudo-inertia of
each Voronoi cell. All these grids can be downloaded on the website.
Thus depicting an optimal quadratic N -quantization of the bi-variate normal distribution
N (0; I2 ) with N = 500 yields
On the above website are also available some (free access) grids for the d-variate normal distri-
bution.

The 1-dimension. . . Although of little interest for applications since other (deterministic meth-
ods like PDE approach are available) we propose below for the sake of completeness a Newton-
Raphson method to compute optimal quantizers of scalar unimodal distributions, i.e. absolutely
continuous distributions whose density is log-concave. The starting point is the following theorem
due to Kieffer [81] which states uniqueness of the optimal quantizer in that setting.

Theorem 10.4 (1D-distributions, see [81]) If d = 1 and PX (dξ) = ϕ(ξ) dξ with log ϕ concave,
then there is exactly one stationary N -quantizer (up to the reordering of its component, i.e. as a
328 CHAPTER 10. MULTI-ASSET AMERICAN/BERMUDA OPTIONS, SWING OPTIONS

Figure 10.1: An optimal quantization of the bi-variate normal distribution with size N = 500

set). This unique stationary quantizer is a global (and local) minimum of the distortion function
i.e.
X
∀ N ≥ 1, argminRN DN = {x(N ) }.

Remark. Absolutely continuous distributions on the real line with a log-concave density are
sometimes called unimodal distributions. The support of such distributions is a (closed) interval
[a, b] where −∞ ≤ a ≤ b ≤ +∞.

Examples of unimodal distributions: The normal and Gaussian distributions, the gamma
distributions γ(α, β) with α ≥ 1, the Beta B(a, b)-distributions with a, b ≥ 1, etc.

In 1D-setting, a deterministic optimization approach can be developed by proceeding as follows.

 Specification of the Voronoi cells of x = (x1 , . . . , xN ), a < x1 < x2 < · · · < xN < b where [a, b]
1 1 1 i+1 i
denote the support of PX (which is that of ϕ as well): Ci (x) = [xi− 2 , xi+ 2 [, xi+ 2 = x 2+x ,
1 1
i = 2, . . . , N − 1, x 2 = a, xN + 2 = b .  
Z xi+ 12
 Computation of the gradient Gradient: ∇DN X
(x) = 2  1
(xi − ξ)ϕ(ξ)dξ  .
xi− 2
1≤i≤N
The Hessian ∇2 (DN X )(x1 , . . . , xN ) can in turn be computed (at least at N -tuples having com-

ponents where the density ϕ is continuous which is the case everywhere, except possibly at the
endpoints of its support which is an interval). It reads

∇2 (DN
X
)(x1 , . . . , xN ) =
 
tobecompleted
Z u Z u
where Φ(u) = ϕ(v)dv and Ψ(u) = vϕ(v)dv.
−∞ −∞
10.4. QUANTIZATION (TREE) METHODS (II) 329

Thus if X ∼ N (0; 1) the function Φ is (tabulated or) computable (see Section 11.3) at low
computational cots with high accuracy and
x2
e− 2
Ψ(x) = √ , x ∈ R.

This allows for an almost instant search for the unique optimal N -quantizer using a Newton-
Raphson descent on RN with the requested accuracy.
 For the normal distribution N (0; 1) and N = 1, . . . , 500, tabulation within 10−14 accuracy of
both optimal N -quantizers and companion parameters:

x(N ) = x(N ),1 , . . . , x(N ),N




and
(N )
P(X ∈ Ci (x(N ) )), i = 1, . . . N, and kX − X
bx k2
can also be downloaded at the website www.quantize.maths-fi.com.
330 CHAPTER 10. MULTI-ASSET AMERICAN/BERMUDA OPTIONS, SWING OPTIONS
Chapter 11

Miscellany

11.1 More on the normal distribution

11.2 Characteristic function


L
Proposition 11.1 If Z ∼ N (0; 1), then
Z +∞ x2 dx u2
∀ u ∈ R, ΦZ (u) := eiux e− 2 √ = e− 2 .
−∞ 2π

Proof. Differentiating under the integral symbol yields


Z +∞
0 x2 dx
ΦZ (u) = i eiux e− 2 x √ .
−∞ 2π

Now the following integration by parts yields



0
 eiux −→ iueiux
R
2 x2
 xe− x2 −→ −e− 2

Z +∞ x2 dx
Φ0Z 2
(u) = i u eiux e− 2 x √ = −uΦZ (u) .
−∞ 2π
so that
u2 u2
ΦZ (u) = ΦZ (0)e− 2 = e− 2 . ♦

11.3 Numerical approximation of the distribution function


To compute the distribution function of the normal distribution, one usually relies on the fast
approximation formula obtained by continuous fractions expansion techniques (see [1]).
x2
  
− 2 x2
∀ x ∈ R+ , Φ0 (x) = 1 − e√

(a1 t + a2 t2 + a3 t3 + a4 t4 + a5 t5 + O e− 2 t6

331
332 CHAPTER 11. MISCELLANY

where

1
t := 1+px , p := 0.231 6419

a1 := 0.319 381 530. a2 := −0.356 563 782,

a3 := 1.781 477 937, a4 := −1.821 255 978,

a5 := 1.330 274 429

and

 2

− x2 6
O e t ≤ 7, 5 10−8 .

11.4 Table of the distribution function of the normal distribution

The distribution function of the N (0; 1) distribution is given for every real number t by

Z t
1 x2
Φ0 (t) := √ e− 2 dx.
2π −∞

Since the probability density is even, one easily checks that

Φ0 (t) − Φ0 (−t) = 2 Φ0 (t) − 1.

The following tables give the values of Φ0 (t) for t = x0 , x1 x2 where x0 ∈ {0, 1, 2}, x1 ∈
{0, 1, 2, 3, 4, 5, 6, 7, 8, 9} and x2 ∈ {0, 1, 2, 3, 4, 5, 6, 7, 8, 9}.

For example, if t = 1.23 (i.e. row 1.2 and column 0.03) one has Φ0 (t) ≈ 0.8907.
11.5. UNIFORM INTEGRABILITY AS A DOMINATION PROPERTY 333

t 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6661 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7290 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1,0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1,1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1,2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1,3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1,4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1,5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1,6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1,7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1,8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1,9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2,0 0.9772 0.9779 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2,1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2,2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2,3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2,4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2,5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2,6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2,7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2,8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2,9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986

One notes that Φ0 (t) = 0.9986 for t = 2.99. This comes from the fact that the mass of the normal
distribution is mainly concentrated on the interval [−3, 3] as emphasized by the table of the “large”
values hereafter (for instance, one remarks that P({|X| ≤ 4, 5}) ≥ 0.99999 !).

t 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,8 4,0 4,5
Φ0 (t) .99865 .99904 .99931 .99952 .99966 .99976 .999841 .999928 .999968 .999997

11.5 Uniform integrability as a domination property


In this action, we propose a brief background on uniform integrability for random variables taking
values in Rd . Mostly for notational convenience all these random variables are defined on the same
probability space (Ω, AP ) though it is absolutely not mandatory. We leave as an exercise to check in
334 CHAPTER 11. MISCELLANY

what follows that each random variable Xi can be defined on its own probability space (Ωi , Ai , Pi )
with a straightforward adaptation of the statements.

Theorem 11.1 (Equivalent definitions of uniform integrability I.) A family (Xi )i∈I of Rd -
valued random vectors, defined on a probability space (Ω, A, P), is said to be uniformly integrable
if it satisfies one of the following equivalent properties,
 
(i) lim sup E |Xi |1{|Xi |≥R} = 0.
R→+∞ i∈I



 (α) supi∈I E |Xi | < +∞,

(ii) Z
 (β) ∀ ε > 0, ∃ η = η(ε) > 0 such that, ∀ A ∈ A, P(A) ≤ η =⇒ sup |Xi |dP ≤ ε.


i∈I A

Remark. • All norms being strongly equivalent on Rd , claims (i) and (ii) do not depend on the
selected norm on Rd .
• L1 -Uniform integrability of a family of probability distribution (µi )i∈I defined on (Rd , Bor(Rd ))
can be defined accordingly by
Z
lim sup |x|µi (dx) = 0.
R→+∞ i∈I {|x|≥R}

All what follows can be straightforwardly “translated” in terms of probability distributions.


More generally a non-zero Borel function f : Rd → R+ such that lim inf |x|→+∞ = 0 being de-
fined, a family of probability distribution (µi )i∈I defined on (Rd , Bor(Rd )) is f -uniformly integrable
if Z
lim sup f (x)µi (dx) = 0.
R→+∞ i∈I {f (x))≥R}

Proof. Assume first that (i) holds. First it is clear that



sup E|Xi | ≤ R + sup E |Xi |1{|Xi |≥R} < +∞
i∈I i∈I

at least for large enough R ∈ (0, ∞). Now, for every i ∈ I and every A ∈ A,
Z Z
|Xi |dP ≤ RP(A) + |Xi |dP.
A {|Xi |≥R}
Z
ε
Owing to (i) there exists real number R = R(ε) > 0 such that sup |Xi |dP ≤ . Then
i∈I {|Xi |≥R} 2
ε
setting η = η(ε) = 2R yields (ii).
Conversely, for every real number R > 0, the Markov Inequality implies

supi∈I E |Xi |
sup P({|Xi | ≥ R}) ≤ .
i∈I R
11.5. UNIFORM INTEGRABILITY AS A DOMINATION PROPERTY 335

supi∈I E |Xi |
Let η = η(ε) given by (ii)(β). As soon as R > η , sup P({|Xi | ≥ R}) ≤ η and (ii)(β)
i∈I
implies that  
sup E |Xi |1{|Xi |≥R} ≤ ε
i∈I
which completes the proof. ♦

As a consequence, one easily derives that

P0. (Xi )i∈I is uniformly integrable iff (|Xi |)i∈I is.

P1. If X ∈ L1 (P) then the family (X) is uniformly integrable.

P2. If (Xi )i∈I and (Yi )i∈I are two families of uniformly integrable Rd -valued random vectors, then
(Xi + Yi )i∈I is uniformly integrable.

P3. If (Xi )i∈I is a family of Rd -valued random vectors dominated by a uniformly integrable family
(Yi )i∈I of random variables in the sense that

∀ i ∈ I, |Xi | ≤ Yi P-a.s.
then (Xi )i∈I is uniformly integrable.

The four properties follow from characterization (i). To be precise P1 is a consequence of the
Lebesgue Dominated Convergence Theorem whereas the second one follows from the obvious
     
E |Xi + Yi |1{|Xi +Yi |≥R} ≤ E |Xi |1{|Xi |≥R} + E |Yi |1{|Yi |≥R} .

Now let us pass to a simple criterion of uniform integrability.

Corollary 11.1 Let (Xi )i∈I be a family of Rd -valued random vectors defined on a probability space
Φ(x)
(Ω, A, P) and let Φ : Rd → R+ satisfying lim = +∞. If
|x|n→+∞ |x|

sup E Φ(Xi ) < +∞


i∈I

then the family (Xi )i∈I is uniformly integrable.

Theorem 11.2 (Uniform integrability II.) Let (Xn )n≥1 a sequence of Rd -valued random vec-
tors defined on a probability space (Ω, A, P) and let X be an Rd -valued random vector defined on
the same probability space. If
(i) (Xn )n≥1 is uniformly integrable,
P
(ii) Xn −→ X,

then
E|Xn − X| −→ 0 as n → +∞.
(In particular E Xn → E X.)
336 CHAPTER 11. MISCELLANY

Proof. One derives from (ii) the existence of a subsequence (Xn0 )n≥1 such that Xn0 → X a.s..
Hence by Fatou’s Lemma

E |X| ≤ lim inf E|Xn0 | ≤ sup E|Xn | < +∞.


n n

Hence X ∈ L1 (P) and, owing to P1 and P2, (Xn − X)n≥1 is a uniformly integrable sequence. Now,
for every integer n ≥ 1 and every R > 0,

E |Xn − X| ≤ E(|Xn − X| ∧ M ) + E |Xn − X|1{|Xn −X|≥M } .
The Lebesgue Dominated Convergence Theorem implies limn E(|Xn − X| ∧ M ) = 0 so that
 
lim sup E |Xn − X| ≤ lim sup E |Xn − X|1{|Xn −X|≥M } = 0. ♦
n R→+∞ n

Corollary 11.2 (Lp -uniform integrability) Let p ∈ [1, +∞). Let (Xn )n≥1 a sequence of Rd -
valued random vectors defined on a probability space (Ω, A, P) and let X be an Rd -valued random
vector defined on the same probability space. If
(i) (Xn )n≥1 is Lp -uniformly integrable (i.e. (|Xn |p )n≥1 is uniformly integrable),
P
(ii) Xn −→ X,

then
kXn − Xkp −→ 0 as n → +∞.
(In particular E Xn → E X.)

Proof. By the same argument as above X ∈ Lp (P) so that (|Xn − X|p )n≥1 is uniformly integrable
by P2 and P3. One concludes by the above theorem since |Xn − X|p → 0 in probability as
n → +∞. ♦

11.6 Interchanging. . .
Theorem 11.3 (Interchanging continuity and expectation) (see [30], Chapter 8) (a) Let
(Ω, A, P) be a probability space, let I be a nontrivial interval of R and let Ψ : I × Ω → R be a
Bor(I) ⊗ A-measurable function. Let x0 ∈ I. If the function Ψ satisfies:
(i) for every x ∈ I, the random variable Ψ(x, . ) ∈ L1R (P),
(ii) P(dω)-a.s., x 7→ Ψ(x, ω) is continuous at x0 ,
(iii) There exists Y ∈ L1R+ (P) such that for every x ∈ I,

P(dω)-a.s. |Ψ(x, ω)| ≤ Y (ω),


then, the function ψ(x) := E(Ψ(x, . )) is defined at every x ∈ I, continuous at x0 .
(b) The domination property (iii) in the above theorem can be replaced mutatis mutandis by a
uniform integrability assumption on the family (Ψ(x, .))x∈I .

The same extension based on uniform integrability as Claim (b) holds true for the differentiation
Theorem 2.1 (see the exercise following the theorem).
11.7. MEASURE THEORY 337

11.7 Measure Theory


Theorem 11.4 (Baire σ-field Theorem) Let (S, dS ) be a metric space. Then

Bor(S, dS ) = σ (C(S, R))

where C(S, R) denotes the set of continuous functions from (S, dS ) to R. When S is σ-compact
(i.e. is a countable union of compact sets), one may replace the space C(S, R) by the space CK (S, R)
of continuous functions with compact support.

Theorem 11.5 (Functional monotone class Theorem) Let (S, S) be a measurable space. Let
V be a vector space of real valued bounded measurable functions defined on (S, S). Let C be a subset
of V , stable by the product of two functions. Assume furthermore that V satisfies

(i) 1 ∈ V ,

(ii) V is closed under uniform convergence

(iii) V is closed under “bounded non-decreasing convergence”: if ϕn ∈ V , n ≥ 1, ϕn ≤ ϕn+1 ,


|ϕn | ≤ K (real constant) and ϕn (x) → ϕ(x) for every x ∈ S, then ϕ ∈ V .

Then H contains the vector subspace of all σ(C)-measurable bounded functions.

We refer to [115] for a proof of this result.

11.8 Weak convergence of probability measures on a Polish space


The main reference for this topic is [24]. See also [134].
The basic result of weak convergence theory is the so-called Portmanteau Theorem stated below
(the definition and notation of weak convergence of probability measures are recalled in section 4.1).

Theorem 11.6 Let (µn )n≥1 be a sequence of probability measures on a Polish (metric) space (S, δ)
equipped with its Borel σ-field S and let µ be a probability measure on the same space. The following
properties are equivalent:
S
(i) µn =⇒ µ as n → +∞,

(ii) For every open set O of (S, δ),


µ(O) ≤ lim inf µn (O).
n

(ii) For every closed set F of (S, δ),

µ(F ) ≥ lim sup µn (F ).


n

(iii) For every Borel set A ∈ S such that µ(∂A) = 0 (where ∂A = A \ Å is the boundary of A),

lim µn (A) = µ(A).


n
338 CHAPTER 11. MISCELLANY

(iv) For every nonnegative lower semi-continuous function f : S → R+ ,


Z Z
0≤ f dµ ≤ lim inf f dµn
S n S


(v) For every bounded Borel function f : S → R such that µ Disc(f ) = 0,
Z Z
lim f dµn = f dµ.
n S S

For a proof we refer to [24] Chapter 1.


When dealing with unbounded functions,there exists a kind of weak Lebesque dominated con-
vergence theorem.

Proposition 11.2 Let (µn )n≥1 be a sequence of probability measures on a Polish (metric) space
(S, δ) weakly converging to µ.
(a) Let g : S → R+ a (non-negative) µ-integrable Borel function and let f : S → R be a µ-a.s.
continuous Borel function. If
Z Z
0 ≤ |f | ≤ g and gdµn −→ gdµ as n → +∞
S S
Z Z
then f ∈ L1 (µ) and f dµn −→ f dµ as n → +∞.
S S
(b) The conclusion still holds if (µn )n≥1 is f -uniformly integrable i.e.
Z
lim sup |f |dµn = 0
n n≥1 |f |≥R

Proof.cqf d
If S = Rd , weak convergence is also characterized by the Fourier transform µ
b defined on Rd by
Z
µ
b(u) = eĩ(u|x) µ(dx), u ∈ Rd .
Rd

Proposition 11.3 Let (µn )n≥1 be a sequence of probability measures on (Rd , Bor(Rd ) and let µ be
a probability measure on the same space. Then
   
S
µn =⇒ µ ⇐⇒ ∀ u ∈ Rd , µ bn (u) −→ µb(u) .

For a proof we refer e.g. to [74], or to any textbook on a first course in Probability Theory.
Remark. The convergence in distribution of a sequence (Xn )n≥1 defined on probability spaces
(Ωn , An , Pn ) of random variables taking values in a Polish space is defined as the weak convergence
of their distributions on their distribution µn = PnXn = Pn ◦ Xn−1 on (S, S).
11.9. MARTINGALE THEORY 339

11.9 Martingale Theory


Proposition 11.4 Let (Mn )n≥0 be a square integrable discrete time Fn -martingale defined on a
filtered probability space (Ω, A, P, (Fn )n≥1 ).
(a) Then there exists a unique non-decreasing Fn -predictable process null at 0 denoted (hM in )n≥0
such that
(Mn − M0 )2 − hM in is an Fn -martingale.
This process reads
n
X
E (Mk − Mk−1 )2 | Fk−1 .

hM in =
k=1

(b) Set hM i∞ := limhM in . Then,


n

n→+∞
Mn −→ M∞ on the event {hM i∞ < +∞}

where M∞ is a finite random variable. If furthermore EhM i∞ < +∞, then M∞ ∈ L2 (P).

We refer to [116] for a proof of this result.

Lemma 11.1 (Kronecker Lemma) Let (an )n≥1 be a sequence of real numbers and let (bn )n≥1
be a non decreasing sequence of positive real numbers with limn bn = +∞. Then
 
n
!
X an 1 X
 converges in R as a series =⇒ ak −→ 0 as n → +∞ .
bn bn
n≥1 k=1

n
X ak P ak
Proof. Set Cn = , n ≥ 1, C0 = 0. The assumption says that Cn → C∞ = k≥1 bk ∈ R.
bk
k=1
Now, for large enough n bn > 0 and
n n
1 X 1 X
ak = bk ∆Ck
bn bn
k=1 k=1
n
!
1 X
= bn Cn − Ck−1 ∆bk
bn
k=1
n
1 X
= Cn − ∆bk Ck−1
bn
k=1

where we used Abel Transform for series. One concludes by the extended Césaro Theorem since
∆bn ≥ 0 and lim bn = +∞. ♦
n
To establish Central Limit Theorems outside the ‘ô.i.d.” setting, we will rely on Lindberg’s
Central limit theorem for arrays of martingale increments (seeTheorem 3.2 and its corollary 3.1,
p.58 in [70]), stated below in a simple form involving only one square integral martingale.
340 CHAPTER 11. MISCELLANY

Theorem 11.7 (Lindeberg Central Limit Theorem for martingale increments) ([70]) Let
(Mn )n≥1 be a sequence of square integrable martingales with respect to a filtration (Fn )n≥1 and let
(an )n≥1 be a non-decreasing sequence of real numbers going to infinity when n goes to infinity.
If the following conditions hold:
n
1 X  
(i) E |∆Mk |2 | Fn−1 −→ σ 2 ∈ [0, +∞) in probability,
an
k=1
n
1 X  
(ii) ∀ ε > 0, E (∆Mk )2 1{|∆Mk |≥an ε} | Fn−1 −→ 0 in probability,
an
k=1
then
Mn L
√ −→ N 0; σ 2 .

an

 Derive from this result a d-dimensional theorem [Hint: Consider a linear combination of the
d-dimensional martingale under consideration.]
Bibliography

[1] M. Abramovicz, I.A. Stegun (1964). Handbook of Mathematical Functions, National Bu-
reau of Standards, Washington.

[2] A. Alfonsi, B. Jourdain B. and A. Kohatsu-Higa. Pathwise optimal transport bounds


between a one-dimensional diffusion and its Euler scheme, to appear in The Annals of Applied
Probability, 2013.

[3] I.A. Antonov and V.M. Saleev (1979). An economic method of computing LPτ -sequences,
Zh. vȳ chisl. Mat. mat. Fiz., 19, 243-245. English translation: U.S.S.R. Comput. Maths. Math.
Phys., 19, 252-256.

[4] B. Arouna (2004). Adaptive Monte Carlo method, a variance reduction technique. Monte
Carlo Methods and Appl., 10(1):1-24.

[5] B. Arouna, O. Bardou (2004). Efficient variance reduction for functionals of diffusions by
relative entropy, CERMICS-ENPC, technical report.

[6] M.D. Ašić, D.D. Adamović (1970): Limit points of sequences in metric spaces, The American
Mathematical Monthly, 77(6):613-616.

[7] J. Barraquand, D. Martineau (1995). Numerical valuation of high dimensional multivari-


ate American securities, Journal of Finance and Quantitative Analysis, 30.

[8] V. Bally (2004). The central limit theorem for a nonlinear algorithm based on quantization,
Proc. of The Royal Soc., 460:221-241.

[9] V. Bally, D. Talay (1996). The distribution of the Euler scheme for stochastic differential
equations: I. Convergence rate of the distribution function, Probab. Theory Related Fields,
104(1): 43-60 (1996).

[10] V. Bally, D. Talay (1996). The law of the Euler scheme for stochastic differential equations.
II. Convergence rate of the density, Monte Carlo Methods Appl., 2(2):93-128.

[11] V. Bally, G. Pagès and J. Printems (2001). A Stochastic quantization method for non-
linear problems, Monte Carlo Methods and Appl., 7(1):21-34.

[12] V. Bally, G. Pagès (2003). A quantization algorithm for solving discrete time multidimen-
sional optimal stopping problems, Bernoulli, 9(6):1003-1049.

341
342 BIBLIOGRAPHY

[13] V. Bally, G. Pagès (2003). Error analysis of the quantization algorithm for obstacle prob-
lems, Stochastic Processes & Their Applications, 106(1):1-40.

[14] V. Bally, G. Pagès, J. Printems (2003). First order schemes in the numerical quantization
method, Mathematical Finance 13(1):1-16.

[15] V. Bally, G. Pagès, J. Printems (2005). A quantization tree method for pricing and
hedging multidimensional American options, Mathematical Finance, 15(1):119-168.

[16] C. Barrera-Esteve and F. Bergeret and C. Dossal and E. Gobet and A. Meziou
and R. Munos and D. Reboul-Salze (2006). Numerical methods for the pricing of Swing
options: a stochastic control approach, Methodology and Computing in Applied Probability.

[17] O. Bardou, S. Bouthemy, G. Pagès (2009). Pricing swing options using Optimal Quanti-
zation, Applied Mathematical Finance, 16(2):183-217.

[18] O. Bardou, S. Bouthemy, G. Pagès (2006). When are Swing options bang-bang?, Inter-
national Journal for Theoretical and Applied Finance, 13(6):867-899, 2010.

[19] O. Bardou, N. Frikha, G. Pagès (2009). Computing VaR and CVaR using Stochastic
Approximation and Adaptive Unconstrained Importance Sampling, Monte Carlo and Appli-
cations Journal, 15(3):173-210.

[20] M. Benaı̈m (1999). Dynamics of Stochastic Approximation Algorithms, Séminaire de Proba-


bilités XXXIII, J. Azéma, M. Émery, M. Ledoux, M. Yor éds, Lecture Notes in Mathematics
n0 1709, 1-68.

[21] A. Benveniste, M. Métivier and P. Priouret (1987). Algorithmes adaptatifs and ap-
proximations stochastiques. Masson, Paris, 1987.

[22] J. Bertoin, Lévy processes. (1996) Cambridge tracts in Mathematics, 121, Cambridge Uni-
versity Press, 262p.

[23] A. Beskos, G.O. Roberts (2005). Exact simulation of diffusions, Ann. appl. Prob., 15(4):
2422-2444.

[24] P. Billingsley (1968). Convergence of probability measure, Wiley.

[25] J.-P. Borel, G. Pagès, Y.J. XIao (1992). Suites à discrépance faible and intégration
numérique, Probabilités Numériques, N. Bouleau, D. Talay eds., Coll. didactique, INRIA,
ISBN-10: 2726107087.

[26] B. Bouchard, I. Ekeland, N. Touzi (2004) On the Malliavin approach to Monte Carlo
approximation of conditional expectations. (English summary) Finance Stoch., 8 (1):45-71.

[27] N. Bouleau, D. Lépingle (1994). Numerical methods for stochastic processes, Wiley Se-
ries in Probability and Mathematical Statistics: Applied Probability and Statistics. A Wiley-
Interscience Publication. John Wiley & Sons, Inc., New York, 359 pp. ISBN: 0-471-54641-0.

[28] P. P. Boyle (1977). Options: A Monte Carlo approach, Journal of Financial Economics,
4(3):323-338
BIBLIOGRAPHY 343

[29] O. Brandière, M. Duflo (1996). Les algorithmes stochastiques contournent-ils les pièges?
(French) [Do stochastic algorithms go around traps?], Ann. Inst. H. Poincaré Probab. Statist.,
32(3):395-427.

[30] M. Briane, G. Pagès, Théorie de l’intégration, Cours & exercices, Vuibert, 1999.

[31] Buche R., Kushner H.J. (2002). Rate of convergence for constrained stochastic approxima-
tion algorithm, SIAM J. Control Optim., 40(4):1001-1041.

[32] J.F. Carrière (1996) Valuation of the early-exercise price for options using simulations and
nonparametric regression. Insurance Math. Econ., 19:1930.

[33] K.L. Chung (1949) An estimate concerning the Kolmogorov limit distribution, Trans. Amer.
Math. Soc., 67, 36-50.

[34] E. Clément, A. Kohatsu-Higa, D. Lamberton (2006). A duality approach for the weak
approximation of Stochastic Differential Equations, Annals of Applied Probability, 16(3):449-
471.

[35] É. Clément, P. Protter, D. Lamberton (2002). An analysis of a least squares regression
method for American option pricing, Finance & Stochastics, 6(2):449-47.

[36] P. Cohort (2000), Sur quelques problèmes de quantification, thèse de l’Univsersité Paris 6,
Paris, 187p,

[37] P. Cohort (2004). Limit theorems for random normalized distortion, Ann. Appl. Probab.,
14(1):118-143.

[38] S. Corlay, G. Pagès (2014). Functional quantization based stratified sampling methods,
pre-print LPMA, 1341, to appear in Monte Carlo and Applications Journal.

[39] R. Cranley, T.N.L. Patterson (1976). Randomization of number theoretic methods for
multiple integration,SIAM J. Numer. Anal., 13(6):904-914.

[40] D. Dacunha-Castel, M. Duflo (1986). Probabilités, Problèmes à temps mobile, Masson,


Paris.

[41] R.B. Davis , D.S. Harte (1987) Tests for Hurst effect, Biometrika, 74:95-101.

[42] L. Devroye (1986). Non uniform random variate generation, Springer, New York.

[43] M. Duflo (1996) ( 1997). Algorithmes stochastiques, coll. Mathématiques & Applications,
23, Springer-Verlag, Berlin, 319p.

[44] M. Duflo, Random iterative models, translated from the 1990 French original by Stephen
S. Wilson and revised by the author. Applications of Mathematics (New York), 34 Springer-
Verlag, Berlin, 385 pp.

[45] P. Étoré, B. Jourdain (2010). Adaptive optimal allocation in stratified sampling methods,
Methodol. Comput. Appl. Probab., 12(3):335-360.
344 BIBLIOGRAPHY

[46] H. Faure (1986). Suite à discrépance faible dans T s , Technical report, Univ. Limoges (France).

[47] H. Faure (1982). Discrépance associée à un système de numération (en dimension s), Acta
Arithmetica, 41:337-361.

[48] O. Faure (1992). Simulation du mouvement brownien and des diffusions, Thèse de l’ENPC
(Marne-la-Vallée, France), 133p.

[49] H. Föllmer, A. Schied (2002). Stochastic Finance, An introduction in discrete time, De


Gruyter Studies in Mathematics 27, De Gruyter, Berlin, 422p, second edition in 2004.

[50] J.C Fort, G. Pagès (1996). Convergence of stochastic algorithms: from the Kushner &
Clark Theorem to the Lyapunov functional, Advances in Applied Probability, 28:1072-1094.

[51] J.C Fort, G. Pagès (2002). Decreasing step stochastic algorithms: a.s. behaviour of weighted
empirical measures, Monte Carlo Methods & Applications, 8(3):237-270.

[52] A. Friedman (1975). Stochastic differential equations and applications. Vol. 1-2. Probability
and Mathematical Statistics, Vol. 28. Academic Press [Harcourt Brace Jovanovich, Publishers],
New York-London, 528p.

[53] J. H. Friedman, J.L. Bentley and R.A. Finkel (1977). An Algorithm for Finding
Best Matches in Logarithmic Expected Time, ACM Transactions on Mathematical Software,
3(3):209-226.

[54] N. Frikha, S. Menozzi (2012). Concentration bounds for Stochastic Approximations, Elec-
tron. Cummun. Proba., 183 17(47):1-15.

[55] A. Gersho and R.M. Gray (1988). Special issue on Quantization, I-II (A. Gersho and R.M.
Gray eds.), IEEE Trans. Inform. Theory, 28.

[56] A. Gersho and R.M. Gray (1992). Vector Quantization and Signal Compression. Kluwer,
Boston.

[57] P. Glasserman (2003). Monte Carlo Methods in Financial Engineering, Springer-Verlag,


New York, 596p.

[58] P. Glasserman, P. Heidelberger, P. Shahabuddin (1999). Asymptotically optimal


importance sampling and stratification for pricing path-dependent options. Math. Finance
9(2):117-152.

[59] P. Glasserman, D.D. Yao (1992). Some guidelines and guarantees for common random
numbers, Manag. Sci., 36(6):884-908.

[60] P. Glynn (1989). Optimization of stochastic systems via simulation. In Proceedings of the
1989 Winter simulation Conference, Society for Computer Simulation, San Diego, 90-105.

[61] E. Gobet, G. Pagès, H. Pham, J. Printems (2007). Discretization and simulation for a
class of SPDE’s with applications to Zakai and McKean-Vlasov equations, SIAM Journal on
Numerical Analysis, 44(6):2505-2538.
BIBLIOGRAPHY 345

[62] E. Gobet, S. Menozzi (2004): Exact approximation rate of killed hypo-elliptic diffusions
using the discrete Euler scheme, Stochastic Process. Appl., 112(2):201-223.

[63] E. Gobet (2000): Weak approximation of killed diffusion using Euler schemes, Stoch. Proc.
and their Appl., 87:167-197.

[64] E. Gobet, A. Kohatsu-Higa (2003). Computation of Greeks for barrier and look-back
options using Malliavin calculus, Electron. Comm. Probab., 8:51-62.

[65] E. Gobet, R. Munos (2005). Sensitivity analysis using Itô-Malliavin calculus and martin-
gales. Application to optimal control, SIAM J. Control Optim., 43(5):1676-1713.

[66] S. Graf, H. Luschgy (2000). Foundations of quantization for probability distributions,


LNM 1730, Springer, Berlin, 230p.

[67] S. Graf, H. Luschgy, and G. Pagès (2008). Distortion mismatch in the quantization of
probability measures, ESAIM P&S, 12:127-154.

[68] A. Grangé-Cabane (2011), Algorithmes stochastiques, applications à la finance, Master


thesis (supervisIon G. Pagès), 49p.

[69] J. Guyon (2006). Euler scheme and tempered distributions, Stochastic Processes and their
Applications, 116(6), 877-904.

[70] P. Hall, C.C. Heyde (1980). Martingale Limit Theory and Its Applications, Academic Press,
308p.

[71] U.G. Haussmann (1979). On the Integral Representation of Functionals of Itô Processes, ???,
3:17-27.

[72] S.L. Heston (1993). A closed-form solution for options with stochastic volatility with appli-
cations to bon and currency options, The review of Financial Studies, 6(2):327-343.

[73] J. Jacod, A.N. Shiryaev (2003). Limit theorems for stochastic processes. Second edition
(first edition 1987). Grundlehren der Mathematischen Wissenschaften [Fundamental Principles
of Mathematical Sciences], 288. Springer-Verlag, Berlin, 2003. xx+661 pp.

[74] J. Jacod, P. Protter (1998). Asymptotic error distributions for the Euler method for
stochastic differential equations. Annals of Probability, 26(1):267-307.

[75] J. Jacod, P. Protter Probability essentials. Second edition. Universitext. Springer-Verlag,


Berlin, 2003. x+254 pp.

[76] P. Jaillet, D. Lamberton, and B. Lapeyre (1990). Variational inequalities and the
pricing of American options, Acta Appl. Math., 21:263-289.

[77] H. Johnson (1987). Call on the maximum or minimum of several assets, J. of Financial and
Quantitative Analysis, 22(3):277-283.

[78] I. Karatzas, S.E. Shreve (1998). Brownian Motion and Stochastic Calculus, Graduate
Texts in Mathematics, Springer-Verlag, New York, 470p.
346 BIBLIOGRAPHY

[79] O. Kavian (1993). Introduction à la théorie des points critiques et applications aux problèmes
elliptiques. (French) [Introduction to critical point theory and applications to elliptic problems].
coll. Mathématiques & Applications [Mathematics & Applications], 13, Springer-Verlag, Paris
[Berlin], 325 pp. ISBN: 2-287-00410-6.

[80] A.G. Kemna, A.C. Vorst (1990). A pricing method for options based on average asset
value, J. Banking Finan., 14:113-129.

[81] J.C. Kieffer (1982) Exponential rate of convergence for Lloyd’s method I, IEEE Trans. on
Inform. Theory, Special issue on quantization, 28(2):205-210.

[82] J. Kiefer, J. Wolfowitz (1952). Stochastic estimation of the maximum of a regression


function, Ann. Math. Stat., 23:462-466.

[83] P.E. Kloeden, E. Platten (1992). Numerical solution of stochastic differential equations.
Applications of Mathematics, 23, Springer-Verlag, Berlin (New York), 632 pp. ISBN: 3-540-
54062-8.

[84] A. Kohatsu-Higa, R. Pettersson (2002). Variance reduction methods for simulation of


densities on Wiener space, SIAM J. Numer. Anal., 40(2):431-450.

[85] U. Krengel (1987). Ergodic Theorems, De Gruyter Studies in Mathematics, 6, Berlin, 357p.

[86] L. Kuipers, H. Niederreiter (1974). Uniform distribution of sequences, Wiley.

[87] H. Kunita (1982). Stochastic differential equations and stochastic flows of diffeomorphisms,
Cours d’école d’été de Saint-Flour, LN-1097, Springer-Verlag, 1982.

[88] T.G. Kurtz, P. Protter (1991). Weak limit theorems for stochastic integrals and stochastic
differential equations, Annals of Probability, 19:1035-1070.

[89] H.J. Kushner, G.G. Yin (2003). Stochastic approximation and recursive algorithms and ap-
plications, 2nd edition, Applications of Mathematics, Stochastic Modelling and Applied Proba-
bility, 35, Springer-Verlag, New York.

[90] J.P. Lambert (1985). Quasi-Monte Carlo, low-discrepancy, and ergodic transformations, J.
Computational and Applied Mathematics, 13/13, 419-423.

[91] D. Lamberton, B. Lapeyre (1996). Introduction to stochastic calculus applied to finance.


Chapman & Hall, London, 185 pp. ISBN: 0-412-71800-6.

[92] B. Lapeyre, G. Pagès (1989). Familles de suites à discrépance faible obtenues par itération
d’une transformation de [0, 1], Comptes rendus de l’Académie des Sciences de Paris, Série I,
308:507-509.

[93] B. Lapeyre, G. Pagès, K. Sab (1990). Sequences with low discrepancy. Generalization and
application to Robbins-Monro algorithm, Statistics, 21(2):251-272.

[94] S. Laruelle, C.-A. Lehalle, G. Pagès (2009). Optimal split of orders across liquidity
pools: a stochastic algorithm approach, pre-pub LPMA 1316, to appear in SIAM J. on Fi-
nancial Mathematics.
BIBLIOGRAPHY 347

[95] S. Laruelle, G. Pagès (2012). Stochastic Approximation with averaging innovation, Monte
Carlo and Appl. J.. 18:1-51.

[96] V.A. Lazarev (1992). Convergence of stochastic approximation procedures in the case of a
regression equation with several roots, (transl. from) Problemy Pederachi Informatsii, 28(1):
66-78.

[97] B. Lapeyre, E. Temam (2001). Competitive Monte Carlo methods for the pricing of Asian
Options, J. of Computational Finance, 5(1):39-59.

[98] M. Ledoux, M. Talagrand (2011). Probability in Banach spaces. Isoperimetry and pro-
cesses. Reprint of the 1991 edition. Classics in Mathematics. Springer-Verlag, Berlin, 2011.
xii+480 pp.

[99] P. L’Ecuyer, G. Perron (1994). On the convergence rates of ipa and fdc derivatives esti-
mators, Oper. Res., 42:643-656.

[100] M. Ledoux (1997) Concentration of measure and logarithmic Sobolev inequalities, technical
report, http://www.math.duke.edu/ rtd/CPSS2007/Berlin.pdf.

[101] J. Lelong (2007). Algorithmes Stochastiques and Options parisiennes, Thèse de l’École Na-
tionale des Ponts and Chaussées.

[102] V. Lemaire, G. Pagès (2010). Unconstrained recursive importance sampling, Annals of


Applied Probability, 20(3):1029-1067.

[103] Ljung L. (1977). Analysis of recursive stochastic algorithms, IEEE Trans. on Automatic
Control, AC-22(4):551-575.

[104] F.A. Longstaff, E.S. Schwarz (2001). Valuing American options by simulation: a simple
least-squares approach, Review of Financial Studies, 14:113-148.

[105] H. Luschgy (2013). Martingale die discrete Ziet, Theory und Anwendungen. Berlin. Springer,
452p.

[106] H. Luschgy, G. Pagès (2001). Functional quantization of Gaussian processes, J. of Funct.


Analysis, 196(2), 486-531.

[107] H. Luschgy, G. Pagès (2008). Functional quantization rate and mean regularity of pro-
cesses with an application to Lévy processes, Annals of Applied Probability, 18(2):427-469.

[108] Mac Names (2001). A fast nearest-neighbour algorithm based on principal axis search tree.
IEEE T. Pattern. Anal., 23(9):964-976.

[109] S. Manaster, G. Koehler (1982). The calculation of Implied Variance from the Black-
Scholes Model: A Note, The Journal of Finance, 37(1):227-230.

[110] G. Marsaglia, T.A. Bray (1964) A convenient method for generating normal variables.
SIAM Rev., 6 , 260-264.
348 BIBLIOGRAPHY

[111] M. Matsumata, T. Nishimura (1998). Mersenne twister: A 623-dimensionally equidis-


tributed uniform pseudorandom number generator, ACM Trans. in Modeling and Computer
Simulations, 8(1):3-30.

[112] G. Marsaglia, W.W.Tsang (2000). The Ziggurat Method for Generating Random Vari-
ables, Journal of Statistical Software, 5(8):363-372.

[113] Métivier M., Priouret P. (1987) Théorèmes de convergence presque sure pour une classe
d’algorithmes stochastiques à pas décroissant. (French. English summary) [Almost sure con-
vergence theorems for a class of decreasing-step stochastic algorithms], Probab. Theory Related
Fields, 74(3):403-428.

[114] Milstein G.N., (1976). A method of second order accuracy for stochastic differential equa-
tions, Theor. Probab. Its Apll. (USSR), 23:396-401.

[115] J. Neveu, (1964). Bases mathématiques du calcul des Probabilités, Masson, Paris; English
translation Mathematical Foundations of the Calculus of Probability (1965), Holden Day, San
Francisco.

[116] J. Neveu (1972). Martingales à temps discret, Masson, 1972, 218p. English translation:
Discrete-parameter martingales, North-Holland, New York, 1975, 236p.

[117] D.J. Newman (1982). The Hexagon Theorem, IEEE Trans. Inform. Theory (A. Gersho and
R.M. Gray eds.), 28:137-138.

[118] H. Niederreiter (1992) Random Number Generation and Quasi-Monte Carlo Methods,
CBMS-NSF regional conference series in Applied mathematics, SIAM, Philadelphia.

[119] Numerical Recipes: Textbook available at


www.ma.utexas.edu/documentation/nr/bookcpdf.html.
See also [140].

[120] G. Pagès (1987). Sur quelques problèmes de convergence, thèse Univ. Paris 6, Paris.

[121] G. Pagès (1992). Van der Corput sequences, Kakutani transform and one-dimensional nu-
merical integration, J. Computational and Applied Mathematics, 44:21-39.

[122] G. Pagès (1998). A space vector quantization method for numerical integration, J. Compu-
tational and Applied Mathematics, 89, 1-38 (Extended version of “Voronoi Tessellation, space
quantization algorithms and numerical integration”, in: M. Verleysen (Ed.), Proceedings of
the ESANN’ 93, Bruxelles, Quorum Editions, (1993), 221-228).

[123] G. Pagès (2007). Multistep Richardson-Romberg extrapolation: controlling variance and


complexity, Monte Carlo Methods and Applications, 13(1):37-70.

[124] G. Pagès (2007). Quadratic optimal functional quantization methods and numerical appli-
cations, Proceedings of MCQMC, Ulm’06, Springer, Berlin, 101-142.

[125] G. Pagès (2013). Functional co-monotony of processes with an application to peacocks and
barrier options, Sémiaire de Probabilités XXVI, LNM 2078, Springer, Cham, 365:400.
BIBLIOGRAPHY 349

[126] G. Pagès (2014). Introduction to optimal quantization for numerics, to appear in ESAIM
Proc. & Surveys, Paris, France.

[127] G. Pagès, H. Pham, J. Printems (2004). Optimal quantization methods and applications
to numerical problems in finance, Handbook on Numerical Methods in Finance (S. Rachev,
ed.), Birkhauser, Boston, 253-298.

[128] G. Pagès, H. Pham, J. Printems (2004). An Optimal Markovian Quantization Algorithm


for Multidimensional Stochastic Control Problems, Stochastics and Dynamics, 4(4):501-545.

[129] G. Pagès, J. Printems (2003). Optimal quadratic quantization for numerics: the Gaussian
case, Monte Carlo Methods and Appl. 9(2):135-165.

[130] G. Pagès, J. Printems (2005). www.quantize.maths-fi.com, web site devoted to optimal


quantization.

[131] G. Pagès, J. Printems (2009). Optimal quantization for finance: from random vectors to
stochastic processes, chapter from Mathematical Modeling and Numerical Methods in Finance
(special volume) (A. Bensoussan, Q. Zhang guest éds.), coll. Handbook of Numerical Analysis
(P.G. Ciarlet Editor), North Holland, 595-649.

[132] G. Pagès, B. Wilbertz (2012). Optimal Delaunay et Voronoi quantization methods for
pricing American options, pré-pub PMA, 2011, Numerical methods in Finance (R. Carmona,
P. Hu, P. Del Moral, .N. Oudjane éds.), Springer, New York, 171-217.

[133] G. Pagès, Y.J. Xiao (1988) Sequences with low discrepancy and pseudo-random numbers:
theoretical results and numerical tests, J. of Statist. Comput. Simul., 56:163-188.

[134] K.R. Parthasarathy (1967). Probability measures on metric spaces. Probability and Math-
ematical Statistics, No. 3 Academic Press, Inc., New York-London, xi+276 pp.

[135] M. Pelletier (1998). Weak convergence rates for stochastic approximation with application
to multiple targets and simulated annealing, Ann. Appl. Probab., 8(1):10-44.

[136] R. Pemantle (1990), Nonconvergence to unstable points in urn models and stochastic ap-
proximations, Annals of Probability, 18(2):698-712.

[137] H. Pham (2007). Some applications and methods of large deviations in finance and insurance
in Paris-Princeton Lectures on Mathematical Finance 2004, LNM 1919, New York, Springer,
191-244.

[138] B.T. Polyak (1990) A new method of stochastic approximation type. (Russian) Avtomat.
i Telemekh. 7: 98-107; translation in Automat. Remote Control, 51 7, part 2, 937946 (1991).

[139] P.D. Proı̈nov (1988). Discrepancy and integration of continuous functions, J. of Approx.
Theory, 52:121-131.

[140] W.H. Press, S.A. Teukolsky, W.T. Vetterling, B.P. Flannery (2002). Numerical
recipes in C++. The art of scientific computing, Second edition, updated for C++. Cambridge
University Press, Cambridge, 2002. 1002p. ISBN: 0-521-75033-4 65-04.
350 BIBLIOGRAPHY

[141] D. Revuz, M. Yor (1999). Continuous martingales and Brownian motion. Third edition.
Grundlehren der Mathematischen Wissenschaften [Fundamental Principles of Mathematical
Sciences], 293, Springer-Verlag, Berlin, 602 p. ISBN: 3-540-64325-7 60G44 (60G07 60H05).

[142] H. Robbins, S. Monro (1951). A stochastic approximation method, Ann. Math. Stat.,
22:400-407.

[143] R.T. Rockafellar, S. Uryasev (1999). Optimization of Conditional Value at Risk, pre-
print, Univ. Florida, http://www.ise.ufl.edu/uryasev.

[144] K.F. Roth (1954). On irregularities of distributions, Mathematika, 1:73-79.

[145] W. Rudin (1966). Real and complex analysis, McGraw-Hill, New York.

[146] W. Ruppert . . .

[147] K.-I. Sato (1999). Lévy processes and Infinitely Divisible Distributions, Cambridge Univer-
sity Press.

[148] A. N. Shyriaev (2008). Optimal stopping rules. Translated from the 1976 Russian second
edition by A. B. Aries. Reprint of the 1978 translation. Stochastic Modelling and Applied
Probability, 8. Springer-Verlag, Berlin, 2008. 217 pp.

[149] A. N. Shyriaev (1984). Probability, Graduate Tetxts in mathematics, Springer, New York,
577p.

[150] P. Seumen-Tonou (1997). Méthodes numériques probabilistes pour la résolution d’équations


du transport and pour l’évaluation d’options exotiques, Theèse de l’Université de Provence
(Marseille, France), 116p.

[151] M. Sussman, W. Crutchfield, M. Papakipos (2006). Pseudo-random numbers Genera-


tion on GPU, in Graphic Hardware 2006, Proceeding of the Eurographics Symposium Vienna,
Austria, September 3-4, 2006, M. Olano & P. Slusallek (eds.), A K Peters Ltd., 86-94, 424p.,
ISBN 1568813546.

[152] D. Talay, L. Tubaro (1990). Expansion of the global error for numerical schemes solving
stochastic differential equations, Stoch. Anal. Appl., 8:94-120.

[153] B. Tuffin (2004). Randomization of Quasi-Monte Carlo Methods for error estimation: Sur-
vey and Normal Approximation, Monte Carlo Methods and Applications, 10, (3-4):617-628.

[154] S. Villeneuve, A. Zanette (2002). Parabolic A.D.I. methods for pricing american option
on two stocks, Mathematics of Operation Research, 27(1):121-149.

[155] D. S. Watkins (2010) Fundamentals of matrix computations (3rd edition), Pure and Applied
Mathematics (Hoboken), John Wiley & Sons, Inc., Hoboken, NJ, xvi+644 pp.

[156] M. Winiarski (2006). Quasi-Monte Carlo Derivative valuation and Reduction of simulation
bias, Master thesis, Royal Institute of Technology (KTH), Stockholm (Sweden).
BIBLIOGRAPHY 351

[157] A. Wood, G. Chan (1994). Simulation of stationary Gaussian processes in [0, 1]d , Journal
ofcomputational and graphical statistics, 3(4):409-432.

[158] Y.J. Xiao (1990). Contributions aux méthodes arithmétiques pour la simulation accélérée,
Thèse de l’ENPC, Paris.

[159] Y.J. Xiao (1990). Suites équiréparties associées aux automorphismes du tore, C.R. Acad.
Sci. Paris (Série I), 317:579-582.

[160] P.L. Zador (1963). Development and evaluation of procedures for quantizing multivariate
distributions. Ph.D. dissertation, Stanford Univ.

[161] P.L. Zador (1982). Asymptotic quantization error of continuous signals and the quantization
dimension, IEEE Trans. Inform. Theory, 28(2):139-14.
Index

QM C, 87 bump method, 281


QM C (randomized), 106 Burkhölder-Davis-Gundy inequality, 240
SDE, 203
SDE (flow), 250 Cameron-Marin formula, 82
α-quantile, 160 Central Limit Theorem (Lindeberg), 340
χ2 -distribution, 34 Cholesky decomposition, 28
a.s. rate (Euler schemes), 211 Cholesky matrix, 28
p-adic rotation, 98 Chow Theorem, 165
[Glivenko-Cantelli] theorem, 88 Chung’s CLT , 95
Chung’s LIL, 96
acceptance-rejection method, 16 co-monotony, 57
antithetic method, 57 Competitive Learning Vector Quantization algo-
antithetic random variables, 59 rithm, 121
asian option, 54 complexity, 50
averaging principle (stochastic approximation), confidence interval, 35
190 confidence interval (theoretical), 33
confidence level, 33
backward dynamic programming principle (func-
continuous Euler scheme, 204
tional), 311
control variate (adaptive), 60
backward dynamic programming principle (path-
control variate (static), 49
wise), 310
convergence (weak), 88
Bally-Talay theorem, 235
covariance matrix, 27
barrier option, 74
cubature formula (quantization), 118
basket option, 52
curse of dimensionality, 117
Bernoulli distribution, 15
curse of dimensionality (discrepancy), 103
Best-of-call option, 56
binomial distribution, 15 diffusion process (Brownian), 203
Birkhoff’s pointwise ergodic Theorem, 102 digital option, 286
Bismut’s formula, 297 discrepancy (extreme), 90
Black-Scholes formula, 37 discrepancy (star), 90
Blackwell-Rao method, 73 discrepancy at the origin, 90
Box-Müller method, 26 discrete time Euler scheme, 204
bridge of a diffusion, 267 distortion function, 114
Brownian bridge method (barrier), 269 distribution function, 12
Brownian bridge method (Lookback), 269
Brownian bridge method (max/min), 269 ergodic transform, 102
Brownian motion (simulation), 28 Euler function, 10
Brownin bridge, 265 Euler scheme (continuous), 205

352
INDEX 353

Euler scheme (discrete time), 204 Lindeberg Central Limit Theorem, 340
Euler scheme (genuine), 205 Lloyd I procedure (randomized), 121
Euler scheme (stepwise constant), 204 log-likelihood method, 45
Euler schemes, 204 low discrepancy (sequence with), 97
Euler-Maruyama schemes, 204
exchange spread option, 73 Malliavin-Monte Carlo method, 43
exponential distribution, 14 mean quadratic quantization error, 114
extreme discrepancy, 90 Milstein scheme, 225
Milstein scheme (higher dimensional), 228
Faure sequences, 99 Monte Carlo method, 31
Feynmann-Kac’s formula, 256 Monte Carlo simulation, 31
finite difference (constant step), 282
finite difference (deceasing step), 288 nearest neighbour projection, 114
finite difference methods (greeks), 281 negative binomial distribution(s), 16
finite variation function (Hardy and Krause sense negatively correlated variables, 57
), 93 Niederreiter sequences, 100
finite variation function (measure sense), 92 normal distribution, 331
flow of an SDE, 250 numerical integration (quantization), 121
fractional Brownian motion (simulation), 30
functional monotone class theorem, 337 optimal quantizer, 114
option on a basket, 52
gamma distribution, 22 option on an index, 52
generalized Minkowski inequality, 239
generator (homogeneous ), 10 Pareto distribution, 14
geometric distribution(s), 15 parity equation, 66
geometric mean option, 56 parity variance reduction method, 66
greek computation, 281 partial Lookback option, 224
Poisson distribution, 24
Halton sequence, 97 Poisson process, 24
Hammersley procedure, 101 polar method, 21, 27
Haussman-Clark-Occone formula, 301 Polish space, 12
homogeneous generator, 10 Portmanteau Theorem, 337
pre-conditioning, 73
implication (parameter), 134 Proı̈nov’s Theorem, 103
importance sampling, 78
importance sampling (recursive), 140 quantile, 77, 160
index option, 52 quantile (two-sided α-), 33
integrability (uniform), 41 quantization error (mean quadratic), 114
quantization tree, 320
Jensen inequality, 52 quasi-Monte Carlo method, 87
quasi-Monte Carlo method (unbounded dimen-
Kakutani sequences, 98 sion), 109
Kakutani’s adding machine, 98 quasi-stochastic approximation, 111, 196
Kemna-Vorst control variate, 54
Koksma-Hlawka’s Inequality, 93 Röth’s lower bound (discrepancy), 96
Kronecker Lemma, 65, 339 residual maturity, 285
354 INDEX

Richardson-Romberg extrapolation (asian option), theoretical confidence interval, 33


276 two-sided α-quantile, 33
Richardson-Romberg extrapolation (Euler scheme),
236 uniform integrability, 41
Richardson-Romberg extrapolation (integral), 276 uniformly distributed sequence, 89
Richardson-Romberg extrapolation (multistep, Eu-unimodal distribution, 114
ler scheme), 238 uniquely ergodic transform, 102
up & out call option, 275
Richardson-Romberg extrapolation (path-dependent),
273
Van der Corput sequence, 97
Richardson-Romberg extrapolation (quantization),
Von Neumann acceptance-rejection method, 16
121
Voronoi partition, 113
Riemann integrable functions, 89
risk-neutral probability, 66 weak convergence, 88
Robbins-Monro algorithm, 137 weak rate (Euler scheme), 230
Robbins-Siegmund Lemma, 135 Weyl’s criterion, 90
Romberg extrapolation (quantization), 121
yield (of a simulation), 14
semi-convex function, 309
seniority, 285 Zigurat method, 24
sensitivity computation, 281
sensitivity computation (localization), 303
sensitivity computation (tangent process), 292
sequence with low discrepancy, 97
shock method, 281
signed measure, 92
simulation complexity, 50
Sobol’ sequences, 100
square root of a symmetric matrix, 27
star discrepancy, 90
static control variate, 49
stationary quantizer, 115
stepwise constant Euler scheme, 204
stochastic differential equation, 203
stochastic gradient, 138
stratified sampling, 74
stratified sampling (universal), 126
strong Lp -rate (Euler schemes), 206
strong rate (Milstein scheme), 227
strongly continuous distribution, 114
structural dimension, 105
Student distribution, 35
survival function, 13

Talay-Tubaro Theorem, 231, 235


tangent process method, 47

You might also like