Convex and Stochastic Optimization
Convex and Stochastic Optimization
J. Frédéric Bonnans
Convex and
Stochastic
Optimization
Universitext
Universitext
Series Editors
Sheldon Axler
San Francisco State University
Carles Casacuberta
Universitat de Barcelona
Angus MacIntyre
Queen Mary University of London
Kenneth Ribet
University of California, Berkeley
Claude Sabbah
École Polytechnique, CNRS, Université Paris-Saclay, Palaiseau
Endre Süli
University of Oxford
Wojbor A. Woyczyński,
Case Western Reserve University
Thus as research topics trickle down into graduate-level teaching, first textbooks
written for new, cutting-edge courses may make their way into Universitext.
123
J. Frédéric Bonnans
Inria-Saclay
and
Centre de Mathématiques Appliquées
École Polytechnique
Palaiseau, France
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
This book is dedicated to Viviane, Juliette,
Antoine, and Na Yeong
Preface
These lecture notes are an extension of those given in the master programs at the
Universities Paris VI and Paris-Saclay, and in the École Polytechnique. They give
an introduction to convex analysis and its applications to stochastic programming,
i.e., to optimization problems where the decision must be taken in the presence of
uncertainties. This is an active subject of research that covers many applications.
Classical textbooks are Birge and Louveaux [21], Kall and Wallace [62]. The book
[123] by Wallace and Ziemba is dedicated to applications. Some more advanced
material is presented in Ruszczynski and Shapiro [105], Shapiro et al. [113],
Föllmer and Schied [49], and Carpentier et al. [32]. Let us also mention the his-
torical review paper by Wets [124].
The basic tool for studying such problems is the combination of convex analysis
with measure theory. Classical sources in convex analysis are Rockafellar [96],
Ekeland and Temam [46]. An introduction to integration and probability theory is
given in Malliavin [76].
The author expresses his thanks to Alexander Shapiro (Georgia Tech) for
introducing him to the subject, Darinka Dentchev (Stevens Institute of Technology),
Andrzej Ruszczyńki (Rutgers), Michel de Lara, and Jean-Philippe Chancelier (Ecole
des Ponts-Paris Tech) for stimulating discussions, and Pierre Carpentier with whom
he shared the course on stochastic optimization in the optimization masters at the
Université Paris-Saclay.
vii
Contents
ix
x Contents
Summary This chapter presents the duality theory for optimization problems, by
both the minimax and perturbation approach, in a Banach space setting. Under some
stability (qualification) hypotheses, it is shown that the dual problem has a nonempty
and bounded set of solutions. This leads to the subdifferential calculus, which appears
to be nothing but a partial subdifferential rule. Applications are provided to the infimal
convolution, as well as recession and perspective functions. The relaxation of some
nonconvex problems is analyzed thanks to the Shapley–Folkman theorem.
We say that f is proper if its domain is not empty, and if f (x) > −∞, for all x ∈ X .
The feasible set and value of (P f,K ) are resp.
Since the infimum over the empty set is +∞, we have that val(P f,K ) < +∞ iff
F(P f,K ) = ∅. The solution set of (P f,K ) is defined as
Example 1.1 Consider the problem of minimizing the exponential function over R.
The value is finite, but the solution set is empty. Note that minimizing subsequences
have no limit point in R.
Note that there is no ambiguity in this definition (taking the usual addition rules in
the presence of ±∞). The domain of the sum is the intersection of the domains.
Let f K (x) := f (x) + I K (x). Then (P f,K ) has the same feasible set, value and set of
solutions as (P f K ,X ).
1.1 Convex Functions 3
Observe that (Pλ f,K ) has the same feasible set and set of solutions as (P f,X ), and the
values are related by
val(Pλ f,K ) = λ val(P f,K ). (1.8)
Then (P0 f,K ) has the same feasible set as (P f,X ), and its set of solutions is F(P f,X )
if f is proper.
Example 1.3 Consider the entropy function f (x) = x log x (with the convention
that 0 log 0 = 0) if x ≥ 0, and +∞ otherwise. Then 0 f is the indicatrix of R+ .
More generally, if f is proper, then 0 f is the indicatrix of its domain.
Indeed these two problems have the same feasible set and set of solutions, and they
have opposite values.
For any x and y in K , and α ∈ (0, 1), we have that αx + (1 − α)y ∈ K . (1.10)
We recall without proof the Hahn–Banach theorem, valid in a vector space setting,
and deduce from it some results of separation of convex sets in normed vector spaces.
Remark 1.6 (a) Taking x = 0 in (1.13)(i), we obtain that p(0) = 0, and so we could
as well take α = 0 in (1.13)(i).
(b) If β ∈ (0, 1), combining the above relations, we obtain that
We say that a real vector space X is a normed space when endowed with a mapping
X → R, x → x, satisfying the three axioms
⎧
⎨ x ≥ 0, with equality iff x = 0,
αx = |α|x, for all α ∈ R, x ∈ X, (1.16)
⎩
x + x ≤ x + x , (triangle inequality).
Corollary 1.8 Let x1∗ be a continuous linear form on a linear subspace X 1 of the
normed space X . Then there exists an x ∗ ∈ X ∗ whose restriction to X 1 coincides
with x1∗ , and such that
x ∗ ∗ = x1∗ 1,∗ . (1.19)
Proof Apply Theorem 1.7 with p(x) := x1∗ 1,∗ x. Since x ∗ , ±x ≤ p(x), we
have that x ≤ 1 implies x ∗ , ±x ≤ x1∗ 1,∗ . The result follows.
6 1 A Convex Optimization Toolbox
Corollary 1.9 Let x0 belong to the normed vector space X . Then there exists an
x ∗ ∈ X ∗ such that x ∗ = 1 and x ∗ , x0 = x0 .
Proof Apply Corollary 1.8 with X 1 = Rx0 and x1∗ (t x0 ) = tx0 , for t ∈ R.
a (closed) half-space of X .
1.1 Convex Functions 7
Definition 1.11 Let A and B be two subsets of X . We say that the hyperplane Hx ∗ ,α
separates A and B if
We say that x ∗ ∈ X ∗ (nonzero) separates A and B if (1.24) holds for some α, strictly
separates A and B if (1.25) holds, and strongly separates A and B if (1.26) holds for
some ε > 0 and α. If A is the singleton {a}, then we say that x ∗ separates a and B,
etc.
Given two subsets A and B of a vector space X , we define their Minkowski sum
and difference as
A + B = {a + b; a ∈ A, b ∈ B},
(1.27)
A − B = {a − b; a ∈ A, b ∈ B}.
Example 1.14 If C is the closed unit ball of X , then gC (x) = x for all x ∈ X .
A gauge function is obviously positively homogeneous and finite. If B(0, ε) ⊂ C
for some ε > 0, then:
gC (x) ≤ x/ε, for all x ∈ X, (1.30)
so it is bounded over bounded sets. In addition, for any β > gC (x) and γ > 0, since
x ∈ βC and B(0, γ ε) ⊂ γ C, we get x + B(0, γ ε) ⊂ (β + γ )C, so that gC (y) ≤
gC (x) + γ , for all y ∈ B(x, γ ε). We have proved that
8 1 A Convex Optimization Toolbox
Proof Let x and y belong to X . For all βx > gC (x) and β y > gC (y), we have that
(βx )−1 x ∈ C and (β y )−1 y ∈ C, so that
x+y βx βy
= (βx )−1 x + (β y )−1 y ∈ C. (1.33)
βx + β y βx + β y βx + β y
Therefore, gC (x + y) ≤ βx + β y . Since this holds for any βx > gC (x) and β y >
gC (y), we obtain that gC is subadditive. Since gC is positively homogeneous, it
easily follows that gc is convex.
C := {a − b + x0 , b ∈ B, a ∈ A}. (1.34)
Corollary 1.16 Let E be a closed convex subset of the normed space X . Then there
exists a hyperplane that strongly separates any x0 ∈
/ E and E.
Proof For ε > 0 small enough, the open convex set A := B(x0 , ε) has empty inter-
section with E. By Theorem 1.12, there exists an x ∗ = 0 separating A and E, that
is
x ∗ , x0 + εx ∗ ∗ = sup{x ∗ , a; a ∈ A} ≤ inf{x ∗ , b; b ∈ E}. (1.36)
Remark 1.17 Corollary 1.16 can be reformulated as follows: any closed convex
subset of a normed space is the intersection of half spaces in which it is contained.
The following example shows that, even in a Hilbert space, one cannot in general
separate two convex sets with empty intersection.
Example 1.18 Let X = 2 be the space of real sequences whose sum of squares
of coefficients is summable. Let C be the subset of X of sequences with finitely
many nonzero coefficients, the last one being positive. Then C is a convex cone
that does not contain 0. Let x ∗ separate 0 and C. We can identify the Hilbert space
X with its dual, and therefore x ∗ with an element of X . Since each element ei
of the natural basis belongs to the cone C, we must have xi∗ ≥ 0 for all i, and
x ∗j > 0 for some j. For any ε > 0 small enough, x := −e j + εe j+1 belongs to C,
but x ∗ , x = −x ∗j + εx ∗j+1 < 0. This shows that one cannot separate the convex
sets. So, 0 and C cannot be separated.
Proposition 1.19 Let A and B be two nonempty subsets of X , with empty intersec-
tion. If A − B is convex and has a nonempty relative interior, then there exists a
hyperplane Hx ∗ ,α separating A and B, and such that
Remark 1.20 By the previous proposition when B = {b} is a singleton, noting that
rint(A − b) = rint(A) − b, we obtain that when A is convex, if b ∈
/ rint(A) then
there exists an x ∗ ∈ X ∗ such that
1 Except maybe when the set is a singleton and then the dimension is zero, where this is a matter
of definition. However the case when A − B reduces to a singleton means that both A and B are
singletons and then it is easy to separate them.
10 1 A Convex Optimization Toolbox
Corollary 1.21 Let A and B be two convex and nonempty subsets of a Euclidean
space, with empty intersection. Then there exists a hyperplane Hx ∗ ,α separating A
and B, such that (1.37) holds.
Let X and Y be two sets and let L : X × Y → R. Then we have the weak duality
inequality
sup inf L(x, y) ≤ inf sup L(x, y). (1.39)
y∈Y x∈X x∈X y∈Y
Removing the middle term, maximizing the left-hand side w.r.t. y0 and minimizing
the right-hand side w.r.t. x0 , we obtain (1.39). We next define the primal and dual
values, resp., for x ∈ X and y ∈ Y , by
The weak duality inequality says that val(D) ≤ val(P). We say that (x̄, ȳ) ∈ X × Y
is a saddle point of L over X × Y if
An equivalent relation is
Minorizing the left-hand term by changing x̄ into the infimum w.r.t. x ∈ X and
majorizing symmetrically the right-hand term, we obtain
inf sup L(x, y) ≤ L(x̄, ȳ) ≤ sup inf L(x, y), (1.44)
x∈X y∈Y y∈Y x∈X
1.1 Convex Functions 11
which, combined with the weak duality inequality, shows that x̄ ∈ S(P), ȳ ∈ S(D)
and
val(D) = val(P) = L(x̄, ȳ). (1.45)
But we have more, in fact, if we denote by S P(L) the set of saddle points, then:
val(D) = inf L(x, ȳ) ≤ L(x̄, ȳ) ≤ sup L(x̄, y) = val(P). (1.47)
x∈X y∈Y
If val(D) = val(P), then these inequalities are equalities, so that (1.43) holds,
and therefore (x̄, ȳ) is a saddle point. The converse implication has already been
obtained.
We assume in this section that X is a vector space (with no associated topology; this
abstract setting does not make the proofs more complicated). Consider the infinite-
dimensional linear program
where c and ai , i = 1, . . . , p, are linear forms over X , b ∈ R p , and ·, · denotes the
action of a linear form over X . The associated Lagrangian function L : X × R p∗ →
R is defined as
p
L(x, λ) := c, x + λi (ai , x − bi ) , (1.48)
i=1
p∗
where the multiplier λ has to belong to R+ . The primal value satisfies
c, x if x ∈ F(L P),
p(x) = sup L(x, λ) = (1.49)
p∗
λ∈R+
+∞ otherwise.
Therefore (L P) and the primal problem (ofminimizing p(x)) have the same value
p
and set of solutions. Since L(x, λ) = c + i=1 λi ai , x − λb, we have that
12 1 A Convex Optimization Toolbox
p
−λb if c + i=1 λi ai = 0,
d(λ) = inf L(x, λ) = (1.50)
x −∞ otherwise.
The dual problem has therefore the same value and set of solutions as the following
problem, called dual to (L P):
p
Max
p∗
−λb; c + λi ai = 0. (L D)
λ∈R+
i=1
Lemma 1.23 The pair (x, λ) ∈ F(L P) × F(L D) is a saddle point of the
Lagrangian iff (1.52) holds.
Proof Let x ∈ F(L P) and λ ∈ F(L D). Then (1.52)(i) holds, implying that the dif-
ference of cost function is equal to
p
c, x + λb = λi (bi − ai , x). (1.53)
i=1
This sum of nonnegative terms is equal to zero iff the last relation of (1.52)(ii)
holds, and then x ∈ S(L P) and λ ∈ S(L D), proving that (x, λ) is a saddle point.
The converse implication is easily obtained.
Lemma 1.24 If (L P) (resp. (L D)) has a finite value, then its set of solutions is
nonempty.
Proof The proof of the two cases being similar, it suffices to prove the first statement.
Since (L P) has a finite value, there exists a minimizing sequence x k . Extracting a
subsequence if necessary, we may assume that I (x k ) is constant, say equal to J .
Among such minimizing sequences we may assume that J is of maximal cardinality.
If c, x k has, for large enough k, a constant value, then the corresponding x k
is a solution of (L P). Otherwise, extracting a subsequence if necessary, we may
assume that c, x k+1 < c, x k for all k. Set d k := x k+1 − x k , and consider the set
E k := {ρ ≥ 0; x k + ρd k ∈ F(L P)}. Since c, d k < 0 and val(L P) > −∞, this
set is bounded. The maximal element is
1.1 Convex Functions 13
C := {d ∈ X ; ai , d ≤ 0, i ∈ J }. (1.55)
Consider the mapping Ax := (a1 , x, . . . , a j , x, c, x) over X , with image E 1 ⊂
R j+1 . We claim that the point z := (0, . . . , 0, −1) does not belong to the set
j+1
E 2 := E 1 + R+ . Indeed, otherwise we would have z ≥ Ax, for some x ∈ X , and
so ai , x ≤ 0, for all i ∈ J , whereas z j+1 = −1 contradicts (1.56).
Let z 1 , . . . , z q be a basis of the vector space E 1 . Then E 2 is the set of nonnegative
linear combinations of {±z 1 , . . . , ±z q , e1 , . . . , e j+1 }, where by ei we denote the
elements of the natural basis of R j+1 . By Lemma 1.25, E 2 is closed. Corollary 1.16,
allows us to strictly separate z and E 2 . That is,
Since E 2 is a cone, the above infimum is zero, whence λ j+1 > 0. Changing λ into
j+1
λ/λ j+1 if necessary, we may assume that λ j+1 = 1. Since E 2 = E 1 + R+ it follows
j+1
that λ ∈ R+ , and since E 1 ⊂ E 2 we deduce that 0 ≤ i∈J λi ai + c, d for all
d ∈ X , meaning that i∈J λi ai + c = 0. Let us now set λi = 0 for i ≤ p, i ∈ / J.
Then (1.52) holds at the point x̄. By Lemma 1.23, (x̄, λ) is a saddle point and the
conclusion follows.
Remark 1.27 It may happen, even in a finite-dimensional setting, that both the pri-
mal and dual problem are unfeasible, so that they have value +∞ and −∞ resp.;
consider for instance the problem Min x∈R {−x; 0 × x = 1; −x ≤ 0}, whose dual is
Maxλ∈R2 {−λ1 ; −1 − λ2 = 0; λ2 ≥ 0}.
Then there exists a Hoffman constant M > 0, not depending on b, such that, if
Cb = ∅, then
p
dist(x, Cb ) ≤ M (ai , x − bi )+ , for all x ∈ X. (1.59)
i=1
q
ABx = Ax − α j (x)Ax j = 0, (1.61)
j=1
1.1 Convex Functions 15
Specifically, let x̂ be such an element of Cb for which |β| is minimum, that is, β is a
solution of the problem
1 q
Minq |γ |2 ; ai , x + γ j x j ≤ bi , i = 1, . . . , p. (1.63)
γ ∈R 2
j=1
q
The following optimality conditions hold: there exists a λ ∈ R+ such that
p
γj + λi ai , x j = 0, j = 1, . . . , q, (1.64)
i=1
and ⎛ ⎞
q
λi ⎝ai , x + γ j x j − bi ⎠ = 0, i = 1, . . . , p. (1.65)
j=1
q
q
q
|γ |2 = γ j2 = − λi ai , γ j x j = λi (ai , x − bi ) ≤ λ∞ (ai , x − bi )+ .
j=1 i, j i=1 i=1
(1.66)
(c) Among all possible multipliers λ we may take one with minimal support. From
(1.64) we deduce the existence of M1 > 0 not depending on x and b, such that
λ∞ ≤ M1 |γ |. (1.67)
q
|γ | ≤ M1 (ai , x − bi )+ . (1.68)
i=1
The conclusion follows since, as noticed before, the Euclidean norm on H is equiv-
alent to the one induced by the norm of X .
16 1 A Convex Optimization Toolbox
We will generalize the previous result, and for this we need the following fundamental
result in functional analysis.
Theorem 1.29 (Open mapping theorem) Let X and Y be Banach spaces, and let
A ∈ L(X, Y ) be surjective. Then α BY ⊂ AB X , for some α > 0.
Proof See e.g. [28].
Corollary 1.30 Let A and α be as in Theorem 1.29. Then Im(A ) is closed, and
proving (1.69). Let us now check that Im(A ) is closed. Let xk∗ in Im(A ) converge
to x ∗ . There exists a sequence λk ∈ Y ∗ such that xk∗ = A λk . In view of (1.69), λk
is a Cauchy sequence and hence has a limit λ̄ ∈ Y ∗ . Therefore x ∗ = A λ̄ ∈ Y ∗ . The
conclusion follows.
Proposition 1.31 Let X and Y be Banach spaces, and A ∈ L(X, Y ). Then Im(A ) ⊂
(Ker A)⊥ , with equality if A has a closed range.
Proof (a) Let x ∈ Im(A ), i.e., x = A y ∗ , for some y ∈ Y , and x ∈ Ker A. Then
x, x X = y, AxY = 0. Therefore, Im(A ) ⊂ (Ker A)⊥ .
(b) Assume now that A has closed range. Let x ∗ ∈ (Ker A)⊥ . For y ∈ Y , set
Since x ∗ ∈ (Ker A)⊥ , any x such that Ax = y gives the same value of x ∗ , x X ,
and therefore v(y) is well-defined. It is easily checked that it is a linear function.
By the open mapping theorem, applied to the restriction of A from X to its image
(the latter being a Banach space by hypothesis), there exists an x ∈ α −1 yY B X
such that Ax = y, so that |v(y)| ≤ α −1 x ∗ yY . So, v is a linear and continuous
mapping, i.e., there exists a y ∗ ∈ Y ∗ such that v(y) = y ∗ , yY . For all x ∈ X , we
have therefore x ∗ , x X = y ∗ , AxY = A y ∗ , x X , so that x ∗ = A y ∗ , as was to
be proved.
Remark 1.32 See in Example 1.115 another proof, based on duality theory.
Example 1.33 Let X := L 2 (0, 1), Y := L 1 (0, 1), and A ∈ L(X, Y ) be the injection
of X into Y . Then Ker A is reduced to 0, and therefore its orthogonal is X ∗ . On the
other hand, we have that for y ∗ ∈ L ∞ (0, 1), A y ∗ is the operator in X ∗ defined by
x → 0 y ∗ (t)x(t)dt. So the image of A is a dense subspace of X ∗ , but A is not
1
Proof By the open mapping Theorem 1.29, there exists an x ∈ Ker A such that
x − x ≤ α −1 Ax, where α is given by Theorem 1.29. Therefore for some M > 0
not depending on x:
ai , x + ≤ (ai , x)+ + MAx. (1.73)
Applying Lemma 1.28 to x , with Ker A in place of X , we obtain the desired con-
clusion.
1.1.5 Conjugacy
This can be motivated as follows. Let us look for an affine minorant of f of the form
x ∗ , x − β. For given x ∗ , the best (i.e., minimal) value of β is precisely f ∗ (x ∗ ).
Being a supremum of affine functions, f ∗ is l.s.c. convex. We obviously have the
Fenchel–Young inequality
Since the supremum over an empty set is −∞, we may always express f ∗ by maxi-
mizing over dom( f ):
If f (x) is finite we can write the Fenchel–Young inequality in the more symmetric
form
x ∗ , x ≤ f (x) + f ∗ (x ∗ ). (1.79)
Lemma 1.36 Let f be proper. Then the symmetric form (1.79) of the Fenchel–Young
inequality is valid for any (x, x ∗ ) in X × X ∗ .
Proof By (1.77), f ∗ (x ∗ ) > −∞, and f (x) > −∞, so that (1.79) makes sense and
is equivalent to the Fenchel–Young inequality.
1 1
(x, y) ≤ x2 + y2 , for all x, y in X. (1.80)
2 2
Example 1.38 Let p > 1. Define f : R → R by f (x) := |x| p / p. For y ∈ R, the
maximum of x → x y − f (x) is attained at 0 if y = 0, and otherwise for some
x = 0 of the same sign as y such that |x| p−1 = |y|. Introducing the conjugate
∗
exponent p ∗ such that 1/ p ∗ + 1/ p = 1, we get f ∗ (y) = |y| p / p ∗ , so that x y ≤
p∗ ∗
|x| / p + |y| / p , for all x, y in R. Similarly,
p
for some p > 1, let f : Rn → R
p p n
be defined by f (x) := x p / p, where x p = i=1 |xi | p . We easily obtain that
∗
f ∗ (y) = y p∗ / p ∗ , leading to the Young inequality
p
n
1 1 p∗
xi yi ≤ x pp + ∗ y p∗ , for all x, y in R. (1.81)
i=1
p p
Exercise 1.39 Let A be a symmetric, positive definite n × n matrix. (i) Check that
the conjugate of f (x) := 21 x Ax is f ∗ (y) := 21 y A−1 y. Taking x = y, deduce the
Young inequality
1 1
|x|2 ≤ x Ax + x A−1 x, for all x ∈ Rn . (1.82)
2 2
Conclude that
Exercise 1.40 Check that the conjugate of the indicatrix of the (open or closed) unit
ball of X is the dual norm.
Exercise 1.41 Let f (x) := αg(x) with α > 0 and g : X → R̄. Show that
f ∗ (x ∗ ) = αg ∗ (x ∗ /α). (1.84)
Exercise 1.42 Show that the Fenchel conjugate of the exponential is the entropy
function H with value H (x) = x(log x − 1) if x > 0, H (0) = 0, and H (x) = ∞ if
x < 0. Deduce the inequality x y ≤ e x + y(log y − 1), for all x ∈ R and y > 0.
The biconjugate of f is the function f ∗∗ : X → R̄ defined by
i.e., −x ∗ separates x0 from dom( f ). Otherwise, we say that the separating hyperplane
is oblique. We may then assume that β = 1 and we obtain that
meaning that the above r.h.s. is an affine minorant of f . At the same time, its value at
x0 is βε + x ∗ , x0 + γ , which for β > 0 large enough is larger than α0 . So this r.h.s.
is an oblique hyperplane separating (x0 , α0 ) from epi( f ). The conclusion follows.
Definition 1.45 (i) Let E ⊂ X . The convex hull conv(E) is the smallest convex
set containing E, i.e., the set of finite convex combinations (linear combinations
with nonnegative weights whose sum is 1) of elements of E. The convex closure
of E, denoted by conv(E), is the smallest closed convex set containing E (i.e., the
intersection of closed convex set containing E).
(ii) Let f : X → R̄. The convex closure of f is the function conv( f ) : X → R̄
whose epigraph is conv(epi( f )) (note that conv( f ) is the supremum of l.s.c. convex
minorants of f ).
We obviously have that f = conv( f ) iff f is convex and l.s.c.
Theorem 1.46 (Fenchel–Moreau–Rockafellar) Let f : X → R̄. We have the fol-
lowing alternative: either
(i) f ∗∗ = −∞ identically, conv( f ) has no finite value, and has value −∞ at some
point, or
(ii) f ∗∗ = conv( f ) and conv( f )(x) > −∞, for all x ∈ X .
Proof If f is identically equal to +∞, the conclusion is obvious. So we may
assume that dom( f ) = ∅. Since f ∗∗ is an l.s.c. convex minorant of f , we have
that f ∗∗ ≤ conv( f ). So, if conv( f )(x1 ) = −∞ for some x1 ∈ X , then f has no
affine minorant and f ∗∗ = −∞. In addition, since conv( f ) is l.s.c. convex, for any
x ∈ dom(conv( f )), setting x θ := θ x + (1 − θ )x1 , we have that
conv( f )(x) ≤ lim conv( f )(x θ ) ≤ θ conv( f )(x) + (1 − θ ) conv( f )(x1 ) = −∞,
θ↑1
(1.90)
so that (i) holds. On the contrary, if (i) does not hold, then f has a continuous affine
minorant, so that then conv( f )(x) > −∞, for all x ∈ X . Being proper, l.s.c. and
convex, conv( f ) is by Theorem 1.44 the supremum of its affine minorants, which
coincide with the affine minorants of f . The conclusion follows.
Corollary 1.47 Let f be convex X → R̄. Then
(i) conv( f )(x) = lim inf x →x f (x ), for all x ∈ X ,
(ii) if f is finite-valued and l.s.c. at some x0 ∈ X , then f (x0 ) = f ∗∗ (x0 ).
1.1 Convex Functions 21
Proof (i) Set g(x) := lim inf x →x f (x ). It is easily checked that g is an l.s.c. convex
minorant of f , and therefore g ≤ conv( f ). On the other hand, since conv( f ) is an
l.s.c. minorant of f , we have that conv( f )(x) ≤ lim inf x →x f (x ) = g(x), proving
(i).
(ii) By point (i), since f is finite-valued and l.s.c. at x0 , we have that f (x0 ) =
conv( f )(x0 ) > −∞, and we conclude by Theorem 1.46.
Example 1.48 Let K be a nonempty closed convex subset of X , and set f (x) = −∞
if x ∈ K , and f (x) = +∞ otherwise. Then f is l.s.c. convex, and f ∗∗ has value −∞
everywhere, so that f = f ∗∗ .
Let X be a Banach space, and g : X ∗ → R̄. Its (Legendre–Fenchel) conjugate (in the
dual sense) is the function g ∗ : X → R̄ defined by
f ∗∗∗ = f ∗ . (1.95)
22 1 A Convex Optimization Toolbox
Example 1.51 Recall the definition (1.6) of the indicatrix function. The support
function σ K : X ∗ → R̄ (sometimes also denoted by σ (·, K )) is defined by
Equivalently, ∂ f (x) is the set of slopes of affine minorants of f that are exact (i.e.,
equal to f ) at the point x. The inequality in (1.98) may be written as
are exact at x for f ∗∗ are those which attain the supremum in (1.85), i.e.
Also, if ∂ f (x) = ∅, then the corresponding affine minorants, exact at x, are also
minorants of f ∗∗ exact at x, and therefore
Similarly to what was done before we can express the above inequality as
This means that x ∈ ∂g(x ∗ ) iff x ∗ attains the maximum in the definition of g ∗ (x),
i.e., we have that
Lemma 1.52 Let f : X → R̄ have a finite value at some x ∈ X . That equality holds
in the Fenchel–Young inequality (1.79) implies that x ∈ ∂ f ∗ (x ∗ ); the converse holds
if f is proper, l.s.c. convex.
Proof If (1.79) holds with equality, we know that f (x) = f ∗∗ (x) and so, by (1.105)
applied to g = f ∗ , x ∈ ∂ f ∗ (x ∗ ) holds. Conversely, if x ∈ ∂ f ∗ (x ∗ ), then by (1.105)
applied to g = f ∗ , we have that f ∗ (x ∗ ) + f ∗∗ (x) = x ∗ , x. When f is proper, l.s.c.
convex, f ∗∗ (x) = f (x), so that equality holds in the Fenchel–Young inequality, as
was to be proved.
In the analysis of stochastic problems we will need the following sensitivity anal-
ysis results for linear programs.
Lemma 1.55 Let f have a finite value at x̄ ∈ X . Then ∂ f (x̄) is nonempty and
satisfies
∂ f (x̄) = {−M λ; λ ∈ S(Dx̄ )}. (1.109)
f ∗ (x ∗ ) = supx;y≥0 {x ∗ · x − dy; Ay = b + M x}
(1.110)
= − inf x;y≥0 {d · y − x ∗ · x; Ay = b + M x}.
Since f (x̄) is finite, the linear program involved in the above r.h.s. is feasible. By
linear programming duality (Lemma 1.26) it has the same value as its dual, and
hence,
0 ≤ f (x̄) + f ∗ (x ∗ ) − x ∗ · x̄
(1.112)
= f (x̄) − x ∗ · x̄ + inf λ∈R p {λb; x ∗ = −M λ; d + A λ ≥ 0}.
Since f (x̄) is the finite value of a feasible linear program, it is equal to val(Dx̄ ). So,
let λ̄ ∈ S(Dx̄ ). The Fenchel–Young inequality (1.112) is equivalent to
When equality holds, the linear program on the r.h.s. has a solution, say λ, and
−x ∗ · x̄ = λ M x̄, so that equality holds iff
Recall that this is the case of equality in the Fenchel–Young inequality, and therefore
it holds iff x ∗ ∈ ∂ f (x̄). Since the cost function and last constraint correspond to those
of (Dx̄ ), it follows that any solution λ̂ of the linear program on the r.h.s. belongs to
S(Dx̄ ). We have proved that, if x ∗ ∈ ∂ f (x), then x ∗ = −M λ for some λ ∈ S(Dx̄ ).
The converse obviously holds in view of (1.114).
Remark 1.56 Consider the particular case when b = 0 and M is the opposite of the
identity. Rewriting as b the variable x, we obtain that the function
has, over its domain, a subdifferential equal to the set of solutions of the dual problem
Maxp λ · b; d + A λ ≥ 0. (1.116)
λ∈R
We now show that, for convex functions, a local uniform upper bound implies a
Lipschitz property as well as subdifferentiability.
It follows that
Proof Let x̄ ∈ int dom( f ). There exists x 0 , . . . , x n in dom( f ) such that x̄ ∈ int E,
where E := conv({x 0 , . . . , x n }). Then f (x) ≤ max{ f (x 0 ), . . . , f (x n )} over E. We
conclude by Lemma 1.57.
λ, x̂ + α f (x̂) ≤ λ, x + αγ , for all x ∈ dom( f ), γ > f (x). (1.121)
Taking x = x0 and γ > f (x), γ → +∞, we see that α ≥ 0. The separating hyper-
plane cannot be vertical, since x̂ ∈ int(dom( f )), so that we may take α = 1. Mini-
mizing w.r.t. γ we obtain that f (x) ≥ f (x̂) − λ, x − x̂, proving that −λ ∈ ∂ f (x̂).
We now check that ∂ f (x) ⊂ B̄(0, L). Assume that x ∗ ∈ ∂ f (x), with x ∗ ∗ > L.
Then there exists a d ∈ X with d = 1 and x ∗ , d > L. Therefore by the definition
of a subdifferential,
f (x + σ d) − f (x)
lim ≥ x ∗ , d > L , (1.122)
σ ↓0 σ
Example 1.60 Consider the entropy function f (x) = x log x if x ≥ 0 (with value 0
at zero), and f (x) = +∞ otherwise. Then f is l.s.c. convex, and the subdifferential
is empty at x = 0. So, in general, even in a Euclidean space, an l.s.c. convex function
may have an empty subdifferential at some points of its domain.
Lemma 1.62 We have that int(S) ⊂ core(S), and the converse holds in the following
cases: (i) int(S) = ∅, (ii) S is finite-dimensional, (iii) S is closed and convex.
We next give an example3 of a set with an empty interior, and a nonempty core.
Example 1.63 Let X be an infinite-dimensional Banach space. It is known that there
exists a non-continuous linear form on X , that we denote by a(x). Set
Clearly, 0 ∈ core(A). However, since a(x) is not continuous, and therefore not
bounded in a neighbourhood of 0, A has an empty interior.
Proposition 1.64 Let f be a convex function Rn → R̄. Then it is continuous over
the interior of its domain.
Proof Let x̄ belong to the interior of dom( f ). Then there exists x 0 , . . . , x n in dom( f )
whose convex hull E is such that B(x̄, ε) ∈ int(E), for some ε > 0. Since f is convex
it follows that f (x) ≤ maxi f (x i ) for all x in B(x̄, ε). So, the conclusion follows
from Lemma 1.57.
Proposition 1.65 Let f be a proper l.s.c. convex function X → R̄. Then it is con-
tinuous over the interior of its domain.
Proof Let x0 ∈ int(dom( f )), and set S := {x ∈ X ; f (x) ≤ f (x0 ) + 1}. Since f is
l.s.c., this is a closed set. Fix h ∈ X ; for t ∈ R, the function ϕ(t) := f (x0 + th) has a
finite value at 0, and its domain contains [−ε, ε] for some ε > 0. For t ∈ [−ε, ε], we
have that ϕ(t) ≤ max( f (x0 − εh), f (x0 + εh)). By Lemma 1.57, ϕ is continuous
at 0, proving that x0 ∈ core(S). By Lemma 1.62, x0 ∈ int(S), meaning that f is
bounded from above near x0 . We conclude with Lemma 1.57.
If f is convex, then for all x and h in X , ( f (x + th) − f (x))/t is nondecreasing
w.r.t. t ∈ (0, ∞). Therefore the directional derivative
f (x + th) − f (x)
f (x, h) := lim (1.124)
t↓0 t
always exists. Let us see how its value is related to the subdifferential. For this we
need a preliminary lemma on positively homogeneous functions.
Lemma 1.66 Let F : X → R̄ be positively homogeneous, i.e.
Proof (i) The function F is positively homogeneous, with value 0 at 0, and is easily
proved to be convex. Let x ∗ ∈ ∂ F(0). Then
We then deduce the equality in (1.130) from Lemma 1.66(i). The first inequality
being trivial, the conclusion follows.
1.1 Convex Functions 29
Definition 1.68 Let F : X → Y , where X and Y are Banach spaces. We say that F is
Gâteaux differentiable (or G-differentiable) at x̄ ∈ X if, for any h ∈ X , the directional
derivative F (x̄, h) exists and the mapping h → F (x̄, h) is linear and continuous.
We denote by D F(x̄) ∈ L(X, Y ) the derivative of F defined by D F(x̄)h = F (x̄, h),
for all h ∈ X .
Corollary 1.69 Let f : X → R̄ be convex, and continuous at x̄. Then
(iv) the space X has a differentiable norm if the dual norm is strictly subadditive.
Example 1.73 (Example of strict inequality in (1.130)) Let f (x) := 21 x12 /x2 , with
domain the elements x ∈ R2 such that x1 > 0 and x2 > 0. Since
x1 /x2 1/x2 −x1 /x22
∇ f (x) = ; D 2 f (x) = , (1.136)
− 21 x12 /x22 −x1 /x22 x12 /x23
we easily check that D 2 f (x) is positive semidefinite, and hence, f is convex over
its convex domain. We set f (x) = +∞ if x1 < 0 or x2 < 0, and examine how to
define f on R2+ when min(x1 , x2 ) = 0, in order to make f l.s.c., i.e. we compute
30 1 A Convex Optimization Toolbox
f (x) := lim inf{ f (x ); min(x1 , x2 ) > 0}. Clearly, when x2 > 0 (resp. x1 > 0) there
exists a limit of value 0 (resp. +∞), and so, since f is nonnegative, its value at 0
should be 0 (resp. +∞). So we finally set
⎧
⎨0 if x = 0,
f (x) := x /x2
1 2
if x1 ≥ 0 and x2 > 0, (1.137)
⎩ 2 1
+∞ otherwise.
We easily check that f has the following strange property: while min( f ) = 0, there
exists a sequence x k ∈ dom( f ) such that D f (x k ) → 0 and f (x k ) → +∞.
The directional derivatives of f at x = 0 are for h = 0:
0 if h 1 ≥ 0 and h 2 > 0,
f (0, h) := (1.138)
+∞ otherwise.
For h̄ = (1 0) , we have that lim inf h →h̄ f (0, h ) = 0 < +∞ = f (0, h̄). In (1.130),
the inequality is strict, and the supremum is attained for x ∗ = 0.
We have already discussed in Example 1.51 the link between the indicatrix and
support functions.
Let K ∗ be a subset of X ∗ , and x0∗ ∈ X ∗ . The (negative) polar set of K ∗ w.r.t. x0∗ is
the set
Observe that we obtain the same polar sets if we replace K or K ∗ by their convex
closure. We also define the positive polar sets as
and similarly K ∗− (x0∗ ) := −K ∗− (x0∗ ). When x0 = 0 (resp. x0∗ = 0), we simply denote
the polar set by K − (resp. K ∗− ). The bipolar set is defined as e.g. K −− := (K − )− .
Exercise 1.75 Let C be the closed unit ball of the Banach space X . Check that C −
is the closed unit ball of X ∗ , and that C −− = C.
Hint: use Corollary 1.8.
1.1 Convex Functions 31
Proof It suffices to prove the first statement. It is easily seen that both K and 0 belong
to K −− . Since K −− is closed and convex, it contains K := conv(K ∪ {0}). Now let
/ K . We can strictly separate K from x̄, i.e., there exists an x ∗ ∈ X ∗ such that
x̄ ∈
supx∈K x ∗ , x < x ∗ , x̄. Since 0 ∈ K , x ∗ , x̄ > 0. For any positive α < x ∗ , x̄,
close enough to x ∗ , x̄, we have that y ∗ := α −1 x ∗ is such that y ∗ , x̄ > 1, and
y ∗ , x ≤ 1 for all x ∈ K , so that y ∗ ∈ K − and then x̄ cannot belong to K −− . The
conclusion follows.
Exercise 1.78 Check that, when K (resp. K ∗− ) is a cone, then (i) K − (resp. K ∗− ) is
itself a cone, called the (negative) polar cone, such that
K − := {x ∗ ∈ X ∗ ; x ∗ , x ≤ 0, for all x ∈ K },
(1.142)
K ∗− := {x ∈ X ; x ∗ , x ≤ 0, for all x ∗ ∈ K ∗ },
and (ii) the Fenchel conjugate of the corresponding indicatrix functions satisfy
σ K = I K∗ = I K − ; I K∗ ∗ = I K ∗− . (1.143)
Exercise 1.79 Let X be a Banach space and C1 and C2 be two convex cones of the
same space Y , with either Y = X or Y = X ∗ . Check that
∂σ K (x ∗ ) = {x ∈ K ; x ∗ , x ≥ x ∗ , x , for all x ∈ K },
∂σ K (x ∗ ) = N K−1 (x ∗ ). (1.147)
Exercise 1.82 Let C be a closed convex cone of a Banach space X , and let x̄ ∈ C.
Check that
NC (x̄) = C − ∩ (x̄)⊥ ; TC (x̄) = C + Rx̄. (1.148)
Hint: for the second relation, apply (1.144) with C1 = C and C2 = Rx̄.
Observe that
v∗ (y ∗ ) = sup y ∗ , y − inf (ϕ(x, y) − x ∗ , x)
y x
(1.150)
= sup y ∗ , y + x ∗ , x − ϕ(x, y) = ϕ ∗ (x ∗ , y ∗ ).
x,y
It follows that
v∗∗ (y) = sup y ∗ , y − ϕ ∗ (x ∗ , y ∗ ). (1.151)
y ∗ ∈Y ∗
1.2 Duality Theory 33
Max y ∗ , y − ϕ ∗ (x ∗ , y ∗ ). (D y )
y ∗ ∈Y ∗
Additionally:
If ∂v(y) = ∅, then ∂v(y) = S(D y ). (1.154)
In the sequel we will analyze the case of strong duality, i.e. when v(y) = v∗∗ (y), in
order to get some information of ∂v(y).
Remark 1.84 The dual problem can also be obtained by dualizing in the usual way
an equality constraint. Indeed, write the primal problem in the form below, with
z ∈ Y:
Min ϕ(x, z) − x ∗ , x; y − z = 0, (1.155)
x,z
We have that
sup y ∗ L (x, z, y, y ∗ ) = ϕ(x, y) − x ∗ , x if y = z, +∞ otherwise,
(1.157)
inf x,z L (x, z, y, y ∗ ) = y ∗ , y − ϕ ∗ (x ∗ , y ∗ ).
The dual problem obtained in the present perturbation duality framework may there-
fore be viewed as a particular case of the minimax duality discussed in Sect. 1.1.3.
so that ∗
v∗∗ ∗ ∗
D (x ) = sup x , x − ψ (x, y) . (1.159)
x
v∗∗ ∗ ∗
D (x ) = val(D x ∗ ) ≤ val(Px ∗ ) = v D (x ),
D D
(1.160)
S(DxD∗ ) = ∂v∗∗
D (y). (1.161)
Additionally,
if ∂v D (x ∗ ) = ∅, then ∂v D (x ∗ ) = S(DxD∗ ). (1.162)
Now starting from a problem of type (Py ), and rewriting its dual (D y ) as a mini-
mization problem, we can dualize it. Writing the obtained bidual as a minimization
problem, we see that its expression is nothing but
By Theorem 1.44, the duality mapping is involutive in the class of proper, l.s.c.
convex functions, in the following sense:
Lemma 1.85 Let ϕ be proper, l.s.c. and convex. Then (Py ) and its bidual problem
coincide.
Remark 1.86 If X and Y are reflexive, then the bidual problem is the classical dual
of the dual one, so that we will be able to apply the duality theory that follows to the
dual problem.
We call the relation of equality between a primal and a dual cost, that is, for
(x, y, y ∗ ) ∈ X × Y × Y ∗ :
1.2 Duality Theory 35
ϕ(x, y) − x ∗ , x = y ∗ , y − ϕ ∗ (x ∗ , y ∗ ) (1.163)
an optimality condition (in the context of duality theory). By weak duality, this
implies that the primal and dual problem have the same value. If the latter is finite,
then x ∈ S(Py ) and y ∗ ∈ S(D y ), and (1.163) is equivalent to
and
If (1.163) holds with finite value, then
(1.167)
x ∈ S(Py ), y ∗ ∈ S(D y ), val(Py ) = val(D y ), and ∂v(y) = S(D y ).
Proof Relation (1.166) follows from Proposition 1.83, and is easily seen to imply
(1.167).
We next need stronger assumptions that guarantee the equality of the primal and
dual cost.
Theorem 1.88 Assume that v is convex, uniformly upper bounded near y, and
finitely-valued at y. Then (i) val(D y ) = val(Py ), (ii) x ∈ S(Py ) iff there exists a
y ∗ ∈ Y ∗ such that the optimality condition (1.163) holds, (iii) ∂v(y) = S(D y ), the
latter being nonempty and bounded, and (iv) the directional derivatives of v satisfy,
for all z ∈ Y :
v (y, z) = sup{y ∗ , z; y ∗ ∈ S(D y )}. (1.168)
Remark 1.89 (i) A sufficient condition for v to be convex is that ϕ is convex. (ii)
A sufficient condition for having a uniform upper bound near y is that ϕ(x0 , ·) is
continuous at y, for some x0 ∈ X .
It may happen, however, that while ϕ is l.s.c. convex, v is not l.s.c., and this
prevents us from deducing its continuity from Proposition 1.65.
36 1 A Convex Optimization Toolbox
Exercise 1.90 Let X = L ∞ (0, 1), Y = L 2 (0, 1), and denote by A the injection from
X into Y . Take x ∗ = 0 and
Check that ϕ is l.s.c. convex, but v(y), equal to the indicatrix of L ∞ (0, 1), is not l.s.c.
(see the related analysis in Example 1.136).
We next state a stability condition, also called a qualification condition, that pro-
vides a sufficient condition for the continuity of the value function. The condition is
that y ∈ int(dom(v)), or equivalently:
For all y ∈ Y close enough to y,
(1.170)
there exists an x ∈ X such that ϕ(x , y ) < ∞.
Lemma 1.91 Assume that ϕ is l.s.c. convex, the stability condition (1.170) holds,
and v( ȳ) is finite. Then v is continuous at ȳ.
Proof See e.g. [26, Prop. 2.152]; the proof is too technical to be reproduced here.
Corollary 1.92 Under the assumptions of Lemma 1.91, the conclusion of Theorem
1.88 holds.
Example 1.93 (A strange example) Consider the reverse entropy function, where
x ∈ R:
where δ(λ) := inf x L(x, λ). By the duality theory, the dual problem has a bounded
and nonempty set of solutions and the primal and dual value are equal, i.e., λ is a dual
solution iff δ(λ) = 0, with infimum in the Lagrangian attained at 0. Now if λ > 0,
the infimum is attained at a positive point. So, the unique dual solution is λ̄ = 0 and
the optimality condition reads
1.2 Duality Theory 37
0 ∈ argmin x + 0 × Ĥ (λ) . (1.174)
x∈R
This indeed holds if we correctly interpret the product 0 × Ĥ (λ) as being equal to
+∞ whenever Ĥ (λ) = +∞, see Sect. 1.1.1.2.
In many applications, we can check in a direct way the continuity of the value
function. Here is a specific example.
Proposition 1.94 Let K be a closed convex subset of the Hilbert space X . Then the
function v(y) := 21 dist(y, K )2 is convex and of class C 1 , with derivative Dv(y) =
y − PK (y).
1 1
x − y2 + y ∗ 2 − (y ∗ , y − x) + I K (x) + σ K (y ∗ ) − (y ∗ , x) = 0. (1.177)
2 2
The sum of the three first terms is 21 y − x − y ∗ 2 , and the sum of the three last is,
by the Fenchel–Young inequality, nonnegative. Therefore (1.177) is equivalent to
Exercise 1.95 Given a Hilbert space X (identified with its dual), f l.s.c. proper
convex X → R̄, y ∈ X , and ε > 0, consider the problem
ε
Min f (x) + x − y2 . (1.179)
x∈X 2
(i) Show that this problem has a unique solution xε (y) (hint: the cost is strongly
convex), called the proximal point to y.
(ii) Check that the dual problem is
1 ∗ 2
Max
∗
(y ∗ , y) − y − f ∗ (y ∗ ). (1.180)
y ∈X 2ε
(iii) Show that the primal and dual values are equal, and that the dual problem has a
unique solution yε∗ (y) = ε(y − xε (y)).
(iv) Show that f ε (y) := inf x ( f (x) + 2ε x − y2 ) (the Moreau–Yosida approxima-
tion) has a continuous derivative D f ε (y) = ε(y − xε (y)).
Here G : X → Y and F : Y → R̄. This enters into our general framework, with here
we have that
Max
∗
y ∗ , y − F ∗ (y ∗ ) + inf (L(x, y ∗ ) − x ∗ , x). (D y )
y x
(F(G(x)
+ y) + F ∗ (y ∗ ) − y ∗ , G(x) + y) (1.184)
+ L(x, y ∗ ) − x ∗ , x) − inf x (L(x , y ∗ ) − x ∗ , x ) = 0.
Each row above being nonnegative by the Fenchel–Young inequality, this is equiva-
lent to the relations
(i) y ∗ ∈ ∂ F(G(x) + y);
(1.185)
(ii) x ∈ argmin(L(·, y ∗ ) − x ∗ , ·).
Remark 1.96 Since, as we have seen, these relations express nothing but the Fenchel–
Young equality for ϕ, we deduce that if ϕ(x, y) is finite, then
Remark 1.97 Since (Py ) is feasible iff y ∈ dom(F) − G(x) for some x ∈ dom( f ),
we have that dom(v) = dom(F) − G(dom( f )), and the stability condition (1.170)
reads:
y ∈ int (dom(F) − G(dom( f ))) . (1.187)
Proposition 1.98 Let ϕ be l.s.c. convex, and y ∈ Y be such that v(y) is finite, and
(1.187) holds. Then v is continuous at y, and the conclusion of Theorem 1.88 holds.
enters into the previous framework with F = I K , the indicatrix of K . In that case
the optimality conditions (1.185) reduce to
(i) y ∗ ∈ N K (G(x) + y);
(1.189)
(ii) x ∈ argmin(L(·, y ∗ ) − x ∗ , ·).
40 1 A Convex Optimization Toolbox
Under which conditions is the function ϕ defined in (1.181) jointly convex and l.s.c.?
Varying only y, we see that F must be l.s.c. convex. An obvious case is when f and
F are l.s.c. convex, and G is affine and continuous. But there are some other cases
when this property holds, although G is nonlinear.
Example 1.100 Let F be a nondecreasing, l.s.c. proper convex function over R, and
G be an l.s.c. proper convex function over X . We claim that ψ(x, y) := F(G(x) +
y) is l.s.c. convex. Setting X := X × Y and G (x, y) := G(x) + y, we reduce the
discussion to the l.s.c. and convexity of F(G(x)). Let xk → x̄ in X . Then
The first inequality uses the fact that F is nondecreasing and G is l.s.c.; the second
inequality uses the l.s.c. of F. So, F ◦ G is l.s.c. Now for α ∈ (0, 1) and x , x in X ,
setting x := αx + (1 − α)x :
Example 1.101 More generally, consider the case when F is an l.s.c. proper convex
function over R p that is nondecreasing (for the usual order relation y ≤ z if yi ≤ z i ,
for i = 1 to p), and G(x) = (G 1 (x), . . . , G p (x)) with G i (x) an l.s.c. proper convex
function over X , for i = 1 to p). By similar arguments we get that ψ(x, y) :=
F(G(x) + y) is l.s.c. convex. A particular case is that of the supremum of convex
functions, see Sect. 1.2.3.
A more general analysis of the case of composite functions in the format (1.181)
is as follows. Assume F to be l.s.c. proper convex. By Theorem 1.44, it is equal to
its biconjugate, and hence,
Therefore,
Since the supremum of l.s.c. convex functions is l.s.c. convex, we deduce that
Lemma 1.102 Let F be l.s.c. proper convex, and x → y ∗ , G(x) be l.s.c. convex
for any y ∗ ∈ dom F ∗ . Then ϕ is l.s.c. convex.
1.2 Duality Theory 41
Definition 1.103 The recession cone of the closed convex subset K of Y is the
closed convex cone defined by
K ∞ := {y ∈ Y ; K ⊂ K + y}. (1.194)
Remark 1.104 (i) If K is bounded, its recession cone reduces to {0}. The con-
verse holds if Y is finite-dimensional. In infinite-dimensional spaces, there may
exist unbounded convex sets with recession cone reducing to {0}: see [26, Example
2.43].
(ii) We have that K ∞ = K if K is a closed convex cone.
Remark 1.106 We slightly changed the classical definition [26, Def. 2.103], but
the theory is essentially the same. Note that any affine mapping is K -convex. The
converse holds if K ∞ = {0}. On the other hand, if K = Y then any mapping is
K -convex.
Proof The l.s.c. being obvious, it suffices to check that I K (G(x) + y) is convex. Let
α ∈ (0, 1), x , x in X , y , y in Y . Set (x, y) := α(x , y ) + (1 − α)(x , y ). Then
Lemma 1.108 We have that G : X → Y is K -convex iff, for any λ ∈ (K ∞ )−, the
function G λ (x) := λ, G(x) is convex.
Proof Since K ∞ is closed and convex, by Lemma 1.77, it is the negative polar cone of
(K ∞ )−, i.e., y0 ∈ K ∞ iff λ, y0 ≤ 0, for all λ ∈ (K ∞ )− . Therefore, G is K -convex
iff
λ, G(αx + (1 − α)x ) − αG(x ) − (1 − α)G(x ) ≤ 0, (1.198)
42 1 A Convex Optimization Toolbox
Remark 1.109 Lemma 1.107 can be deduced from Lemma 1.102, where F = I K
and F ∗ = σ K , observing that dom(σ K ) ⊂ (K ∞ )− .
p
Example 1.110 Let Y := R p and K := R− (the case of finitely many inequality
constraints). Then (K ∞ )− = K − = R+ . As expected we obtain that G is K -convex
p
When G(x) = Ax, with A ∈ L(X, Y ), the Lagrangian defined in (1.182) is such that
Max
∗
y ∗ , y − f ∗ (x ∗ − A y ∗ ) − F ∗ (y ∗ ). (D y )
y
The function ϕ(x, y) = f (x) + F(Ax + y) is l.s.c. convex if f and F are l.s.c.
convex, and the stability condition (1.170) reads, since dom(v) = y + dom(F) −
A dom( f ):
y ∈ int (dom(F) − A dom( f )) . (1.203)
Theorem 1.113 (Fenchel duality) Let f and F be l.s.c. convex, and (1.203) hold.
Then
is the particular case of the previous example in which F(y) = I K (y) and x ∗ = 0,
and therefore the dual problem is
Max
∗
y ∗ , y − σ K (y ∗ ) − f ∗ (x ∗ − A y ∗ ). (D y )
y
Example 1.115 Consider the particular case of the previous example in which
A is surjective, K = {0}, x ∗ = 0, and f (x) = c, x, with c ∈ (Ker A)⊥ . By the
open mapping theorem, for some c > 0, there exists a feasible x(y) such that
x(y) ≤ cy. Since c ∈ (Ker A)⊥ , x(y) is a primal solution. The value function
v(·), being both locally upper bounded and finite, is locally Lipschitz. By the discus-
sion in Example 1.114, we have that c = A λ, for some λ ∈ Y ∗ . We have proved
that (Ker A)⊥ ⊂ Im(A ). Since the converse inclusion is easily proved, we have
obtained another proof of Proposition 1.31.
44 1 A Convex Optimization Toolbox
Exercise 1.116 (Tychonoff and Lasso [120] type regression) Assuming that Y is
a Hilbert space identified with its topological dual, and given A ∈ L(X, Y ), b ∈ Y ,
ε > 0 and a ‘regularizing function’ R : X → R̄, consider the regularized linear least-
square problem
1
Min Ax − b2H + ε R(x). (Py )
x∈X 2
1
Max −(λ, b)Y − λ2Y − ε R ∗ (−A λ/ε). (1.207)
λ∈Y 2
1
λ = Ax − b; − A λ ∈ ∂ R(x). (1.208)
ε
Theorem 1.117 If (1.212) holds, then x ∗ ∈ ∂x ϕ(x, y). Conversely, if ϕ is l.s.c. con-
vex and the stability condition (1.170) holds, then x ∗ ∈ ∂x ϕ(x, y) iff the set of y ∗ ∈ Y ∗
satisfying (1.212) is nonempty and bounded.
Proof That x ∗ ∈ ∂x ϕ(x, y) when (1.212) holds follows from the definition of full
and partial subdifferentials. Now let ϕ be as in the theorem. It suffices to prove that
if x ∗ ∈ ∂x ϕ(x, y), then there exists a y ∗ ∈ Y ∗ such that (1.212) holds. Since x ∗ ∈
∂x ϕ(x, y), we have that the function x → ϕ(x , y) − x ∗ , x attains its minimum at
x. By the duality result in Corollary 1.92, the set of solutions y ∗ of the dual problem,
satisfying the optimality condition (1.163), which (by the discussion after (1.163))
is equivalent to (1.212), is nonempty and bounded. The conclusion follows.
We now specialize the previous theorem to the case of the composite function
recalling that the (standard) Lagrangian was defined in (1.182). We give a direct
proof of the expression of the subdifferential of ϕ, already obtained in Remark 1.96:
or equivalently
ϕ(x , y ) = f (x ) + F(G(x ) + y )
≥ f (x ) + F(G(x) + y) + y ∗ , G(x ) − G(x) + y − y,
(1.216)
= ϕ(x, y) + L(x , y ∗ ) − L(x, y ∗ ) + y ∗ , y − y,
≥ ϕ(x, y) + x ∗ , x − x + y ∗ , y − y,
Theorem 1.119 Assume that ϕ(x, y) = f (x) + F(G(x) + y) is l.s.c. convex, and
that the stability condition (1.187) holds. Then x ∗ ∈ ∂x ϕ(x, y) iff the set of y ∗ ∈ Y ∗
such that (1.185) holds is nonempty and bounded.
In the case of Fenchel’s duality, i.e., when G(x) = Ax with A ∈ L(X, Y ), we see
that (1.185)(ii) holds iff
We have that x ∗ ∈ ∂x ϕ(x, y) iff (1.218) holds for some y ∗ ∈ Y ∗ , whenever the
stability condition (1.203) is satisfied.
Corollary 1.121 Let f and g be l.s.c. convex functions X → R̄, with finite value at
x0 . If 0 ∈ int (dom( f ) − dom(g)) (which holds in particular if f or g is continuous
at x0 ), then ∂( f + g)(x0 ) = ∂ f (x0 ) + ∂g(x0 ).
Example 1.122 Let gi , i = 1 to n, be l.s.c. proper convex functions over the Banach
space X . We set
n
G(x) := gi (x), with dom(G) = ∩i=1
n
dom(gi ). (1.219)
i=1
n
F(x1 , . . . , xn ) := gi (xi ), with dom(F) = Πi=1
n
dom(gi ). (1.220)
i=1
1.2 Duality Theory 47
n
For (x1∗ , . . . , xn∗ ) ∈ (X ∗ )n , we have that A (x1∗ , . . . , xn∗ ) = i=1 xi∗ (the transpose
of the copy operator is the sum). The qualification condition (1.203) can be written,
since BY = (B X )n , as
n
∂G(x) = ∂gi (x), for all x ∈ X, if (1.221) holds. (1.222)
i=1
We show here how subdifferential calculus gives calculus rules for normal and tangent
cones, starting with the simple case of the intersection of two convex sets.
Lemma 1.124 Let K 1 and K 2 be two closed convex subsets of X , and let K :=
K 1 ∩ K 2 , and x̄ ∈ K . Then
Proof The relations in (1.224) are easy consequences of the definition of tangent
and normal cones. We next apply Corollary 1.121 with f := I K 1 and g := I K 2 , so
that f + g = I K . Since dom( f ) − dom(g) = K 1 − K 2 and ∂ I K (x) = N K (x), we
deduce that if 0 ∈ int(K 1 − K 2 ), then N K (x̄) = N K 1 (x̄) + N K 2 (x̄). Computing the
normal cones (we have seen in (1.144) that the polar of a sum of convex cones is
the intersection of their polar cones), it follows that TK (x̄) = TK 1 (x̄) ∩ TK 2 (x̄). The
conclusion follows.
By similar techniques one can prove various extensions of this result, given as
exercises.
n
N K (x̄) = N K i (x̄); TK (x̄) = ∩i=1
n
N K i (x̄). (1.226)
i=1
Hint: apply Example 1.122 with gi (x) = I K i (x), and use ∂ I K (x̄) = N K (x̄).
K := {x ∈ K X ; Ax + b ∈ K }. (1.227)
Exercise 1.128 Let K X and K be closed convex subsets of X and Y resp., and
G : X → Y . Set for ȳ ∈ Y :
Kˆ := {(x, y ) ∈ K X × Y ; G(x) + y ∈ K },
(1.229)
K := {x ∈ K X ; G(x) + ȳ ∈ K }.
Remark 1.129 In the framework of the previous exercise, assume in addition that
G(x) is G-differentiable and x → L(x, y ∗ ) is convex, for all y ∗ ∈ N K (G(x̄) + ȳ).
Then x̄ ∈ argmin x∈K X (L(·, y ∗ ) − x ∗ , x) iff x ∗ ∈ N K X (x̄) + DG(x̄) y ∗ , so that
This holds in particular if G is affine (and continuous): we then recover the conclusion
of Exercise 1.127.
In this section we start from a relatively general Lagrangian function and see how to
obtain the minmax duality thanks to the perturbation duality. Let X and Y be Banach
spaces, X 0 ⊂ X and Y0∗ ⊂ Y ∗ , both nonempty, L : X 0 × Y0∗ → R. By (1.39), we
have the weak duality inequality:
has value v(y) = val(Py ) equal to the r.h.s. of (1.235). We know by (1.151) that
v∗∗ (y) = sup y ∗ y ∗ , y − ϕ ∗ (0, y ∗ ). Define L̂ : X × Y ∗ → R̄ by
⎧
⎨ +∞ / Y0∗ ,
if y ∗ ∈
∗
L̂(x, y ) := −L(x, y ) if (x, y ∗ ) ∈ X 0 × Y0∗ ,
∗
(1.237)
⎩
−∞ otherwise.
Denoting by L̂ ∗y (x, y) the partial Fenchel–Legendre transform (in the dual space Y ∗ )
of L̂(x, ·) w.r.t. the second variable, we have that for all x ∈ X :
ϕ(x, y) = sup y ∗ , y − L̂(x, y ∗ ) = L̂ ∗y (x, y). (1.238)
y ∗ ∈Y ∗
50 1 A Convex Optimization Toolbox
L (x, y, y ∗ ) = y ∗ , y − L̂(x, y ∗ ) ≤ y ∗ , y − L̂ ∗∗ ∗
y (x, y ). (1.240)
sup inf L (x, y, y ∗ ) ≤ v∗∗ (y) ≤ v(y) = inf sup L (x, y, y ∗ ). (1.241)
y ∗ ∈Y0∗ x∈X 0 x∈X 0 y ∗ ∈Y ∗
0
L̂(x, y ∗ ) = L̂ ∗∗ ∗
y (x, y ), for all x ∈ X 0 . (1.242)
Theorem 1.130 Assume that X 0 and Y0∗ are nonempty and convex subsets, X 0 is
closed, L(·, y ∗ ) is l.s.c. convex for each y ∗ ∈ Y0∗ , (1.242) holds, and Y0∗ is bounded.
Then equality holds in (1.233), and the set of y ∗ for which the supremum on the left
is attained is nonempty and bounded.
Proof (a) Since X 0 is convex and closed, for each y ∗ ∈ Y0∗ , the function (x, y) →
L (x, y, y ∗ ) extended by +∞ if x ∈ / X 0 is an l.s.c. convex function of (x, y), and
hence, its supremum w.r.t. y ∗ ∈ Y0∗ , i.e. ϕ(x, y), is itself l.s.c. convex.
(b) Let us check that v(y) < +∞. Fix x0 ∈ X 0 . Since y ∗ → L(x0 , y ∗ ) is an infimum
of ∗affine functions, we have that for some (y0 , c0 ) ∈ Y × R (depending on x0 ):
(c) If the primal value is −∞, the conclusion follows from the weak duality inequality,
the maximum of the dual cost being attained at each y ∗ ∈ Y0∗ .
(d) In view of the expression of ϕ in (1.236), if v is finite at some y ∈ Y , we have
that
|v(y ) − v(y)| ≤ sup |y ∗ , y − y| ≤ sup y ∗ y − y, (1.245)
y ∗ ∈Y0∗ y ∗ ∈Y0∗
proving that v is everywhere finite and Lipschitz. Since v is convex and Lipschitz, by
Lemma 1.59, ∂v(y) is nonempty and bounded, and therefore v(y) = v∗∗ (y) and the
set of dual solutions is not empty and bounded. We conclude by (1.241), in which
by (1.242) the first inequality is an equality.
A direct consequence of the previous result is, see [94, Corollary 37.3.2]:
Lemma 1.131 Let A and B be nonempty closed convex subsets of Rn and Rq , resp.,
with B bounded, and L be a continuous convex-concave mapping A × B → R. Then
1.2.4 Calmness
Definition 1.132 Let f : X → R̄ have a finite value at x̄. We say that f is calm at
x̄ with constant r > 0 if
Lemma 1.133 Let f : X → R̄ be convex, and calm at x̄ with constant r > 0. Then
(i) f is l.s.c. at x̄, and (ii) ∂ f (x̄) has at least an element of norm at most r .
which shows that f is calm at x̄ with constant r . So, if f is convex and f (x̄) is finite,
then ∂ f (x̄) is nonempty iff f is calm at x̄.
Corollary 1.135 In the framework of the perturbation duality theory presented in
Sect. 1.2.1.1, assume that ϕ is convex (not necessarily l.s.c.), and that the value
function v(·) is calm at y ∈ Y , with constant r > 0. Then val(Py ) = val(D y ), and
∂v(y) = S(D y ) has at least one element of norm at most r .
Since, by Remark 1.134, calmness characterizes subdifferentiability for convex
functions, the difficulty is of course to check this condition! We first present a “patho-
logical” example that illustrates the theory.
Example 1.136 Let X = L 2 (0, 1), Y = L 1 (0, 1), g ∈ X , and A be the canonical
injection X → Y . Denote by (·, ·) X the scalar product in X . Consider the problem
Max
∗
y ∗ , y; g = −A y ∗ . (D y )
y ∈Z
and so the primal and dual values are equal, and the dual problem has solution −g,
in accordance with Corollary 1.135. Of course it can be checked by direct means that
∂v(y) = −g.
Here
A supremum less than +∞ implies y ∗ ≥ 0, and the optimal choice for y is then
yi = −ai , x, so that ϕ ∗ (0, y ∗ ) = 0 if c + i=1 yi∗ ai = 0, and +∞ otherwise. The
p
By Hoffman’s Lemma 1.28, calmness is satisfied whenever v(y) is finite, and hence,
the primal and dual values are equal and the dual problem has a solution, in agreement
with Lemma 1.26.
Remark 1.138 The stability condition (1.170) does not hold in Example 1.136, and
does not necessarily hold in Example 1.137. So these examples show the usefulness
of the concept of calmness.
Coming back to the minimization of composite functions in Sect. 1.2.1.5, assume that
f is proper, l.s.c. convex, and that F is l.s.c., convex, and positively homogeneous
with value 0 at 0. Then F(x) > −∞ for all x. By Theorem 1.44, F is equal to
its biconjugate, and by Lemma 1.66, F(y) = σ K ∗ (y), and F ∗ = I K ∗ , where K ∗ =
∂ F(0). So problem (Py ) in Sect. 1.2.1.5 is of the form
Remark 1.139 An obvious choice for Y is the space of bounded functions over Ω. If
Ω is a compact metric space, we can also choose the space of continuous and bounded
functions over Ω (indeed, by the Heine–Cantor theorem, a continuous function over
a compact set is uniformly continuous, and this easily implies that a uniform limit
of continuous functions is continuous).
The dual space Y ∗ is endowed with the norm
Theorem 1.141 Let Y be a Banach space endowed with the norm (1.257), contain-
ing the constant functions. We assume that f is proper, l.s.c., convex, x → G(x) is
continuous and that for any y ∗ ∈ S (Ω), x → y ∗ , G(x) is convex. Then problems
(Py ) and (D y ) have the same value, that is finite or equal to −∞. If this value is
finite, then S(D y ) is nonempty (necessarily bounded since S (Ω) is). In addition,
x ∈ S(Py ) and y ∗ ∈ S(D y ) iff (x, y ∗ ) satisfies
Using the subdifferential calculus rule in Example 1.122 and especially (1.223), we
see that the optimality condition (1.260) reduces to
p
0 ∈ ∂ f (x) + yi∗ ∂x G i (x); y∗ ∈ S p ; y ∗j = 0, j ∈
/ argmax G i (x). (1.262)
i=1 i
56 1 A Convex Optimization Toolbox
Example 1.143 Let Y = C(Ω), the space of continuous functions on the compact
metric space Ω. The dual space is the space of finite Borel measures over Ω, and
S (Ω) is nothing but the set P(Ω) of Borel probability measures over Ω. We can
define the support of a measure, denoted by supp(·), as the complement of the largest
open set where it is equal to 0. Then the two last relations of (1.260) are equivalent
to
y ∗ ∈ P(Ω); supp(y ∗ ) ⊂ argmax(G(x) + y). (1.263)
The literature often refers to linear conical optimization problems, which are as
follows. Given two Banach spaces X and Y , consider the problem
If we prefer to use the positive polar set C + = −C − , the expression of the dual
problem becomes
Max+ η, b; A η = c. (1.267)
η∈C
The dual problem is itself in the conical linear class, except of course that the spaces
are of dual type. It can be rewritten, setting K := C − × {0} (zero in X ∗ ), in the form
(formally close to (1.264)):
We can also dualize the dual problem (1.268); in view of Lemma 1.85, the resulting
bidual problem will coincide with the original one.
ε BY ⊂ C + b + Im(A) (1.269)
(in particular, if there exists an x0 ∈ X such that Ax0 − b ∈ int C), then (1.264) and
(1.266) have the same value. If the latter is finite, then the solution set of the dual
problem (1.266) is nonempty and bounded.
then (1.264) and (1.266) have the same value, and if this common value is finite, then
the solution set of the primal problem (1.264) is nonempty and bounded.
Proof It suffices to check that (1.270) is equivalent to the stability condition for the
dual problem. The latter holds iff there exists an ε > 0 such that
This holds iff, for all (μ, η) close to 0 in Y ∗ × X ∗ , there exists a λ ∈ Y ∗ such that
μ ∈ C − − λ; η = −c − A λ. (1.272)
1.3.3 Polyhedra
Proof Let N̂ P (x̄) denote the r.h.s. of (1.275). If x ∗ = i∈I (x̄) λi ai with λ ≥ 0, and
x ∈ P, then
x ∗ , x − x̄ = λi ai , x − x̄ = λi (ai , x − bi ) ≤ 0, (1.276)
i∈I (x̄) i∈I (x̄)
Let Φ : X → R̄ be defined by
Proof Since g is convex and continuous, and I Q is l.s.c. convex, by the subdifferential
calculus rules (Lemma 1.120), we have that ∂Φ(x̄) = ∂g(x̄) + ∂ I Q (x̄). We conclude
by noting that ∂ I Q (x̄) = N Q (x̄), whose expression is given by Lemma 1.146, and
by Lemma 1.147.
Lemma 1.150 With the above notations, let M ∈ L(Z , X ), where Z is a Banach
space, and let Ψ = Φ ◦ M have a finite value at z̄ ∈ Z . Set x̄ = M z̄. Then
Theorem 1.151 Let P satisfy (1.273) and be nonempty. Then there exists an element
xi ∈ X , with i ∈ I ∪ J , I and J finite sets, such that
p
Maxp λ · y; c + λi ai = 0. (L D y )
λ∈R+
i=1
Theorem 1.152 Fix z̄ ∈ Z , set ȳ = M z̄, and let x̄ ∈ S(L Pȳ ). Then
Proof By linear programming duality and the general duality theory (Lemma 1.26
and Theorem 1.87), we have that
It is easily seen that the operator (that to two extended real-valued functions
over X associates their infimal convolution) is commutative and associative. More
generally, the infimal convolution of n extended real-valued functions f 1 , . . . , f n
over X is defined as
n
n
n
i=1 f i (y) := infn f i (xi ); xi = y . (1.288)
x∈X
i=1 i=1
3
One easily checks that ( f 1 f 2 ) f 3 = i=1 f i (y). In order to fit with our duality
theory, consider the related problem
n
n
Minn f i (xi ) − xi∗ , xi ; xi = y, (Py )
x∈X
i=1 i=1
as well as
v∗ (y ∗ ) = sup y y ∗ , y − v(y)
n
∗ n
= supy ∗ , y + xi , xi − f i (xi ) ; xi = y,
x,y
i=1 i=1 (1.290)
n
∗ n
= sup y + xi∗ , xi − f i (xi ) = f i∗ (y ∗ + xi∗ ).
x
i=1 i=1
Taking all xi∗ equal to zero, we obtain that the Fenchel conjugate of the infimal
convolution is the sum of conjugates, i.e.
n ∗
n
i=1 f i (y ∗ ) = f i∗ (y ∗ ). (1.291)
i=1
n
Max
∗ ∗
y ∗ , y − f i∗ (y ∗ + xi∗ ). (D y )
y ∈Y
i=1
n
Since dom(v) = i=1 dom( f i ), the stability condition is
n
y ∈ int dom( f i ) . (1.292)
i=1
We deduce that
Proposition 1.153 Assume that the f i are l.s.c. convex, and let (1.292) hold. Then
v(y) = val(D y ), and if the value is finite, S(D y ) is nonempty and bounded.
When the f i are proper, l.s.c. convex, the cost function of (Py ) is itself a proper,
l.s.c. and convex function of (x, y). By Lemma 1.85, (Py ) is the dual of (D y ). In
view of Remark 1.86, when X is reflexive, we may regard (Py ) as the “classical” dual
of (D y ), with perturbation parameter x ∗ . Clearly (D y ) is feasible iff there exists a
y ∗ ∈ Y ∗ such that y ∗ + xi∗ ∈ dom( f i∗ ), for i = 1 to n, i.e., if x ∗ ∈ Πi dom( f i ) − Ay ∗ ,
where the operator A : Y ∗ → (Y ∗ )n is defined by Ay ∗ = (y ∗ , . . . , y ∗ ) (n times). The
dual stability condition is therefore
n
(xi∗ , . . . , xn∗ ) ∈ int Πi=1 dom( f i∗ ) − AY ∗ . (1.293)
Max y ∗ , y − f 1∗ (y ∗ ) − f 2∗ (y ∗ ). (D y )
y ∗ ∈Y ∗
The unique feasible point is y ∗ = 0, which is also the unique dual solution. The
primal stability condition holds, and accordingly we find that the primal and dual
values are equal and that the dual solution (equal to 0) is the subgradient of the infimal
convolution.
The dual stability condition cannot hold since the infimum in the infimal convo-
lution is not attained. Indeed this condition is that for any x ∗ close to 0 in R2 , there
exists a y ∈ R such that x2∗ ≤ y ≤ x1∗ , which is impossible.
Let f be a proper l.s.c. convex function over X . Given x0 ∈ dom( f ), we define the
recession function f ∞ : X → R̄ by
f (x0 + τ d) − f (x0 )
f ∞ (d) := sup . (1.295)
τ >0 τ
It is easily checked that f ∞ is convex and positively homogeneous, and that the
supremum is attained when τ → +∞.
Lemma 1.156 The recession function is the support function of the domain of f ∗ ,
that is,
f ∞ (d) = sup x ∗ , d; f ∗ (x ∗ ) < +∞ . (1.296)
x ∗ ∈X ∗
Proof Being proper l.s.c. convex, f is equal to its biconjugate, that is,
Therefore,
x ∗ , x0 + τ d − f ∗ (x ∗ ) − f (x0 )
f ∞ (d) = sup sup . (1.298)
τ >0 x ∗ ∈X ∗ τ
1.3 Specific Structures, Applications 63
By the above lemma, the recession function does not depend on the element x0 ∈
dom( f ) used in its definition.
Being proper l.s.c. convex f has an affine minorant, say a, x X + b; then g has
affine minorant a, x X + bt.
Lemma 1.157 The perspective function is convex and positively homogeneous; its
conjugate is the indicatrix of the set
Proof Note that the domain of g is convex. Let x1 , x2 in X , t1 > 0, t2 > 0, and
θ ∈]0, 1[. Set
Then θ ∈]0, 1[, (1 − θ ) = (1 − θ )t2 /t, and x/t = θ x1 /t1 + (1 − θ )x2 /t2 . Using
the convexity of f , we get
g(x, t) ≤ t θ f (x1 /t1 ) + (1 − θ ) f (x2 /t2 )
= θ t1 f (x1 /t1 ) + (1 − θ )t2 f (x2 /t2 ) (1.303)
= θg1 (x1 , t1 ) + (1 − θ )g(x2 , t2 ),
Dividing by t > 0 and setting y := x/t we see that the above set of inequalities is
equivalent to x ∗ , y − f (y) + t ∗ ≤ 0, for all y ∈ X . Maximizing in y we obtain the
conclusion.
Since f is proper l.s.c. convex, so is f ∗ . Therefore C is nonempty. It follows that
∗∗
g = σC is never equal to −∞ (this also follows from the fact that g has, as already
established, an affine minorant). By the Fenchel–Moreau–Rockafellar Theorem 1.46,
g ∗∗ is equal to the convex closure of g.
Lemma 1.158 The biconjugate of the perspective function satisfies, for all x ∈ X :
⎧
⎨ (i) +∞ if t < 0,
⎪
∗∗
g (x, t) = (ii) g(x, t) if t > 0, (1.305)
⎪
⎩
(iii) f ∞ (x) if t = 0.
(i) If t < 0, we may take x0∗ in the nonempty set dom( f ∗ ) and set (x ∗ , t ∗ ) :=
(x0∗ , − f ∗ (x0∗ ) − t ), with t → ∞; it follows that g ∗∗ (x, t) = +∞.
(ii) If t > 0, maximizing in t ∗ in (1.306) and since f = f ∗∗ , we get
(iii) If t = 0, then
We next relate the perspective function to the resolution of the nonconvex problem
Lemma 1.159 Problems (P12 ) and (P12 ) have the same value.
Proof Let (x1 , x2 , t1 , t2 ) be in the feasible set of (P12 ). Setting xi := xi /ti , for i =
1, 2, one easily checks that (P12 ) has the same value as the problem
Minimizing w.r.t. to (x1 , x2 ) first, we see that the value of problem (P12
) is equal to
inf t1 inf
c, x1 + t2 inf
c, x2 = min inf c, x1 , inf c, x2 .
ti >0,t1 +t2 =1 x1 ∈K 1 x2 ∈K 2 x1 ∈K 1 x2 ∈K 2
(1.308)
The result easily follows.
p
L(x, λ) := f (x) + λi gi (x), (1.309)
i=1
This is obviously an l.s.c. convex function, everywhere greater than −∞. We will
assume that it is proper; this holds, for instance, if inf f > −∞, since then 0 ∈
66 1 A Convex Optimization Toolbox
Min d(λ). (D )
λ∈R p
We denote it by (D ) to take into account the change of sign, but call the amount
− val(D ) the dual value in order to remain coherent with the general duality theory.
Proposition 1.160 We assume that (i) the function d(·) is proper, and (ii) the exis-
tence of ε > 0 such that
Then the dual problem has a nonempty and compact set of solutions.
Remark 1.161 Note that (1.311) is equivalent to the same relation in which we write
K instead of K .
1
ε B ⊂ conv ({ki − g(xi ), i = 1, . . . , r }) . (1.312)
2
We have that
d(λ) ≥ maxi {− f (xi ) + λ, ki − g(xi )}
≥ mini {− f (xi )} + maxi {λ, ki − g(xi )} (1.313)
≥ mini {− f (xi )} + 21 ε|λ|,
the last inequality using the fact that a maximum of linear forms is equal to the
maximum over their convex hull. It follows that a minimizing sequence λk (which
exists since d is proper) is bounded and therefore has a subsequence converging to
some λ̄. Since d(·) is l.s.c. (being a supremum of linear forms), λ̄ ∈ S(D ). That
S(D ) is bounded is a consequence of the coercivity property (1.313).
Let us now add some hypotheses for ensuring the existence of points minimizing the
Lagrangian in the vicinity of dual solutions.
Proposition 1.162 Assume that there exists a metric compact set Ω ⊂ X such that,
if λ is close enough to S(D ), the set of minima of L(·, λ) has at least one point in
Ω, and that f and g are continuous over Ω.
Then λ ∈ S(D ) iff there exists a Borelian probability measure μ over Ω such
that, denoting by Eμ g(x) = Ω g(x)dμ(x) the associated expectation, the following
holds:
1.4 Duality for Nonconvex Problems 67
Proof Set δ(λ) := supx∈X {−L(x, λ)}. By our assumptions, when λ is close enough
to S(D ), δ(λ) is equal to the continuous function
Since δ(·) and δ (·) are convex, and coincide near S(D ), they have the same subd-
ifferential near S(D ).
Let λ ∈ S(D ). Since δ (·) is continuous at λ, Corollary 1.121 implies that
We next give an expression for ∂δ (λ). We have that δ (λ) = F[G(λ)], where F :
C(Ω) → R is defined by F(y) := max{yx , x ∈ Ω}, and G affine R p → C(Ω) is
defined by (denoting the value at x ∈ Ω by a subindex) G(λ)x := −L(x, λ). Set A :=
DG(λ). Since F is Lipschitz, the subdifferential calculus rules (Theorem 1.119)
apply, so that by (1.317):
Remark 1.163 When X is a metric compact set we can also consider the following
relaxed formulation
Min f (x)dμ(x); g(x)dμ(x) ∈ K . (1.319)
μ∈P (X ) X X
The stability condition for this convex problem is precisely (1.311). So, under this
condition, if the above problem is feasible, there is no duality gap. The Lagrangian
is
68 1 A Convex Optimization Toolbox
p
L (μ, λ) = f (x) + λi gi (x) dμ(x) − σK (λ) = L(x, λ)dμ(x) − σK (λ).
X i=1 X
(1.320)
Therefore the infimum of the Lagrangian w.r.t. the primal variable μ ∈ P can be
expressed as
That is, (1.319) has the same dual as the original problem. If (1.311) holds, then
the stability condition holds for the convex problem (1.319), and hence, there is no
duality gap. We can therefore interpret the dual problem as the dual of the relaxed
problem.
Proposition 1.164 (i) Let λ̄ ∈ S(D ). If x̄ ∈ argmin L(·, λ̄) is such that g(x̄) ∈ K
and λ ∈ NK (g(x̄)), then x̄ ∈ S(P), and the primal and dual problems have the same
value.
(ii) Under the hypotheses of Proposition 1.162, if K is closed and convex, and
λ̄ ∈ S(D ) is such that x → g(x) is constant over argmin L(·, λ̄) (which is the
case in particular if L(·, λ̄) attains its minimum at a single point), then any
x̄ ∈ argmin L(·, λ̄) is a solution of (P) and the conclusion of point (i) is therefore
satisfied.
Proof (i) Since L(x̄, λ̄) = inf x L(x, λ̄) and λ̄ ∈ NK (g(x̄)), and consequently σK (λ̄)
is equal to λ̄, g(x̄), we have that
f (x̄) = L(x̄, λ̄) − σK (λ̄) = inf L(x, λ̄) − σK (λ̄) = −d(λ̄), (1.322)
x
i.e., x̄ and λ̄ are primal and dual feasible with equal cost, meaning that x̄ is a solution
of the primal problem, λ̄ is a solution of the dual one, and the primal and dual values
are equal.
(ii) We apply Proposition 1.162. Since g(x) is constant over argmin L(·, λ̄), we obtain
the existence of a probability measure with support over argmin L(·, λ̄), such that
for any x̄ ∈ argmin L(·, λ̄), g(x̄) = Eμ g(x) ∈ K . We conclude using point (i).
Remark 1.165 In most non-convex problems there is a duality gap; the hypotheses
of the proposition above are not satisfied. Now the hypotheses of Propositions 1.160
and 1.162 are weak. So, in general, the dual problem has a compact and nonempty
set of solutions, but in each of them, the minimum of the Lagrangian is reached at
several points (with different values of the constraint g(x)).
Remark 1.166 We will apply point (ii) of Proposition 1.164 (in a case when the set
of minima of the Lagrangian is in general not a singleton) to the study of controlled
Markov chains with expectation constraints, see Theorem 7.34.
Exercise 1.167 For x ∈ R, let f (x) = 1 − x 2 and g(x) = x. The problem of mini-
mizing f (x) over X := [−1, 1], under the constraint g(x) = 0, has a unique solution
1.4 Duality for Nonconvex Problems 69
We assume here that X is a convex subset of a Banach space X and that g(x) =
Ax, with A ∈ L(X , Rn ). The convexification of f : X → R̄ is defined, for x ∈
conv(dom( f )), by
conv( f )(x) := inf αi f (x );
i
αi x = x ,
i
(1.323)
i i
over all finite families i ∈ I with αi ≥ 0, i αi = 1 and x i ∈ X . This is the largest
convex function minorizing f . It satisfies, for x ∈ X :
where the last inequality follows from the definition of ρ X ( f ). The conclusion fol-
lows.
Remark 1.169 Proposition 1.168 was obtained by Aubin and Ekeland [11, Thm. A].
We will next see how to improve this estimate in the case of decomposable prob-
lems.
the minimal number of a nonnegative combination was greater than n + p, adding some linear
5 If
combination of these elements equal to 0 and with nonzero elements, we could easily find another
nonnegative combination of z with fewer nonzero coefficients, which would give a contradiction.
1.4 Duality for Nonconvex Problems 71
p
z = i=1 j∈Ji βi j z i j with at most n + p nonzero βi j . Since for all i, j∈Ji βi j = 1,
at least one βi j is nonzero
for each i, meaning that at most n indices have more than
p
one nonzero βi j . As x = i=1 j∈Ji βi j yi j , the conclusion follows.
with K a convex subset of R p , assuming that the constraints are linear and a decom-
posable cost function
N
N
f (x) = f k (xk ); g(x) = gk (xk ), gk (xk ) = Ak xk , k = 1, . . . , N ,
k=1 k=1
(1.331)
with X = X 1 × · · · × X N , X k a closed convex subset of a Banach space X k , Ak ∈
L(X k , R p ), and xk ∈ X k for each k. The associated Lagrangian defined in (1.309)
satisfies
N
L(x, λ) := L k (xk , λ), where L k (xk , λ) := f k (xk ) + λ · gk (x). (1.332)
k=1
So, we have a decomposability property for the Lagrangian: the (opposite of) the
dual cost satisfies
N
d(λ) = dk (λ), where dk (λ) := inf L k (xk , λ). (1.333)
xk ∈X k
k=1
We recall the definition of the measure of lack of convexity in (1.324), and set
ρk := ρ X k ( f k ), for k = 1 to N .
Min s ; s ∈ K . (1.335)
s∈conv(S)
72 1 A Convex Optimization Toolbox
Let s be feasible for this problem. By the Shapley–Folkman Theorem 1.170, we may
assume that sk = ( f (x̄k ), g(x̄k )), with x̄k ∈ X k , except for at most p + 1 indexes,
say the p + 1 first. For k = 1 to p + 1, since X k is convex, there exists x̄k ∈ X k such
that sk = Ak x̄k . Then x̄ (which is well defined as an element of X ) is feasible and
satisfies f k (xk ) ≤ sk + ρk , for k = 1 to p + 1. The result follows.
Remark 1.172 This result is due to Aubin and Ekeland N [11, Thm. D]. In this decom-
posable setting, it is easily checked that ρ X ( f ) = k=1 ρk . So, in general the duality
estimate improves the one in Proposition 1.168 when p + 1 < N .
While these notes are mainly devoted to convex problems, it is useful to discuss
optimality conditions in the case of nonlinear equality constraints. The (general)
result below will have an application in the theory of semidefinite programming, see
the proof of Lemma 2.12. So, consider the following problem
Theorem 1.174 Let x̄ be a local solution of (P), such that Dg(x̄) is onto. Then
there exists a unique λ ∈ Y ∗ such that ∂ f (x̄) + Dg(x̄) λ 0.
The proof of the theorem is based on Liusternik’s theorem that essentially gives a
sufficient condition for an element of Ker Dg(x̃) to be a tangent direction to g −1 (0).
Theorem 1.175 Let x̃ be such that g(x̃) = 0 and Dg(x̃) is onto. Let h ∈ Ker Dg(x̃).
Then there exists a path R+ → X , t → x(t) such that x(t) = x̃ + th + o(t) and
g(x(t)) = 0.
Proof Set A := Dg(x̃) and denote by c(·) the modulus of continuity of Dg(x) at x̃,
such that
Dg(x) − Dg(x̃) ≤ c(r ) whenever x − x̃ ≤ r. (1.336)
By the open mapping Theorem 1.29, there exists a c > 0 such that, for any b ∈
Y , there exists an a ∈ X that satisfies Aa = b and a ≤ c A b. So given t > 0,
consider the sequence x k in X such that x 0 = x̃ + th and
Let R > 0 be such that c(x − x̃) ≤ 1/(2c A ) whenever x − x̃ ≤ R. Let K be
an integer such that for all 0 ≤ k ≤ K + 1, we have that x k − x̃ ≤ R. Then by
induction we obtain that
and so
Now let x be such that x − x̃ + 2c A g(x) ≤ R. The above relations imply that
(1.341) hold for all k. Therefore x k converges to x a such that g(x a ) = 0, and in
addition x a − x ≤ 2c A g(x 0 ). Since g(x 0 ) = g(x + th) = o(t), the result
follows by taking x(t) := x a .
Proof (Proof of Theorem 1.174) (a) The difference of two multipliers belongs to the
kernel of Dg(x̄) . However, that Dg(x̄) is onto implies that its transpose is injective.
The uniqueness of the multiplier follows.
(b) We prove the existence of the multiplier. Given h ∈ Ker Dg(x̄), let x(t) be the
associated feasible path provided by Theorem 1.175. As F is locally Lipschitz, we
have that F(x(t)) = F(x̄ + th) + o(t). Since x̄ is a local solution it follows that
1.5 Notes
Let us endow L(R p , Rn ) with the Frobenius norm and its associated scalar product;
for p × n matrices A and B:
⎛ ⎞1/2
A F := ⎝ Ai2j ⎠ ; A, B F := Ai j Bi j . (2.1)
i, j i, j
To prove this, it suffices to check the first relation and use the fact that the Frobenius
scalar product is symmetric and the identity A, B F = A , B F .
Being the sum of eigenvalues, the trace of a matrix is invariant under a basis
change: for all square matrices M and P, with P invertible, we have that
In other words, the Frobenius scalar product is invariant under orthonormal basis
changes in Rn and R p . In particular, let x 1 , . . . , x n be an orthonormal system (i.e.,
the columns of an orthonormal matrix Q). Then
A2F = trace(Q A AQ) = |Ax i |2 . (2.6)
i
Consider now the case of symmetric matrices. We know that A ∈ S n can be diag-
onalized by an orthonormal basis change. Denoting by λi (A) the eigenvalues of A,
counted with their multiplicity, and arranged in nonincreasing order, we obtain by
(2.5):
n
A2F = trace(A2 ) = λi (A)2 . (2.7)
i=1
n
A, B F = λi μ j (x i · y j )2 . (2.8)
i, j
Theorem 2.1 (Fejer) The symmetric square matrix A is positive semidefinite iff we
have A, B F ≥ 0 for all symmetric positive semidefinite B.
Denote by S+n the set of semidefinite positive matrices. By the Fejer theorem this
is a selfdual (i.e., equal to its positive polar) cone.
Proof The Frobenius norm endows S n with a Hilbert space structure. The projec-
tion, say B, of A over the nonempty closed convex set S+n is therefore well defined,
and characterized by the relation
2.1.2.1 Framework
Positive semidefinite linear programs (SDP) are optimization problems of the form
n
Minn c · x; A0 + x i Ai 0, (S D P)
x∈R
i=1
where the Ai , i = 0 to n, are symmetric matrices of size p, and, given two sym-
metric matrices A and B of the same size, “A B” means that A − B is positive
semidefinite. (In a similar way we will use to denote positive definiteness, and
and ≺ for negative semidefiniteness and negative definiteness resp.). Let us see how
to reduce some optimization problems to the SDP format. It is trivial to reduce linear
constraints to SDP constraints1 :
Ax − b ≤ 0 ⇔ −diag(Ax − b) 0.
1 We denote by diag the operator that to a vector associates the diagonal matrix having this vector
for its diagonal, and also the operator that to a square matrix associates its diagonal.
78 2 Semidefinite and Semi-infinite Programming
This allows us to reduce problems with convex quadratic cost function and constraints
to the SDP format. Another type of example is that of minimisation of the greatest
eigenvalue:
Min t ; t I − A(x) 0.
(x,t)
n
Minn c · x ; A0 + x i Ai + y 0, (S D Py )
x∈R
i=1
with here y ∈ S p . Set v(y) := val(S D Py ). The (strong duality) Corollary 1.92
implies the following.
Theorem 2.5 Assume that val(S D P) is finite, and that the following stability con-
dition holds: there exists an x̂ ∈ Rn such that A(x̂) 0. Then
(a) we have the equality val(DS D P) = val(S D P),
(b) the set S(DS D P) is nonempty and bounded,
(c) for all z ∈ S n , we have v (0, z) = max{y ∗ , z F ; y ∗ ∈ S(DS D P)}.
By Lemma 1.85, the primal problem is also the dual of its dual. So, consider the
perturbation of equality constraints
Exercise 2.8 Show that the following problem has a nonzero duality gap, although
both the primal and dual problems have solutions:
⎛ ⎞
x2 + 1 0 0
Min x2 ; ⎝ 0 x1 x2 ⎠ 0.
0 x2 0
80 2 Semidefinite and Semi-infinite Programming
Hint: check that the feasible set is R+ × {0}, and so the primal value is 0, while the
dual is
Maxn λ11 ; λ22 = 0, 1 + λ11 + 2λ23 = 0,
λ∈S −
and therefore, any dual feasible λ satisfies λ23 = 0, and so λ11 = −1, so that the dual
value is −1.
Let F be an application of S n in R̄. One says that F is rotationally invariant if, for
all orthonormal matrices Q of size n, we have
Let f : Rn → R̄. One says that f is symmetric if, for all permutations π of {1, . . . , n}
(i.e., a bijective mapping from {1, . . . , n} into itself), we have
We will call f the spectral function associated with F. Let us see how to compute
the Fenchel conjugate of a rotationally invariant function.
The first step of the proof deals with the cone of nonincreasing vectors:
2.2 Rotationally Invariant Matrix Functions 81
K d := x ∈ Rn ; x1 ≥ x2 ≥ · · · ≥ xn . (2.13)
j
n
K d− = y∈R ;n
yi ≤ 0, j = 1, . . . , n − 1; yi = 0 . (2.14)
i=1 i=1
i−1
(xi−1 − xi ) yk = 0, i = 2, . . . , n. (2.15)
k=1
Qx = x; Q P z = z. (2.16)
n−1
n
x y = (x1 − x2 )y1 + (x2 − x3 )(y1 + y2 ) + · · · + (xn−1 − xn ) yk + xn yk .
k=1 k=1
(2.17)
It follows that the r.h.s. of (2.14) is included in K − . Conversely, let 1 ≤ p ≤ n − 1,
y ∈ K d− , and x ∈ Rn whose p first coordinates are equal to 1, and the other ones
to 0. Then x ∈ K d , and so 0 ≥ x y = k=1 yk . Choosing x = ±1, we obtain 0 ≥
p
(±1) y = − k=1 yk , whence (i). Point (ii) is an immediate consequence of (i) and
n
(2.17). Let us show (iii). By the definition of y and (i), it is clear that y ∈ K d− . If
(2.16) is satisfied, then
x P z = x Q Q P z = (Qx) z = x · z (2.18)
and so x · y = 0. Let us show the converse. By (ii), x · y = 0 iff (2.15) is satisfied. Let
I be the set of equivalence classes of components of x, and Q be a permutation; then
Qx = x iff any I ∈ I is stable under Q. In particular, there exists a permutation Q
for which Q Pz is nonincreasing over all I ∈ I . If i(I ) denotes the smallest index of
)−1
each class, we observe that (2.15) is equivalent to i(Ik=1 yk = 0, for all I ∈ I , and
so i∈I yi = 0, I ∈ I , i.e., i∈I (Q P z)i = i∈I z i , for all I ∈ I . But Q P z ≤ z,
and these two vectors have nondecreasing components over each I ∈ I ; they are
therefore equal.
with equality iff there exists an orthonormal matrix U diagonalizing these two matri-
ces, and such that
− Z̄ W − W Z̄ = A, (2.22)
By Theorem 1.174, there exists a unique Lagrange multiplier Λ ∈ S n such that the
above Lagrangian has a zero derivative w.r.t. Z at Z̄ . In other words, for all W ∈ M n ,
we have that
trace W X Z̄ Y + Z̄ X W Y − W Z̄ Λ − Z̄ W Λ = 0. (2.24)
Z̄ X Z̄ Y = Λ = Λ = Y Z̄ X Z̄ . (2.26)
where P1 and P2 are permutation matrices. We can assume that P2 = I . We get then,
since Z̄ is a solution of (2.21), that
2.2 Rotationally Invariant Matrix Functions 83
and
U XU = QV X V Q = Qdiag(λ(X ))Q = diag(λ(X )). (2.30)
F ∗ (Y ) = sup {X, Y F − F(X )} ≤ sup {λ(X ) · λ(Y ) − f (λ(X ))} ≤ f ∗ (λ(Y )).
X ∈S n X ∈S n
(2.31)
Taking Y = U diag(λ(Y ))U , with U orthonormal, and X of the form U diag(x)U ,
with x ∈ Rn , we get
One deduces from the previous results an expression for the subdifferential of a
rotationally invariant function.
2.2.2 Examples
When applying the previous results, it is convenient to rewrite (2.20) in the form
n
Y = U diag(λ(Y ))U = λi (Y )Ui Ui . (2.34)
i=1
Denote by U the set of orthonormal matrices whose p first columns form a base of
the eigenspace E 1 associated with λ1 (X )). By (2.34), Y ∈ ∂ F(X ) iff, for a certain
U ∈U:
p
p
αi Ui Ui ; α ∈ R+ ,
p
Y = αi = 1. (2.37)
i=1 i=1
Setting
p
p
P p = α ∈ R+ , αi = 1 , (2.38)
i=1
n
f (λ) := − log λi if λi > 0, i = 1, . . . , n, +∞ otherwise, (2.41)
i=1
is l.s.c. convex, and differentiable over its domain Rn++ . The associated matrix func-
tion, called the logarithmic barrier of the cone S+n , is
where here still the inversion of the vector is computed componentwise. Since
the conjugate of f (t) = − log t (with domain R++ ) is f ∗ (t ∗ ) = −1 − log(−t ∗ )
n
(with domain R−− ), the conjugate of f (x) = − i=1 log xi (with domain Rn++ )
∗ ∗ n ∗
is f (x ) = −n − i=1 log(−xi ) (with domain R−− ), and the conjugate function
n
of F is
F ∗ (Y ∗ ) = −n − log det(−Y ∗ ), (2.44)
The logarithmic barrier allows the extension to linear SDP problems of the interior
point algorithms for linear programming. Here we just give a brief discussion of
86 2 Semidefinite and Semi-infinite Programming
the penalized problem. With problem (S D P) of Sect. 2.1.2.1 we associate one with
logarithmic penalty, where μ > 0 is the penalty parameter, setting A(x) := A0 +
n
i=1 x i Ai :
Minn c · x − μ log det(A(x)). (S D Pμ )
x∈R
We apply Fenchel’s duality (Chap. 1, Sect. 1.2.1.8), taking into account that (i) the
conjugate of x → c · x is the indicatrix of {c}, (ii) F ∗ (μ−1 Y ) = F ∗ (Y ) + n log μ,
(iii) the conjugate of F1 (A) := μF(A0 + A) is
(DS D Pμ )
It is usually written in terms of S = −Y as
One may prefer to rewrite the first relation in a symmetrized form (which is equivalent
since S A(x) = μId implies that S and A(x) commute):
with
1 i
f i (x) = x A x + bi · x + ci , i = 0, . . . , p, (2.49)
2
1 0 1
Min A , X F + b0 · x; Ai , X F + bi · x + ci ≤ 0, i = 1, . . . , p; X = x x .
x ∈ Rn 2 2
X ∈Sn
(QC P )
We will call the SDP relaxation of problem (QC P) the variant of the formulation
(QC P ) in which we relax X = x x in X x x . By Example 2.4, an equivalent
formulation of the relaxed problem is
1 0
Min A , X F + b0 · x; 21 Ai , X F + bi · x + ci ≤ 0, i = 1, . . . , p;
x ∈ Rn 2
X ∈Sn (R QC P)
1x
0.
x X
Proposition 2.17 We have that val(R QC P) ≤ val(QC P). If in addition the matri-
ces Ai , i = 0 to p, are positive semidefinite (in other words if the criterion and the
constraints are convex), then val(R QC P) = val(QC P).
Proof Since problem (R QC P) has the same criterion as (QC P ), and a larger
feasible set, we have that val(R QC P) ≤ val(QC P). If the matrices Ai , i = 0 to p,
are positive semidefinite, let (x, X ) ∈ F(R QC P). Define X := x x and ϕ(x, X ) :=
1
2
A0 , X F + b0 · x. Since X X , by Fejer’s theorem, Ai , X F ≤ Ai , X F , so
that (x, X ) ∈ F(R QC P), and ϕ(x, X ) ≤ ϕ(x, X ); so val(QC P ) ≤ val(R QC P)
and the result follows.
88 2 Semidefinite and Semi-infinite Programming
In the sequel we will show that SDP relaxation is strongly related to classical dual-
ity, whose discussion needs a generalization of the Schur lemma 2.3. We introduce
the pseudo inverse of A ∈ S n ,
n
A† := λi† xi xi (2.50)
i=1
1
L(x, λ) = x A(λ)x + b(λ) · x + c(λ),
2
where λ ∈ R p and (setting λ0 = 1)
p
p
p
A(λ) = λi A ; b(λ) =
i
λi b ; c(λ) =
i
λi ci .
i=0 i=0 i=1
We will denote the dual criterion by q(λ) := inf x L(x, λ). The dual problem is there-
fore:
Maxp q(λ); λ ≥ 0. (D QC P)
λ∈R
(ii) The dual problem is equivalent to the following SDP problem (in the sense that
it has the same value, and their solutions have the same components for λ)
c(λ) − w 21 b(λ)
Max w; 1 1
0. (D QC P )
λ≥0,w∈R b(λ) A(λ)
2 2
2.3 SDP Relaxations of Nonconvex Problems 89
Proof Point (i) is an elementary computation, and point (ii) is an immediate appli-
cation of the generalized Schur lemma.
Since (D QC P ) is an SDP linear problem, we know how to compute its dual
problem; it is convenient to call the latter the bidual
problem
of (QC P). Let us write
α x
the multiplier, an element of S n+1 , in the form . The Lagrangian of problem
x X
(D QC P ) can be expressed as
1
L (λ, w, α, x, X ) := w + α(c(λ) − w) + b(λ) · x + A(λ), X F .
2
Define
αx
C := (α, x, X ) ∈ R × R × S ;
n n
0 .
x X
We get
p
1 i
L (λ, w, α, x, X ) = (1 − α)w + λi A , X F + b · x + ci .
i
i=0
2
This hypothesis is for example satisfied if A0 0 (take λ = 0). The following the-
orem sums up the main results of the section.
90 2 Semidefinite and Semi-infinite Programming
(ii) If the criterion and constraints of (QC P) are convex, val(R QC P) = val(QC P).
(iii) If problem (D QC P ) satisfies the qualification hypothesis (2.53), then (D QC P)
and (R QC P) have the same value, i.e., the SDP relaxation has the same value as
the classical dual.
SDP relaxation therefore does as well as classical duality and, in many cases, both
have the same value.
Consider a variant of the previous problem, where the f i are still defined by (2.49),
with an additional integrity constraint:
Remark 2.22 One easily reduces the more usual constraint x ∈ {0, 1}n to the above
integrity constraint.
Remark 2.23 In the case of a linear programming problem with the above integrity
constraints, we have that all Ai are equal to 0, and hence, the formulation of the
relaxed problem reduces to
⎧
⎪
⎪ Min b0 · x; bi · x + ci ≤ 0, i = 1, . . . , p;
⎨ x ∈ Rnn
X ∈S (2.55)
⎪
⎪ 1x
⎩ 0; X ii = 1, i = 1, . . . , n.
x X
2.4 Second-Order Cone Constraints 91
The associated order relation is, given x and y in Rm+1 , x Q m+1 y if x − y ∈ Q m+1 .
We will see how to rewrite various relations in the form
p
Ax + b ∈ R− × Q m 1 +1 × · · · × Q m q +1 . (2.57)
p
Minn 1/(ai · x + bi ); ai · x + bi > 0, i = 1, . . . , p,
x∈R (2.59)
i=1
ci · x + di ≥ 0, i = 1, . . . , q.
p
Min ti ti (ai · x + bi ) ≥ 1, i = 1, . . . , p,
x∈R ,t∈R p
n
i=1 (2.60)
ti ≥ 0, ai · x + bi ≥ 0, i = 1, . . . , p,
ci · x + di ≥ 0, i = 1, . . . , q.
with the implicit constraint ai · x > 0 for all i. Show that a reformulation of this
problem is
1 ai · x
Min t; ≤ ≤ t. (2.62)
x∈Rn ,t∈R t bi
Apply to the inequalities on the left, rewritten in the form tai · x ≥ bi , the result of
Exercise 2.24, and conclude that we have a linear SOC reformulation of this problem.
Example 2.27 Let be a positive integer. Let us show that we can rewrite “linearly”
the relation
x ∈ R2+ ; t ∈ R; t ≤ (x1 x2 . . . x2 )1/2 . (2.63)
√ √ √
x ∈ R4+ ; y ∈ R2+ ; t ∈ R; t ≤ τ ; 0 ≤ τ ≤ y1 y2 ; y1 ≤ x1 x2 ; y2 ≤ x3 x4 ;
(2.65)
which itself can be rewritten as
We again apply Exercise 2.24. We leave to the reader the generalization to arbitrary
, and check that one obtains O(2 ) “linear” relations in R3 .
where the xi are replaced by x1 for the first p1 indexes, x2 for the following p2
indexes, until xn , and then by t for the following 2 − p indexes, and finally by 1 for
the p − i pi remaining indexes. Computing the power 2 of both sides of (2.68),
we get
t 2 ≤ x1 1 x2 2 . . . xnpn t 2 − p .
p p
(2.69)
2.4 Second-Order Cone Constraints 93
Simplifying by t 2 and taking the pth root, we see that (2.68) is equivalent to (2.67);
using Example 2.27, it follows that (2.67) has a linear SOC reformulation.
Note that, in particular, the geometric mean can be SOC linearly rewritten.
Minn c · x ; A j x − b j Q m j +1 0, j = 1, . . . , J. (L S OC P)
x∈R
In order to compute the dual problem, we introduce the operator Rm+1 → Rm+1 ,
y → ỹ := (y0 , − ȳ), that leaves Q m+1 invariant.
Lemma 2.29 (i) The second-order cone Q m+1 is selfdual (equal to its positive polar
cone). (ii) In addition, when x and y are two nonzero elements of Q m+1 , we have
x · y = 0 iff x0 = |x̄| and y ∈ R+ x̃.
Proof Let x and y belong to Q m+1 . If x0 = 0, then x is zero and x · y = 0, and the
same for y. Assume now that x0 and y0 are positive. Then
|x̄| | ȳ|
x · y = x0 y0 + x̄ · ȳ ≥ x0 y0 − |x̄|| ȳ| = x0 y0 1 − . (2.70)
x0 y0
By definition of Q m+1 , the above fractions have values in [0, 1], whence x · y ≥ 0,
which proves that Q m+1 ⊂ Q + m+1 . In addition x · y = 0 iff x 0 = | x̄|, y0 = | ȳ| and
x̄ · ȳ = −|x̄|| ȳ|. Since x̄ = 0, the last relation is equivalent to ȳ ∈ R− x̄, whence (ii).
It remains to show that Q + +
m+1 ⊂ Q m+1 . Let y ∈ Q m+1 . If ȳ = 0, let x ∈ Q m+1 be
such that x0 = 1. We get that 0 ≤ x · y = y0 , so y ∈ Q m+1 . If on the contrary ȳ = 0,
set z := (| ȳ|, − ȳ). Then z ∈ Q m+1 , and so 0 ≤ y · z = y0 | ȳ| − | ȳ|2 = | ȳ|(y0 − | ȳ|),
implying y0 ≥ | ȳ|, as was to be shown.
The dual of (LSOCP) (which again is a particular case of conical linear optimiza-
tion) can therefore be expressed as
J
J
Max bj · y j; (A j ) y j = c. (LSOCP∗ )
J
y∈Πi=1 Q m j +1
j=1 j=1
We deduce the optimality conditions: Primal and dual feasibility and complemen-
tarity, the latter being obtained for each j:
94 2 Semidefinite and Semi-infinite Programming
J
A j x − b j ∈ Q m j +1 , y j ∈ Q m j +1 , (A j x − b j ) · y j = 0, j = 1, . . . , J ; (A j ) y j = c.
j=1
(2.71)
Let us show how to represent a linear SOC constraint as a linear SDP constraint.
Given s ∈ Q m+1 , we define the “arrow” mapping: Rm+1 → S m+1 , (we recall that
S m+1 is the space of symmetric matrices of size m + 1):
s0 s̄
Arw(s) := . (2.72)
s̄ s0 Im
Minn c · x; Arw(A j x − b j ) 0, j = 1, . . . , J. (L S D P)
x∈R
we obtain that Arw(s) · Y = s0 trace(Y ) + 2s̄ · Ȳ0 , and so Arw : S m+1 → Rm+1
can be expressed as
trace(Y )
Arw Y := . (2.74)
2Ȳ0
J
J
trace(Y j )
(A j ) = c. (LSDP∗ )
j j
Max b0 trace(Y j ) + 2b̄ · Ȳ0 ; j
Y ∈Πi=1
J m j +1
S+
2Ȳ0
j=1 j=1
2.4 Second-Order Cone Constraints 95
Proposition 2.31 (i) The dual problems (LSOCP∗ ) and (LSDP∗ ) have the same
value. (ii) The feasible set of (LSOCP∗ ) is the image under the mapping Arw of the
feasible set of (LSDP∗ ).
Proof It suffices to check point (ii), which, in view of the dual costs, implies point
(i). Let us show that Arw S m+1 ⊂ Q m+1 . Indeed, if s ∈ Q m+1 and Y ∈ S m+1 , we
have by Fejer’s theorem 2.1
and we conclude by Lemma 2.29. On the other hand, Arw is injective; its transpose
operator is therefore surjective. We conclude by identifying the feasible points of
(LSOCP∗ ) with the elements of the form (trace(Y j ), 2Ȳ0 ), where Y j ∈ S m j +1 .
j
Note that no qualification hypothesis was made, so that the primal and dual values
can be different. In order to obtain an expression for the solutions of (LSDP∗ ) as a
function of those of (LSOCP∗ ), we must, given y ∈ Q m+1 , express the set
We will only discuss the most interesting case when y0 = | ȳ| > 0.
Lemma 2.32 Let y ∈ Q m+1 such that y0 = | ȳ| > 0. Then Arw− (y) reduces to the
single element
1 y0 ( ȳ)
Y (y) = . (2.77)
2 ȳ ȳ ȳ /y0
Proof We have that Arw Y (y) = y, and by the Schur lemma 2.3, Y (y) 0. Let now
Y ∈ Arw− (y). Since Ȳ0 = 21 ȳ, and Y00 cannot be zero, the Schur lemma implies
Ȳ = 14 ȳ ȳ /Y00 + M, with M 0. Therefore,
2.5.1 Framework
Minn c · x; aω · x ≤ bω , ω ∈ Ω. (S I L)
x∈R
Minn c · h; aω h ≤ 0, ω ∈ Ω(x̄). (L x̄ )
h∈R
The following lemma allows us to reduce the study of first-order optimality con-
ditions to those of a homogeneous problem.
Lemma 2.33 (i) If x̄ ∈ F(S I L) is such that h = 0 is a solution of the linearized
problem, then x̄ ∈ S(S I L).
(ii) If the qualification condition (2.80) holds, and x̄ ∈ F(S I L), then x̄ ∈ S(S I L)
iff h = 0 is a solution of the linearized problem.
Proof (i) If, on the contrary, x̄ ∈ / S(S I L), then there exists an x̃ ∈ F(S I L) such
that c · x̃ < c · x̄, and then h := x̃ − x̄ is feasible for the linearized problem, and
c · h < 0, so that 0 is not a solution of the linearized problem.
(ii) In view of step (i), it suffices to prove that, if x̄ ∈ S(S I L), then h = 0 is a
solution of the linearized problem. Assume on the contrary that c · h < 0, for some
h ∈ F(L x̄ ). Let ε > 0 small enough be such that h ε := h + ε(x̂ − x̄) satisfies c ·
h ε < 0. Set x(t) := x̄ + th ε . Let us show that, for t > 0 small enough, we have
x(t) ∈ F(S I P). If this is not the case, there exists a sequence tk ↓ 0 and ωk ∈ Ω
such that aωk x(tk ) > bωk . Extracting a subsequence if necessary, we can assume that
ωk → ω̄. Passing to the limit in the previous inequality, we get ω̄ ∈ Ω(x̄), and so
there exists an α > 0 such that aω̄ (x̄)h ε ≤ εaω̄ (x̂ − x̄) < 0. For (x, ω) in (x̄, ω̄), we
have therefore aω h ε < −0, and so, if k is large enough:
We know that the topological dual of C(Ω) is the space M(Ω) of (signed) finite,
Borel measures over Ω, (see Malliavin [77, Chap. 2]), or in short, measures. We will
rather show how to obtain in a “direct way” the existence of Lagrange multipliers
as measures with finite support, i.e., linear combinations of finitely many Dirac
measures, in the form λ, y = ω∈supp(λ) λω yω . Here the set supp(λ) is a finite
subset of Ω, called the support of λ, and such that λω = 0, for all ω ∈ supp(λ).
Denote by M(Ω)+ the cone of positive measures.
If λ is a measure with finite support, we call {λω , ω ∈ supp(λ)} the components of
λ, and we will say that λ is positive if its components are. We will denote by M F (Ω)
p
the set of finite measures over Ω, by M F (Ω) the set of finite measures of support of
p
cardinality at most p, and by M F (Ω)+ , M F (Ω)+ the corresponding positive cones.
One defines the dual problem “with finite support”, or “finite dual”, as
Max −bω λω ; c + λω aω = 0. (F S I D)
λ∈M F (Ω)+
ω∈supp(λ) ω∈supp(λ)
Taking the infimum over x ∈ F(S I L) and the supremum over λ ∈ F(F S I D), we
obtain (i). In addition, if the primal and dual values are equal, this relation implies
that x ∈ S(S I L) and λ ∈ S(F S I D) iff the inequality is in fact an equality, whence
(ii) and by the same type of argument (iii).
98 2 Semidefinite and Semi-infinite Programming
Let us now state the main result of the section. We will say that E ⊂ M F (Ω)+ is
bounded if there exists an α > 0 such that ω∈supp(λ) λω ≤ α, for all λ ∈ E.
Theorem 2.35 Let the Slater hypothesis (2.80) hold, and val(S I L) be finite. Then
val(S I L) = val(F S I D), and S(F S I D) is nonempty and bounded. In addition,
(F S I D) has at least one solution with support of cardinality at most n.
The proof is based on the next lemmas, which have their own interest.
Remark 2.36 Applying the duality theory of Chap. 1, we obtain the existence of
Lagrange multipliers in M(ω)+ . Then the Krein–Milman theorem [65] allows us
to obtain the existence of multipliers with finite support. On the other hand, our
approach uses only elementary computations.
Lemma 2.38 Let the Slater hypothesis (2.80) hold. If the dual problem (F S I D) is
feasible, then it has a nonempty and bounded solution set.
Proof Let λ ∈ F(F S I D). Using (2.81), we get for some ε > 0:
−c · x̂ = λω aω · x̂ ≤ λω (bω − ε),
ω∈supp(λ) ω∈supp(λ)
and so
ε λω ≤ c · x̂ + λω bω .
ω∈supp(λ) ω∈supp(λ)
2.5 Semi-infinite Programming 99
If λ is an ε solution of (F S I D), with ε > 0 (this exists since, the primal being
feasible, the dual value is finite), let
− λω bω ≥ val(F S I D) − ε , (2.86)
ω∈supp(λ)
Lemma 2.39 Let the Slater hypothesis (2.80) hold. If (S I L) has a solution, then it
has the same value as (F S I D), and S(F S I D) is nonempty and bounded.
Proof (a) Let x̄ ∈ S(S I L). By Lemma 2.33, h = 0 is a solution of the linearized
problem (L x̄ ). Set
⎧ ⎫
⎨ ⎬
C (x̄) := λω aω ; λ ∈ M F (Ω)+ ; supp(λ) ⊂ Ω(x̄) ∪ {0}. (2.87)
⎩ ⎭
ω∈supp(λ)
(d) Since −c ∈ C (x̄), there exists a λ ∈ F(F S I D) with support in C (x̄). By propo-
sition 2.34(iii), λ ∈ S(F S I D), and problems (S I L) and (F S I D) have the same
value. Finally, S(F S I D) is nonempty and bounded in view of Lemma 2.38.
Lemma 2.40 Let val(S I L) be finite, and the Slater hypothesis (2.80) hold. Then
(S I L) and (F S I D) have the same value, and S(F S I D) has at least an element of
cardinality at most n.
n
Minn c · x + γ |xi |; aω · x ≤ bω , ω ∈ Ω, (S I L γ )
x∈R
i=1
where γ > 0. We first show that this problem, whose value is finite, has a solution.
Given ε > 0, let x be an ε solution of (S I L γ ). We then have
n
n
val(S I L) + γ |xi | ≤ c · x + γ |xi | ≤ val(S I L γ ) + ε, (2.89)
i=1 i=1
and so
n
γ |xi | ≤ val(S I L γ ) + ε − val(S I L).
i=1
n
Min
n
c·x +γ z i ; ±xi ≤ z i , i = 1, . . . , n; aω · x ≤ bω , ω ∈ Ω. (S I L γ )
x∈R
z∈Rn i=1
Indeed, the first relation follows from the equality val(S I L γ ) = val(F S I Dγ ), and
from the equality limγ ↓0 val(S I L γ ) = val(S I L), which can be easily checked. The
second is an immediate consequence of the definition of (F S I Dγ ). These relations
allow us to show that λγ is bounded; indeed,
⎛ ⎞
o(1) = ⎝c + λγω aω ⎠ · x̂ ≤ c · x̂ + λγω (bω − ε),
ω∈supp(λγ ) ω∈supp(λγ )
To obtain λ ∈ S(F S I D), via Proposition 2.34(i) it then suffices to pass to the limit
(in a subsequence) in (2.90).
Proof (Proof of theorem 2.35) Under the hypotheses of the theorem, Lemma 2.40
ensures the equality val(S I L) = val(F S I D) as well as the existence of an element
of S(F S I D) of cardinality at most n. Combining with Lemma 2.38, we obtain that
S(F S I D) is bounded.
Let a and b be two real numbers, with a < b. The problem of the best uniform
approximation of a continuous function f over [a, b] by a polynomial of degree n
can be written as
Min max | p(x) − f (x)|; x ∈ [a, b], (AT )
p∈P n
where Pn denotes the set of polynomials of degree at most n with real coeffi-
cients. We denote by I+ ( p) (resp. I− ( p)) the set of points where p(x) − f (x) attains
its maximum (resp. minimum), and we set I ( p) := I+ ( p) ∪ I− ( p). We recall that
f ∞ := sup{| f (x)|, x ∈ [a, b]}.
The cost function and constraints are affine functions of the optimization parameters,
and the Slater condition (2.80) is satisfied (take h̄ = (v̄, p̄) with v̄ = 1 + f ∞ and
p̄ = 0). By Lemma 2.33, (v, p) ∈ S(AT ) iff (w, r ) = 0 is a solution of the linearized
problem. The latter can be written as follows:
In other words, (v, p) ∈ S(AT ) iff there exists no polynomial r ∈ Pn such that
r (x) < 0 when x ∈ I+ ( p) and r (x) > 0 when x ∈ I− ( p). The conclusion
follows.
Proof By Lemma 2.41, it suffices to check that (2.92)–(2.93) is satisfied iff (2.91)
has no solution. If (2.92)–(2.93) is satisfied, then by (2.91), r changes sign at least
n + 1 times, and therefore has at least n + 1 distinct roots, which is impossible. If
on the contrary (2.92)–(2.93) is not satisfied, then ( p − f ) changes sign at most n
times over I ( p). So there exist m ≤ n and numbers α0 , . . . , αm+1 , with αi ∈
/ I ( p)
for all i, such that
a = α0 < α1 < · · · < αm+1 = b,
and ( p − f ) is non-zero and has a constant sign, alternatively +1 and −1, over
I ( p)∩]αi , αi+1 [ for all i = 0 to m. The same holds for r (x) := Πi=1
m
(x − αi ). There-
fore either r or −r satisfies (2.91).
We say that the set of points x0 < x1 < · · · < xn+1 in [a, b] is a reference of the
polynomial p if (2.92)–(2.93) is satisfied.
imply that either r (xi ) is equal to zero, or it has the sign of p(xi ) − f (xi ), for all i.
Set
The previous results allow us to present the theory of Chebyshev polynomials, and
their application to Lagrange interpolation.
The Chebyshev polynomial of degree n, denoted by Tn , is defined over [−1, 1]
by the equality Tn (cos θ ) = cos(nθ ), or equivalently Tn (x) = cos(n cos−1 x). The
formula
cos((n + 1)θ ) + cos((n − 1)θ ) = 2 cos θ cos(nθ )
In particular,
T0 (x) = 1, T1 (x) = x, T2 (x) = 2x 2 − 1.
Proposition 2.44 For all n ∈ N, the polynomial ( 21 )n Tn+1 is, among all polynomials
of degree n + 1 whose coefficient of x n+1 is 1, the one of minimal uniform norm over
[−1, 1].
Proof (a) One easily checks by induction that the coefficient of x n+1 of Tn+1 is 2n .
The coefficient of x n+1 of ( 21 )n Tn+1 is therefore 1.
104 2 Semidefinite and Semi-infinite Programming
(b) We want to show that the coefficients of degree 0 to n of ( 21 )n Tn+1 are a solution
of the problem
n+1 i
n
Min max x − zi x . (2.95)
z 0 ,...,z n x∈[−1,1]
i=0
Given j ∈ {0, . . . , n}, there is a unique polynomial of degree nthat vanishes at all
points xi , except at x j where it is equal to one, that is j (x) := i= j (x − xi )/(x j −
n
xi ) and (2.96) therefore has the unique solution p(x) = i=0 f (xi ) j (x). A naive
choice of the interpolation points is to take them with constant increments. This leads
to significant errors. We will see that, in some sense, the zeros of the Chebyshev
polynomial are the best possible choice.
Lemma 2.45 Let f ∈ C n+1 [a, b]. Then the error e(x) = f (x) − p(x) satisfies
1 n
e(x) = (x − xi ) f (n+1) (ξ ), (2.97)
(n + 1)! i=0
Proof (a) If a function of class C 1 over [a, b] vanishes at two distinct points, then
by Rolle’s theorem, its derivative has at least a zero between these two points. By
induction, we deduce that if g ∈ C n+1 [a, b] vanishes at n + 2 distinct points, then
its derivative of order n + 1 has at least a zero in [a, b].
(b) If x ∈ [a, b] is an interpolation point, (2.97) is trivially satisfied. Otherwise, set
n
t − xi
g(t) = f (t) − p(t) − e(x) . (2.98)
i=0
x − xi
Since g is of class C n+1 [a, b] and vanishes at all interpolation points and at x,
there exists an ξ ∈ [a, b] such that g (n+1) (ξ ) = 0. Computing g (n+1) (ξ ), the result
follows.
2.5 Semi-infinite Programming 105
The previous lemma suggests, in the absence of specific information about f (n+1) ,
to choose the interpolation points that minimize the uniform norm of the product
function
n
prod(x) := (x − xi ). (2.99)
i=0
1 1 [2(n − i) + 1)]π
xi = (a + b) + (b − a) cos , i = 0, . . . , n. (2.100)
2 2 2(n + 1)
f (n+1) ∞
f − p∞ ≤ i = 0, . . . , n. (2.101)
2n (n + 1)!
Proof It suffices to check the result when [a, b] = [−1, 1]. We must find the poly-
nomial of degree n + 1, with coefficient of x n+1 equal to 1 and roots in [−1, 1], of
minimal uniform norm. By Proposition 2.44, ( 21 )n Tn+1 is the unique solution of this
problem, and the interpolation points are the zeros of Tn+1 , whence (2.100). Using
Tn+1 ∞ = 1 and (2.97), we obtain (2.101).
Proof Since f (n+1) is constant, by Lemma 2.45, the maximal error is proportional
to the uniform norm of the product function. By Proposition 2.46, this amount is
minimal if the interpolation points are given by (2.100).
Remark 2.48 The corollary is useful in the following situation. Let p be a polynomial
of degree n + 1, which is a candidate for the approximation of f . We can wonder
what would be the quality of the approximation of f by a polynomial of degree n.
Taking the polynomial q obtained by using the interpolation points given by (2.100),
we have by (2.101) the estimate
p (n+1) ∞
f − q∞ ≤ f − p∞ + . (2.102)
2n (n + 1)!
106 2 Semidefinite and Semi-infinite Programming
Lemma 2.49 Let n be even. Then the polynomial pz (ω) is nonnegative over R iff
there exists a symmetric, positive semidefinite matrix Φ = {Φi j }, 0 ≤ i, j ≤ n/2,
such that
zk = Φi j , k = 0, . . . , n. (2.104)
i+ j=k
n
n
pz (ω) = Φi j ω =
k
Φi j ωi ω j = yΦy ≥ 0,
k=0 i+ j=k k=0 i+ j=k
in the form q(ω) = A(ω) + i B(ω), where A and B are polynomials with real coeffi-
cients. Then pz (ω) = A(ω)2 + B(ω)2 . We have obtained a decomposition pz (ω) as
a sum of two squares of polynomials, of degree at most n/2. Consider a polynomial
n/2
of degree at most n/2: k=0 ci ωi . Its square is of the desired form with Φi j = ci c j
for all i and j (this rank 1 matrix is positive semidefinite). The same holds for a sum
of squares (it suffices to sum the corresponding matrices).
Remark 2.50 (i) We have shown that a polynomial is nonnegative over R iff it is the
sum of at most two squares of polynomials. (ii) Lemma 2.49 allows us to check the
nonnegativity of a polynomial by solving an SDP problem.
2.6 Nonnegative Polynomials over R 107
n
n
Min ci z i ; z ∈ K ; z k ωk ≥ 0, for all ω ∈ R,
z
i=0 k=0
n
Min ci z i ; z ∈ K ; z k = Φi j , k = 0, . . . , n; Φ 0.
z,Φ
i=0 i+ j=k
The previous result allows us to deduce an analogous result in the case of the
nonnegativity of a polynomial over R+ .
Lemma 2.52 A polynomial pz (ω) of degree n is nonnegative over R+ iff there exists
a symmetric, positive semidefinite matrix Φ = {Φi j }, 0 ≤ i, j ≤ n, such that
0 = i+ j=2k−1 Φi j , 1 ≤ k ≤ n,
(2.105)
zk = i+ j=2k Φi j, 0 ≤ k ≤ n.
z 0 + z 1 ω2 + · · · + z n ω2n (2.106)
Lemma 2.53 Let a, f 1 and f 2 be three functions R → R such that f i (ω) = qi (ω)2 +
a(ω)ri (ω)2 , where qi (resp. ri ) are polynomials of degree n i (resp. n i−1 ). Then the
function f (ω) = f 1 (ω) f 2 (ω) is of the form q(ω)2 + a(ω)r (ω)2 , where q and r are
functions R → R that are polynomial if a(·) is. If in addition a(·) is a polynomial of
degree at most 2, we can choose polynomials q and r of degree at most n 1 + n 2 and
n 1 + n 2 − 1, respectively.
We denote by x the integer part of x (greatest integer less than or equal to x).
(i) There exist two polynomials q and r of degree at most 21 n and 21 n − 1,
respectively, such that
pz (ω) = q(ω)2 + ωr (ω)2 . (2.108)
(ii) There exist two positive semidefinite matrices Φ and Ψ , of indexes varying from
0 to 1 + 21 n and 0 to 21 (n − 1) respectively, such that
z 0 = Φ00 ; z k = Φi j + Ψi j , k = 1, . . . , n. (2.109)
i+ j=k i+ j=k−1
Lemma 2.55 Let a and b be two real numbers, with a < b. A polynomial pz (ω) of
degree n is nonnegative over [a, b] iff it satisfies one of the two following conditions:
(i) There exist two polynomials q and r of degree at most 21 n and 21 n − 1 resp. if n is
even, and at most 21 (n − 1) if n is odd, such that
q(ω)2 + (b − ω)(ω − a)r (ω)2 if n is even,
pz (ω) = (2.111)
(ω − a)q(ω)2 + (b − ω)r (ω)2 otherwise.
(ii) There exist two positive semidefinite matrices Φ and Ψ , with index varying from
0 to 21 n and 21 n − 1 resp. if n is even, and from 0 to 21 (n − 1) if n is odd, such that,
if n is even:
zk = Φi j − ab Ψi j + (a + b) Ψi j − Ψi j , k = 1, . . . , n.
i+ j=k i+ j=k i+ j=k−1 i+ j=k−2
(2.112)
and if n is odd:
z k = −a Φi j + Φi j + b Ψi j − Ψi j , k = 1, . . . , n.
i+ j=k i+ j=k−1 i+ j=k i+ j=k−1
(2.113)
We denote the set of possible values of the first n + 1 moments of positive measures
over Ω by
Similarly, we denote by M Fn the set of moments of positive measures with finite sup-
port over Ω; of course M Fn ⊂ M n . We will, in this section, study characterizations
of the sets M n and M Fn . The latter are obviously convex cones of Rn .
Lemma 2.57 The set M Fn has a nonempty interior, and Rn = M Fn − M Fn .
Proof We will prove a more precise result: the conclusion remains true if Ω includes
at least n + 1 distinct points.
(a) Let ω0 , . . . , ωn be distinct points of Ω. Let us show that the set of moments of
measures with support over ω0 , . . . , ωn is equal to Rn+1 . Indeed these moments form
a vector subspace; let z belong to its orthogonal. We then have, for all (λ0 , . . . , λn ),
n
n
n
0= zi (ωk ) λk
i
= λk pz (ωk ).
i=0 k=0 k=0
This proves that the polynomial pz vanishes at the points ω0 , . . . , ωn , i.e., it has more
roots than its degree, implying that z = 0, as was to be proved.
(b) We show that the interior of M Fn is nonempty, by checking that the set E of
moments of positive measures with support over the distinct points ω0 , . . . , ωn of Ω
has a nonempty interior. Since E is convex, if it has an empty interior, it is included
in a hyperplane with normal λ; then λ is also normal to E − E, but we checked that
E − E = Rn , which is a contradiction.
Remark 2.58 The set of moments of positive measures with support over the points
{ω0 , . . . , ωn } is of course the cone generated by the n + 1 Dirac measures asso-
ciated with {ω0 , . . . , ωn }. It is therefore characterized by a finite number of linear
inequalities (Pulleyblank [90]).
The aim of this section is to present a method of characterization of the set M n .
We first recall a classical result, based on the following matrices (often called moment
matrices in the literature)
112 2 Semidefinite and Semi-infinite Programming
⎛ ⎞
m0 m1 · · · mn
⎜m 1 m 2 · · · m n+1 ⎟
⎜ ⎟
M0 (m) := ⎜ . .. .. .. ⎟ ;
⎝ .. . . . ⎠
m n m n+1 · · · m 2n
⎛ ⎞
m1 m 2 · · · m n+1
⎜ m2 m 3 · · · m n+2 ⎟
⎜ ⎟
M1 (m) := ⎜ .. .. .. .. ⎟ .
⎝ . . . . ⎠
m n+1 m n+2 · · · m 2n+1
Lemma 2.59 (i) Let (m 0 , . . . , m 2n+1 ) ∈ M 2n+1 . Then M0 (m) 0. (ii) If in addition
Ω ⊂ R+ , then M1 (m) 0.
Proof (i) Set x(ω) = (1, ω, ω2 , . . . , ωn ) . From x(ω)x(ω) 0 and μ ≥ 0, we
deduce that
M0 (m) = x(ω)x(ω) dμ(ω) 0. (2.116)
Ω
Remark 2.60 We can give other examples of necessary conditions based on similar
arguments. For instance, when Ω = [0, 1], the vector
n
Min m k zk ; pz (ω) ≥ 0 over Ω. (P M)
z∈Rn+1
k=0
The criterion is linear, and the feasible domain is a cone; the value of this problem
is therefore 0 or −∞. The “finite dual” problem (in the sense of Sect. 2.5) is
Max 0; Mk (μ) = m k , k = 0, . . . , n. (D M F )
μ∈M F (Ω)+
2.6 Nonnegative Polynomials over R 113
Its value is 0 if m ∈ M n , and −∞ otherwise. Its feasible set is the set of positive
measures having m for first moments.
Lemma 2.61 (i) We have val(D M F ) ≤ val(P M).
(ii) If in addition Ω is compact, then val(D M F ) = val(P M), and M n = M Fn .
Proof (i) If the dual is not feasible, so that its value is −∞, then val(D M F ) ≤
val(P M) trivially holds. Otherwise, let μ ∈ M(Ω) be such that Mk (μ) = m k , k = 0
to n, and z ∈ F(P M). Then
n
n
n
m k zk ≥ m k zk − pz (ω)dμ(ω) = z k (m k − Mk (μ)) = 0. (2.118)
k=0 k=0 Ω k=0
n
L
Min m k zk ; z = AΦ, Φ 0, = 1, . . . , L , (2.119)
z,Φ
k=0 =1
L
Min Φ , A m; Φ 0, = 1, . . . , L . (P M )
Φ
=1
Proposition 2.65 Let m ∈ int M n . Then there exists a finite measure, with cardi-
nality of support at most n + 1, having m for first moments.
Proof Let r > 0 and set Ωr := Ω ∩ [−r, r ]. If the conclusion does not hold, there
exists no measure with support in Ωr having m for first moments. By Lemma 2.61,
there exists a z r ∈ Rn+1 such that pzr is nonnegative over Ωr , and k m k z rk < 0. Let
z̄ = 0 be a limit point of z r /|z r |. Then pz̄ is nonnegative over Ω, and k m k z̄ k ≤ 0.
Choose m so close to m that k m k z̄ k < 0. Then problem (P M) for m has
value −∞, which by (2.118) implies that m ∈ / M , in contradiction with m ∈
n
int M n .
Example 2.66 Let us show that if Ω = R+ , the set M n is not closed. Let r >
1. To the measure μr = (1 − r −n )δ0 + r −n δr are associated the moments m r =
(1, r 1−n , . . . , 1) with limit m = (1, 0, . . . , 0, 1). It is clear that m ∈
/ M n.
n
Min m k zk ; pz (ω) − χ S (ω) ≥ 0 over Ω. (P M S )
z∈Rn+1
k=0
2.6 Nonnegative Polynomials over R 115
n
Min m k zk ; pz (ω) ≥ 1 over S; pz (ω) ≥ 0 over Ω\S. (P M S )
z∈Rn+1
k=0
Given measures μ1 and μ2 with support S and Ω\S resp., the Lagrangian of the
problem is, denoting by M0:n (·) the vector of moments of order 0 to n:
n
L(μ, z) := m k zk − ( pz (ω) − 1) dμ1 (ω) − pz (ω)dμ2 (ω)
k=0 S Ω\S (2.121)
= (m − M0:n (μ1 + μ2 )) · z + dμ1 (ω)
S
Remark 2.69 The previous results can be useful in the context of risk control.
Assume that certain moments of a probability of gains, with values in a bounded
interval Ω, are known. We can then compute the maximal value probability of
gain below a certain threshold s, by solving a maximal loading problem, with here
S :=] − ∞, s] ∩ Ω.
2.7 Notes
questions of sensitivity are dealt with in Bonnans and Ramírez [25], and an overview
is given in Alizadeh and Goldfarb [4].
About semi-infinite programming, see Bonnans and Shapiro [26, Sect. 5.4], or
Goberna and Lopez [53]. The problem of moments is discussed in Chap. 16 of [125];
a classical reference is Akhiezer [2]. The related work by Lasserre [69] deals with
the minimization of polynomial functions of several variables, with polynomial con-
straints. Our discussion of Chebyshev interpolation follows Powell’s book [89], a
classical reference in approximation theory.
Chapter 3
An Integration Toolbox
Soit Ω be a set; we denote by P(Ω) the set of its subsets. We say that F ⊂ P(Ω)
is an algebra (resp. σ -algebra) if it contains ∅ and Ω, the complement of each
of its elements, and the finite (resp. countable) unions of its elements. Note that an
algebra (resp. a σ -algebra) also contains the finite (resp. countable) intersections of its
elements. The trivial σ -algebra is the algebra {∅, Ω}. An intersection of algebras
(resp. σ -algebras) is an algebra (resp. σ -algebra). Therefore, if E ⊂ P(Ω), we
may define its generated algebra (resp. σ generated-algebra) as the intersection of
algebras (resp. σ -algebras) containing it, or equivalently the smallest algebra (resp.
σ -algebra) containing it. The above intersections are not over an empty set since
they contain the trivial σ -algebra. If F is a σ -algebra of Ω, we say that (Ω, F ) is
a measurable space, and call the elements of F measurable sets.
Remark 3.1 We can build the algebra (resp. σ -algebra) generated by E ⊂ P(Ω) as
follows. Consider the sequence Ek ⊂ P(Ω), k ∈ N, such that E0 := E , and Ek+1 is
the subset of P(Ω) whose elements are the elements of Ek , as well as their com-
plements and finite (resp. countable) unions. We can call Ek the k steps completion
Definition 3.3 Given two sets Ω1 and Ω2 , and subsets Fˆi of P(Ωi ), with generated
σ -algebras denoted by Fi , for i = 1, 2, we set
1 This means that there exists a subset O of P (Ω) that contains Ω and ∅, and is stable under finite
intersection and arbitrary union. Its elements are called open sets. The complements of open sets
are called closed sets.
3.1 Measure Theory 119
If (Y, ρ) is a metric space, its open balls with center y ∈ Y and radius r > 0 are
denoted as
B(y, r ) := {y ∈ Y ; ρ(y, y ) < r }. (3.4)
The open subsets of Y are defined as unions of open balls. This makes Y a topological
space. In the sequel metric spaces will always be endowed with their Borel σ -algebra.
Measurable Mappings
Let (X, F X ) and (Y, FY ) be two measurable spaces. The mapping f : X → Y is said
to be measurable if, for all Y1 ∈ FY , f −1 (Y1 ) ∈ F X . A composition of measurable
mappings is therefore measurable, if the σ -algebra of the intermediate space is the
same for the two mappings.
Lemma 3.7 Let (X, F X ) and (Y, FY ) be two measurable spaces, the σ -algebra
FY being generated by G ⊂ P(Y ). Then f : X → Y is measurable iff f −1 (g) is
measurable, for all g ∈ G .
120 3 An Integration Toolbox
Proof Let O be an open subset of Y . Clearly, O = ∪r >0 Or , and hence, x ∈ f¯−1 (O)
iff there exists an r0 > 0 such that x ∈ f¯−1 (Or0 ), i.e., there exists a y ∈ Or0 such that
y = f¯(x) = limk f k (x).This holds iff, for any r1 > r0 , f k (x) ∈ Or1 for large enough
k, i.e., iff x belongs to ≥k f −1 (Or1 ) for large enough k: relation (3.6) follows.
2 By the definition, A \ B := {x ∈ A; x ∈
/ B}.
3.1 Measure Theory 121
Definition 3.12 Let a denote the integer part of a, i.e., the greatest integer m such
that m ≤ a. For k ∈ N, set
−k k
2 2 a if a ≥ 0,
a k := (3.7)
−2−k −2k a otherwise.
the sum is nonzero). Finally, in the general case, by the previous discussion, there
exists a measurable function gk such that Yk = gk (X ), and for m > n, we have
|gm (x) − gk (x)| ≤ 2−k , proving that the sequence gk simply converges to some g,
which is measurable by Lemma 3.26.
Remark 3.16 For a measurable function f with value in a separable (i.e. that con-
tains a dense sequence) metric space (E, ρ), we can do a somewhat similar construc-
tion. Let ek , k ∈ N be a dense sequence in E. Set E k, := {e j , k ≤ j ≤ }. Define f k
by induction: f k (x) = e0 if ρ(e0 , f (x)) ≤ ρ(e, f (x)) for all e ∈ E 0,k , and at step i,
1 ≤ i ≤ k, if f k (x) has not been set yet, then f k (x) = ei if ρ(ei , f (x)) ≤ ρ(e, f (x)),
for all e ∈ E i,k . In this way we obtain a sequence of simple functions that simply
converge to f . It follows that the above Doob–Dynkin lemma holds when replacing
R p by a separable metric space.
3.1.2 Measures
and σ -finiteness:
There exists an exhaustion sequence Ak in F , i.e., such that
(3.10)
μ(Ak ) < ∞ and Ω = ∪k Ak .
We also have
μ (∪k Ak ) = lim μ(Ak ) if Ak ⊂ Ak+1 for all k. (3.13)
k
3.1 Measure Theory 123
Indeed, apply (3.9) to the disjoint family Ak \ Ak−1 (for k ≥ 1, assuming w.l.o.g.
that A0 = ∅). Let us show next that (3.13) implies
⎧
⎨ μ (∩k Ak ) = limk μ(Ak )
if {Ak } is a nonincreasing family of measurable sets (3.14)
⎩
having finite measure for large enough k.
Construction of Measures
In the case of Lebesgue measure over R the starting point is the length of finite
intervals; one has to check that the latter can be extended to a measure over the
Borelian σ -algebra. More generally let Fˆ be an algebra of subsets of Ω, and let F
denote the generated σ -algebra. Let μ : Fˆ → R+ ∪ {+∞} be a σ -additive function
over Fˆ , i.e., if ∪i∈I Ai ∈ Fˆ , then
μ (∪i∈I Ai ) = i∈I μ(Ai ), for I finite or countable
(3.15)
and {Ai }i∈I ⊂ Fˆ such that Ai ∩ A j = ∅ if i = j.
Product Spaces
Proposition 3.21 Let (Ωi , Fi , μi ), for i = 1, 2, be two measure spaces. Let μ be the
set function over F1 × F2 defined by μ(F1 × F2 ) := μ1 (F1 )μ2 (F2 ), for all F1 ∈ F1
σ
and F2 ∈ F2 . Then there exists a unique measure μ̄ over F1 ⊗ F2 that extends μ,
in the sense that μ̄(F) = μ(F), for all F ∈ F1 × F2 .
Fc := {A := A1 ∪ A2 ; A1 ∈ F ; A2 is negligible} (3.17)
A Useful Lemma
Then we can define a finitely additive set function μ̂ over the set of cylinders of X
by
μ̂(C(A)) := μn (A), for each measurable A in R p×n . (3.21)
If A and A are Borelian sets in R p×i and R p× j resp. with j > i, then C(A) coincides
with C(A ) iff A = A × R p×( j−i) . So, in view of (3.20), μ̂(C(A)) is well defined.
Theorem 3.24 The set function μ̂ has a unique extension to a probability measure
μ on F .
Proof (Taken from Shiryaev [115, Chap. 2, Sect. 3]). By Carathéodory’s extension
Theorem 3.17, it suffices to prove that μ̂ is σ -additive.
Since μ̂ is finitely additive, its σ -additivity is equivalent (taking complements)
to the property of “continuity at 0”, i.e., if Ck := C(Ak ) is a decreasing sequence in
Fˆ with empty intersection, then μ̂(Ck ) → 0. Note that we may assume that Ak is
a Borelian set in R p×n k , with n k ≥ k. Assume on the contrary that μ̂(Ck ) → δ > 0.
We prove (independently) in Lemma 5.4 that any Borelian subset A of Rn is such
that
For any ε > 0, there exist F, G resp. closed and open subsets of Rn
(3.22)
such that F ⊂ A ⊂ G and P(G \ F) < ε.
Intersecting F with a closed ball of arbitrarily large radius, it is easily seen that we
may in addition assume F to be compact. So, let Fk ⊂ Ak be compact sets such that
μn k (Ak \ Fk ) < δ/2k+1 . Let F̂k := C(∩q≤k Fk ) = ∩q≤k C(Fk ). Then
Ck \ F̂k = Ck \ ∩q≤k C(Fk ) = ∪q≤k Ck \ C(Fk ), (3.23)
and therefore
Since μ̂(Ck ) → δ > 0 it follows that limk μ̂( F̂k ) ≥ δ/2. So, there exists a sequence
x k in R∞ such that x k ∈ F̂k for all k. Let p ≥ 1 be an integer. Since Fk is a compact
subset of Rn k , with n k ≥ k, p → x kp is bounded. By a diagonal argument we can, up
to the extraction of a subsequence, assume that p → x kp is convergent for each p to,
say, x p . We easily see that x belongs to ∩k Ck , contradicting our hypothesis.
Remark 3.25 As mentioned in [115], the proof has an immediate extension to the
case when X = Y ∞ , where Y is a metric space endowed with the Borelian σ -algebra.
We need probabilities μn on Y n , satisfying the consistency property μn+1 (A × Y ) =
μn (A), for all n = 1, . . .. Then we are able to build an extension of the μn on the
3.1 Measure Theory 127
Let (X, F X ), (Y, FY ) be two measurable spaces. We endow (X, F X ) with a measure
μ. Denote by M the vector space of functions X → Y that are measurable, after
modification on a negligible set, and by Mμ the quotient space through the equiv-
alence relation f ∼ f iff f (x) = f (x) a.e. We call an element of an equivalence
class a representative of that class, and say that the sequence f k ∈ Mμ converges
a.e. to f ∈ Mμ if the convergence holds a.e. for some representative.
Lemma 3.26 Let (X, F X , μ) be a measure space, and (Y, dY ) be a metric space. If
a sequence in Mμ converges a.e., then its limit is a measurable function.
Theorem 3.27 (Egoroff) Let (X, F X , μ) be a measure space such that μ(X ) < ∞,
(Y, ρ) a metric space, and f¯k a sequence of Mμ (X, Y ). If f¯k converge a.e. to ḡ, then
for any representatives ( f k , g) of ( f¯k , ḡ) and ε > 0, there exists a K ⊂ F X such that
μ(X \K ) ≤ ε and f k uniformly converges to g on K .
For all ε > 0, we have that μ ({x ∈ X ; ρ( f k (x), g(x)) > ε}) → 0. (3.28)
128 3 An Integration Toolbox
Theorem 3.28 Let (X, F X , μ) be a measure space, and (Y, ρ) be a metric space.
Let f k be a sequence of measurable mappings X → Y . Then: (i) Convergence in
measure implies convergence a.e. of a subsequence. (ii) If μ(X ) < ∞, convergence
a.e. implies convergence in measure.
We briefly review some results, referring for the proof to [77, Chap. 1]. For f , g in
L 0 (Ω) set
e( f, g) := inf {ε + μ(| f − g| > ε)}. (3.30)
ε>0
3.1.5 Integration
Let (Ω, F , μ) be a measure space. The spaces L 0 (Ω) and E 0 (Ω) of measurable
and simple functions, resp., were introduced in Definition 3.10. If f ∈ E 0 (Ω) has
values a1 < · · · < an , the sets Ai := f −1 (ai ), i = 1 to n, are measurable and give a
partition of Ω. Denote by E 1 (Ω) the subspace of E 0 (Ω) for which the Ai have a
finite measure whenever ai = 0. If f ∈ E 1 (Ω), we define the integral of f as
n
f (ω)dμ(ω) := ai μ(Ai ). (3.32)
Ω i=1
3.1 Measure Theory 129
Proof We follow [77, French edition, p. 37]. If the conclusion does not hold, then
h k := f k − f k is a Cauchy sequence that converges a.e. to zero and such that
Ω h k (ω)dμ(ω) has a nonzero limit, say γ , so that, for large enough k, h ≥
k 1
1
2
|γ | > 0 and h k − h 1 < |γ |/8 for > k. Fix such a k and write h k = q ξq 1 Aq .
Then, for > k:
130 3 An Integration Toolbox
|ξq (ω) − h (ω)|dμ(ω) ≤ h k − h 1 ≤ 41 h k 1 = 1
4
|ξq |dμ(ω).
q Aq q Aq
(3.37)
So, we must have that
|ξq (ω) − h (ω)|dμ(ω) ≤ 1
4
|ξq (ω)|dμ(ω) = 41 |ξq (ω)|μ(Aq ), for some q.
Aq Aq
(3.38)
For this particular q, by the Tchebycheff inequality:
μ({ω ∈ Aq ; |ξq (ω) − h (ω)| > 21 |ξq (ω)|}) < 21 μ(Aq ), (3.39)
so that
μ({ω ∈ Aq ; |h (ω)| > 21 |ξq (ω)|}) ≥ 21 μ(Aq ). (3.40)
With the above definition we have the usual calculus rules such as
( f + g)(ω)dμ(ω) = f (ω)dμ(ω) + g(ω)dμ(ω), (3.41)
Ω Ω Ω
whenever the integrals of f and g are defined, except of course if f and g have
infinite integrals of opposite sign.
Proof The σ -finiteness axiom (3.10) holds since ρ f (Ω) := f 1 < ∞. It remains to
show that, if the Ai satisfy the assumptions
in (3.9), then ρ f (∪ A
i∈I i ) = i∈I ρ(Ai ),
or equivalently ∪i∈I Ai f (ω)dω = i∈I Ai f (ω)dω. This follows from the mono-
tone convergence Theorem 3.34, where we set f k (ω) := f (ω) ≤k 1 A (ω).
Corollary 3.37 Let {Bk } ⊂ F be such that Bk+1 ⊂ Bk , and B := ∩k Bk has zero
measure. Then Bk f (ω)dμ(ω) → 0, for all f ∈ L 1 (Ω).
132 3 An Integration Toolbox
α := sup | f k (ω) − g(ω)|dμ(ω) ≤ 2 h(ω)dμ(ω) → 0. (3.45)
k K K
So, given γ > 0, take such that α ≤ γ . By (3.46), we have that lim supk f k −
g1 ≤ γ . It follows that f k → g in L 1 (Ω).
(c) Assume now that μ(Ω) = ∞, and let A be the exhaustion sequence in (3.10).
Set B := Ω \ A . By step (b), A | f k (ω) − g(ω)|dω → 0, and so
lim sup f k − g1 = lim sup | f k (ω) − g(ω)|dω ≤ 2 |h(ω)|dω. (3.47)
k k B B
Now B |h(ω)|dω = ρh (B ). Since ∩ B has zero measure, by Corollary 3.37, the
above r.h.s. converges to 0 when ↑ +∞. The conclusion follows.
Remark 3.39 We have proved in step (a) that a measurable function belongs to
L 1 (Ω) whenever it is dominated by some h ∈ L 1 (Ω).
Example 3.40 Define the functions f k and g R → R by f k (x) := e−(x−k) , g(x) = 0.
2
Then f k and g are integrable, and f k → g a.e. However, the integral of f k does not
converge to that of g. The above theorem does not apply, since the domination
hypothesis does not hold.
3.1 Measure Theory 133
Remark 3.42 Fatou’s lemma allows us to prove the l.s.c. of some integral functionals,
see e.g. after (3.134).
Proof Apply Lemma 3.41 to the sequence | f k |, which converges a.e. to | f |, with
g = 0.
Example 3.44 The integrable sequence f k (x) := e−(x−k) simply converges to 0, and
2
gives an example of strict inequality in (3.48). It also shows that the convergence
in L 1 (Ω) does not necessarily occur in the setting of Corollary 3.43. Taking now
f k (x) := −e−(x−k) , we verify that, to obtain (3.48), the hypothesis that f k ≥ g, with
2
We have until now presented the standard theorems of integration theory. We now
end this section with some more advanced results. Let us first show that Fatou’s
lemma implies an easy and useful generalized dominated convergence theorem, see
Royden [105, Chap. 4, Thm. 17].
for all k. Changing E j into ∪i≤ j E i if necessary, we may assume that F j is a non-
increasing sequence, whose intersection has zero measure. By Corollary 3.37, we
may fix j such that F j | f (ω)|dμ(ω) ≤ ε and ε j Mε + ε ≤ 2ε, so that
F j | f k − f |(ω)dP(ω) ≤ 3ε. Since f k → f uniformly on E j , the conclusion fol-
lows.
Remark 3.48 The theorem does not hold over a measure space when μ(Ω) is not
finite, as Example 3.44 shows.
Exercise 3.49 Let Ω := [0, 1] be endowed with Lebesgue’s measure. Let fk (ω) = k
over [0, 1/k] and f k (ω) = 0 otherwise. Show that this sequence is not uniformly
integrable, and does not satisfy the conclusion of Vitali’s theorem.
3.1.6 L p Spaces
Let
L ∞ (ω) := { f ∈ L 0 (Ω); f ∞ < ∞}. (3.53)
It is easily checked that this space, endowed with the norm · ∞ , is a Banach space.
Now, for p ∈ [1, ∞) set
We will check in Lemma 3.53 that this is a norm. Let us prove that L p (Ω) is a vector
space. It is enough to check that if f , g in L p (ω), then f + g in L p (ω). Indeed, the
function x → |x| p being convex, we have that
2− p f + g pp = | 21 ( f + g)| p ≤ 1
2
| f |p + 1
2
|g| p = 21 f pp + 21 g pp .
Ω Ω Ω
(3.56)
Let p ∈ [1, ∞], and q be the conjugate exponent, such that 1/ p + 1/q = 1. The
following lemma shows that to every element of L q (Ω) is associated a continuous
linear form on L p (Ω):
Proof The result is obvious if p ∈ {1, ∞}. So, let p ∈ (1, ∞). Since the inequality
(3.57) is positively homogeneous w.r.t. f and g, it is enough to check that f g1 ≤ 1
whenever f p = gq = 1. So, given f ∈ L p (Ω), f p = 1, we need to check
that the convex problem below has value not less than −1:
1 1
Min − f (ω)g(ω)dμ(ω); |g(ω)|q dμ(ω) ≤ . (3.58)
g∈L q (Ω) Ω q Ω q
We may always assume that f g ≥ 0 a.e., since otherwise we obtain a lower cost
by changing g(ω) over −g(ω) on {ω ∈ Ω; f (ω)g(ω) < 0}. We may assume that
f (ω) ≤ 0 a.e., in view of the discussion on the sign of f g. We will solve this qualified
convex problem by finding a solution to the optimality system, with multiplier λ > 0.
The Lagrangian function can be expressed as
136 3 An Integration Toolbox
− f (ω)g(ω) + λ|g(ω)|q /q dμ(ω) − λ/q, (3.59)
Ω
whose minimum is attained for g ≥ 0 such that − f (ω) + λg(ω)q−1 = 0 a.e., i.e.,
g(ω) = ( f (ω)/λ) p/q , which is an element of L q (Ω). Since λ > 0 the constraint is
binding, and so,
−p
1= |g(ω)| dμ(ω) = λ
q
| f (ω)| p dμ(ω) = λ− p , (3.60)
Ω Ω
Corollary 3.52 Let μ(Ω) < ∞, and 1/ p + 1/q = 1/r with r ∈ (1, ∞). Then
L p (Ω) ⊂ L r (Ω) and if f ∈ L p (Ω), we have that
f r ≤ μ(ω)1/q f p . (3.62)
Lemma 3.53 The space L p (Ω) is a normed vector space, so that for any f , g in
L p (Ω), the following Minkowski inequality holds:
f + g p ≤ f p + g p . (3.63)
Proof It is enough to check the triangle inequality (3.63) when f and g are nonneg-
ative. Let q be such that 1/ p + 1/q = 1. By Lemma 3.50, since p − 1 = p/q:
( f + g) p = f ( f + g) p−1 + g( f + g) p−1 ≤ ( f p + g p )( f + g) p/q q . (3.64)
Ω Ω Ω
Note that
1/q
( f + g) p/q q = ( f + g) p = f + g p/q
p = f + g p−1
p . (3.65)
Ω
p p−1
We obtain that f + g p ≤ ( f p + g p ) f + g p . The conclusion follows.
Note the following variant in L p (Ω) of the dominated convergence Theorem 3.38:
3.1 Measure Theory 137
Remark 3.55 Under the hypotheses of the above theorem, if μ(Ω) < ∞, by Corol-
lary 3.52, for all r ∈ [1, to L (Ω) and f k → g in L (Ω). Taking
p), f k and g belong
r r
Next we will check that, for p ∈ (1, ∞), L p (Ω) is complete, by characterizing it
as a dual space.
In the sequel we will characterize the dual of L p (Ω) spaces. See also Royden [105,
Chap. 11] or Lang [68, Chap. VII].
and therefore G is a continuous linear form over L 2 (Ω). By the Riesz repre-
sentation theorem for Hilbert spaces, there exists a g ∈ L 2 (Ω) such that G( f ) =
∞
Ω g(ω) f (ω)dμ(ω), for all f ∈ L (Ω). We next prove that g ∈ L (Ω). Let f k be
2
the characteristic function of the set {ω; g(ω) ≥ n}. Then G( f k ) ≥ k f k 1 and there-
fore we must have f k = 0 for large enough k. This proves that esssup g < ∞, and by
a symmetric argument we obtain that g∈ L ∞ (Ω). Since L 2 (Ω) is a dense subset of
L 1 (Ω), it easily follows that G( f ) = Ω g(ω) f (ω)dμ(ω), for all f ∈ L 1 (Ω), and
so the conclusion holds.
(b) When Ω = ∪k Ω k with μ(Ωk ) < ∞, by the previous arguments, for each k we
have that G( f ) = Ωk gk (ω) f (ω)dμ(ω), for all f ∈ L 1 (Ωk ), with gk ∞ ≤ G.
We may assume that the Ωk are nondecreasing. Then we may define g ∈ L ∞ (Ω)
by g(ω) = gk (ω) for all ω ∈ Ωk and all k. Given f ∈ L 1 (Ω), let f k (ω) = f (ω) if
138 3 An Integration Toolbox
Proof That f ∈ L p (Ω) easily follows from Corollary 3.43. We check in Remark 3.59
below that, for any ε > 0, there exists a Cε > 0 such that, for any a, b in R :
|a + b| p − |a| p ≤ ε|a| p + Cε |b| p . (3.70)
follows.
3.1 Measure Theory 139
Remark 3.59 The inequality (3.70) can be justified as follows. For p = 1 it is trivial.
Now let p ∈ (1, ∞). If |b| > 2|a|, then
|a + b| p − |a| p = |a + b| p − |a| p ≤ 2 p |b| p , (3.73)
and the desired relation holds with Cε = 2 p . Otherwise, by the mean value theorem,
we have that, for some θ ∈]0, 1[:
|a + b| p − |a| p = p|a + θ b| p−1 |b| ≤ 3 p−1 p|a| p−1 |b|. (3.74)
We need to discuss integrals with values in a Banach space Y . Given a measure space
(Ω, F , μ), by L 0 (Ω; Y ) we denote the space of measurable functions of (Ω) with
image Y ; remember that the Banach space Y is implicitly endowed with the Borel σ -
algebra (the one generated by open subsets), so that f ∈ L 0 (Ω; Y ) iff, for any Borel
subset A of Y , f −1 (A) is Lebesgue measurable. The subspace of simple functions
n set of [Ω]) is denoted by L (Ω; Y ).
00
(with finitely many values except on a null
Simple functions can be written as f = i=1 yi 1 Ai , where yi ∈ Y , and the Ai are
measurable subsets of [Ω], with negligible intersections. We may define the integral
and norm of the simple function f by
n n
f (ω)dω := yi mes(Ai ); f 1,Y := yi Y mes(Ai ). (3.76)
Ω i=1 i=1
Note that
f 1,Y = f (ω)Y dω, for all f ∈ L 00 (Ω; Y ). (3.77)
Ω
The space L 1 (Ω; Y ) of (Bochner) integrable functions is obtained, as is done for the
Lebesgue integral, by passing to the limit in Cauchy sequences of simple functions.
If f k is such a sequence, extracting a subsequence if necessary, we may assume
that f q − f p 1,Y ≤ 2−q for any q < p, so that the series f k+1 − f k 1,Y is con-
vergent. Consider the series sk (ω) := f k+1 (ω) − f k (ω)Y and the corresponding
sums Sk (ω) := ≤k sk (ω). By the monotone convergence theorem, Sk converges
140 3 An Integration Toolbox
in L 1 (Ω) to some S∞ , and (being nondecreasing) converges also for a.a. ω. So, for
a.a. ω, the normally convergent sequence f k (ω) has a limit f (ω) in Y , such that
Therefore
f (ω) − f k (ω)Y dω ≤ |S∞ (ω) − Sk (ω)|dω = o(1). (3.79)
Ω Ω
We define the integral and norm of f as the limit of those of the f k . These integral
and norm of f are well defined, since they coincide for every Cauchy sequence of
simple functions having the same limit. Indeed, let f k be another Cauchy sequence
of simple functions for the L 1 norm, converging to f for a.a. ω. By (3.77), (3.79)
applied to f k and f k , and the triangle inequality:
f k − f k 1,Y = f k (ω) − f k (ω)1 dω
Ω (3.80)
≤ f k (ω) − f (ω)1 dω + f (ω) − f k (ω)1 dω
Ω Ω
the last equality being a consequence of the dominated convergence theorem; the
domination hypothesis holds since f k (ω)Y ≤ f 0 (ω)Y + S∞ (ω) and the r.h.s.
belongs to L 1 (Ω).
Note that there is a version of the dominated convergence theorem for Bochner
integrals, see also Aliprantis and Border [3, Thm. 11.46]:
3.1 Measure Theory 141
Proof Let gk (ω) := f k (ω) − f (ω)Y . Then gk → 0 a.e. and gk (ω) ≤ 2g(ω)Y
a.e. By the (standard) dominated convergence theorem, gk → 0 in L 1 (Ω). Extracting
a subsequence if necessary, we may assume that gk L 1 (Ω) ≤ 2−k . Then
f q − f k L 1 (Ω;Y ) ≤ Ω f q (ω) − f (ω)Y dμ(ω) + Ω f (ω) − f k (ω)Y dμ(ω)
= Ω (gq (ω) + gk (ω))dμ(ω)
(3.83)
converges to 0. That is, f k is a Cauchy sequence in L 1 (Ω; Y ). Being constructed
as a set of limits of Cauchy sequences, L 1 (Ω; Y ) is necessarily complete, and we
have seen that convergence in this space implies convergence a.e. for a subsequence.
Since f k → f a.e, it follows that f k → f in L 1 (Ω; Y ). The conclusion follows.
Example 3.63 Let Y = C(X ), the space on continuous functions over the metric
compact set X , known to be separable (as a consequence of the Stone–Weierstrass
theorem). Then L 1 (Ω, Y ) coincides with the set of measurable functions f : Ω → Y
such that f (x, ω)Y is integrable, and
f L 1 (Ω,Y ) = max | f (x, ω)|dμ(ω). (3.84)
Ω x∈X
where p ∈ [1, ∞], and U ⊂ Rm . We adopt the following definition of the domain of
an integral cost, valid in the context of a minimization problem.
Intuitively, we would expect that the infimum in (3.86) is obtained by the exchange
property of the minimization and integration operators, i.e.
inf f (ω, u(ω))dμ(ω) = inf f (ω, v)dμ(ω). (3.89)
u∈L p (Ω;U ) Ω Ω v∈U
This, however, raises some technical issues, the first of them being to check that
the r.h.s. integral is well-defined. We will solve this problem assuming that f is a
Carathéodory function, and then in the case when in addition the local constraint
depends on ω. We will analyze conjugate functions, also in the case of more general
convex integrands, and discuss the related problem of minimizing such an integral
subject to the constraint that some integrals of the same type are nonpositive.
proving that the r.h.s. is measurable. If Ω f (ω, u k (ω)dμ(ω) = −∞ for large enough
k, then the equality (3.89) holds with value −∞. Otherwise, we conclude by the
monotone convergence Theorem 3.34.
We next discuss a more general case where we have a constraint of the form
πz 0 ...,z m C := Pz 0 · · · Pz m C. (3.95)
Proof Since (ii) easily follows from (i), it suffices to prove the latter. We essen-
tially reproduce the arguments in [102]. Let a ∈ Rm . Since P̂a is a composition of
projections, it suffices to prove that, if Γ (ω) is a measurable closed-valued mul-
timapping, then Pa (ω) := Pa Γ (ω) is measurable. For this, consider the sequence of
multimappings
Next, let D be a countable dense subset of C, which always exists. We claim that
We conclude, as in step (b) of the proof of Proposition 3.67, by the monotone con-
vergence Theorem 3.34.
Remark 3.72 If U (ω) is, for a.a. ω, a finite set, then the above minimizing sequence
u k converges a.e. to some ū ∈ L 0 (Ω), with values in U (ω). By the monotone con-
vergence Theorem 3.34, we have that
inf f (ω, u(ω))dμ(ω) = f (ω, ū(ω))dμ(ω). (3.103)
u∈L (Ω;U ) Ω
p
Ω
In the case of convex integrands (such that f (ω, ·) is, for a.a. ω, convex) we can deal
with integral functionals using the following result.
Lemma 3.73 Let g be a proper, l.s.c. convex function Rm → R̄, and E be a dense
subset of dom(g). Then, for all y ∈ dom(g), we have that
Proof Denote by ĝ(y) the r.h.s. of (3.104). Since g is l.s.c., g(y) ≤ ĝ(y). We next
prove the opposite inequality. Changing if necessary Rm into the affine space spanned
by dom(g), we may assume that the latter has a nonempty interior. We know that g
is continuous over the interior of its domain. Since E is a dense part of dom(g), if
y ∈ int(dom(g)), then (3.104) holds, and hence, for all y ∈ dom(g):
ĝ(y) ≤ lim sup g(yt )) ≤ lim sup ((1 − t)g(y0 ) + tg(y)) = g(y), (3.106)
t↑1 t↑1
as was to be shown.
Proof The proof is similar to that of Proposition 3.67, with here U (ω) = Rm . Given
a dense sequence ak in Rm , set
u k−1 (ω) if f (ω, u k−1 (ω)) ≤ f (ω, ak ),
u k (ω) = (3.107)
ak otherwise.
Then u k is measurable, and limk f (ω, u k (ω)) = inf v∈U f (ω, v) in view of
Lemma 3.73.
We next deal with the more general situation when dom( f (ω, ·)) may have an
empty interior.
Proof The proof is an easy variant of that of Proposition 3.74. The details are left to
the reader.
Remark 3.77 The difficulty here is to check the existence of a Castaing representa-
tion in Definition 3.75. If dom( f (ω, ·)) is a closed-valued measurable multimapping,
this follows from Proposition 3.70. If f is a convex integrand and dom( f (ω, ·)) has
a nonempty interior a.e., a Castaing representation is the sequence u k constructed in
the proof of Proposition 3.74.
As we have seen, when p ∈ [1, ∞), the dual of L p (Ω)m is L q (Ω)m , where 1/ p +
1/q = 1, and when p = ∞, its dual contains L 1 (Ω)m . Let U (·) be a measurable
3.2 Integral Functionals 147
with domain
with domain
dom(FU ) := dom(F) ∩ L p (Ω; U ). (3.111)
This amounts to minimizing the integral of ω → f (ω, u(ω)) − u ∗ (ω) · u(ω) over
L p (Ω, U ). The latter is Carathéodory (resp. a normal convex integrand) iff the same
holds for f (ω, u). Set
Corollary 3.79 Let f , p and u ∗ be as in Proposition 3.78, and let F have a finite
value at u. Then u ∗ ∈ ∂ FU (u) iff u ∗ (ω) ∈ ∂ fU (ω, u(ω)) a.e.
and u ∗ ∈ ∂ FU (u) iff equality holds, i.e., iff the above integrand is equal to 0 a.e. The
conclusion follows.
Proposition 3.80 Let f : Ω × Rm → R be a normal convex integrand. Let p ∈
[1, ∞], and u ∗ ∈ L q (Ω)m . If dom(F) = ∅, then
F ∗ (u ∗ ) := f ∗ (ω, u ∗ (ω))dμ(ω). (3.117)
Ω
Corollary 3.81 Let f , p and u ∗ be as in Proposition 3.80, and let F have a finite
value at u. Then u ∗ ∈ ∂ F(u) iff u ∗ (ω) ∈ ∂ f (ω, u(ω)) a.e.
Proof The argument is similar to the one in the proof of Corollary 3.79.
Example 3.82 We extend Example 1.38 to the present setting as follows. Take
f (x) := |x| p / p with p > 1. Then for u ∗ ∈ L p (Ω), with 1/ p + 1/q = 1, we have
that
1
F ∗ (u ∗ ) = u ∗ (ω)qq dμ(ω). (3.118)
q Ω
Proof The second relation follows from Proposition 3.80; let us prove the first one.
For all α < F ∗ (u ∗1 ), there exists a u α ∈ dom(F) such that
α < u ∗1 , u α − f (ω, u α (ω))dμ(ω). (3.120)
Ω
For all β < σ (u ∗s , dom(F)), there exists a u β ∈ dom(F) such that u ∗s , u β > β. Let
Ak be as in the definition of a singular part of an element of L ∞ (Ω). Set
u α (ω) if ω ∈ Ak ,
u α,β,k (ω) := (3.121)
u β (ω) otherwise.
Then
u ∗s , u α,β,k = u ∗s , 1Ω\Ak u β = u ∗s , u β > β. (3.122)
u α,β,k (ω) → u α and f (ω, u α,β,k (ω)) → f (ω, u α (ω)) a.e., when k ↑ ∞. (3.123)
So, by the dominated convergence theorem (note that, by the definition of u α and
u β , f (ω, u α (ω)) and f (ω, u β (ω)) are integrable), we get
limk u ∗1 (ω) · u α,β,k (ω) − f (ω, u α,β,k (ω)) dμ(ω)
Ω (3.124)
∗
= u 1 (ω) · u α (ω) − f (ω, u α (ω)) dμ(ω),
Ω
and so
F (u ) ≥ lim u ∗1 + u ∗s , u α,β,k −
∗ ∗
f (ω, u α,β,k (ω))dμ(ω) > α + β. (3.125)
k Ω
and u ∗ ∈ ∂ F(u) iff the sum equals 0, i.e., iff both the integral and
Consider now the case when the decision x ∈ Rm should not depend on ω. We have
to minimize
f¯(x) := f (ω, x)dμ(ω), (3.132)
Ω
it follows from Fatou’s Lemma 3.41 and the l.s.c. of f (ω, ·) a.s. that F is l.s.c. Given
a nonempty closed, convex subset K of Rm , consider the problem
We are in the framework of the Fenchel duality theory in Example 1.2.1.8. The
expression of the stability condition (1.203) becomes
Definition 3.88 Let ys∗ be a singular element of L ∞ (Ω, Rm )∗ . Let 1 denote the
constant function of L ∞ (Ω) with value 1. Then we define the expectation of ys∗ by,
for i = 1 to n:
(Eys∗ )i = ysi∗ , 1. (3.137)
Lemma 3.89 (i) If y ∗ ∈ L q (Ω, Rm ), then A y ∗ = Ω y ∗ (ω)dμ(ω) = Ey ∗ .
(ii) When p = ∞, if y ∗ ∈ L ∞ (Ω, Rm )∗ has the decomposition y ∗ = y1∗ + ys∗ , with
y1∗ ∈ L 1 (Ω, Rm ), and ys∗ singular with components denoted by ysi∗ , i = 1, . . . , n, we
have
A y ∗ = Ey1∗ + Eys∗ . (3.138)
m
Point (ii) follows from y ∗ , Ax = Ω y1∗ (ω)dμ(ω) · x + i=1 xi ysi∗ , 1.
Proposition 3.90 Let f be a normal convex integrand such that f¯ has a finite value
at x and that (3.134), (3.135) hold. Then, when p ∈ [1, ∞[ :
∂ f¯(x) = ∗ ∗ q ∗
x (ω)dμ(ω); x ∈ L (Ω); x (ω) ∈ ∂ f (ω, x) p.s. , (3.140)
Ω
152 3 An Integration Toolbox
Remark 3.92 The sum in the first line of Equation 3.141 reduces to E(x1∗ + xs∗ )
(where the expectation of the sum is defined as the sum of expectations, which is
correct since the decomposition is unique).
We consider a more general situation where the decision is a Banach space different
from L p (Ω), having in mind the case when ω is a vector and the decision might
depend on some components of the vector. So let X be a Banach space and let
A ∈ L(X, L p (Ω)). Given a closed convex subset K of X , we consider the problem
where F(y) := Ω f (ω, y(ω))dμ(ω), and f¯(x) := F(Ax). We assume that the fol-
lowing optimality condition (similar to (3.135)) holds:
ε BY ⊂ dom(F) − AK . (3.142)
Proposition 3.93 Let f be a normal convex integrand such that f¯ has a finite value
at x, and that (3.134) and (3.142) hold. Then (i) when p ∈ [1, ∞[ :
∂ f¯(x) = A x ∗ ; x ∗ ∈ L q (Ω); x ∗ (ω) ∈ ∂ f (ω, x) p.s. , (3.143)
(ii) If x is solution of (P), say when p = ∞, we have, with x1∗ and xs∗ as above, that
3.2 Integral Functionals 153
Let us consider the following problem of linear programming with simple recourse
Min c · x + Eω dω · yω
x,y
x ∈ Rn+ ; A0 x ≤ b0 (3.148)
ω ω ω
yω ∈ Rm+ ; A yω = b + M x, a.s.
Min dω · yω ; yω ∈ Rm
+; Aω yω = bω + M ω x. (Pω (x))
yω
Max
ω
−λω · (bω + M ω x); dω + (Aω ) λω ≥ 0. (Dω (x))
λ
Since its feasible set does not depend on x, it is natural to suppose that it is nonempty
a.s. (otherwise (3.148) would have infimum −∞ whenever it is feasible). Denote
by vω (x) the value of problem (Pω (x)). By linear programming duality theory
(Lemma 1.26), we have that
154 3 An Integration Toolbox
Proof (i) It is easily checked that vω (x) is a.s. convex. Since vω (x) = val(Dω (x))
a.s., the latter being a supremum of affine functions of x, it is also l.s.c.
(ii) Let y k be a dense sequence in Rm + . Let | · |1 denote the norm in a finite-
1
the infimum being attained a.e. since it corresponds to the value of a linear program.
Therefore, ϕω (x) = 0 iff x ∈ dom vω , and dom vω = ϕω−1 (0) is a.s. nonempty. In
addition, let the sequence x j → x be such that ϕω (x j ) → 0. Then there exists a
ω j ω ω j ω j
sequence y j ∈ Rm + such that |A y − b − M x |1 → 0. It follows that |A y −
ω ω
b − M x|1 → 0, that is,
Set
G ω := {(yω , x) ∈ Rm
+ × R+ ;
n
Aω yω − bω − M ω x = 0}. (3.153)
(3.154)
Minimizing the r.h.s. over yω ∈ Rm + , we obtain
Now it is enough to check that for any bounded closed subset C of Rn , (dom v)−1 (C)
is measurable. Let c be a dense sequence in C. We claim that
There exists ε > 0 and x̂ ∈ Rn such that B(x̂, ε) ∈ dom vω a.s. x̂ > 0 and A0 x̂ < b0 .
(3.159)
Define F : L ∞ (Ω)n → R̄ by F(z) := Eω vω (z ω ). We recall the definition of expec-
tation of elements of L ∞ (Ω, Rn )∗ given in Definition 3.88.
Proof Denote by ∂vω (x) the partial subdifferential of vω (x) w.r.t x. By Lemma 1.55,
we have that
∂vω (x) = −(M ω ) S(Dω (x)) a.s. (3.161)
We conclude by (3.161).
Let (Ω, F , μ) be a probability space. We assume that μ is non-atomic, i.e., for any
A ∈ F , with μ(A) > 0, there exists a B ∈ F , B ⊂ A, such that 0 < μ(B) < μ(A).
This is known to be equivalent to the Darboux property3
For all α ∈ (0, 1), there exists a B ∈ F , B ⊂ A,
(3.165)
such that μ(B) = αμ(A).
3 For a proof of the Darboux property, based on Zorn’s lemma, see [3, Theorem 10.52, p. 395].
3.3 Applications of the Shapley–Folkman Theorem 157
Set
p
A := F, A :=
c
conv F, S p := {α ∈ R+ ; αi = 1}, (3.170)
Ω Ω i
and
S Ic := α ∈ L ∞ (Ω) p ; α(ω) ∈ S p ; αi (ω) = 0, i ∈
/ I (ω); a.e. , (3.171)
S I := α ∈ S Ic ; αi (ω) ∈ {0, 1}; a.e. . (3.172)
Proposition 3.97 Let (3.168) hold. Then A is equal to Ac , is convex and compact,
and any x ∈ A has the following representation:
p
x= αi (ω) f i (ω)dμ(ω), for some α ∈ S I . (3.173)
i=1 Ω
Proof (a) By Theorem 3.96, A is convex; since the f i are integrable, it is bounded.
Let f be an integrable selection of F. Set
E i := {ω ∈ Ω; i ∈ I (ω); ω ∈
/ E j , j < i; f i (ω) = f (ω)}. (3.175)
By Remark 3.72, there exists an α ∈ S that reaches the infimum in (3.176). Set
We have proved that Acλ contains an element of Aλ . On the other hand, since Ac
is bounded, to any nonzero λ ∈ R p is associated some x̄ ∈ ∂ Ac such that (3.176)
holds. Therefore A and Ac have the same support function, and since these sets are
convex, they have the same closure.4 So it suffices to prove that A is closed, which
is equivalent to the equality Acλ = Aλ , for any pair (x̄, λ) as above.
We conclude by an induction argument over the dimension of A. If A is one-
dimensional, then Acλ = {x̄}, and since it contains one point in A, it follows that
Acλ = Aλ . Let the result hold when the dimension of A is q − 1, for q ≥ 2, and let
A have dimension q. Define
4 Theindicatrix function of a nonempty closed convex set is l.s.c. convex, and hence, equal to its
biconjugate. Since the conjugate of the indicatrix is the corresponding support function, two closed
convex sets having the same associated support function are equal.
3.3 Applications of the Shapley–Folkman Theorem 159
with the convention that F0 (u) is equal to +∞ if 0 (ω, u(ω))+ is not integrable. We
assume (for the sake of simplicity) that for any u ∈ L p (U ), i (ω, u(ω)) is integrable,
i = 0 to q. The Lagrangian of the problem L : L p (Ω)m × Rq∗ → R̄ is defined by
q
L(u, λ) := F0 (u) + λi Fi (u). (3.181)
i=1
By Theorem 3.96, this set is convex; its components are indexed from 0 to q. We
may rewrite the primal problem in the form
Min e0 ; e1:q ∈ K ,
e∈E
where e1:q ∈ Rq has components e1 to eq , and set E 1:q := {e1:q ; e ∈ E}. The primal
problem is feasible iff 0 ∈ E 1:q − K , and the stability condition (1.170) of perturba-
tion duality is
ε B ⊂ E 1:q − K , for some ε > 0. (3.183)
Theorem 3.98 Let (3.183) hold. Then val(P I ) = val(D I ), and ū ∈ S(P I ) iff there
exists a λ ∈ N K (F(ū)) such that
Proof The convex sets E and E := (−∞, val(P I )) × K are disjoint, since any
point in the intersection is the image of a feasible u ∈ L p (Ω) with cost function
lower than the value of (P). By Corollary 1.21, we can separate E and E, i.e., there
exists a nonzero pair (β, λ) ∈ R × R p∗ such that
Minimizing the r.h.s. over (k, u) we obtain that val(P I ) ≤ d(λ) ≤ val(D I ). Since
the converse inequality obviously holds, the primal and dual values are equal.
Assume now that ū ∈ S(P I ). Then, by (3.186),
L(ū, λ) − λF1:q (ū) = val(P I ) ≤ L(u, λ) − λk, for all (k, u) ∈ K × L p (Ω),
(3.187)
or equivalently
0≤ inf (L(u, λ) − L(ū, λ)) + inf λ(F1:q (ū) − k) . (3.188)
u∈L (Ω)
p k∈K
Taking u = ū and k = F1:q (ū), we see that each infimum is nonpositive. Therefore
they are both equal to zero, i.e., λ ∈ N K (F(ū)) and (3.184) holds. Conversely, if
ū ∈ dom(F) is such that λ ∈ N K (F(ū)) and (3.184) holds, then for all u ∈ dom(F):
Remark 3.99 While problem (P I ) is not convex in general (for instance, its cost
function is not convex) we have been able to reformulate it as a convex problem. Set
[λ](ω, u) := 0 (ω, u) + λ1:q (ω, u). Since the Lagrangian is itself an integral, to
which Proposition 3.71 applies, we deduce that, under the hypotheses of the above
theorem, if ū ∈ S(P I ), then
Exercise 3.100 Discuss the case when Ω = [0, 1], the integrands f i do not depend
on ω and are polynomials of degree at most n, and U (ω) = R.
3.4 Examples and Exercises 161
We have observations
ai (ω)u(ω)dω = bi , i = 1, . . . , N , (3.192)
Ω
The strictly l.s.c. convex function Ĥ (x) has domain R+ , with value 0 at 0. We have in
view cases when u is a density probability and so we assume that a1 (ω) = 1. In the
crystallographic applications that we have in mind, u(ω) is the density probability
for atoms to be at position ω and the observations correspond to the computation of
Fourier modes, see [39]. The problem to be considered is
where (Au)i := Ω ai (ω)u(ω)dω. The cost function is obviously convex, and is l.s.c.
in view of Fatou’s Lemma 3.41 (where we can take g(ω) = −c, c being the maximum
of − Ĥ ). So, the Fenchel duality framework is applicable. Set
N
Ĥ λ (ω, v) := Ĥ (v) + λi ai (ω) · v. (3.195)
i=1
Since Ĥ (v) ≥ −c, the primal value is not less than −c|Ω|. So, the primal problem
has a finite value. By (3.199), the primal and dual values are equal and the set of dual
solutions is nonempty and bounded. Let λ̄ be a dual solution. Then u ∈ dom Ĥ is a
primal solution iff it satisfies the optimality condition
It follows that
Ĥ (ū(ω)) = −(a(ω) · λ̄ + 1)e−a(ω)·λ̄−1 , (3.202)
Example 3.102 Consider the particular case of the previous example when N = 1,
and the constraint has a probability density, i.e. a(ω) = 1 a.e. and K = {1}. Then the
dual cost is −|Ω|e−λ−1 − λ, which attains its maximum when |Ω|e−λ−1 = 1, i.e.,
for λ̄ = log |Ω| − 1; the optimal density is u = e−λ̄−1 = 1/|Ω|, as expected (the
uniform law maximizes the entropy).
Example 3.103 (Phase transition models, see [80]) Let f : R → R, f (u) := u(1 −
u), and let Ω be a measurable subset of Rn . We
choose the function space X :=
L p (Ω), p ∈ [1, ∞). For u ∈ X , set F(u) := Ω f (u(ω))dμ(ω), where dμ is the
Lebesgue measure. Consider theproblem of minimizing F(u) with the constraints
u(ω) ∈ U a.e., U := [0, 1], and Ω u(ω)dμ(ω) = a, a ∈ (0, mes(Ω)).
Given λ ∈ R, the Lagrangian of this problem is
L(u, λ) := FU (u) + λ u(ω)dμ(ω) − a = ( f (u(ω)) + λu(ω))dμ(ω) − λa. (3.204)
Ω Ω
3.4 Examples and Exercises 163
We compute
fU∗ (z) := sup uz − u(1 − u). (3.206)
u∈U
Clearly it attains its maximum at λ̄ = 0, and so, the primal and dual values are equal,
although the problem is nonconvex.
Example 3.104 This example illustrates how singular multipliers occur in optimality
systems. Consider the problem
We choose ∞ (the space of bounded sequences) as the constraint space and denote
by 1 and b the sequences with generic term 1 and 1/(k + 1), respectively. Thus we
are considering the problem
Min x; x1 + b ≥ 0, k = 0, 1, . . . (3.210)
x∈R
where we have used the natural order relation for sequences. Let K = ∞ + be the
convex cone of elements of ∞ with nonnegative elements and let A : R → ∞ ,
Ax := x1. The constraint can be written as Ax + b ∈ K . The duality Lagrangian is
The problem is convex and the stability condition obviously holds, and so, primal
and dual values are equal. The optimality condition is, in view of the dual constraint:
We now use the structure of elements of (∞ )∗ . Any λ ∈ (∞ )∗ can be uniquely
decomposed as λ = λ1 + λs , where λ1 ∈ 1 and the singular part λs depends only
on the behavior at infinity.
For any y ∈ K , we have that
Taking, for i ∈ N, y = ei (the sequence with all components equal to 0 except the
ith equal to 1) we obtain that λ1 ∈ K − . Then let y ∈ K . Denote by y N the sequence
whose N first terms are zero, the others being equal to those of y. We have that
λ1 , y N = o(1) and λs , y N = λs , y. Since λ ∈ K − , we deduce that λs ∈ K − .
Finally take y = ek /(k + 1). Then x1 ± y ∈ K , and therefore 0 ≥ λ, ±y =
λk /(k + 1), proving that λ1k = 0. and therefore λ1 = 0. In view of the dual constraint,
1
it follows that λs = 0.
3.5 Notes
For complements on Sect. 3.1 (measure theory) we refer to e.g. Malliavin [77]. The
integral functionals discussed in Sect. 3.2 were studied in Rockafellar [96, 98, 99];
see also Castaing and Valadier [33], Aubin and Frankowska [12]. The proof of the
Shapley–Folkman theorem is taken from Zhou [127]. We use it to prove the convexity
of integrals of multimappings. See Tardella [119] and its references on the Lyapunov
theorem. Maréchal [79] introduced useful generalizations of the perspective function.
Chapter 4
Risk Measures
Summary Minimizing an expectation gives little control of the risk of a reward that is
far from the expected value. So, it is useful to design functionals whose minimization
will allow one to make a tradeoff between the risk and expected value. This chapter
gives a concise introduction to the corresponding theory of risk measures. After an
introduction to utility functions, the monetary measures of risk are introduced and
connected to their acceptation sets. Then the case of deviation and semi-deviation,
as well as the (conditional) value at risk, are discussed.
4.1 Introduction
4.2.1 Framework
Note that classical economic theory deals with gain maximization and (often con-
cave) utility functions. However, since we choose to analyze minimization problems,
we will use disutility functions (which will be the opposite of the utility functions).
Definition 4.2 Let (Ω, F , μ) be a probability space and let s ∈ [1, ∞]. The prefer-
ence function associated with the disutility function u is the function U : L s (Ω) → R̄
defined by
U (y) := E[u(y)], (4.1)
with domain
D(U ) = {x ∈ L s (Ω); u(x) is measurable and E[max(u(x), 0)] < +∞}. (4.2)
This expresses the preference for getting the mean value of a random variable rather
than the variable itself.
Definition 4.4 A certainty equivalent price (also called “utility equivalence price”)
of y ∈ D(U ) is defined as an amount α ∈ R such that
so that
ce(y + a) = log(ea U (y)) = a + log(U (y)) = a + ce(y). (4.6)
So, in the case of the exponential utility function, the certainty equivalent price
satisfies the relation of translation invariance: ce(y + a) = a + ce(y).
4.2 Utility Functions 167
We now interpret y as the gain of a portfolio that can by combined with other random
variables, called free assets. If a financial asset z, an element of L s (Ω), has price pz
on the market, then the asset z − pz has a zero price. These market prices should not
be confused with utility indifference prices that apply to assets that are not priced in
the market. So if z 1 , . . . , z n are zero value assets, for any θ ∈ Rn , we may choose to
have the portfolio
y(θ ) = y + θ1 z 1 + · · · + θn z n . (4.7)
Assuming that the above function of θ is differentiable, and that the rule for differ-
entiating the argument of the sum holds, we see that the optimality condition of this
problem is
∂U [y(θ )]
0= = Eμ [u (y(θ ))z i ], i = 1, . . . , n. (4.9)
∂θi
Assume that u (·) is positive everywhere. Let the random variable ηθ be defined by
u (y(θ ))
ηθ = . (4.10)
Eμ [u (y(θ ))]
Being positive and with unit expectation, η is the density of the equivalent probability
measure μθ such that dμθ = ηθ μ. We may write the optimality condition (4.9) as
0 = Eμθ [z i ]. (4.11)
In other words, optimal portfolios are those for which the financial assets have null
expectation under their associated probability μθ . In such a case, we say that μθ is
a neutral risk probability.
Remark 4.7 If short positions are forbidden, meaning that we have the constraint
θ ≥ 0, then the optimality conditions may be expressed as
This will hold if s = 1 and u satisfies a linear growth condition, since in this case
dom(U ) = Y . Since f ∗ = σ K , and U ∗ (·) = E(u ∗ (·)) by Proposition 3.80, the expres-
sion of the dual problem is, assuming s ∈ [1, ∞) and 1/s + 1/s = 1:
Lemma 4.10 (i) The set of monetary measures of risk is convex, and invariant under
addition of a constant et translation.1 (ii) Let f i , i ∈ I , be a family of MMRs. If the
Proof The proof of (i)–(ii) being immediate, it suffices to prove (iii). Let x and
y be in X , and M := supω |x(ω) − y(ω)|. Then x ≥ y − M. By monotonicity and
translation invariance, ρ(x) ≥ ρ(y − M) = ρ(y) − M. Exchanging x and y, we
obtain the converse inequality.
We recall (see Sect. 1.3.4, Chap. 1) that the infimal convolution of a family f i of
extended real-valued functions over X , i ∈ I finite, is defined as
(i∈I f i ) (x) := inf f i (xi ); xi = x . (4.18)
i∈I i∈I
Lemma 4.11 The infimal convolution of a finite family of monetary measures of risk
is, whenever it is finite-valued, a monetary measure of risk.
Proof Let the f i be extended real-valued functions over X , for i = 1 to n. Then
i∈I f i (x) = inf f i (xi ) + f n x − xi . (4.19)
x1 ,...,xn−1
1≤i≤n−1 1≤i≤n−1
Taking the supremum over x, we obtain ρ ∗ (Q) ≥ ρ ∗ (Q) − Q, y . Since ρ ∗ (Q) >
−∞ and Q, y < 0, this implies ρ ∗ (Q) = +∞. If on the other hand Q, 1 = 1,
by translation invariance, we get
In other words, ρ(x) is the smallest constant reduction of losses that allows one to
get a nonpositive risk. Acceptation sets satisfy
(i) A − X + ⊂ A,
(4.26)
(ii) For all x ∈ X , ρ(x) := min{α ∈ R; x − α1 ∈ A} is finite.
Lemma 4.13 A monetary measure of risk ρ is convex iff its associated acceptance
set is convex.
Q, 1 ρ(x) ≥ Q, x − α. (4.29)
We now show the converse inclusion by checking that conv(A) satisfies the axioms
of an acceptation set. Condition (4.26)(i) is a consequence of the one satisfied by A,
and for (4.26)(ii), set
Since A ⊂ conv(A), we have that r (x) ≤ ρ(x). The affine minorant (x) := Q, x
− α of ρ(x) is such that (x) ≤ 0 for all x ∈ A, and hence, for all x ∈ conv(A).
So, x − γ 1 ∈ conv(A) implies α ≥ Q, x − γ 1 = Q, x − γ , i.e., and so γ ≥
Q, x − α. Therefore, the infimum in (4.31) is finite and, as conv(A) is closed, is
attained. The associated MMR r is an l.s.c. convex minorant of ρ, and so r ≤ conv(ρ).
But, since the mapping ρ → Aρ is nonincreasing, the converse inclusion in (4.31)
holds. The conclusion follows.
172 4 Risk Measures
This model will illustrate the above concepts. It involves two agents, the issuer A and
the buyer B. An asset F is to be sold to the buyer at a price π to be determined. Initially,
A and B have outcome functions X and Y , in the Banach space X , and assess risk
with risk measures ρ A and ρ B . The buyer will find the transaction advantageous if
ρ B (Y + F + π ) ≤ ρ B (Y ). (4.32)
In view of the translation invariance property, the best (highest) price is π(F) :=
ρ B (Y ) − ρ B (Y + F). The financial product F minimizing the risk of the issuer in a
class F is then the solution of
Using again the translation invariance property, we obtain the equivalent problem
Min ρ A (X − F) + ρ B (Y + F) − ρ B (Y ). (4.34)
F∈F
is, for all c ≥ 0, convex and continuous, translation invariant, and satisfies
If c ∈ [0, 1/2], any z ∈ ∂ρ1 (0) is nonnegative and has unit expectation. We deduce
that:
Lemma 4.15 For c ∈ [0, 1/2], the function ρ1 (x) defined in (4.38) is a convex MMR.
4.3.5.1 Semi-deviation
As in the case of the deviation function, since E(y) ≤ 1 when y ∈ (Bq )+ the subd-
ifferential of Φ p is a.s. greater than or equal to −1. We deduce the following result:
Lemma 4.17 For p ∈ [1, ∞) and c ∈ [0, 1], the following function is a convex
MMR:
ρ̂ p (x) := E(x) + cΦ p (x). (4.43)
Remark 4.18 The function ρ p+ is of practical interest since it penalizes losses, and
not gains, w.r.t. the average revenue.
Risk models often involve a constraint on the probability that losses are no more than
a given level. Denote by
H X (a) := P[X ≤ a] (4.44)
the cumulative distribution function (CDF) of the real random variable X . This is a
nondecreasing function with limits 0 at −∞, and 1 at +∞, which is right continuous.
Setting H X (a − ) := limb↑a H X (b), we have that
an α quantile. Having in view the minimization of losses, we define the value at risk
of level α ∈]0, 1[ as
Since the acceptation set is nonconvex, the value at risk is also nonconvex.
where X is a Banach space. Let us see how to compute a convex function G(X )
such that G(X ) > 0 if VaRα (X ) > 0; the related problem
might be easier to solve, and its value will provide an upper bound of the one of
(4.50).
Observe that, for any γ > 0:
Lemma 4.19 Assume that E|X | is a continuous function over X . Then CVaRα is
a continuous, convex risk measure.
Proof Clearly, CVaR is nondecreasing and translation invariant, and so is a risk mea-
sure. Since (δ, X ) → δ + α −1 E[X − δ]+ is convex, CVaRα , which is the infimum
w.r.t. δ, is convex. Taking δ = 0, we get CVaRα (X ) ≤ α −1 E|X |, proving that CVaR
is locally upper bounded, and hence, by Proposition 1.65, is continuous.
Lemma 4.20 The infimum in the r.h.s. of (4.55) is attained for δ = VaRα (X ), and
hence,
176 4 Risk Measures
As a consequence, for the function G(X ) we may choose the CVaR function.
4.4 Notes
Risk measures were introduced by Artzner et al. [9] with an axiomatic approach.
The most commonly used are the Var and CVaR. See Shapiro et al. [114, Chap. 6].
A reference book on this subject, with applications in finance, is Föllmer and
Schied [49]. An important extension is the concept of dynamic risk measure, see
Ruszczyński and Shapiro [107]. For the link with utility functions, see Dentcheva
and Ruszczyński [43].
Chapter 5
Sampling and Optimizing
1
N
log ϕ(θ, ωi ). (5.1)
N i=1
This can be interpreted as a sampling approach for maximizing the following expec-
tation:
Φ(θ ) := Eθ̄ log[ϕ(θ, ·)] = log[ϕ(θ, ω)]ϕ(θ̄ , ω)dμ(ω). (5.2)
Ω
Lemma 5.1 We have that Φ(θ ) ≤ Φ(θ̄), for all θ ∈ , with equality iff ϕ(θ, ω) =
ϕ(θ̄ , ω) a.s.
© Springer Nature Switzerland AG 2019 177
J. F. Bonnans, Convex and Stochastic Optimization, Universitext,
https://doi.org/10.1007/978-3-030-14977-2_5
178 5 Sampling and Optimizing
Proof Set ϕ̄(θ, ω) = ϕ(θ, ω)/ϕ(θ̄ , ω). Since log(s) ≤ s − 1, with equality iff s = 1,
we deduce that log[ϕ̄(θ, ω)] ≤ ϕ̄(θ, ω) − 1 and so,
Φ(θ ) − Φ(θ̄) = Ω log[ϕ̄(θ, ω)]ϕ(θ̄ , ω)dμ(ω)
≤ Ω (ϕ(θ, ω) − ϕ(θ̄ , ω))dμ(ω) = 0,
the last equality being due to the fact that ϕ(θ, ω) and ϕ(θ̄ , ω) are density functions
of probabilities, and so, have unit integral. The result follows.
The maximum likelihood approach to the parameter estimation problem can there-
fore be interpreted as an expectation maximization based on a sample.
Remark 5.2 The log-likelihood approach is related to the following notion. Given
a strictly convex function ϕ over R, such that ϕ(1) = 0 and ∂ϕ(1) = ∅, and given
p, q, densities of the probability laws P, Q over (Ω, F , μ), the ϕ divergence, or
Csiszar divergence [34], is the function
Iϕ (Q, P) := ϕ(q(ω)/ p(ω)) p(ω)dμ(ω), (5.3)
Ω
assuming that p(ω) > 0 a.s. Clearly Iϕ (P, P) = 0, and for a ∈ ∂ϕ(1):
Iϕ (Q, P) = ϕ(1 + (q(ω) − p(ω))/ p(ω)) p(ω)dμ(ω)
Ω
(5.4)
q(ω) − p(ω)
≥a p(ω)dμ(ω) = 0,
Ω p(ω)
since p and q are densities. In addition, since ϕ is strictly convex, equality holds iff
q(ω) = p(ω) a.s. Taking ϕ = − log we recover, up to a constant, the (opposite of
the) above function Φ.
In this section we will discuss random variables with image in a metric space. So,
(Y, ρ) will be a metric space and the associated distance. An example that will be
considered in applications is that of the space of continuous functions over a compact
set.
As a σ -field over Y we take the Borelian field (generated by open sets; its elements
are called the Borelian subsets).
5.2 Convergence in Law and Related Asymptotics 179
Definition 5.3 We say that the probability measure P over Y is regular if any
Borelian subset A of Y is such that,
For any ε > 0, there exist F, G resp. closed and open subsets of Y
(5.5)
such that F ⊂ A ⊂ G and P(G \ F) < ε.
Proof We follow [20, Ch. 1]. If A is closed, take F = A and for some δ > 0,
G := G δ , where G δ := {y ∈ Y ; ρ(y, A) < δ}. Then P(G δ \ A) = E1{0<ρ(y,A)<δ} .
By the dominated convergence theorem, E1{0<ρ(y,A)<δ} → 0 when δ → 0 and so,
the regularity property holds for closed sets.
Since the closed sets generate the Borelian σ -field, it suffices to check that the
set of regular Borelian subsets of Y is closed under (i) complementation and (ii)
countable unions. Indeed, let (5.5) hold for a given Borelian set A. Denoting by Ac
the complement of A, etc., we have that G c ⊂ Ac ⊂ F c , G c is closed, F c is open,
and F c \ G c = G \ F has probability less than ε. Point (i) follows. Now let An ,
n ∈ N , be a sequence of regular Borelian sets and ε > 0. Let Fn , G n be respectively
open and closed subset such that Fn ⊂ An ⊂ G n , and P(G n \ Fn ) < 2−(n+2) ε. Then
(5.5) holds with G := ∪n G n and F := ∪n≤k Fn , for large enough k. The conclusion
follows.
Let (Ω, F , μ) be a probability space. We know that a random variable (r.v.) y over
Ω with image in Y induces over Y the image probability of μ by y, called the law
or distribution of y, denoted by y∗ μ, and defined by
n
n
E y∗ μ f = ai (y∗ μ)(Ai ) = ai μ(y −1 (Ai )), (5.8)
i=1 i=1
180 5 Sampling and Optimizing
so that (5.7) holds. In the general case, we can build a sequence f k of simple functions
converging a.s. to f , and dominated by | f |, so that f k ◦ y → f ◦ y in L 1 (Ω). Then,
(5.8) and the dominated convergence theorem imply
Given F ⊂ Y and y ∈ Y , we denote the distance to F by
Definition 5.6 Let x and x be two r.v.s (with possibly different associated proba-
bility spaces) with values in the same metric space Y , and laws denoted by P and P .
L
We say that x ∼ x if x and x have the same law.
Proof Clearly, if x and x have the same law, then (5.11) holds. Conversely, let
(5.11) hold. Let F be a closed subset of Y . For ε > 0, define f : Y → R by f ε (y) :=
(1 − ρ(y, F)/ε)+ . By monotone convergence,
P(F) = lim f ε (y)dP = lim f ε (y)dP = P (F). (5.12)
ε↓0 Y ε↓0 Y
So the two probabilities are equal over closed sets, and so also over open sets. Since
by Lemma 5.4 any probability measure over a metric space is regular, the result
follows.
Definition 5.8 We say that a sequence Pk of measures over the metric space Y
narrowly converges to a measure P over Y , if
f (x)dPk (x) → f (x)dP(x), for all f ∈ Cb (Y ). (5.13)
Y Y
Definition 5.9 Let X , X k (for k ∈ N) be r.v.s over the probability spaces (Ω, F , P),
and (Ωk , Fk , Pk ) resp., both with image in Y . We say that the sequence of r.v.s X k
L
over Ωk converges in law to the r.v. X , and write X k → X , if the laws of X k narrowly
5.2 Convergence in Law and Related Asymptotics 181
Definition 5.10 One says that the sequence of r.v.s X k is bounded in probability if,
for any1 y0 ∈ Y , we have, setting |X |∼ := ρ(X, y0 ):
for all ε > 0, there exists a cε > 0 such that Pk (|X k |∼ > cε ) ≤ ε. (5.15)
for all ε > 0, there exists a κε > 0 such that μ(|X |∼ > κε ) ≤ 21 ε. (5.16)
Lemma 5.11 Let X k be a sequence of r.v.s with image in the metric space Y , con-
verging in law to X . Then X k is bounded in probability.
Proof Let ε > 0, κε be given by (5.16), and f : Y → R be continuous with image
in [0, 1], with value 0 if |y|∼ ≤ κε , and 1 if |y|∼ ≥ κε + 1. Then
1 The definition is independent of y0 . In the applications Y will be a Banach space and we will take
y0 = 0 so that | · |∼ will be equal to the norm of Y .
2 Indeed, the family A := {ω ∈ Ω; |X | > n} being nonincreasing with empty intersection,
n ∼
μ(An ) ↓ 0 by (3.14).
182 5 Sampling and Optimizing
1 1
f λ (y ) − f λ (y) ≤ sup ρ(y , z) − ρ(y, z) ≤ ρ(y , y). (5.19)
λ z∈Y λ
f λ (y) = min inf ρ(z,y)≤α f (z) + λ1 ρ(z, y) , inf ρ(z,y)>α f (z) + λ1 ρ(z, y)
≥ min( f (y) − ε, inf f + α/λ),
(5.20)
and hence, lim inf λ↓0 f λ (y) ≥ f (y) − ε. The conclusion follows.
Proof The condition is obviously necessary; let us show that it is sufficient. So, let
(5.21) be satisfied, and let f : Y → R be continuous and bounded. By symmetry, it
suffices to show that lim inf k Ek f (X k ) ≥ E f (X ). The Lipschitz regularization f λ of
f being Lipschitz and bounded, it satisfies
Ek f λ (X k ) → E f λ (X ). (5.22)
By monotone convergence and in view of Lemma 5.14(ii), we have that for all ε > 0,
there exists a λε such that
E f λ (X ) ≥ E f (X ) − ε if λ < λε . (5.23)
as was to be shown.
Actually we can use as test functions Lipschitz functions with bounded support:
L
Lemma 5.18 Let X k be bounded in probability. Then X k → X iff
Ek f (X k ) → E f (X ), for all f : Y → R
(5.26)
Lipschitz with bounded support.
Proof It suffices to check that (5.26) implies the convergence in law. Let f be Lip-
schitz and bounded. For M > 0, let ϕ M be Lipschitz R → [0, 1], with value 1 over
[0, M] and 0 over [M + 1, ∞[. By dominated convergence,
Ek f (X k ) = Ek f (X k )ϕ Mε (X k ) + Ek f (X k )ψε (X k ). (5.28)
Definition 5.19 Let yk be a sequence of r.v.s with image in Y . One says that yk
converges in probability to ȳ ∈ Y (deterministic) if it converges in probability to the
constant function with value ȳ over Y , i.e., if
0 = lim Ek f (yk ) − f ( ȳ) = lim Ek f (yk ) ≥ ε lim sup Pk {ρ(yk , ȳ) ≥ ε}. (5.33)
k k k
and so lim inf k Ek [ f (yk ) − f (yk )] ≥ −εL f , which by symmetry implies Ek [ f (yk ) −
f (yk )] → 0 as was to be shown.
3 The separability of Y ensures that ρ(yk (ω), yk (ω)) is measurable, see Billingsley [20, Appendix
II].
5.2 Convergence in Law and Related Asymptotics 185
Definition 5.26 We say that a measure μ over Y is Gaussian if, for all nonzero
y ∗ ∈ Y ∗ , the following measure over R is Gaussian:
measurable in ω for all x. We assume that f is Lipschitz (in x) with a square integrable
Lipschitz constant, in the sense that
which combined with the previous hypotheses implies the existence of a finite second
moment of f (·, x) for any x ∈ X .
Then ω → f (ω, ·) is an r.v. with image in the Banach space Y = Cb (X ) p , with
expectation denoted by f¯(x). We denote the sample approximation of f¯ by
1
N
fˆN (x) := f (ωi , x). (5.41)
N i=1
We next state a Functional Central Limit Theorem (FCLT; functional here means
infinite-dimensional).
√
Theorem 5.27 If (5.38)–(5.40) holds, then N fˆN (x) − f¯(x) converges in law
to the Gaussian of covariance equal to that of f .
Proof See Araujo and Giné [8, Cor. 7.17] for the proof of this difficult result.
We now establish some differential calculus rules for r.v.s converging in law.
Theorem 5.28 (Delta theorem) Let Yk be a sequence of r.v.s with values in a sepa-
rable Banach space Y containing η, τk ↑ ∞, and Z an r.v. with values in Y , such
that Z k := τk (Yk − η) converges in law to Z . Let G : Y → W , where W is a Banach
space, be differentiable at η. Then τk (G(Yk ) − G(η)) converges in law to G (η)Z .
Proof In view of the representation Theorem 5.23, we may suppose that the Yk are
r.v.s over the same probability space (Ω, F , P) and that Z k → Z a.s. Since G is
differentiable at η,
5.2 Convergence in Law and Related Asymptotics 187
the limit of the r.h.s. of (5.44) is G (x, h). The result follows.
We next introduce the “Hadamard” version of the Delta theorem.
Theorem 5.32 (Hadamard Delta Theorem) Let Y and W be Banach spaces, with
Y separable, K a subset of Y , G : K → W Hadamard differentiable at η ∈ K
tangentially to K , and Yk a sequence of r.v.s with values in K . Let τk ↑ ∞, and Z
an r.v. with values in Y , such that Z k := τk (Yk − η) converges in law to Z . Then
τk (G(Yk ) − G(η)) converges in law to G (η, Z ).
Proof The proof is similar to that of Theorem 5.28, replacing (5.42) with
188 5 Sampling and Optimizing
the limit being independent of the sequence (tk , xk ). If this holds for all h ∈ TK (x̄),
one says that G is second-order Hadamard differentiable at x̄ tangentially to K .
When K = Y , one says that G is second-order Hadamard differentiable at x̄.
L
2τk2 (G(Yk ) − G(η) − G (η, Yk − η)) → G (η, Z ). (5.49)
Proof The arguments are similar to those of the first-order delta theorem.
Since ϕ (z) is invertible this provides an expression for G (ϕ)(ψ)2 . The first and last
term are identical and we can eliminate
So,
G (ϕ̄)(ψ)2 = ϕ̄ (z̄)−1 2ψ (z̄)ϕ (z̄)−1 ψ(z̄) − ϕ (z̄)(ϕ (z̄)−1 ψ(z̄))2 . (5.57)
We assume that it has a regular root x̄, i.e., f¯(x̄) = 0 and D f¯(x̄) is invertible. There
exists an ε > 0 such that f¯ has no other root in B̄(x̄, ε), and D f¯ is uniformly
190 5 Sampling and Optimizing
invertible over B̄(x̄, ε). Let Y denote the separable Banach space of C 1 functions
over B̄(x̄, ε) with image in Rn , endowed with the natural norm
We set x̂ N := χ ( fˆN ).
Theorem 5.35 Let f , x̄ be as above and Z (x̄) denote the covariance of f (·, x̄).
Then
N 1/2 (x̂ N − x̄) → − f¯ (x̄)−1 Z (x̄),
L
(5.61)
We also assume that f has finite second moments and denote its expectation by f¯(x).
When optimizing an expectation, it frequently occurs that the law μ is not known,
but nevertheless it is possible to get a sample of realizations that follows the law of
μ. Given an integer N > 0, we obtain an empirical distribution μ̂n that associates
the probability 1/N with each element ω1 , . . . , ω N of the sample (or rather gives
probability p/N in the case of p identical realizations) and zero to the others. This
empirical distribution is an r.v. We recall that we denote by
1
N
ˆ
f N (x) = f (ωi , x) (5.64)
N i=1
5.3 Error Estimates 191
the mean value of the empirical distribution, also called the standard estimator of
the mean value. We recall that this estimator is unbiased, since
1
N
E fˆN (x) = E f (ωi , x) = f¯(x). (5.65)
N i=1
1
N
2 1
E fˆN (x) − f¯(x) E f (ωi , x) − f¯(x)
2
= = V ( f, x). (5.66)
N 2 i=1 N
So, the standard deviation (square root of the variance) of fˆN (x) is N −1/2 V ( f, x)1/2 .
We recall the classical estimator of the variance.
1
N
2
V̂ ( f, x) := f (ωi , x) − fˆN (x) . (5.67)
N − 1 i=1
N
N
N
(N − 1)V̂ ( f ) := f (ωi )2 − 2 fˆN f (ωi ) + N fˆN2 = f (ωi )2 − N fˆN2 ,
i=1 i=1 i=1
(5.68)
and so (N − 1)EV̂ ( f ) = N V ( f ) − V ( f ); the result follows.
Remark 5.37 It follows that the naive estimator below has a negative bias of order
1/N :
1
N
2
Ṽ ( f, x) := f (ωi , x) − fˆN (x) . (5.69)
N i=1
can be approximated by the problem of minimizing the standard estimate of the mean
value:
192 5 Sampling and Optimizing
Proof (a) We first show that E val( P̂N ) ≤ val(P). Since f¯(x) = E fˆN (x), this is
equivalent to
ˆ ˆ
inf E f N (x) ≥ E inf f N (x) , (5.71)
x∈X x∈X
N +1
1 1
v N +1 = E inf f (ω j , x)
N +1 x∈X
i=1
N j=i
N +1
1 1
≥ E inf f (ω j , x) (5.72)
N +1 i=1
x∈X N
j=i
N +1
1 1
= E inf f (ω j , x) = v N ,
N + 1 i=1 x∈X N j=i
as was to be shown.
By the above lemma, val( P̂N ) is an estimate of val(P) with a nonpositive bias.
1
n
Min ( f (ωi ) − e)2 ,
e∈R N i=1
Set g(ω) := max | f (ω, x)|. Since X is compact, it contains a dense sequence x k ,
x
and since f (ω, x) ∈ Cb (X ) a.s., we have that g(ω) = max | f (ω, x k )| a.s., proving
k
that g is measurable.
Theorem 5.40 Let (5.63) hold. If g is integrable, then fˆN (x) → f¯(x) uniformly,
with probability (w.p.) 1.
N
ˆ 1
lim sup f N (x) − fˆN (x ) ≤ lim sup h(ω, xi ) ≤ ε w.p. 1,
N x ∈B(x,αε ) N N x ∈B(x,αε ) i=1
(5.74)
where the first inequality uses the triangle inequality, and the second one uses the
separability of B(x, αε ) to ensure the measurability of supx ∈B(x,αε ) h(ω, x ), and the
law of large numbers.
Covering the compact set X by finitely many open balls with radius αε and center
x k , k = 1 to K ε , and using fˆN (x k ) → f¯(x k ) w.p. 1, we get, for x ∈ B(x k , αε ):
lim sup fˆN (x) − f¯(x) ≤ lim sup fˆN (x) − fˆN (x k ) + lim sup f¯(x k ) − f¯(x) .
N N N
(5.75)
The first limit in the r.h.s. is w.p. 1 no more than ε by (5.74), and the second one can
be made arbitrarily small by taking αε small enough. The conclusion follows.
Corollary 5.41 Under the hypotheses of Theorem 5.40, val( P̂n ) → val(P) with
probability 1.
Proof Indeed, the theorem ensures w.p. 1 the uniform convergence of the cost func-
tion, and the function f → min x∈X f (x) is continuous over Cb (X ).
194 5 Sampling and Optimizing
Let f : X → R. We set
Proof Since |min( f ) − min( f )| ≤ supx | f (x) − f (x)|, we have that min(·) is Lip-
schitz. By Lemma 5.31, it suffices to check that it has directional derivatives satisfying
(5.77). Let f and g belong to Cb (X ). We have, for ε > 0:
Proof By Theorem 5.27, N 1/2 ( fˆN (x) − f¯(x)) converges in law to Z . We conclude
by combining Proposition 5.42 and the Hadamard Delta Theorem 5.32.
Remark 5.44 The asymptotic law of N 1/2 (val( P̂N ) − val(P)), when the minimum
of f¯ over X is not unique, is therefore in general not Gaussian.
Example 5.45 Let ω √ be a standard Gaussian variable, X = {1, 2},√ f (ω, 1) = ω,
f (ω, 2) = 0. Then N fˆN (1) is a standard Gaussian variable, and N fˆN (2) = 0.
√ law of ( P̂N ) is min(0, Z 1 ), where Z 1 is a standard Gaussian variable. We have
So the
that N fˆN (x) converges in law to Z := (Z 1 , 0) so that, as follows from the above
5.3 Error Estimates 195
theorem, min x fˆN (x) converges in law (since in fact here the law is constant over the
sequence) to min(0, Z 1 ).
Problems with expectation type constraints need a more involved analysis. We restrict
ourselves to the convex setting, which is the only one that is well understood.
with solution set denoted by Λ( f, G). We know that, if the duality gap is zero, then
(x̄, λ̄) is a primal-dual solution4 iff
One easily checks that the stability condition (1.170) of the duality theory holds iff
There exists β f,G > 0, and x f,G ∈ X, such that
(5.82)
G i (x f,G ) < −β f,G , i = 1, . . . , p.
vk := val( f + εk φ k , G + εk Ψ k ). (5.85)
Let x̄ ∈ S( f, G), γ ∈ (0, 1) and set x γ := γ x f,G + (1 − γ )x̄ with x f,G defined in
(5.82). Then x γ ∈ X and
lim sup f (xk ) = lim sup vk ≤ inf lim( f + εk φ k )(x γ ) = val( f, G). (5.87)
k k γ ∈(0,1) k
vk = f (x k ) + εk φ k (x k )
= min L[ f + εk φ k , G + εk Ψ k ](x, λk )
x∈X
≤ L[ f + εk φ k , G + εk Ψ k ](x̄, λk ) (5.92)
= L[ f, G](x̄, λk ) + εk L[φ k , Ψ k ](x̄, λk )
≤ val( f, G) + εk L[φ, Ψ ](x̄, λ̄) + o(εk ).
The second inequality uses the relation L[ f, G](x̄, λk ) ≤ L[ f, G](x̄, λ̄) = val( f, G),
a consequence of the fact that λ̄ is a dual solution. Since Λ( f, G) is bounded, we get
vk ≤ val( f, G) + εk max L[φ, Ψ ](x̄, λ) + o(εk ), (5.93)
λ∈Λ( f,G)
with X a compact and convex subset of Rn . The sample approximation of this problem
is
Min fˆN (x); Ĝ N (x) ≤ 0, x ∈ X, (P fˆN ,Ĝ N )
x
where fˆN is the empirical estimate (5.41), and the same convention for Ĝ N (with the
same sample). We need the qualification condition
The set S( f¯, Ḡ) of solutions of (P f¯,Ḡ ) is a convex and compact subset of X . We
recall that we denote by L[ f¯, Ḡ](x, λ) := f¯(x) + λ · Ḡ(x) the Lagrangian and by
Λ( f¯, Ḡ) the set of Lagrange multipliers, solutions of the dual problem
Theorem 5.48 Let f (ω, x) and G(ω, x) satisfy (5.38)–(5.40), and let the qualifi-
cation condition (5.94) hold. Let (Z f¯ , Z Ḡ ) denote the components of the Gaussian
variable with image in Cb (X ) p+1 , of covariance equal to that of ( f¯, Ḡ). Let Z i
denote the component associated with Ḡ i . Then we have the convergence in law of
N 1/2 val(P fˆN ,Ĝ N ) − val(P f¯,Ḡ ) towards
p
min max Z f¯ (x) + λi Z i (x) . (5.95)
x∈S( f¯,Ḡ)) λ∈Λ( f¯,Ḡ)
i=1
In this section we briefly recall the starting point of the theory of large deviations,
and show how to apply this theory to stochastic optimization problems.
E[et Z N ] = Πi=1
N
E[et X i /N ] = M(t/N ) N . (5.97)
1 t
log P(Z N ≥ a) ≤ − a + L M(t/N ). (5.98)
N N
Minimizing over t > 0, we obtain
1
log P(Z N ≥ a) ≤ −I + (a), (5.99)
N
where
I + (a) := sup {aτ − L M(τ )} . (5.100)
τ >0
This definition is close to that of the Fenchel transform of the logarithmic moment-
generating function, also called the rate function:
We have of course I + (a) ≤ I (a). The interesting case is when a > E(X ); then the
probability of Z N ≥ a tends to zero as N ↑ ∞. We will see then that I + (a) = I (a)
under weak hypotheses, and this gives the following large deviations estimate:
Theorem 5.49 (Cramér’s theorem) Let a > E(X ). If M(τ ) has a finite value in
[−τ, τ ] for some τ > 0, then I + (a) = I (a), and so with (5.99):
|et X (ω) − 1| 1 1
≤ τ |X (ω)|eτ |X (ω)| ≤ eτ |X (ω)| . (5.103)
t τ τ
Since M(τ ) has a finite value in [−τ, τ ], the r.h.s. has an expectation majorized by
(M(τ ) + M(−τ ))/τ . By the dominated convergence theorem, (M(t) − M(0))/t
has when t ↓ 0 a limit equal to M (0+ ) = E(X ). Consequently,
Let us come back to the stochastic optimization problem (P) and its sampled version
( P̂N ) of Sect. 5.3.2. Assume for the sake of simplicity that (P) has at least one solution
x̄, and that the moment-generating function M(t) is finitely-valued for t > 0 small
enough. By the large deviations principle, for all a > f¯(x̄), we have, denoting by Ix
the rate function associated with f (ω, x):
So, the value of the sampled problem has an “exponentially weak” probability of
being more than f¯(x̄) plus a given positive amount.
Remark 5.50 When minimizing over a finite set, it follows by similar arguments that
the value of the sampled problem has an “exponentially weak” probability of being
less than f¯(x̄) minus a given positive amount.
5.5 Notes
The state of the art on the subject of the chapter is given in Ruszczynski and Shapiro
[106], and Shapiro et al. [114]. The Hadamard Delta Theorems 5.32 and 5.43 are due
to Shapiro [111]. Theorem 5.47 is also due to Shapiro [112].
Linderoth et al. [75] made extensive numerical tests to obtain statistical estimates
of the value function for simple recourse problems.
5 It suffices to check this in the case of a finite sum. Let L M(t) = log( n
i=1 pi e ),
t xi with the p pos-
−1
n −1
n i
itive of sum one. Then L M (t) = M(t) pi x i e t x i and L M (t) = M(t) 2 t xi −
i=1 pi x i e
i=1
M(t)−2 ( i=1 n
pi xi et xi )2 . We conclude by the Cauchy–Schwarz inequality.
Chapter 6
Dynamic Stochastic Optimization
L s (F ) := L s (Ω, F ); HF := L 2 (F )m , (6.1)
Both HF and HG are Hilbert spaces, and the norm on HG is induced by the norm
of HF . It follows that HG is a closed subspace of HF . The (orthogonal) projection
operator from HF onto HG is called the conditional expectation (over G ) and usually
denoted by E[·|G ]; but this notation is often too heavy and so it is convenient to write
PG instead. So, if X ∈ HF , its projection Y onto HG is such that
Y = PG X = E[X |G ]. (6.3)
PG (α1 · X 1 + α2 · X 2 ) = α1 · PG X 1 + α2 · PG X 2 . (6.4)
PG (a + X ) = a + PG X. (6.5)
X ≤ X ⇒ PG X ≤ PG X . (6.8)
Proof (i) The expectations in (6.6) are the scalar products in HF of Z with Y and
Z . So we can rewrite this equation as (X − Y, Z )F = 0, for all Z ∈ HG , which is
the characterization of the projection onto a subspace.
(ii) Set Y Z := PG (Z X ). By point (i), Y Z is characterized by
6.1 Conditional Expectation 203
We now present the conditional Jensen’s inequality and some of its consequences.
(iii) We have the integral Jensen inequality (the expectations having values in R ∪
{+∞}):
Eϕ(Y ) ≤ Eϕ(X ). (6.14)
Proof (i) Since ϕ is proper l.s.c. continuous and convex, we have that ϕ is a supremum
of its affine minorants, i.e., there exists an A ⊂ Rm × R such that, for all x ∈ Rm :
In view of (6.4), (6.5) and (6.8), we have that for any (a, b) ∈ A:
a · Y + b = a · PG X + b = PG [a · X + b] ≤ PG (ϕ(X )) . (6.17)
204 6 Dynamic Stochastic Optimization
We next show how to extend the conditional expectation from HF to the larger
space L 1 (F )m . By (6.15) we already know that, for all X ∈ HF :
PG X 1 ≤ X 1 , (6.18)
Lemma 6.4 (i) Relation (6.18) is also satisfied by the extension of the conditional
probability to L 1 (F )m .
(ii) The latter satisfies (6.4), (6.5), (6.7), (6.8), and (6.12)–(6.15). If X ∈ L 1 (F )m ,
then Y = PG X is characterized by the relation
Proof (i) That (6.18) holds for all X ∈ HF follows from Lemma 6.2(iv) with s = 1.
Given X ∈ L 1 (G )m , and k ∈ N, k = 0, consider the truncation
For s ∈ [1, ∞], we denote by s its conjugate number, such that 1/s + 1/s = 1,
and set Us := L s (F )m . We denote by Ps the restriction of the conditional expectation
(over U1 ) to Us , and view it as an element of L(Us ).
6.1 Conditional Expectation 205
Proof (i) By Lemma 6.4(ii), Ps u is characterized by the equality in (6.22), for all
Z ∈ U∞ . So we only have to prove that (6.22) holds for s ∈ (1, ∞]. Let v ∈ Us . The
componentwise truncated sequence:
proving that Ps = Ps .
∗
(iii) By the same arguments, when v ∈ U∞ and u ∈ U1 (which is a subset of U∞ ),
we have that P∞ v = P1 v.
This holds in particular when taking for E the set of characteristic functions of
G -measurable subspaces.
E[v∗ |G ] := P∞ v∗ . (6.27)
206 6 Dynamic Stochastic Optimization
Remark 6.8 (i) In view of Lemma 6.5(iii), when v∗ ∈ L 1 (F ), we recover the usual
conditional expectation.
(ii) By the same lemma, for all s ∈ [1, ∞], Ps is a conditional expectation (but of
course P∞ = P1 ).
Let (Ω, F , P) be, as before, a probability space, and let G be a sub σ -algebra of
F . Denote by L 0 (Ω, F ) the set of measurable functions w.r.t. the σ -algebra F ,
and by L 0+ (Ω, F ) the set of such measurable functions that are nonnegative a.s. To
f ∈ L 0+ (Ω, F ) we associate the sequence of truncated functions f k , k ∈ N, such
that f k (ω) := min( f (ω), k) and their conditional expectation gk := E[ f k , G ]. The
latter are well-defined since f k ∈ L ∞ (Ω, F ). The conditional expectation being
a nondecreasing mapping, the sequence gk is nondecreasing and converges a.s. to
some g ∈ L 0+ (Ω, F ). We say that g is the conditional expectation of f and write
g = E[ f, G ].
More generally, if f ∈ L 0 (Ω, F ) is such that f ≥ h a.s. for some h in L 1 (Ω, F ),
we can define E[ f, G ] as the limit a.s. of the nondecreasing sequence E[ f k , G ].
Proof Let z ∈ L ∞ + (Ω, G ). Using the monotone convergence Theorem 3.34 twice,
and the fact that gk = E[ f k |G ], we get
Let s ∈ [1, ∞]. The two extreme cases are: when G is the trivial σ -algebra, the con-
ditional expectation coincides with the expectation; when G = F , the conditional
expectation is the identity operator in L s (F ).
Example 6.11 Let (Ω1 , F1 ) and (Ω2 , F2 ) be measurable spaces, and let F be the
product σ -algebra (the one generated by F1 × F2 ). Set Ω := Ω1 × Ω2 , and let
P be a probability measure on (Ω, F ). Set G := F1 × {Ω2 , ∅}. The associated
random functions are those that do not depend on ω2 . Then, roughly speaking, Y :=
E[X |G ] is obtained by averaging for each ω1 ∈ Ω1 the value of X (ω1 , ·). More
precise statements follow.
Example 6.12 In the framework of Example 6.11, assume that Ω1 and Ω2 are finite
sets, say equal to {1, . . . , p} and {1, . . . , q} resp., with elements denoted by i and j;
let pi j be the probability of (i, j). Taking for Z the characteristic function 1{i0 } (i, j),
for any i 0 ∈ {1, . . . , p}, in (6.22), we deduce that
j∈Ω2 pi j X (i, j)
Y (i) = , for all i ∈ Ω1 . (6.30)
j∈Ω2 pi j
Example 6.13 (Independent noises) In the framework of Example 6.11, let P be the
product of the probability P1 over (Ω1 , F1 ) and P2 over (Ω2 , F2 ), so that ω1 and
ω2 are independent. Then Y := E[X |G ] is given by, a.s.:
Y (ω1 ) = X (ω1 , ω2 )dP2 (ω2 ). (6.31)
Ω2
Remark 6.14 More general expressions can be obtained using the disintegration
theorem [40, Chap. III]. In most applications we have (reformulating the model if
necessary) independent noises.
The main convergence theorems of integration theory have their counterparts for
conditional expectations.
Theorem 6.15 (Monotone convergence) Let f k be a nondecreasing sequence of
L 1 (F ), with limit a.s. f ∈ L 1 (F ). Set gk := E[ f k |G ] and g := E[ f |G ]. Then gk is
nondecreasing, and converges to g both a.s. and in L 1 (G ).
Proof Since f k is nondecreasing, by (6.8) (which is valid in L 1 (F )) so is gk , and
hence, gk → ĝ a.s. for some measurable function ĝ, such that gk ≤ ĝ ≤ g. By dom-
inated convergence, ĝ is integrable. Let A ∈ G with characteristic function z = 1 A .
Using the monotone convergence Theorem 3.34 twice, we get:
Proof Set fˆk := inf{ f ; ≥ k}, and ĝk := E[ fˆk |G ]. Then fˆk is nondecreasing and
converges a.s. to f . Since h ≤ fˆk ≤ f k , fˆk is integrable. By the monotone conver-
gence Theorem 6.15, ĝk ↑ g a.s. Since fˆk ≤ f k , we have that ĝk ≤ gk . The conclusion
follows.
Proof We may assume that EX = EY = 0 and it is enough to prove the result when
X is a scalar. Then
Now
E(X − Y )2 = EEG (X − Y )2 = Evar G X (6.37)
and
E[X Y ] = EEG [X Y ] = E(Y EG [X ]) = EY 2 (6.38)
Remark 6.20 The law of total variance (6.35) can be interpreted as the decomposition
of the variance as the sum of the term varY explained, or predicted by G , and of the
unexplained, or unpredicted term Evar G X .
Remark 6.24 We proved in Lemma 1.124 the following geometric calculus rule: the
normal cone of an intersection of closed convex sets is the sum of normal cones
to these sets, provided that the qualification condition 0 ∈ int(K 1 − K 2 ) holds. In
the above lemma we obtained the geometric calculus rule without the qualification
condition.
An easy application of the above result, that we state for future reference, is as
follows. Consider the problem
Proposition 6.25 Let ū ∈ F(6.45) satisfy (6.46), and set ȳ = A ū. Then ū is a solu-
tion of (6.45) iff there exists y ∗ ∈ N K Y ( ȳ), u ∗ ∈ ∂ F(ū) and q ∈ NK (ū) such that
P A y ∗ + u ∗ + q = 0. (6.47)
A y ∗ + u ∗ + q1 0. (6.48)
6.1 Conditional Expectation 211
T −1 T −1 T T
U= Ut ; K = Kt ; Y = Yt ; K Y
= Kt Y , (6.51)
t=0 t=0 t=1 t=1
T −1
yτ [u] = Aτ t u t , τ = 1, . . . , T, where Aτ t ∈ L(Ut , Yτ ). (6.52)
t=0
such that
Pt Aτ t yτ∗ + u ∗t + qt = 0, t = 0, . . . , T − 1. (6.54)
τ ∈T
So (by the sudifferential calculus rule for a sum) it is also equivalent to the fact that
ū is solution of the problem
T −1
T
min F(u) + yτ∗ , Aτ t u t ; u t ∈ Kt ∩ Vt , t = 0, . . . , T − 1. (6.56)
u
t=0 τ =1
We apply the results of the previous section in the case of measurability constraints,
i.e., (Ω, F , μ) is a probability space, and G is a σ -algebra included in F . For
some s ∈ [1, ∞], we assume that U = L s (F )m and V = L s (G )m . We recall that Ps
denotes the conditional expectation operator in L s (F )m .
Definition 6.29 Let K be a closed convex subset of L s (F )m , for some s ∈ [1, ∞].
We say that K is compatible with G if PG K ⊂ K , i.e., if any x ∈ K is such that
PG x ∈ K .
Proposition 6.30 Let ū ∈ F(6.45) satisfy the qualification condition (6.46); set ȳ =
A ū. Then
ū ∈ ∗S (6.45) iff there exists y ∗ ∈ N K Y ( ȳ), u ∗ ∈ ∂ F(ū) q ∈ NK (ū) such
∗
that Ps A y + u + q = 0, or equivalently,
(ii) We say that K defines an integral Jensen type constraint if, for some proper l.s.c.
convex function ϕ over Rm , we have that
(iii) We say that K defines a local constant constraint if there exists a nonempty
closed convex subset K of Rm such that
6.1 Conditional Expectation 213
where I is a countable set and the (ai , bi )i∈I are F -measurable and essentially
bounded. We say that ϕ is an F -adapted function, and that it is G -adapted if in
addition any (ai , bi ), for i ∈ I , is G -measurable.
Definition 6.34 Let K be a nonempty, closed convex subset of L s (F )m , for some
s ∈ [1, ∞].
(i) We say that K defines a generalized Jensen type constraint if, for some function
ϕ satisfying (6.61), G -adapted, we have that
(ii) We say that K defines a generalized integral Jensen type constraint if, for some
function ϕ satisfying (6.61), G -adapted, we have that
6.1.9 No Recourse
The problem without recourse is a particular case of the previous theory, when
G = {∅, Ω} is the trivial σ -algebra. Then the conditional expectation in L s (F )
214 6 Dynamic Stochastic Optimization
A very simple example illustrates the fact that, in the presence of constraints to be
satisfied a.e., the multipliers in the dual of L ∞ typically have singular parts.
Set ȳ(ω) := ū − ω. Taking Y = L ∞ (Ω) as constraint space, and observing that the
constraint is qualified, we obtain the existence of a multiplier λ such that
For any ε > 0 and y ∈ Y with zero value on (dm , dm + ε), there exists a ρ > 0 such
that ȳ ± ρy ∈ Y− . Since λ ∈ NY− ( ȳ), and so λ, ȳ = 0, it follows that λ, y = 0.
We have proved that λ is equal to its singular part; note that it is nonzero in view of
(6.67), since p > 0.
Random variables such as prices, temperatures, etc. that depend on time are modelled
as series, say yt ∈ Rn with t ∈ Z. Quite often the yt are not independent variables,
and we can express them as function of past values:
where the random variables et ∈ Rm , called innovations, are “white noise”, i.e., i.i.d.
with zero mean and unit variance. A simple example is the one of autoregressive (AR)
models
yt = a1 yt−1 + · · · + aq yt−q + Φ̂et , (6.69)
6.2 Dynamic Stochastic Programming 215
where the ai are n × n matrices and Φ̂ is a given matrix; this model of order q is
also called ARq. Then the vector Yt := (yt , yt−1 , . . . , yt−q+1 ) has the first-order
dynamics ⎛ ⎞ ⎛ ⎞
a1 a2 · · · aq−1 aq Φ̂
⎜1 0 ⎟ ⎜0⎟
⎜ ⎟ ⎜ ⎟
Yt+1 = ⎜ .. ⎟ Yt + ⎜ . ⎟ et . (6.70)
⎝ . ⎠ ⎝ .. ⎠
0 ··· 1 0 0
So this type of model is suitable for our framework. For more on AR models and
their nonlinear extensions, we refer to [55].
We start with the general setting of an abstract problem in product form of Exam-
ple 6.27. We call u the control, and y the state, and assume that the control to state
mapping is defined by the state equation
with At ∈ L(Yt , Yt+1 ), Bt ∈ L(Ut , Yt+1 ), dt ∈ Yt+1 , and solution denoted by y[u],
and that the cost function has the following form:
T −1
F(u) = J (u, y[u]), with J (u, y) := t (u t , yt ) + ϕ(yT ). (6.72)
t=0
z t+1 = At z t + Bt vt , t = 0, . . . , T − 1; z 0 = 0. (6.73)
We first give a means to express the subdifferential of F, using the adjoint state (or
costate) approach.
Definition 6.37 Set P := Y1∗ × · · · × YT∗ as costate space. Let ū ∈ U have associ-
ated state ȳ := y[ū]. The costate p ∈ P (i.e., pt ∈ Yt∗ , t = 1 to T ) associated with
ū, y ∗ ∈ Y ∗ and w∗ ∈ Y ∗ (we distinguish these two dual variables since they will play
different roles) is defined as the solution of the backward equation ( pt is computed
by backward induction)
pt = yt∗ + wt∗ + At pt+1 , t = 1, . . . , T − 1;
(6.74)
pT = yT∗ + wT∗ .
216 6 Dynamic Stochastic Optimization
We note the useful identity, where (v, z) satisfies the linearized state equation
(6.73):
⎧
⎪ T T −1
⎪
⎪ ∗ ∗
⎪
⎪ y + w , z = p , z + pt − At pt+1 , z t
⎪
⎪ t=1
t t t T T
⎪
⎪ t=1
⎪
⎪ T T −1
⎪
⎪
⎪
⎪ = p , z − pt+1 , At z t
⎨ t t
t=1 t=0
T −1 (6.75)
⎪
⎪ T
⎪
⎪ = pt , z t + pt+1 , Bt vt − z t+1
⎪
⎪
⎪
⎪
⎪
⎪
t=1 t=0
⎪
⎪ T −1
⎪
⎪ = Bt pt+1 , vt .
⎪
⎩
t=0
Note that (v∗ , y ∗ ) ∈ U ∗ × Y ∗ belongs to ∂ J (ū, ȳ) iff v0∗ ∈ ∂0 (ū 0 , ȳ0 ), (vt∗ , yt∗ ) ∈
∂t (ū t , ȳt ), for t = 1 to T − 1, and yT∗ ∈ ∂ϕ( ȳT ).
Lemma 6.38 We have that u ∗ ∈ ∂ F(ū) iff there exists (v∗ , y ∗ ) ∈ ∂ J (ū, ȳ) such that
the costate p associated with y ∗ and w∗ = 0 satisfies
Proof We have that the state satisfies y[u] = A u + d for some linear continuous
operator A and some d in an appropriate space. Since F(u) = J (u, y[u]), by the
subdifferential calculus rules in Lemma 1.120, we have that u ∗ ∈ ∂ F(ū) iff u ∗ =
v∗ + A y ∗ for some (v∗ , y ∗ ) ∈ ∂ J (ū, ȳ), or equivalently, if
T −1
T −1
T
u ∗t , vt = vt∗ , vt + yt∗ , z t . (6.77)
t=0 t=0 t=1
Theorem 6.39 Let ū be feasible, with associated state ȳ. Assume that the quali-
fication condition (6.46) holds, and that the constraints that ū t belongs to Kt are
compatible with the projector Pt , for t = 0 to T − 1. Then ū is a solution of the
abstract optimal control problem (6.45) iff there exists yT∗ ∈ ∂ϕ( ȳT ), and
We now particularize the previous setting by assuming that the spaces Ut and Yt
do not depend on t, so we may denote them as U0 , Y0 , and that, if y = y[u] with
u t ∈ Vt for all t, then yt belongs to some closed subspace Z t of Y0 , with which is
associated a projector Q t . We assume that the operators Pt ∈ L(U0 ) and Q t ∈ L(Y0 )
(which in our stochastic programming applications correspond to some conditional
expectations) satisfy PT = I , Q T = I as well as the following identities:
Pt = Pt Pτ = Pτ Pt ; Q t = Q t Q τ = Q τ Q t , t = 0, . . . , τ − 1, (6.80)
and
Q t+1 At = At Q t+1 ; t = 0, . . . , T − 1, (6.81)
Note that (6.80) implies that the sequences of spaces Vt and Z t are nondecreasing.
We introduce the adapted costate
p̄t = Q t pt , t = 1, . . . , T. (6.83)
Remark 6.41 By Remark 6.8, the transpose of conditional expectations are condi-
tional expectations (in a generalized sense for L ∞ norms), so that (at least in the case
of L s spaces for s ∈ [1, ∞)), in the stochastic optimization applications, p̄t will be
adapted. This justifies the terminology of adapted costate.
Lemma 6.42 Under the assumptions of Lemma 6.38, if (6.80)–(6.82) hold, then the
following adapted costate equation holds
p̄t = Q t yt∗ + wt∗ + At p̄t+1 , t = 1, . . . , T − 1;
(6.84)
p̄T = yT∗ + wT∗ ,
Remark 6.43 Similarly to Remark 6.40 we observe that (6.79) is equivalent to the
fact that for t = 0 to T − 1, ū t minimizes u → t (u, ȳt ) + p̄t+1 , Bt u over Kt .
Definition 6.44 We say that a measurable mapping (with values in a Banach space)
u = (u 0 , . . . , u T −1 ) is adapted to the filtration if u t is Ft measurable for t = 0 to
T − 1.
we have indeed that yt ∈ Z t , for t = 0 to T . We assume next that the cost function
is an expectation with the property of additivity w.r.t. time, i.e.,
t (u t , yt ) = Eˆt (ω, u t (ω), yt (ω)), t = 0, . . . , T − 1,
(6.91)
ϕ(yT ) = Eϕ̂(ω, yT (ω)),
ˆ
where the functions (ω, ·, ·) and ϕ̂(ω, ·) are a.s. convex functions. Under technical
conditions seen in Sect. 3.2 of Chap. 3, we have that, for t = 0 to T − 1:
6.2 Dynamic Stochastic Programming 219
∂t (u t , yt ) = {(vt∗ , yt∗ ) ∈ Ut∗ × Yt∗ ; (vt∗ (ω), yt∗ (ω)) ∈ ∂ ˆt (ω, u t (ω), yt (ω)) a.s.},
(6.92)
∂ϕ(yT ) = {yT∗ ∈ YT∗ ; yT∗ (ω) ∈ ∂ ϕ̂(ω, yT (ω)) a.s.}. (6.93)
We may denote the conditional expectation over Ft by Et . Noticing that the operators
Pt and Q t as well as their adjoints are conditional expectations over Ft , we may
write the adapted costate equation (6.84) in the following form:
p̄t = Et yt∗ + wt∗ + At p̄t+1 , t = 1, . . . , T − 1;
(6.94)
p̄T = yT∗ + wT∗ ,
Remark 6.45 In practice it is not easy to deal with functions of several variables.
Storing them, or computing conditional expectations becomes very expensive when
the dimension increases. The optimality conditions are nevertheless of interest for
studying theoretical properties (such as sensitivity analysis).
This case of a local operator is quite common in practice. Assuming that Bt has the
same structure and identifying the operators At and Ât , Bt and B̂t , we can express
the optimality conditions in the following form:
yt+1 (ω) = Ât (ω)yt (ω) + B̂t (ω)u t (ω) + dt (ω) a.s., t = 0, . . . , T − 1;
y0 ∈ Z 0 given,
(6.100)
∗ ∗ ∗ ∗
p̄t = Et yt + wt + Ât p̄t+1 , t = 1, . . . , T ; p̄T = yT + wT . (6.101)
Et vt∗ + B̂t p̄t+1 + qt = 0, t = 0, . . . , T − 1. (6.102)
6.2.7.1 Framework
Let yt ∈ [ym , y M ] denote the amount of water at a dam at the beginning of day t.
We can turbine an amount u t ∈ [u m , u M ], and spill an amount st ≥ 0. The natural
increment of water is bt ≥ 0. So the dynamics is
yt+1 = yt + bt − u t − st , t = 0, . . . , T − 1. (6.103)
In a deterministic version of this problem, where bt and ct are known for all time t,
the problem of maximizing the revenue can be written as
6.2 Dynamic Stochastic Programming 221
T −1
Min − ct u t − C T yT s.t. (6.103)-(6.104). (6.106)
u,s
t=0
where ⎧ ∗
⎨ wt ≤ 0 if yt = ym ,
w∗ = 0 if yt ∈ (ym , y M ), (6.108)
⎩ t∗
wt ≥ 0 if yt = y M .
We can interpret p̂t as the marginal value of storing, called in this context the water
price. If the market price ct is strictly smaller (resp. strictly greater) than the water
value, then one should store (resp. turbine) as much as possible. Observe that the
water value decreases (resp. increases) when the storage attains the minimum (resp.
maximum) value.
For the spilling variable s the policy is to take st = 0 as long as the water value is
positive, and st ≥ 0 otherwise (with a value compatible with the constraint yt+1 ≤
y M ).
This is in agreement with the following observation. If during some time interval
the inflows are important, it may be worth turbining even if the market price is low.
So the water price should be small, and possibly become greater after.
Exercise 6.46 If ym = −∞ and y M = +∞, show that the optimal strategies are to
take u t = u m if ct < C T , and u t = u M if ct > C T , and u t ∈ [u m , u M ] otherwise.
We may assume that randomness occurs only in the variables bt and ct . Here we will
assume that
bt is deterministic; ym = 0; y M = +∞, (6.111)
222 6 Dynamic Stochastic Optimization
Vt = Z t = L ∞ (Ft ). (6.112)
6.3 Notes
The discussion of conditional expectation is classical, see e.g. Malliavin [77], and
Dellacherie and Meyer [40]. For more on first-order optimality conditions, see
Rockafellar and Wets [103, 104], Wets [124] and Dallagi [37] for the numerical
aspects.
Chapter 7
Markov Decision Processes
Any element of X ik has the representation x = (i, x k+1 , . . . , x N ). Let the set of events
(denoted by Ω in probability theory) be X ik , with σ -field P(X ik ). We denote by Pik
© Springer Nature Switzerland AG 2019 223
J. F. Bonnans, Convex and Stochastic Optimization, Universitext,
https://doi.org/10.1007/978-3-030-14977-2_7
224 7 Markov Decision Processes
which remains meaningful for a process starting at a time possibly less than k.
In the next lemma, we check that, given the knowledge of the state at some time
< N , the additional knowledge of past states (for times up to k − 1) is useless for
the estimation of x +1 (and so, by induction for x j , j > + 1).
(x k ,...,x )∈A
(7.6)
Therefore by the Bayes rule
as was to be shown.
We can view M k = {Mikj }i, j∈S ×S as a possibly ‘infinite matrix’ with a (nonnegative)
element Mikj in row i and column j, the sum over each row being equal to 1. We
call such a ‘matrix’ having these two properties a transition operator. If S is finite,
a transition operator M reduces to a stochastic matrix (a matrix with nonnegative
elements whose sum over each row is 1).
We have the following calculus rules that extend the usual matrix calculus: prod-
ucts between transition operators, and the product of a transition operator with a
7.1 Controlled Markov Chains 225
horizontal vector on the left, or a vertical vector on the right, under appropriate
conditions on these vectors.
More precisely, let 1 and ∞ , respectively, denote the space of summable and
bounded sequences, whose elements are represented as horizontal (for 1 ) and vertical
(for ∞ ) vectors. These spaces are resp. endowed with the norms
π 1 := |πi |; v∞ := sup |vi |. (7.8)
i∈S i∈S
We recall that ∞ is the topological dual (the set of continuous linear forms) of 1 .
We denote the duality pairing by
π v := πi vi , for all π ∈ 1 and v ∈ ∞ . (7.9)
i∈S
This is in accordance with the rules for products of vectors in the case of a finite state
space. Let π ∈ 1 , v ∈ ∞ , and M be a transition operator. We define the products
π M ∈ 1 and Mv ∈ ∞ by
(π M) j := πi Mi j ; (Mv)i := Mi j v j , for all i, j in S . (7.10)
i∈S j∈S
It is easy to check that the product of two transition operators is a transition operator.
We interpret
P := π ∈ ; πi ≥ 0, i ∈ S ;
1
πi = 1 (7.14)
i∈S
as a set of probability laws over S , and ∞ as a values space. The (left) product of a
probability law π with a transition operator is a probability law, and we can interpret
the pairing (7.9) as the expectation of v under the probability law π . One can interpret
226 7 Markov Decision Processes
the ith row of M k as the probability law of x k+1 , knowing that the process x satisfies
xk = i ∈ S .
Let x ∈ X k , the class of processes starting at time k. It may happen that the initial
state x k is unknown, but has a known probability law π k ; we then write x k ∼ π k .
Then we may define the event set as X k and the probability of x ∈ X k as
We note that
Pkπ k (x) := πxkk Pkx k (x), (7.16)
and that for > k, the probability law of x k+1 , i.e. π := P(x | x k ∼ π k ), satisfies
the forward Kolmogorov equation
π +1
j = πi P[x +1 = j, x = i] = πi Mi, j , for = k to N − 1, (7.17)
i i
or equivalently
π +1 = π M = π k Πq=k
M q , for = k to N − 1 . (7.18)
N
Vik = cik + π c . (7.20)
=k+1
Denote by e j the probability concentrated at state i, i.e., the element of 1 with all
components equal to 0, except for the jth one equal to 1.
7.1 Controlled Markov Chains 227
N
Vik = cik + P[x k+1
= j | x = i]
k
E[cx | x k+1 = j]. (7.22)
j∈S =k+1
N
Now P[x k+1 = j | x k = i] = Mikj and =k+1 E[cx | x k+1 = j] = V jk+1 . The con-
clusion follows.
In the case of an infinite horizon, the probability space can be defined by Kol-
mogorov’s extension of finite horizon probabilities, see Theorem 3.24. We first con-
sider a problem with discount rate β ∈ (0, 1) and non-autonomous data, i.e., ck and
M k depend on the time k. We assume that
∞
Vik := (1 − β)E β −k cx |x k = i . (7.24)
=k
Vik
= cik + ei ∞=k+1 β
−k
M k . . . M −1 c
1−β (7.27)
∞ −k −1
= cik + j∈S Mikj βck+1 j + =k+2 je β M k+1
. . . M c ,
so that
∞
Vik
= cik + Mikj E β −k cx |x k+1 = j , (7.28)
1−β =k+1
j∈S
and the above expectation is nothing else than βV jk+1 /(1 − β). The conclusion fol-
lows.
Remark 7.4 Lemma 7.3 allows us to compute V k given V k+1 . In practice, we can
compute an approximation of V k given a horizon N > 0, setting
N −1
Vik,N := (1 − β)E β −k cx,N
|x
k
=i (7.30)
=k
is the value function of a problem with finite horizon N and therefore can be computed
by induction, starting from V N ,N = 0. We have the error estimate
V k,N − V k ∞ ≤ (1 − β) β −k c ∞ ≤ β N −k c∞ . (7.31)
≥N
Remark 7.5 In the autonomous case, i.e., when (ck , M k ) does not depend on time,
and is then denoted as (c, M), it is easily checked that V k actually does not depend
on k, and is therefore denoted by V . Then Lemma 7.3 tells us that V satisfies
V = (1 − β)c + β M V. (7.32)
Remark 7.6 We often have periodic data (think of seasonal effects in economic
modelling), i.e., (ck , M k ) = (ck+K , M k+K ), where the positive integer K is called
7.1 Controlled Markov Chains 229
Consider now a Markov chain whose transition probabilities Mikj (u) depend on a
control variable u ∈ Uik , where Uik is an arbitrary set depending on the time k and
state i ∈ S . We have costs depending on the control and state: cik (u) : Uik → R, and
final values ϕ ∈ ∞ , such that
Let Φ k denote the set of feedback mappings (at time k), that to each i ∈ S associates
u i ∈ Uik . Given a horizon N > k, we choose a feedback policy, i.e., an element u of
the set
Φ (0,N −1) := Φ 0 × · · · × Φ N −1 , (7.36)
N −1
Vik (u) := Eu cx (u x ) + ϕx N |x k = i , k ∈ N, i ∈ S . (7.37)
=k
Here, by the short notation ck (u), we mean the function of i ∈ S with value cik (u i ).
Also, the following holds:
230 7 Markov Decision Processes
N −1
V k (u)∞ ≤ ck ∞ + ϕ∞ , k = 0, . . . , N . (7.39)
=k
N −1
V k ∞ ≤ ck ∞ + ϕ∞ , k = 0, . . . , N . (7.41)
=k
Given ε ≥ 0 and k ∈ {0, . . . , N − 1}, we define the set Φ k,ε of ε-optimal feedback
policies at time k, as
⎧ ⎧ ⎫ ⎫
⎨ ⎨ ⎬ ⎬
Φ k,ε = û ∈ Φ k ; û i ∈ ε-argmin cik (u i ) + Mikj (u i )V jk+1 , for all i ∈ S .
⎩ u∈U k ⎩ ⎭ ⎭
i j
(7.42)
By ε-argminu∈Uik , we mean the set of points where the infimum is attained up to ε,
that is, in the present setting, the set of û i ∈ Uik such that
⎧ ⎫
⎨ ⎬
cik (û i ) + Mikj (û i )V jk+1 ≤ ε + inf cik (u) + Mikj (u)V jk+1 . (7.43)
u∈Uik ⎩ ⎭
j j
Note that this set may be empty if ε = 0. Consider the dynamic programming equa-
tion: find (v = v0 , . . . , v N ) ∈ (∞ ) N +1 such that
⎧ ⎧ ⎫
⎪
⎪ ⎨ ⎬
⎨ k
vi = inf cik (u) + Mikj (u)vk+1 , i ∈ S , k = 0, . . . , N − 1,
u∈Uik ⎩ ⎭
j (7.44)
⎪
⎪ j
⎩ N
v = ϕ.
Proposition 7.7 The (minimal) value function V k is the unique solution of the
dynamic programming equation. If the policy ū is such that for some εk ≥ 0,
ū k ∈ Ū k,εk for all k, then
N −1
Vik ≤ V k (ū) ≤ Vik + ε̄k , ε̄k := ε , k = 0, . . . , N − 1. (7.45)
=k
7.1 Controlled Markov Chains 231
In particular, if the above relation holds with εk = 0 for all k, then the policy u is
optimal in the sense that Vik = Vik (u), for all k = 0, . . . , N − 1, and i ∈ S .
So, we have proved that vk ≤ V k ≤ V k (ū) ≤ ε̄k + vik . Since ε̄k can be taken arbitrar-
ily small, vk = V k for all k, and the conclusion follows.
In this section, we assume that the data are autonomous: the cost function, transition
operator and control sets do not depend on time, and we have a discount coefficient
β ∈ (0, 1). The following theorem characterizes the optimal policies, and shows in
particular that we can limit ourself to autonomous (not depending on time) feedback
policies Φ that with each i ∈ S associate to an element u i of Ui . Sometimes we will
use the following hypothesis:
For all i and j in S , Ui is metric compact
(7.48)
and the functions ci (u) and Mi j (u) are continuous.
Given the discount factor β ∈]0, 1[, the (minimal) value function is defined by
Theorem 7.8 (i) The value function is the unique solution of the dynamic program-
ming equation: find v ∈ ∞ such that
⎧ ⎫
⎨ ⎬
vi = inf (1 − β)ci (u) + β Mi j (u)v j , i ∈ S. (7.51)
u∈Ui ⎩ ⎭
j
Set ε := (1 − β)−1 ε. Then the policy u is ε suboptimal, in the sense that the asso-
ciated value V (u) satisfies
(iii) Let (7.48) hold. Then there exists (at least) an optimal policy.
We recall that
inf a(u) − inf b(u) ≤ sup |a(u) − b(u)|, (7.55)
u∈U u∈U
u∈U
Proof (a) Let us show first that (7.51) has a unique solution. This equation is of the
form v = T v. Since T w∞ ≤ (1 − β)c∞ + βw∞ , the operator T indeed
maps ∞ into itself. Given w and w in ∞ , using (7.55) and the fact that M(u) is a
transition operator, we get
7.1 Controlled Markov Chains 233
1 m
m
(T w )i − (T w)i ≤ sup Mi j (u)(w − w) j ≤ sup Mi j (u)w − w∞
β u∈Ui j=1 u∈Ui j=1
and the r.h.s. is equal to w − w∞ . So, T is a contraction with coefficient β and,
by the Banach–Picard theorem, has a unique solution denoted by v. We next prove
that v is equal to the minimal value V .
(b) Let u ∈ Φ be a policy, with associated value V (u). Since
we deduce using (7.52) that v − V (u) ≤ β M(u)(v − V (u)). Lemma 7.9 below
ensures that v ≤ V (u). Since this holds for all policies, we also have v ≤ V .
(c) If (7.53) is satisfied, using (7.55) we get
Vi (u) − vi ≤ ε + sup β Mi j (ũ)(V j (u) − v j ) ≤ ε + β sup(V (u) − v). (7.58)
ũ∈Ui j∈S
Taking the supremum in i, we deduce that sup(V (u) − v) ≤ ε . Since v ≤ V (u) for
any u ∈ Φ, we deduce (7.54), whence (ii).
(d) It follows from (ii) that a policy satisfying the dynamic programming equation
(7.51) is optimal. Such a policy exists whenever (7.48) holds. Points (i) and (iii)
follow.
Lemma 7.9 Let M be a transition operator, β ∈]0, 1[, ε ≥ 0 and w ∈ ∞ satisfy
w ≤ ε1 + β Mw. Then w ≤ (1 − β)−1 ε1.
Proof We have Mw ≤ (sup w)1 since M is a transition operator, and so w ≤ (ε +
β sup w)1. Therefore, sup w ≤ ε + β sup w, whence the conclusion.
Definition 7.10 We say that the sequence {u q } of autonomous feedback policies
q
simply converges to ū ∈ Φ if u i → ū i , for all i ∈ S . We define in the same way the
simple convergence in and ∞ .
1
Lemma 7.11 Let {u q } simply converge to ū in Φ. Then the associated value sequence
V (u q ) simply converges to V (ū).
Proof Since V (u q ) is bounded in ∞ , by a diagonalizing argument, there exists a
subsequence of V (u q ) that simply converges to some V̄ ∈ ∞ . We will show that
V̄ = V (ū). It easily follows then that the sequence V (u q ) simply converges to V (ū).
So, extracting a subsequence if necessary, we may assume that V (u q ) simply
converges to V̄ ∈ ∞ . Fix ε ∈ (0, 1) and i ∈ S . There exists a partition (I, J ) of S
such that
I has a finite cardinality and Mi j (ū) ≥ 1 − 21 ε. (7.59)
j∈I
Since I is finite and u q simply converges to ū, for q large enough, we have that
j∈I Mi j (u ) ≥ 1 − ε, and so
q
234 7 Markov Decision Processes
Mi j (ū) ≤ ε; Mi j (u q ) ≤ ε. (7.60)
j∈J j∈J
Set, for i ∈ S , Δi := lim supq |Vi (u q ) − Vi (ū)|. Since I is finite, we have that
Δi = lim sup (1 − β)(ci (u i ) − ci (ū i )) + β (Mi j (u i )V (u q ) j − Mi j (ū i )V j (ū))
q q
q j
≤ β lim sup (Mi j (u q )V (u q ) j − Mi j (ū)V j (ū)) + ε(V (u q )∞ + V (ū)∞ )
q j∈I
≤ ε(V q ∞ + V (ū)∞ ).
We now want to characterize optimal policies when starting from a given point, say
i ∈ S . That is, a policy u ∈ Φ such that the associated value satisfies Vi (u) = Vi .
Definition 7.13 Consider an autonomous Markov chain with transition operator M.
Let i ∈ S . We say that j ∈ S is q-steps accessible from i (with q ≥ 1) if a Markov
chain starting at state i and time 0 has a nonzero probability of having its state equal
to j at time q. We say that j is accessible from i if it is n-steps accessible for some
n ≥ 1. The union of such j is called the accessible set from state i.
Let here M q denote the q times product of M. It is easily checked by induction
q
that Mi j > 0 iff the Markov chain starting at i at time 0 has a positive probability of
being equal to j at time q. Therefore the accessible set is
∞ q
Si = ∪q=1 { j ∈ S ; Mi j > 0}. (7.61)
In the case of a controlled Markov chain, we denote by Si (u) the accessible set when
starting from i, with the policy u ∈ Φ. Set Sˆi (u) := {i} ∪ Si (u).
Theorem 7.14 A policy u ∈ Φ is optimal, when starting from i 0 ∈ S , iff it satisfies
the dynamic programming equation over Sˆi0 (u), i.e.,
⎧ ⎫
⎨ ⎬
u i ∈ argmin (1 − β)ci (v) + β Mi j (v)V j , for all i ∈ Sˆi0 (u). (7.62)
v∈Ui ⎩ j
⎭
7.1 Controlled Markov Chains 235
In the case of autonomous infinite horizon problems, the simplest method for solving
the dynamic programming principle (7.51) is the value iteration algorithm: compute
the sequence vq in ∞ , for q ∈ N, the solution of
⎧ ⎫
⎨ ⎬
q+1 q
vi = inf (1 − β)ci (u) + β Mi j (u)v j , i ∈ S , q ∈ N. (7.64)
u∈Ui ⎩ ⎭
j
q−1
β cx (u ) + β q vx0q |x 0 = i , k ∈ N, i ∈ S ,
q
Vi (u) := (1 − β) min Eu
u∈Φ (0,q−1)
=0
(7.65)
where the set Φ (0,N −1) of feedback policies was defined in (7.36).
Proposition 7.15 The value iteration algorithm converges to the unique solution V
of (7.51), and we have
Proof We showed in the proof of Theorem 7.8 that the Bellman operator T , defined
in (7.56), is a contraction with ratio β in the uniform norm. We conclude by the
Banach–Picard theorem.
Remark 7.16 When taking v0 = 0 we obtain the explicit estimate of distance to the
solution:
236 7 Markov Decision Processes
When β is close to 1, the value iteration algorithm can be very slow. A possible
alternative is the policy iterations, or Howard algorithm. Roughly speaking, the idea
is, for a given policy, to compute the associated value, and then to update the policy by
computing the argument of the minimum in the dynamic programming operator. We
assume that the compactness hypothesis (7.48) holds. Each iteration of the algorithm
has two steps:
Algorithm 7.18 (Howard algorithm)
1. Initialization: choose a policy u 0 ∈ Φ; set q := 0.
2. Compute the value function vq associated with the policy u q ∈ Φ, i.e., the solu-
tion of the linear equation
4. q := q + 1; go to step 2.
Denote by V the value function, the unique solution of the dynamic programming
principle (7.51).
Proposition 7.19 Let (7.48) hold. Then the Howard algorithm is well-defined. The
sequence vq is nonincreasing and satisfies
In addition, denote by v̄q+1 the value obtained by applying the value iteration to vq .
Then vq+1 ≤ v̄q+1 .
Proof The linear system (7.68) has a unique solution in ∞ , since it is a fixed point
equation of a contraction. In view of (7.48), the minimum in the second step is
attained. The sequence vq is bounded in ∞ since we have
7.1 Controlled Markov Chains 237
whence
since M(u q+1 ) has nonnegative elements and vq+1 ≤ vq . The conclusion
follows.
Remark 7.20 The previous proof shows that the policy iterations converge at least as
rapidly as the value iteration. However, each iteration needs to solve a linear system.
This can be expensive, especially if the transition operators are not sparse.
Remark 7.21 The contraction constant β is optimal for the Howard algorithm, as
Example 7.22 shows. In addition, in this example the sequence computed by the
value and Howard algorithms coincide. So, in general, the Howard algorithm does
not converges more rapidly than the value iteration.
Example 7.22 Here is a variant of an example due to Tsitsiklis, see Santos and Rust
[109], showing that the Howard algorithm does not necessarily converge faster than
the value iteration algorithm. Let S = N, and for all i ∈ N, i = 0, Ui = {0, 1}. The
decision 0 (resp. 1) represents a (deterministic) move from state i to state i − 1 (resp.
to itself). The only possible decision at state 0 is to remain there. The cost is 1 at any
state i = 0, and 0 at state 0. So the optimal policy is to choose u = 0 when i = 0.
The optimal value is V0 = 0 and for i > 0,
We choose to initialize Howard’s algorithm with the policy u = 1 for any i > 0. So
We then have
v0 − V ∞ = v10 − V1 = 1 − (1 − β) = β. (7.75)
238 7 Markov Decision Processes
At each iteration q of the algorithm the decision at state q changes from 1 to 0, and
this is the only change, so that
q q
vi = Vi , 0 ≤ i ≤ q; vi = 1, i > q. (7.76)
It follows that
q
vq − V ∞ = vq+1 − Vq+1 = 1 − (1 − β q+1 ) = β q+1 = β q v0 − V ∞ . (7.77)
The idea is to replace, in Howard’s algorithm, the linear system resolution with
finitely many value iteration-like steps, where the decision is freezed.
4. q := q + 1; go to step 2.
So, informally speaking, when m is large, the sequence vq should not be too dif-
ferent from the one computed by Howard’s algorithm. The gain is that the freezed
value iteration steps are generally much faster than the corresponding classical value
iterations.
7.1 Controlled Markov Chains 239
/ Sˆ }.
τ := min{k ∈ N; x k ∈ (7.81)
τ −1
Vi := (1 − β) inf Eu β k cx k (u x k ) + β τ ϕx τ |x 0 = i . (7.82)
u∈Φ
k=0
Remark 7.25 If the ci are set to zero and ϕi = 1 (resp. ϕi = −1) for all i ∈ S \ Sˆ,
we see that the problem consists, roughly speaking, in maximizing (resp. minimizing)
a discounted value of the exit time.
It appears that exit problems reduce to the standard one by adding a final state
say i f to the state space, which becomes S := S ∪ {i f }. The decision sets are for
i ∈S :
Ui if i ∈ Sˆ ,
Ui = (7.83)
{0} otherwise.
In other words, for any i ∈ S \ Sˆ , the only possible transition is to the final state
i f , and when in i f the process remains there. The costs are
⎧
⎨ ci j (u) if i ∈ Sˆ ,
ci (u) = ϕi if i ∈ S \ Sˆ , (7.85)
⎩
0 if i = i f .
Proposition 7.26 Let supu∈U |ci (u)| be finite and ϕ bounded. Then the value function
of the exit time problem is the unique solution of the dynamic programming equation
⎧ ⎧ ⎫
⎪
⎪ ⎨ ⎬
⎨ v = inf (1 − β)c (u) + β Mi j (u)v j , i ∈ Sˆ,
i i
u∈Ui ⎩ ⎭ (7.86)
⎪
⎪ j
⎩
vi = (1 − β)ϕi , i ∈ S \ Sˆ .
240 7 Markov Decision Processes
Proof This is a consequence of our previous results. We have just shown that exit
time problems can be rewritten as standard controlled Markov chain problems, and
the value at the state i f is zero. Writing the corresponding dynamic programming
equation, for i ∈ Sˆ we get the first row in (7.86), and otherwise we get
Remark 7.27 The value iteration algorithm (rewriting the exit problem as a standard
one), when starting with initial values such that
satisfies
vi = (1 − β)ϕi , i ∈ S \ Sˆ ; vi f = 0,
q q
for all q ∈ N. (7.89)
Exercise 7.28 Extend the policy iteration algorithm to the present setting, the
sequence of values satisfying (7.89).
7.1.6.1 Setting
We now study an extension of the previous framework, with the additional possibility
of a stopping decision at any state i ∈ S with cost ψi in R ∪ {+∞} (in fact, the
possibly infinite value restricts the possibility of stopping to the states with a finite
value of ψ). We assume that Ψ has a finite infimum. Let M(u) be the transition
operator of the controlled Markov chain. We assume that (7.48) holds. Given Sˆ ⊂
S , we denote by τ the first exit time of Sˆ , and consider the additional decision θ ,
called the stopping time (a function of i ∈ S ). Set
1 if θ < τ,
χθ<τ = (7.91)
0 otherwise,
7.1 Controlled Markov Chains 241
and adopt a similar convention for χθ≥τ . We consider the controlled stopping time
problem
(θ∧τ )−1
Vi := (1 − β) inf Eu β k c(u)x k + β θ χθ<τ ψx θ + β τ χθ≥τ ϕx τ |x 0 = i .
u∈Φ
k=0
(7.92)
Remark 7.29 (i) When Ui is a singleton for all i ∈ S , the only decision is when to
stop. We speak then of a pure stopping problem. (ii) The optimal policy may be to
never stop.
In the sequel we assume that
(i) the compactness hypothesis (7.48) holds,
(7.93)
(ii) supu∈U |ci (u)| < ∞, (iii) ϕ ∈ ∞ , (iv) inf ψ is finite.
Theorem 7.30 The value function V of the stopping problem belongs to ∞ , and is
the unique solution of the dynamic programming equation
⎧ ⎛ ⎧ ⎫ ⎞
⎪ ⎨ ⎬
⎪
⎨ (i) vi = min ⎝ inf (1 − β)ci (u) + β Mi j (u)v j , (1 − β)ψi ⎠ , i ∈ Sˆ ,
u∈Ui ⎩ ⎭
⎪
⎪
j
⎩
(ii) vi = (1 − β)ϕi , i∈/ Sˆ .
(7.94)
Proof Choosing a policy without stopping, we get that Vi ≤ c∞ . Changing ci into
(inf c)1 and ψi into (inf ψ)1, for each i ∈ S , we get Vi ≥ min(−c∞ , (1−β) inf ψ)
(remember that Ψ has a finite infimum). So, V ∈ ∞ .
We can rewrite the stopping problem as a standard one. As in the case of exit
problems, we add to S a final state i f with only transition to itself, and transitions
from any i ∈ S \ Sˆ to i f , with associated cost ϕi . The difference is that we add
the possible decision from any i ∈ S to i f with associated cost ψi . The associated
dynamic programming equation then reads
⎧ ⎛ ⎧ ⎫ ⎞
⎪
⎪ ⎨ ⎬
⎪ ⎝
⎪ vi = min inf (1 − β)ci (u) + β Mi j (u)v j , (1 − β)ψi + βvi f ⎠ , i ∈ Sˆ ,
⎪
⎨ u∈Ui ⎩ ⎭
j
⎪
⎪
⎪
⎪ vi = (1 − β)ϕi + βvi f , i ∈ S \ Sˆ ,
⎪
⎩
vi f = βvi f .
(7.95)
Clearly this holds iff vi f = 0 and the second row of (7.94) is satisfied. So, (7.94) is
equivalent to the dynamic programming equation of the reformulation as a standard
problem and therefore characterizes the minimum value function.
As in the case of exit problems, we easily check that the value iteration algorithm
(applied to the reformulation as a standard problem), initialized with v0 such that
242 7 Markov Decision Processes
satisfies
So, we can define the value iterations algorithm for exit problems as computing the
sequence satisfying (7.97) as well as
⎛ ⎧ ⎫ ⎞
⎨ ⎬
= min ⎝ inf , (1 − β)ψi ⎠ , i ∈ Sˆ .
q+1 q
vi (1 − β)ci (u) + β Mi j (u)v j
u∈Ui ⎩ ⎭
j
(7.98)
q
vi < ψi for all q ∈ N. That is, the set I q of states with ‘non-stopping decision
at iteration q’ (defined precisely below) is nondecreasing. We next formulate the
Howard algorithm. Given a policy u q ∈ Φ, one has to compute the solution vq of the
linear equation
⎧ ⎛ ⎞
⎪
⎪
⎪
⎨ vi = ⎝(1 − β)ci (u i ) + β
⎪ Mi j (u i )v j ⎠ , i ∈ I q ,
q q q q
j (7.99)
⎪
⎪ i ∈ Sˆ \ I q ,
q
⎪
⎪ v = (1 − β)ψi ,
⎩ iq
vi = (1 − β)ϕi , i∈/ Sˆ.
⎧ ⎫
⎨ ⎬
, i ∈ Sˆ.
q q−1
u i ∈ arg min (1 − β)ci (u) + β Mi j (u)v j (7.100)
u∈Ui ⎩ ⎭
j
3. Set
7.1 Controlled Markov Chains 243
⎧ ⎛ ⎞ ⎫
⎨ ⎬
I q := I q−1 ∪ i ∈ Sˆ ; ⎝(1 − β)ci (u i ) + β Mi j (u i )v j ⎠ < (1 − β)ψi
q q q−1
.
⎩ ⎭
j
(7.101)
4. Compute the solution vq of the linear equation (7.99); go to 2.
From the study of the policy iteration in the standard framework, see Proposition
7.19, we deduce that:
Proposition 7.32 The Howard algorithm computes a nonincreasing sequence vq
that satisfies
vq+1 − V ∞ ≤ βvq − V ∞ . (7.102)
It may happen that exit time or stopping problems have finite values in the absence of
discounting. Indeed, consider the controlled stopping time problem similar to (7.92),
but without discounting:
θ∧τ −1
Vi := inf E u
c(u)x k + χθ<τ ψx θ + χθ≥τ ϕx τ |x = i .
0
(7.103)
u∈Φ
k=0
Example 7.33 Assume that c(u) and ϕ have nonnegative values, and that ψ ∈ ∞
(in particular, stopping in any state is possible), with inf Ψ < 0. Minorizing Vi by
changing c(u) and ϕ to zero, we obtain that for all i ∈ Sˆ , inf ψ ≤ Vi ≤ ψi , so that
V ∈ ∞ .
Using the arguments of the previous sections, one easily checks that the value
functions satisfy a dynamic programming principle similar to those already stated,
but with β = 1. We leave the details as an exercise.
This section presents some more advanced aspects of the theory of controlled Markov
chains, among them problems with expectation constraints, with partial information,
including open loop control.
We will see, in the presence of constraints over the expectations of functions of the
state, a nice relation with the duality theory presented in the first chapter.
244 7 Markov Decision Processes
7.2.1.1 Setting
We apply the duality theory of Chap. 1 to this (nonconvex) problem. For λ ∈ Rr and
v ∈ Ui , set
ciλ (v) := ci (v) + λ · Ψi (v). (7.108)
and set V̄iλ := inf u∈Φ Viλ (u). The (standard) Lagrangian, duality Lagrangian, and
dual cost associated with problem (7.107) are, resp.:
⎧
⎨ L(u, λ) := Vi0 (u) + λ · Wi0 (u) = Viλ0 (u),
L (u, λ) := L(u, λ) − σ K (λ), (7.110)
⎩
δ(λ) := inf u∈Φ L (u, λ) = V̄iλ0 − σ K (λ).
7.2 Advanced Material on Controlled Markov Chains 245
We know that its value (the dual value) is a lower bound of the value of (7.107). This
lower bound is often useful, since the primal problem is not easy to solve. We next
analyze some cases when there is no duality gap, i.e., the primal and dual values are
equal.
|S | = m < ∞, (7.112)
Note that the above hypotheses do not imply that problem (7.107) is convex (for
instance, the criterion is not a convex function of the policy). We can rewrite the dual
problem as the one of minimizing the l.s.c. function
Theorem 7.34 Let hypotheses (7.112)–(7.114) hold and the primal problem be fea-
sible. Then
(i) The set of solutions of the dual problem (7.111) is nonempty and compact, and
λ is a dual solution iff thereexists a Borelian probability measure μ over Φ such
that, denoting by Eμ g(u) = Φ g(u)dμ(u) the associated expectation, the following
holds:
suppμ ⊂ argmin L(u, λ); Eμ Wi0 (u) ∈ K ; λ ∈ N K (Eμ Wi0 (u)). (7.116)
u∈Φ
246 7 Markov Decision Processes
(ii) Problems (7.107) and (7.111) have equal value, and there exists a primal-dual
solution (ū, λ). Any such primal-dual solution (ū, λ) is characterized by the relations
Proof (i) It is easily checked that u → (Vi0 (u), Wi0 (u)) is continuous. Indeed, let u
and u be two policies. Then
Denote the corresponding value associated with u ∈ Φ by Vin0 (u); the associated
perturbed problem is
Min Vin0 (u); Wi0 (u) ∈ K . (Pn )
u∈Φ
Vi0 (ū) = V̄i0 ≤ lim V̄in0 ≤ lim Vin0 (ū) = Vi0 (ū). (7.122)
n n
The first inequality follows from ci (u) ≤ ciε (u), and the two other relations are obvi-
ous. So, V̄in0 → V̄i0 .
Let λn be a dual solution of the perturbed problem (Pn ) (it exists by the same
arguments as for the nominal problem). In view of the qualification condition (7.113),
{λ}n is bounded (adapt the arguments in the proof of Proposition 1.160). Extracting
a subsequence if necessary, we may assume that λn → λ̄.
For u ∈ Φ and j ∈ S , set V jλ,n (u) := V jn (u) + λ · W j (u), as well as
7.2 Advanced Material on Controlled Markov Chains 247
Let Φ n be the set of u ∈ Φ that attain the minimum in L n (·, λn ). Let u n ∈ Φ n , having
Theorem 7.14, for all i ∈ Su ,
accessible set Su n when starting from state i 0 . By n
u i attains the minimum over Ui of u → cin (u) + j∈S Mi j (u)V jλ . The latter being
strictly convex, all optimal policies have the same value at i 0 , and therefore by
induction the same accessible set, and coincide over this accessible set. Of course,
for states outside the accessible set, the control can take arbitrary values. So all
elements of Φ n have the same value of the constraint Wi0 (u). By Proposition 1.164,
(Pn ) and its dual have the same value, u n is a solution of (Pn ), and we have that
Wi0 (u n ) ∈ K ; λn ∈ N K (Wi0 (u n ));
(7.124)
Vin0 (u n ) + λn · Wi0 (u n ) ≤ Vin0 (u) + λn · Wi0 (u), for all u ∈ Φ.
We have proved that val(Pn ) → val(P). Passing to the limit in the above optimality
conditions in (u n , λn ) we obtain that the limit point (ū, λ̄) satisfies the optimality
conditions for the original problem, i.e.
Wi0 (ū) ∈ K ; λ̄ ∈ N K (Wi0 (ū));
(7.125)
Vi0 (ū) + λ̄ · Wi0 (ū) ≤ Vi0 (u) + λ̄ · Wi0 (u), for all u ∈ Φ.
By Proposition 1.164, ū is a primal solution and the primal and dual problems have
the same value. That the primal-dual solutions are characterized by (7.117) is a
standard result of duality theory.
Theorem 7.35 Let (7.127) and (7.128) hold. Then the primal and dual problems
have the same value, and a nonempty set of solutions.
Proof Adapt the techniques in the proofs of the previous statements to the case of a
finite horizon.
Remark 7.36 Obviously the technique can easily be adapted to the case of several
probabilistic constraints.
We next come back to the finite horizon framework. Assume that the Ui are equal to
a set denoted by U , and consider the problem of control of the Markov chain without
observation of the state, and knowing only a probability law of the initial state.
We consider a problem starting at time k in {0, . . . , N − 1}, with initial probability
law π k . An open-loop policy is now an element u of U N −k , whose component
u represents the decision taken at time = k, . . . , N − 1. The transition matrices
M k (u) are known, and therefore also the probability laws for x :
Equivalently, for = k + 1, . . . , N :
−1
π (u) = π k M k (u), where M k (u) := M q (u q ). (7.130)
q=k
So, the criterion associated with an open loop policy u ∈ U and an initial probability
law π k is
N −1
N −1
V k (u, π k ) = Eu cx (u ) + ϕx N = π (u)c (u ) + π N (u)ϕ. (7.131)
=k =k
It is a linear function of π k :
N −1
V (u, π ) = π V̂ (u);
k k k k
where V̂ (u) :=
k
M k (u)c (u ) + M k N (u)ϕ.
=k
(7.132)
Note that M kk (u) is the identity mapping. For any open-loop policy u, the linear
mapping π k → V k (u, π k ) is Lipschitz from 1 into ∞ , with constant at most
7.2 Advanced Material on Controlled Markov Chains 249
Set
U := set of mappings S → U. (7.134)
Since an infimum of uniformly Lipschitz functions is Lipschitz with the same con-
stant, the Bellman values
V̄ k (π ) = inf π V̂ k (u) (7.135)
u∈U N −k
We can link the previous results to the first-order optimality conditions of some
discrete-time optimal control problem. For the sake of clarity, let us first consider an
abstract discrete-time optimal control problem with state equation
y k = Fk (u k , y k−1 ), k = 1, . . . , N ; ŷ 0 − y 0 = 0. (7.137)
N
J (u, y) := k (u k , y k−1 ) + Ψ (y N ). (7.138)
k=1
The reduced cost is f (u) := J (u, y[u]). The optimal control problem is
N
L (u, y, p) := J (u, y) + p k · Fk (u k , y k−1 ) − y k + p 0 · ( ŷ 0 − y 0 ). (7.140)
k=1
We next assume that the functions Fk , k and Ψ are continuously differentiable. The
costate equation is obtained by setting
D y L (u, y, p) = 0. (7.141)
Given (u, y) with y = y[u], the (backwards) costate equation has a unique solu-
tion, denoted by p[u] and called the costate associated with u. Since f (u) =
L (u, y[u], p[u]), D y L (u, y[u], p[u]) = 0, and (u, y[u]) satisfies the state equa-
tion, we have, by the chain rule:
Lemma 7.39 Let u be a local solution of the optimal control problem. For k = 1 to
N , if the sets Uk are convex, then
Proof Let v ∈ U and k ∈ {1, . . . , N }. For t ∈ (0, 1), set wtk := (1 − t)u k + tv, and
wt = u for ∈ {1, . . . , N }, = k. Since u is a local solution, we have that
f (wt ) − f (u)
0 ≤ lim = ∇u f (u) · (v − u k ), (7.148)
t↓0 t
7.2 Advanced Material on Controlled Markov Chains 251
We apply the previous results to the Markov chain open loop setting (7.135). The
state equation is the law of the Markov chain process in the absence of observation
(but here writing the state as a vertical vector in order to adapt the optimal control
setting):
ν k = M k−1 (u k ) ν k , k = 0, . . . , N − 1; ν̂ 0 − ν 0 = 0, (7.150)
where ν̂0 is a given probability law on S . The control variables are the u k , and the
state variables are the laws ν k represented as vertical vectors. The cost function is
N −1
J (u, ν) := ν k · ck + ν N · ϕ. (7.151)
k=0
N −1
L (u, ν, W ) := J (u, ν) + W k+1 · M k (u k ) ν k − ν k+1 + W 0 · (ν̂ 0 − ν 0 ).
k=0
(7.153)
So, the costate equation gives
W N = ϕ; W k = ck + M k (u k )W k+1 , k = 0, . . . , N − 1. (7.154)
Lemma 7.41 The costate associated with the problem (7.152) coincides with the
value function V , and Pontryagin’s principle (7.149) holds for this problem.
As before x is the state of a Markov chain, and at each time step we observe a
signal y taking values in a finite set Y . The process starts at, say, time k and finishes
at time N . Let k ≤ n ≤ N . The probability of ((x k , y k ), . . . , (x n , y n )), given the
k
initial probability law π k,y ∈ 1 , (which therefore depends on the initial signal y k )
k
for x , is
k k,y k ,y +1
P((x k , y k ), . . . , (x n , y n )) | π k,y ) = πx k Π=k
n−1
Mx x +1 . (7.156)
Here Mi,r
j represents the probability, being in state i at time , of having both the
transition to state j, and the observation r at time + 1. So, we have that
Mi,r
j ≥ 0; Mi,r
j = 1, for all i ∈ S and ∈ {k, . . . , N − 1}. (7.157)
r ∈Y j∈S
k
The marginal law of (y k , . . . , y n , x n ) given π k,y is
k
k
P(y k , . . . , y n , x n | π k,y )) = P(((x k , y k ), . . . , (x n , y n )) | π k,y ). (7.158)
x k ,...,x n−1
Therefore,
k n−1 ,y k +1
P(y k , . . . , y n , x n | π k,y ) = π k,y Π=k M ex n , (7.159)
where here ei denotes the element of ∞ with zero components except for the ith
one, equal to 1. So, the probability law for the observations is
k n−1 ,yk +1
P(y k , . . . , y n | π k,y ) = π k,y Π=k M 1. (7.160)
The conditional law of x n , knowing the ‘initial’ law at time k and the signal up to
time n, is therefore
k n−1 ,y k +1
P(x n | (y k , . . . , y n , π k,y )) π k,y Π=k M
q =
n
= . (7.161)
P(y k , . . . , y n , π k,y )k
k,y k n−1 ,y +1
π Π=k M 1
7.2 Advanced Material on Controlled Markov Chains 253
One usually computes the marginal law (and therefore the conditional law) by induc-
tion, in the following way, for n > k:
n
p k = π k,yk , p n := p n−1 M n−1,y , q n := p n / p n 1. (7.162)
k
Next, knowing (y k , . . . , y n , π k,y ), the probability that y n+1 = z is
k
P(y k , . . . , y n , y n+1 = z, π k,y )
= q n M n,z 1. (7.163)
P(y k , . . . , y n , π k,y k )
k
As expressed by (7.162), the conditional law at step n + 1, knowing π k,y and
(y k , . . . , y n+1 ) with y n+1 = z, will be
q n M n,z
q n+1 := , with probability q n M z 1, for any z ∈ Y. (7.164)
q n M n,z 1
This is the equation of a dynamical system with state q n , whose transitions are
governed by probability laws depending only on the state. We see that this structure
is very similar to that of Markov chains. Consider the value function
N −1
V k (q) := π c + π N ϕ. (7.165)
=k
Here π is the law of the process at time , with initial value q at time k, and the
functions c and ϕ belong to ∞ . So, the value function will satisfy the following
equation: !
⎧
⎪ q M n,z
⎪ V (q) = qc +
⎪
n n
(q M 1)V
n,z n+1
,
⎨ q M n,z 1
z∈Y
(7.166)
⎪
⎪ n = k, . . . , N − 1;
⎪
⎩ N
V (q) = qϕ.
We come back to the setting of Sect. 7.1.3: infinite horizon, discount factor β ∈
(0, 1), with value function denoted by V . We say that v ∈ ∞ is a subsolution of the
‘discounted’ dynamic programming equation if
vi ≤ (1 − β)ci (u) + β Mi j (u)v j , for all i ∈ S and u ∈ Ui . (7.169)
j∈S
Setting ⎛ ⎞
δi := inf ⎝(1 − β)ci (u) + β Mi j (u)v j ⎠ − vi , (7.170)
u
j∈S
Assume next that both S and the sets Ui , for all i ∈ S , are finite. Then (7.171) is a
linear programming problem, which gives a way to numerically solve the problem.
The associated Lagrangian function is
⎛ ⎛ ⎞⎞
L(v, λ) := −π v + λi (u) ⎝vi − ⎝(1 − β)ci (u) + β Mi j (u)v j ⎠⎠ .
i∈S u∈Ui j∈S
(7.172)
So, the expression of the dual problem is:
Max −(1 − β) ci (u)λi (u); λi (u) = πi + β M ji (û)λ j (û).
λ≥0
i∈S u∈Ui u∈Ui j∈S û∈U j
(7.173)
7.3 Ergodic Markov Chains 255
We now consider what happens for undiscounted finite horizon processes when the
horizon goes to infinity. Under appropriate hypotheses, we are able to compute the
limit of the average reward per unit time, and to extend these results to the controlled
setting.
In this section we assume the state space to be finite.
7.3.1 Orientation
Consider an autonomous (uncontrolled) Markov chain (c, M) with finite state space
S and cost function c ∈ ∞ . Consider the sequence formed by the value iteration
operator:
V n+1 = c + M V n (7.174)
Setting
S n := (Id + M + · · · + M n )/(n + 1), (7.176)
1 n
V̄ n := V = S n−1 c. (7.177)
n
Observe that
(M − I )S n−1 = S n−1 (M − I ) = (M n − I )/n, (7.178)
and therefore
1
(M − I )V̄ n = (M − I )S n−1 c = (M n − I )c. (7.179)
n
Since M n is a bounded sequence, the r.h.s. converges to 0. Therefore, any limit-point
of V̄ n is an eigenvector of M with eigenvalue 1. Since 1 is an eigenvector of M with
eigenvalue 1, we may ask when V̄ n converges to a multiple of 1.
A related question is, given a probability law π 0 for the starting point of the
Markov chain, to see how the related probabilities π n at step n and the average
probability π̄ n over the first n steps behave. We know that π n = π 0 M n , so that
1 0
π̄ n := (π + · · · + π n−1 ) = π 0 S n−1 , for n ≥ 0. (7.180)
n
256 7 Markov Decision Processes
V̄ n (π 0 ) := π̄ n c = π 0 V̄ n . (7.183)
It may happen that the sequence wn has a Cesaro limit but does not converge, take
for example wn = (−1)n . On the other hand, if wn has a limit, then wn converges in
the Cesaro sense to the same limit. If wn → w̄ at a linear rate, in the sense that
|wn − w̄| ≤ Cηn , for some C > 0 and η ∈ (0, 1), (7.184)
Taking the example of a constant sequence (except for the first term) we see that the
convergence in the Cesaro sense is typically at best at speed 1/n.
Coming back to Markov chains, in view of (7.180) and (7.183), we obviously
have
C C
If S n → S̄, then π n → π̄ = π 0 S̄, and V̄ n → S̄c. (7.186)
Example 7.43 Consider an uncontrolled ! Markov chain with S = {1, 2} and M equal
01
to the permutation matrix M := . Then M n is equal to M if n is odd, and equal
10
to the identity otherwise. So, M n and π n have no limit. However, we have the Cesaro
limits !
n C 1 1 1 C
M →2 ; π n → 21 21 , V̄ N → 21 (c1 + c2 )1. (7.187)
11
With a transition matrix M we associate the graph G M in which the set of nodes (or
vertices) is the state set S , and there is an edge (a directed arc) between vertices i
and j iff Mi j > 0.
7.3 Ergodic Markov Chains 257
It is easily checked that there exists an n-step walk from i to j iff Minj > 0, i.e., if j
is n-step accessible from i (Definition 7.13). We say that two states i, j communicate
if each of them is accessible from the other. This is an equivalence relation whose
classes are called communication classes or just classes. A recurrent class is a class
that contains all states that are accessible from any of its elements. Once the state
enters this class, it stays in it forever. A transient class is a class that is not recurrent.
A state is transient (resp. recurrent) if it belongs to a transient (resp. recurrent) class.
Definition 7.45 The class graph is the graph whose nodes are the communication
classes, with a directed arc between two classes C , C iff C = C , and Mi j > 0, for
some i ∈ C and j ∈ C .
Observe that the class graph is acyclic (it contains no cycle) so that each maximal
path ends in a recurrent class. In particular, there exists at least one recurrent class.
Proof Let π be an invariant probability law. Let T (resp. R) denote the set of transient
(resp. recurrent) states. Then Minj = 0 if i ∈ R and j ∈ T , for any n ≥ 1, and so,
π(T ) = πi Minj = πi Minj = πi Minj . (7.188)
j∈T i∈S j∈T i∈T i∈T j∈T
This implies that if πi = 0, then j∈T Minj = 1 for all n ≥ 1, meaning that all acces-
sible states from i are transient, contradicting the fact that (as is easily established)
some recurrent states must be accessible from any transient state.
Lemma 7.47 Let C denote the square submatrix of M corresponding to row and
columns associated with transient states. Then C n → 0 at a linear rate.
Proof For large enough n, the probability that the Markov chain starting at any
transient state i ∈ T is in a recurrent state is positive, say greater than ε > 0 when
258 7 Markov Decision Processes
n > n 0 . Let n > n 0 . By the previous discussion, j∈T Cinj < 1 − ε. But then C n is
a strict contraction in ∞ since, for any v ∈ ∞ :
⎛ ⎞
C n v∞ ≤ max Cinj |v j | ≤ max ⎝ Cinj ⎠ v∞ ≤ (1 − ε)v∞ (7.189)
i∈T i∈T
j∈T j∈T
The first (resp. second) equality occurs iff for some i ∈ S , Mi j = 0 for any j ∈ S
such that y j > min(y) (resp. y j < max(y)), and
(ii) if M is a transition matrix such that ε := mini, j Mi j is positive, then for any
y ∈ ∞ , M n y converges to a constant vector and
π̄ ⊥ = {y ∈ ∞ ; π̄ y = 0}. (7.197)
We recover the fact that 1 is a simple eigenvalue of M (the associated eigenspace has
dimension 1) and that the other eigenvalues have modulus less than one. If follows
that if π 0 is any probability law for x 0 , then for some C > 0:
|π 0 M n − π̄ | ≤ C ηn . (7.199)
260 7 Markov Decision Processes
If the transition matrix M has a single class (which is therefore recurrent), we cannot
hope M n to converge in general (see Example 7.43), but we still have the following
result.
Lemma 7.53 A single class transition matrix has a unique invariant probability.
Proof For ε ∈ (0, 1), set Mε := ε I + (1 − ε)M, where M is a single class transition
matrix. Then by the binomial formula for commutative matrices
n !
ε n n 1− p
(M ) = ε (1 − ε) p M p (7.200)
p
p=0
has, for n large enough, only positive elements. By Lemma 7.50, M ε has a unique
invariant probability. Since it is easily checked that M ε and M have the same invariant
probabilities, the conclusion follows.
Lemma 7.54 A single class transition matrix M (with therefore a unique invariant
probability π̄ ), is such that for any probability π 0 , π n := π 0 Sn converges to π̄ , and
Sn converges to ⎛ ⎞
.. ..
⎜ .1 . ⎟
S̄ = ⎜
⎝ π̄ · · · π̄ m⎟
⎠. (7.201)
.. ..
. .
More generally after some permutation of the state indexes we may write the transi-
tion matrix in the form !
A 0
M= , (7.202)
BC
where A is a p × p matrix, p ≤ m, the first p state being recurrent, the other being
transient. The matrix A is block diagonal, each block corresponding to a recurrent
class. Then !
An 0
Mn = , (7.203)
Bn C n
7.3 Ergodic Markov Chains 261
n
Bn+1 = B An + C B An−1 + · · · + C n B = C i B An−i . (7.204)
i=0
q
(I − C)C = lim(I − C) C k = lim(I − C q+1 ) = I. (7.206)
q q
k=0
and using (7.204) we obtain that for some Bn , the expression of the average sum
S n is !
Ān 0
Sn = . (7.208)
Bn C̄ n
By Lemma 7.54, Ān → Ā (as before, the block diagonal matrix whose rows for
a given recurrent class are equal to the corresponding invariant probability) and
C̄ n → 0 since C n → 0 at a linear rate. By (7.204),
262 7 Markov Decision Processes
1 i
n
1
Bn = (B1 + · · · + Bn ) = C B Aj, (7.209)
n+1 n + 1 q=0 i+ j=q
and therefore
1 i j i n − i + 1 n−i
n n−i n
Bn = C B A = C B Ā . (7.210)
n + 1 i=0 j=0 i=0
n+1
We assume that the Markov chain has a unique invariant probability law π̄ (i.e., there
is a unique recurrent class). Set W := M − I , and consider the linear equation
c + W V = η1, (7.211)
is the average cost. Solving the linear system (7.211) therefore gives a way to compute
the average cost without computing the invariant probability law.
Proof Since the setting is finite-dimensional, it suffices to check that the only solu-
tions for c = 0 are when η = 0 and V is constant. That η = 0 follows from (7.212).
Now let V attain its maximum at i 0 . Then
Vi0 = (M V )i0 = Mi 0 j V j ≤ Mi0 j Vi0 = Vi0 , (7.213)
j j
where we used the fact that M is a stochastic matrix. The equality means that we
have that V j = Vi0 whenever Mi0 j = 0, i.e., when j is 1-step accessible from i 0 . By
induction, we deduce that this holds for any state accessible from i 0 , and in particular
for any element of the recurrent class. We have a similar result by considering a state
where V attains its minimum. Therefore the minimum of V is equal to its maximum.
The result follows.
7.3 Ergodic Markov Chains 263
The linear equation (7.211), taking into account the decomposition (7.202) of the
transition matrix M, is of the form
η1 = c + (A − I )V ,
(7.214)
η1 = c + BV + (C − I )V ,
where c refers to the subvector of c with corresponding components for the recurrent
class, etc. Since A is a transition matrix with a unique recurrent class, the first row
has a unique solution that determines (η, V ). Since, by Lemma 7.55(i), (C − I ) is
invertible, the second row determines the value of V , given (η, V ). Observe that,
in order to have the correct value of η, it is enough to solve the first block, i.e., we
can ignore the transient states.
1
V̄ n = S n−1 c = S n−1 η1 − S n−1 (M − I )V = η1 − (M n − I )V. (7.215)
n
Remark 7.57 If M is a regular transition matrix, by Remark 7.52, 1 is a simple
eigenvalue. Take for V (defined in ∞ /R) the representative V̂ that is a combination
of vectors of other eigenspaces, whose eigenvalues have modulus less that 1, so that
M n V̂ ≤ cγ n for some c > 0, γ < 1. So, (7.215) gives the following expansion
of V n = n V̄ n :
V n − nη1 − V̂ = M n V̂ ≤ cγ n . (7.216)
In view of (7.217), each feasible pair (u, π ) is such that π = π(u). By the previous
section, an equivalent problem is
Theorem 7.58 Let ū satisfy the ergodic dynamic programming principle (7.220).
Then
(i) ū is a solution of the ergodic problem (7.219).
(ii) If û is another solution of (7.220), writing c̄ = c(ū), ĉ = c(û), etc., then V̄ − V̂
is maximal and constant over S (û). If in addition S (ū) ∩ S (û) = ∅, then V̄ − V̂
is constant.
Proof (i) Let (ū, V̄ , η̄) satisfy (7.220) and let û ∈ Φ. Then
δV ≤ M̂δV. (7.222)
We deduce that if δV attains its maximum at state i, then the maximum is also attained
at each 1-step accessible state from i, and also by induction at any accessible state
from i (for the policy û). This implies that δV is maximal and constant over S (û).
Exchanging the roles of û and ū, we obtain that δV is minimal and constant over
S (ū). So, if S (ū) ∩ S (û) = ∅, then δV is constant.
We assume next that the Ui are compact, c and M are continuous, and (7.217)
holds; then, by Lemma 7.56, the following Howard policy iteration algorithm is
well-defined (compare to the Howard Algorithm 7.18 for the discounted case):
7.3 Ergodic Markov Chains 265
4. Go to step 2.
Next, consider the following single class hypothesis for any strategy, stronger than
(7.217):
For each u ∈ Φ, S = S (u). (7.225)
Theorem 7.60 (i) The sequence computed by the Howard algorithm is such that ηq
is nonincreasing.
(ii) If (7.225) holds, then any limit point of (u q , V q , ηq ) satisfies the ergodic dynamic
programming principle and is therefore an optimal policy in view of Theorem 7.58.
and therefore
By the definition of the Howard algorithm, the l.h.s., denoted by ξ q , has nonpositive
values. Multiplying on the left by π q+1 ≥ 0, since π q+1 W q+1 = 0, we obtain that
So, ηq is nonincreasing.
(ii) Being bounded, ηq converges to some η̄ ∈ R. Take a subsequence for which
(u q , u q+1 ) → (ū, û), with similar conventions for costs, probabilities, etc. Passing
to the limit in the relation
we obtain that
ĉ + Ŵ V̄ ≤ c(u) + W (u)V̄ , for all u ∈ Φ. (7.230)
266 7 Markov Decision Processes
With (7.225) we conclude that (η̄, V̄ ) satisfies the ergodic dynamic programming
principle (7.220). The conclusion follows.
Remark 7.61 Any solution of the ergodic dynamic programming principle (7.220)
provides a stationary sequence for Howard’s algorithm. So, if the single class hypoth-
esis (7.225) holds, by the above two theorems, a policy is optimal iff it satisfies the
ergodic dynamic programming principle.
7.4 Notes
For partially observed processes, see Monahan [81]. For more on ergodic Markov
chains, see Arapostathis et al. [7] and Hsu et al. [61]. On the superlinear convergence
of Howard’s type algorithms, see Bokanowski, Maroso and Zidani [22], and Santos
and Rust [109].
For further reading we refer to the books by Bersekas [19], Altman [5] (especially
about expectation constraints), Puterman [91], and for continuous state spaces to
Hernández-Lerma, and Lasserre [56, 57]. The link with discretization of continuous
time processes is discussed in Kushner and Dupuis [67]. On the modified policy
iteration algorithms for discounted Markov decision problems, see Puterman and
Shin [92].
Chapter 8
Algorithms
In this section we will study the case of convex dynamic problems, whose convex
Bellman values can be approximated by a collection of affine minorants. We start
with the static case.
So, computing ϕk (x) can be done by storing only (k + 1) vectors of Rn+1 , instead
of 2(k + 1) vectors of Rn , as would suggest the definition of ϕk .
Lemma 8.3 If ε > 0, the algorithm stops after finitely many iterations. If ε = 0,
either it stops after finitely many iterations, or ϕk (x k+1 ) → min X f , and any limit-
point of x k is a solution of (8.1).
Proof It suffices to study the case when ε = 0 and the algorithm does not stop. Let x̄
be a limit-point of x k . For the associated subsequence x ki , since ϕk is Lipschitz and
nondecreasing as a function of k, the value of its minimum converges. We conclude
using Lemma 8.1, (8.8) and
min f (x) ≥ lim ϕk (x k+1 ) = lim ϕki −1 (x ki ) = lim ϕki −1 (x̄) = f (x̄). (8.10)
x∈X k i i
8.1.2.1 Principle
N −1
J (u, y) := t (u t , yt ) + N (y N ), (8.11)
t=0
yt+1 = At yt + Bt u t , t = 0, . . . , N − 1; y0 = y 0 , (8.12)
u t ∈ Ut , t = 0, . . . , N − 1. (8.13)
where the minimization is over the control variables satisfying the control constraints
(8.13). Then the following dynamic programming principle holds:
Since the data are Lipschitz, so are the Bellman values vτ , with constant, say, L. The
algorithm is as follows. At iteration k, we have a convex minorant ϕtk of vt , which
is therefore necessarily Lipschitz with constant at most L, and a nondecreasing
function of k. The initialization with k = 0 is usually done by taking ϕt0 equal to
a large negative number. We first perform the forward step: this means computing
a feasible trajectory (u k , y k ) such that u kt is a solution of the approximate dynamic
programming strategy, where vt+1 is replaced by ϕt+1 k
:
u kt ∈ argmin t (u, ytk ) + ϕt+1
k
(At ytk + Bt u) , t = 0, . . . , N − 1. (8.17)
u∈Ut
This step is forward in the sense that we first compute u k0 , then u k1 , etc. We then
see how to perform the backward step, which consists in computing an improved
minorant of vt , i.e., ϕtk+1 such that
We will improve the minorant of the Bellman function by applying the subdifferential
calculus rule in Lemma 1.120 to the forward step (8.17), combined with Theorem
1.117. Since t and ϕt+1 are continuous, the latter in view of (8.17), for t = 0 to
N − 1, there exists
rtk = (rut
k
, r yt
k
) ∈ ∂t (u kt , ytk ); h kt ∈ NUt (u kt ); qt+1
k
∈ ∂ϕt+1
k
(yt+1
k
), (8.21)
such that
k
rut + Bt qt+1
k
+ h kt = 0, t = 0, . . . , N − 1. (8.22)
8.1 Stochastic Dual Dynamic Programming (SDDP) 271
ϕt+1
k
(y ) ≥ ϕt+1
k
(yt+1
k
) + qt+1
k
· (y − yt+1
k
), (8.23)
⎪
⎩
0 ≥ h t · (u − u t ).
k k
k (A y + B u) ≥ (u k , y k ) + ϕ k (y k ) + r k + A q k
t (u, y) + ϕt+1 k
t t t t t t+1 t+1 yt t t+1 · (y − yt ).
(8.25)
Minimizing the l.h.s. over u ∈ Ut we obtain an affine minorant of the value function
vt . Therefore, the above r.h.s. is itself an affine minorant of the value function vt . So,
we can update ϕtk as follows:
k
ϕtk+1 (y) := max ϕtk (y), t (u kt , ytk ) + ϕt+1
k
(yt+1
k
) + r yt + A qt+1
k
· (y − ytk ) .
(8.26)
We also update ϕ Nk as follows:
ϕ Nk+1 (y) := max ϕ Nk (y), N (y Nk ) + r Nk · (y − y Nk ) , where r Nk ∈ ∂ N (y Nk ).
(8.27)
The updates of the ϕtk can be performed in parallel or in any order, and is anyway
very fast. We see that the costly step of the algorithm is the forward one. Since ϕtk is
nondecreasing and upper bounded by vt , it has a limit denoted by ϕ̄t .
Proof (a) We claim, using a backward induction argument, that (8.28) holds. For
t = N this follows from Lemma 8.1. Let it hold for t + 1, with 0 ≤ t ≤ N − 1. It
suffices to check the result for a subsequence ki such that y ki +1 is convergent.
Since the data are Lipschitz and the minorants ϕk are Lipschitz with constant L,
ctk := r yt
k
+ A k
t qt+1 is bounded.
Given ε > 0, for large enough i, by the induction hypothesis, since ϕtk is nonde-
creasing w.r.t. k, for j > i, we have using (8.26) that
272 8 Algorithms
k
≥ t (u kt i , ytki ) + ϕt+1
ki
(Ay ki + Bu ki ) + ctk · (yt j − ytki ))
k
≥ t (u kt i , ytki ) + vt+1 (Ay ki + Bu ki ) − ε − |ctk | |yt j − ytki | (8.29)
k
≥ vt (ytki ) − ε − |ctk | |yt j − ytki |
k k
≥ vt (yt j ) − ε − (L + |ctk | ) |yt j − ytki |.
k k
Since |ctk | is bounded, this implies lim inf j ϕt j (yt j ) − vt (ytki ) ≥ 0. Since ϕtk is a
minorant of vt , the claim follows.
(b) We must prove that any limit-point of u k is a solution of (P). Indeed we have
that for all u t ∈ Ut , in view of step (a):
t (u kt , ytk ) + ϕt+1
k
(At ytk + Bt u kt ) ≤ t (u t , ytk ) + ϕt+1
k
(At ytk + Bt u t )
≤ t (u t , yt ) + vt+1 (At ytk + Bt u t ) + o(1).
k
(8.30)
Making k ↑ ∞ we get that
By point (a), ϕ̄t+1 ( ȳt+1 ) = vt+1 ( ȳt+1 ). Minimizing the r.h.s. over u t ∈ Ut we get that
in view of the dynamic programming principle
8.1.3.1 Principle
For the sake of simplicity we assume that Ω = Ω0N +1 , and that ω = (ω0 , . . . , ω N )
with all components independent, of the same law. Additionally Ω0 = {1, . . . , M}
and the event i has probability pi . We say that a random variable is Ft -measurable
if it depends on (ω0 , . . . , ωt−1 ). We consider adapted policies: u t (and therefore also
yt ) is Ft -measurable, for t = 0 to N − 1. We denote by y[u] the state associated
with control u, the adapted solution of
The cost function is, given (u, y) adapted and a.s. bounded
8.1 Stochastic Dual Dynamic Programming (SDDP) 273
N −1
J (u, y) := E t (u t , yt , ωt ) + N (y N , ω N ) . (8.34)
t=0
We assume that the functions entering into the cost are Lipschitz and convex w.r.t.
(u, y). Denote the reduced cost by
The problem is to minimize the reduced cost satisfying the control constraints:
N −1
v N = E N ; vτ (x) := min E t (u t , yt , ωt ) + N (y N , ω N ) | yτ = x ,
u τ ,...,u N −1
t=τ
(8.37)
where the minimization is over the feasible adapted policies (feasible in the sense
that they satisfy the above control constraints). The dynamic principle reads
M
vt (y) = min pi (t (u t , yt , i) + vt+1 (At y + Bt u + ei )) . (8.39)
u∈Ut
i=1
M
u kt ∈ argmin pi t (u, ytk ) + ϕt+1
k
(At ytk + Bt u + ei ) , t = 0, . . . , N − 1,
u∈Ut i=1
(8.40)
k
and then compute yt+1 according to (8.33). Assuming that in the case of multiple
minima we choose one of them following a rule such as choosing the solution of
minimum norm, this determines an adapted policy. Computing trajectories when
choosing i with probability pi , this procedure then appears as a Monte Carlo type
computation for estimating the reduced cost F(u k ) associated with the adapted policy
u k . We have that
ϕ0k (y 0 ) ≤ v0 (y 0 ) ≤ F(u k ). (8.41)
274 8 Algorithms
So, provided we have a statistical procedure implying that for some ε > 0 and ak ∈ R
ϕt+1
k
(y ) ≥ ϕt+1
k
(yi,t+1
k
) + qi,t+1
k
· (y − yi,t+1
k
), (8.47)
⎪
⎩
0 ≥ h t · (u − u t ).
k k
Summing these relation (with weights pi for the two first) when y = At y + Bt u + ei
and using (8.46) we obtain that
M
pi t (u, y, i) + ϕt+1
k
(At y + Bt u + ei ) ≥ atk + btk · (y − ytk ), (8.48)
i=1
where ⎧ M
⎨ atk := i=1 pi t (u kt , ytk , i) + ϕt+1
k
(yi,t+1
k
) ,
M (8.49)
⎩ btk := i=1 pi rikyt + A qi,t+1 k
.
8.1 Stochastic Dual Dynamic Programming (SDDP) 275
Minimizing the l.h.s. of (8.48) over u ∈ Ut , we see that the above r.h.s. gives an
affine minorant of the value function vt , so that we can update ϕtk as follows:
ϕtk+1 (y) := max ϕtk (y), atk + btk · (y − ytk ) . (8.50)
We recall that the Frobenius scalar product between two matrices A, B of same size
is
A, B F = Ai, j Bi, j = trace(AB ). (8.53)
i, j
Note that, if A, B, C are matrices such that AB and C have the same dimension,
then we have the “transposition rule”
8.2.2 Setting
Here A, a p × n matrix, b(ω) ∈ L 2 (Ω) p and c(ω) ∈ L 2 (Ω)n are given. We assume
that the probability has support over the closed set Ω ⊂ Rn ω and that for some
matrices B and C of appropriate dimension:
In addition we decide to take a linear decision rule, i.e. for some X ∈ Rn×n ω :
ω1 = 1 a.s. on Ω, (8.58)
so that these linear decision rules are in fact affine decision rules on ω2 , . . . , ωn ω .
Denoting by (AX − B)i the ith row of AX − B, the resulting problem reads:
Ω = {ω ∈ Rn ω ; W ω + Z z ≥ h}. (8.62)
8.2 Introduction to Linear Decision Rules 277
So, v(y) is the value of a feasible linear program (we assume of course that Ω is
nonempty) whose Lagrangian function is
− y · ω + λ · (h − W ω − Z z) = −(y + W λ) · ω − (Z λ) · z + λ · h. (8.64)
Therefore, the dual problem has the same value as the primal one, i.e.,
In addition, both the primal and dual problem have solutions if v(y) is finite. So,
v(y) ≥ 0 iff λ · h ≥ 0, for some λ satisfying the constraints in (8.65), which may be
expressed in the form
λ W + y = 0; λ Z = 0; λ ≥ 0. (8.66)
Taking for y the rows of AX − B, and denoting by Λ the matrix whose rows are
the transpose of the corresponding λ, we obtain an equivalent linear programming
reformulation of problem (8.62):
Lemma 8.7 Let Ω be of the form (8.62). Then the value of the linear programming
problem (8.67) is an upper bound of the value of the original problem (8.55).
We next generalize the previous analysis by considering the setting of linear conical
optimization, see Chap. 1, Sect. 1.3.2. Assume that for some z ∈ Rn z , h ∈ Rn h , W and
Z matrices of appropriate dimensions, and some (finite-dimensional) closed convex
cone K :
Ω = {ω ∈ Rn ω ; W ω + Z z − h ∈ K }. (8.68)
Remember that the infimum is not necessarily attained, even if the value is finite.
Assume that the above problem is qualified, i.e., for some ε > 0:
Expressing the dual using K + rather than K − , by Corollary 1.144, we have that
either v(y) = −∞, or
So, v(y) ≥ 0 iff λ · h ≥ 0, for some dual feasible λ. The dual constraints may be
expressed in the form
λ W + y = 0; λ Z = 0; λ ∈ K + . (8.72)
Taking for y the rows of AX − B, and denoting by Λ the matrix whose rows are
the transpose of the corresponding λ, we obtain an equivalent conic reformulation
of (8.62), where Λi denotes the ith row of the matrix Λ:
Lemma 8.8 Let Ω be of the form (8.68), and satisfy the qualification condition
(8.70). Then the value of (8.73) is an upper bound of the value of the original
problem (8.55).
We are now looking for lower bounds of the value of the original stochastic opti-
mization problem (8.55), when Ω is of the form (8.68). Denoting by v P the value of
(8.68), which we may express as
vP = inf p
sup E (c(ω) · x(ω) + y(ω) · (Ax(ω) + s(ω) − b(ω))) ,
x∈L 2 (Ω)n ,s∈L 2 (Ω)+ y∈L 2 (Ω) p
(8.74)
we get a lower bound by restricting, in the above expression, y to some subspace say
Y of L 2 (Ω) p : so v P ≥ vY , where
8.2 Introduction to Linear Decision Rules 279
vY := inf p
sup E (c(ω) · x(ω) + y(ω) · (Ax(ω) + s(ω) − b(ω))) .
x∈L 2 (Ω)n ,s∈L 2 (Ω)+ y∈Y
(8.75)
Note that
vY := inf p
Ec(ω) · x(ω); Ax(ω) + s(ω) − b(ω) ∈ Y ⊥ . (8.76)
x∈L 2 (Ω)n ,s∈L 2 (Ω)+
Then
vY = inf p
sup E c(ω) · x(ω) + ω Y (Ax(ω) + s(ω) − b(ω)) .
x∈L 2 (Ω)n ,s∈L 2 (Ω)+ Y
(8.78)
Set e(ω) := Ax(ω) + s(ω) − b(ω). Then by the transposition rule (8.54):
vY = inf p
Ec(ω) · x(ω); E(Ax(ω) + s(ω) − b(ω))ω = 0. (8.80)
x∈L 2 (Ω)n ,s∈L 2 (Ω)+
Proof Since M is symmetric and semidefinite, it is not of full rank iff there exists
some nonzero g ∈ Rn ω so that
So, M is not of full rank iff ω lies in the orthogonal of some nonzero vector g ∈ Rn ω .
The conclusion follows.
On the other hand, given any n × n ω matrix X , we have that x(ω) = X ω is such that
the above first relation holds. Assume in the sequel that c(ω) = Cω. Using (8.54)
and the symmetry of M, we get that
vY = inf p
trace(MC X ); S M = Es(ω)ω ; (AX + S − B)M = 0.
X,S,s∈L 2 (Ω)+
(8.85)
Since M is invertible, we deduce that
vY = inf p
trace(MC X ); AX + S = B; S M = Es(ω)ω . (8.86)
X,S,s∈L 2 (Ω)+
The above problem is still not tractable, but we will see that it has the following
tractable relaxation:
Set Γ := Ez(ω)s(ω) ; note that this expectation is finite since E|ω|2 < ∞, in view
of (8.88). By the above display,
The jth column of the r.h.s. matrix is E(W ω − h + Z z(ω))s j (ω). Since W ω − h +
Z z(ω) ∈ K a.e., and K is a closed convex cone, it belongs to K . The conclusion
follows.
8.2 Introduction to Linear Decision Rules 281
Remark 8.11 (i) The derivation of this dual bound did not assume any qualification
condition.
(ii) For a refined analysis of the lower bound, in the case when K is the set of
nonnegative vectors, see [66].
8.3 Notes
Kelley’s [63] algorithm 8.1 for minimizing a convex function over a set X essentially
requires us to solve a linear programming problem at each step, if X is a polyhedron.
Various improvements, involving the quadratic penalization of the displacement and
therefore the resolution of convex quadratic programs, are described in Bonnans et
al. [24].
The SDDP algorithm, due to Pereira and Pinto [86], can be seen as an extension
of the Benders decomposition [18]. Shapiro [113] analyzed the convergence of such
an algorithm for problems with potentially infinitely many scenarios, and considered
the case of a risk averse formulation, based on the conditional value at risk. See also
Girardeau et al. [52]. In the case of a random noise process with memory, a possibility
is to approximate it by a Markov chain, obtained by a quantization method, and to
apply the SDDP approach to the resulting dynamic programming formulation. This
applies more generally when the value functions are convex w.r.t. some variables
only, see Bonnans et al. [23]. The SDDP approach can also provide useful bounds
in the case of problems with integer constraints, see Zou, Ahmed and Sun [128].
In the presentation of linear decision rules we follow Georghiou et al. [51, 66].
The primal upper bound (8.73) can be computed by efficient algorithms when K
is the product of polyhedral cones, second-order cones, and cones of semidefinite
symmetric matrices. See e.g. Nesterov and Nemirovski [85]. For other aspects of
linear decision rules, in connection with robust optimization (for which a reference
book is [16]), see Ben-Tal et al. [14].
Chapter 9
Generalized Convexity and
Transportation Theory
Summary This chapter first presents the generalization of convexity theory when
replacing duality products with general coupling functions on arbitrary sets. The
notions of Fenchel conjugates, cyclical monotonicity and duality of optimization
problems, have a natural extension to this setting, in which the augmented Lagrangian
approach has a natural interpretation.
Convex functions over measure spaces, constructed as Fenchel conjugates of
integral functions of continuous functions, are shown to be sometimes equal to some
integral of a function of their density. This is used in the presentation of optimal
transportation theory over compact sets, and the associated penalized problems. The
chapter ends with a discussion of the multi-transport setting.
Clearly, this holds iff β ≥ ϕ κ (y). In other words, for any given y ∈ Y , if ϕ κ (y) is
finite, then x → κ(x, y) − ϕ κ (y) is the ‘best’ κ-minorant of the form κ(x, y) − β,
for some β ∈ R. If ϕ κ (y) = ∞, there is no such minorant. Finally, ϕ κ (y) = −∞
means that ϕ(x) = ∞, for any x ∈ X .
Since X and Y play symmetric roles we have similar notions for ψ : Y → R. For
instance, the κ-Fenchel conjugate of ψ is the function ψ κ : X → R defined by
Lemma 9.1 (i) The biconjugate of ϕ is the greatest κ-convex function dominated by
ϕ. That is, if f : X → R is κ-convex and f (x) ≤ ϕ(x) for all x ∈ X , then f (x) ≤
ϕ κκ (x) for all x ∈ X .
(ii) A function is κ-convex iff it is equal to its biconjugate.
(iii) A supremum of κ-convex functions is κ-convex.
so that
ψ(y) ≥ sup(κ(x, y) − ϕ(x)) = ϕ κ (y) (9.10)
x∈X
ϕ(x) = sup sup(κ(x, y) − ϕiκ (y)) = sup(κ(x, y) − inf ϕiκ (y)), (9.11)
i∈I y∈Y y∈Y i∈I
In other words,
∂κ ϕ(x̄) = ∅ ⇒ ϕ κκ (x̄) = ϕ(x̄). (9.13)
So when ϕ and its biconjugate have the same value at some point they also have the
same κ-subdifferential.
Remark 9.4 When X is a Banach space, Y is its dual, and κ(x, y) = y, x is the
usual duality product, we will speak of usual convexity, and then we recover the
usual Fenchel transform. Note, however, the difference in the definition of convex
functions.
N
N
κ(xi , yi ) ≥ κ(xi+1 , yi ). (9.15)
i=1 i=1
Lemma 9.5 We have that Γ is κ cyclically monotone iff there exists a κ-convex
function ϕ over X , such that
Proof (i) If some κ-convex function ϕ over X satisfies (9.16) then, by the κ-Fenchel–
Young inequality:
κ(xi , yi ) = ϕ(xi ) + ϕ κ (yi ),
(9.17)
−κ(xi+1 , yi ) ≥ −ϕ(xi+1 ) − ϕ κ (yi ).
N
ϕ(x) := sup (κ(xi+1 , yi ) − κ(xi , yi )). (9.18)
i=1
N +1
N
ϕ(x) ≥ (κ(xi+1 , yi ) − κ(xi , yi )) = κ(x, ȳ) − κ(x̄, ȳ) + (κ(xi+1 , yi ) − κ(xi , yi )).
i=1 i=1
(9.20)
Maximizing over the last sum it follows that
Taking x = x1 we deduce that ϕ(x̄) < ∞. It follows that ϕ(x̄) is finite, and by the
above display, ȳ ∈ ∂κ ϕ(x̄). The conclusion follows.
9.1 Generalized Convexity 287
9.1.3 Duality
Our previous results on generalized convexity (in particular Lemma 9.3) lead to the
following weak duality result:
We continue in the previous setting, in the case when X is again an arbitrary set, Y
is a Banach space and the family of optimization problems is
288 9 Generalized Convexity and Transportation Theory
(b) Dualization of the original problem (Py ), with value v(y), using the coupling
between Y and Y defined by
Remark 9.8 Observe that, when P(0) = 0, in cases (a) and (b), the duality gap is the
same for the unperturbed problem y = 0. So, the augmented Lagrangian approach
can be seen as a generalized convexity approach on the original problem with the
nonstandard coupling y ∗ , y − r P(y).
Example 9.9 The classical example is when Y is a Hilbert space identified with its
dual, and P(y) = 21 y2 . Then the penalty term in the augmented Lagrangian is
1 1 ∗ 2 1
inf 2
r z − g(x)2 + y ∗ , g(x) − z = inf 1
2
r z − g(x) − y − 2 y ∗ 2 ,
z∈K z∈K r r
(9.37)
and therefore the augmented Lagrangian is
2
1 ∗ 1 ∗ 2
L r (x, y ∗ ) := f (x) + r dist K g(x) + y − y . (9.38)
r r2
The case of finitely many inequality constraints corresponds to the case when Y = Rm
is endowed with the Euclidean norm and K = Rm − . The expression of the augmented
Lagrangian is then
m
1 ∗ 2
1 ∗ 2
L r (x, y ∗ ) := f (x) + r g(x) + y − y . (9.39)
i=1
r + r2
Let Ω be a compact subset of Rn , and C(Ω) denote the set of continuous functions
over Ω. Let p ∈ N be nonzero and set X := C(Ω) p , whose elements are viewed as
continuous functions over Ω with value in R p , and norm
Indeed a(ω) · x − f ∗ (a(ω)) ≤ f (x), so that the above inequality holds (since f ∗ has
an affine minorant, the integral has value in [−∞; ∞)), and the equality is obvious
since ϕk → ϕ in X . By Proposition 3.74, the supremum over a(·) of the r.h.s. is
precisely F(ϕ). The result follows.
Recall the Definition 5.3 of regular measures. The dual of C(Ω) is M(Ω), the
set of finite Borel regular measures over Ω; see [77, Chap. II, Sect. 5]. The Fenchel
conjugate of F is F ∗ : M(Ω) p → R̄ defined by
p
μ, ϕ X := ϕi (ω)dμi (ω). (9.44)
i=1 Ω
9.2 Convex Functions of Measures 291
p
Let L 1μ := Πi=1 L 1μi (Ω) denote the set of integrable functions for the measure μ.
Definition 9.12 We say that h = R p → R̄ has superlinear growth if, for all k > 0,
h(y)/|y| > k when |y| > rk , for some rk > 0.
μ1 , ϕε → c μ1 , 1 K = cμ1 (K ). (9.50)
We next identify ϕε with the element of C(Ω) p with first component ϕε and the
other components equal to zero. By Lemma 9.13, f is Lipschitz on bounded sets,
and so, by the dominated convergence theorem, F(ϕε ) → 0. Therefore, F ∗ (μ) ≥
limε ( μ1 , ϕε − F(ϕε )) = cμ1 (K ). Letting c ↑ ∞ we deduce that F ∗ (μ) = +∞.
(ii) Let μs = 0. Then
where in the last inequality we use the Fenchel–Young inequality. We next prove the
converse inequality. Set b(ω, v) := μa (ω)v − f (v). Let ak be a dense sequence in
dom f and let ϕk ∈ L ∞ (Ω) be inductively defined by ϕ0 (ω) = a0 and
ak if b(ω, ak ) > b(ω, ϕk−1 (ω)),
ϕk (ω) = (9.52)
ϕk−1 (ω) otherwise.
Then
b(ω, ϕk (ω)) → f ∗ (μa (ω)) a.e. and, by the monotone convergence theorem,
∗
Ω b(ω, ϕk (ω))dω → Ω f (μa (ω))dω. We cannot conclude the result from this
since the ϕk are not continuous. So, given ε > 0, fix k such that
⎧
⎨
f ∗ (μa (ω))dω − ε if f ∗ (μa (ω)) < ∞,
b(ω, ϕk (ω))dω > Ω Ω (9.53)
Ω ⎩ 1/ε otherwise.
Fix M > 0 such that μaM − μa L 1 (Ω) < ε. Extend ϕk over R p by 0 and let η : R p →
R+ be of class C ∞ with integral 1 and support in the unit ball. Set for α > 0, ηα (x) :=
α −n η(x/α), and ϕ̂α := ϕk ∗ ηα (convolution product). By Jensen’s inequality,
In the last inequality we use ϕ̂α ∞ ≤ ϕk ∞ and ϕ̂α → ϕk in L 1 (Ω). We conclude
by combining the previous inequality with (9.53) and (9.56).
Clearly G is convex and bounded over bounded sets. So, it is continuous, with
conjugate
G ∗ (μ) := sup μ, ϕ X − g(ω, ϕ(ω))dω. (9.59)
ϕ∈X Ω
Proof This is an easy variant of the proof of Lemma 9.14. Let us just mention that,
while Jensen’s inequality in (9.55) cannot be easily extended, we get directly the
analogous to (9.56), namely
by the dominated convergence theorem, since ϕ̂α is bounded in L ∞ (Ω) p and con-
verges to ϕ in L 1 (Ω) p .
We next analyze in a simple way the Kantorovich duality that extends the classical
Monge problem.
294 9 Generalized Convexity and Transportation Theory
Let x be a compact subset of Rn , and c(x) denote the space of continuous functions
over x, endowed with the uniform norm
This is a Banach space, with dual denoted by m(x). We say that η ∈ m(x) is non-
negative, and we write η ≥ 0, if η, ϕ c(x) ≥ 0, for any nonnegative ϕ. We denote
by m + (x) the positive cone (set of nonnegative elements) of m(x). It is known that
m(x) is the space of finite Borel measures over x, see [77, Chap. 2].
Given a compact subset y of R p , set z := x × y (this is a compact subset of Rn+ p ).
To ϕ ∈ c(x) we associate bx ϕ ∈ c(z) defined by
the marginal of μ over X . The marginal mapping μ → μ| X is nothing but the trans-
pose of the canonical injection from C(X ) into C(Z ), and is non-expansive in the
sense that
μ| X M(X ) ≤ μ M(Z ) . (9.65)
Let 1 X have value 1 over X . The marginals are related by the compatibility relation
We fix (η, ν) ∈ P(X ) × P(Y ), and c(x, y) ∈ C(Z ). Consider the Kantorovich
problem
Proposition 9.16 Problems (9.71) and (9.68) have the same finite value, and both
have a nonempty set of solutions.
Proof (a) The dual problem (9.71) is feasible (take for μ the product of η and ν) and
the primal problem (9.68) is qualified: there exists a pair (ϕ0 , ψ0 ) in C(X ) × C(Y )
such that c(x, y) − ϕ0 (x) − ψ0 (y) is uniformly positive. By general results of convex
duality theory, problems (9.68) and (9.71) have the same finite value, and (9.71) has
a nonempty and bounded set of solutions.
(b) It remains to prove that (9.68) has a nonempty set of solutions. Let (ϕk , ψk ) be a
minimizing sequence. Set
ψk (y) := min{c(x, y) − ϕk (x); x ∈ X },
(9.72)
ϕk (x) := min{c(x, y) − ψk (y); y ∈ Y }.
It is easily checked that these two functions are continuous, and satisfy the primal
constraint as well as the inequality (ϕk , ψk ) ≥ (ϕk , ψk ), implying that the associated
cost is smaller than the one for (ϕk , ψk ); so, (ϕk , ψk ) is another minimizing sequence.
In addition, (ϕk , ψk ) has a continuity modulus not greater than the one of c (in short,
it has a c-continuity modulus), since a finitely-valued infimum of functions with c-
continuity modulus has c-continuity modulus. Since we can always add a constant to
ϕk and subtract it from ψk we get the existence of a minimizing sequence (ϕk , ψk )
with c-continuity modulus, and such that ϕk (x0 ) = 0. It easily follows that (ϕk , ψk )
is bounded in C(Y ) and C(X ) resp. By the Ascoli–Arzela theorem, there exists a
subsequence in C(X ) × C(Y ) converging to some (ϕ, ψ). Passing to the limit in
the cost function and constraints of (9.68) we obtain that (ϕ, ψ) is a solution to this
problem.
Remark 9.17 The primal solution (ϕ, ψ) constructed in the above proof satisfies
ψ(y) = min{c(x, y) − ϕ(x); x ∈ X },
(9.73)
ϕ(x) := min{c(x, y) − ψ(y); y ∈ Y }.
296 9 Generalized Convexity and Transportation Theory
Setting κ(x, y) := −c(x, y), the above relations can be interpreted as κ-conjugates
in the sense of (9.1):
− ψ = (−ϕ)κ ; −ϕ = (−ψ)κ . (9.74)
Let (ϕ, ψ) and μ be primal and dual feasible, resp. The difference of associated costs
is, since η and ν are the marginals of μ:
As expected it is nonnegative, and (since the primal and dual problem have the same
value), (ϕ, ψ) and μ are primal and dual solutions, resp., iff the above r.h.s. is equal
to zero, meaning that c(x, y) = ϕ(x) + ψ(y) over the support of μ, which we denote
by Γ . Let (x̄, ȳ) ∈ Γ . Then
c(x̄, ȳ) − ϕ(x̄) = ψ( ȳ) ≤ c(x, ȳ) − ϕ(x), for all x ∈ X. (9.76)
In the sequel we assume that X and Y are the closure of their interior, and that c(·, ·)
is of class C 1 . By the above remark, we may assume that ϕ and ψ satisfy (9.74) and
therefore are Lipschitz.
By Rademacher’s theorem, see [6, Thm. 2.14], ϕ is a.e. differentiable over int(X ).
If x̄ ∈ int(X ) and ϕ(x) is differentiable at x̄, (9.76) implies that
Example 9.18 Take c(x, y) = 21 |x − y|2 . We obtain that ∇ϕ(x̄) = x̄ − ȳ. Therefore
so that the support of μ is contained in the graph of the transportation map T (x). If
η has a density, we can identify μ with this transportation map. In addition, since
(9.74) holds for (ϕ, ψ) we have that ϕ̂ := −ϕ satisfies
ϕ̂(x) = max{−c(x, y) + ψ(y)} = − 21 |x|2 + max x · y − 21 |y|2 + ψ(y) .
y∈Y y∈Y
(9.79)
The last maximum of affine functions of x, say F(x), is a convex function of x. We
deduce that ϕ(x) = 21 |x|2 − F(x), with F convex. We have proved that
If c(x, y) = 21 |x − y|2 , then the transportation plan is a.e.
(9.80)
of the form T (x) = ∇ F(x) a.e., where F is a convex function.
9.3 Transportation Theory 297
Example 9.19 More generally assume that c(x, y) = f (x − y) with f convex and
Lipschitz. Then (9.76) implies that
meaning that
9.3.3.1 Duality
with ê(0) := 0 and domain [0, ∞). The penalty term is defined as
⎧
⎨ ∞ if μs = 0,
P(μ) = (9.87)
⎩ e(μa (ω))dω otherwise.
Ω
Set
f (μ)=ε P(μ(x, y))d(x, y); F(ν) = I{0} (ν), Aμ = −(μ| X , μ|Y ); y = (η, ν).
Z
(9.89)
Here A is from M(Z ) into M(X ) × M(Y ). The penalized dual problem can be written
in the form
Min f (μ) + μ, c C(Z ) + F(Aμ + y). (9.90)
μ∈M(Z )
We can compute the dual to this problem in dual spaces, as explained in Chap. 1,
Sect. 1.2.1.2. While the problem is in a dual space setting, the computations are similar
to those in the standard Fenchel duality framework, so that the ‘bidual’ expressed as
a minimization has expression
Lemma 9.20 We have that f ∗ has finite values and, for every c ∈ C(Z ):
Proof Let fˆ(c) denote the above r.h.s. Since e is l.s.c. proper convex, it is the Fenchel
conjugate of e∗ . So, by Lemma 9.13, since e has superlinear growth, e∗ is finite-
valued and bounded over bounded sets. So, fˆ(c) is a continuous convex function
and, by Lemma 9.14, its conjugate is P(μ). Since fˆ(c) is equal to its biconjugate,
the conclusion follows.
Proposition 9.21 The penalized problem (9.88) is the dual of problem (9.93).
Proof Apply the Fenchel duality theory, taking into account Lemma 9.20.
Remark 9.22 Since the primal penalized problem is qualified (in the case of the
usual penalties given above) its dual has a nonempty and bounded set of solutions.
The semiprimal problem consists in minimizing the primal cost w.r.t. ϕ only. The
primal cost can be expressed as
9.3 Transportation Theory 299
ε (P ∗ ((ϕ(x) + ψ(y) − c(x, y))/ε)dy − η(x)ϕ(x)) dx − ν, ψ C(Y ) .
X Y
(9.94)
So the first-order condition for minimizing w.r.t. ϕ is that
Example 9.23 Entropy penalty: then P ∗ (s) = D P ∗ (s) = es and (9.95) reduces to
Since the l.h.s. of (9.96) is a positive and continuous function of x, (9.96) has a
solution iff η is absolutely continuous, with positive and continuous density (since
Ω is compact, this implies that the density has a positive minimum), and the solution
is ϕ such that
ϕ(x)
+ log (exp((ψ(y) − c(x, y))/ε)dy = log η(x). (9.97)
ε Y
9.3.4 Barycenters
K
Min − νk , ψ k C(Yk ) ;ϕ
k
(x) + ψ k (y) − ck (x, y) ≤ 0,
ϕ,ψ
k=1
for all x ∈ X and y ∈ Yk , (9.100)
K
ϕ k (x) = 0, for all x ∈ X,
k=1
ϕ k ∈ C(X ); ψ k ∈ C(Yk ), k = 1, . . . , K .
Proposition 9.24 Problems (9.99) and (9.100) have the same finite value, and both
have a nonempty set of solutions.
Proof (a) The dual problem (9.99) is feasible (take for μk the product of η and ν k ) and
the primal problem (9.100) is qualified: for k = 1 to K , there exists pairs (ϕ0k , ψ0k ) in
C(X ) × C(Yk ) such that ck (x, y) − ϕ0k (x) − ψ0k (y) is uniformly positive. By general
results of convex duality theory, problems (9.99) and (9.100) have the same finite
value, and (9.99) has a nonempty and bounded set of solutions.
(b) It remains to show that the primal problem (9.100) has solutions. We adapt the
ideas in the proof of Proposition 9.16. Let (ϕ j , ψ j ) be a minimizing sequence. Set
⎧ k
⎪
⎪ ψ̂ j (y) := min{ck (x, y) − ϕ kj (x); x ∈ X }, k = 1, . . . , K ,
⎪
⎪
⎪
⎨ ϕ̂ k (x) := min{ck (x, y) − ψ̂ k (y); y ∈ Yk }, k = 1, . . . , K − 1,
j j
(9.101)
⎪
⎪
K −1
⎪
⎪
⎩ ϕ̂ j (x) := − ϕ̂ kj (x).
K
⎪
k=1
9.3.4.2 Penalization
K
Max − μk , ck Zk +ε P(μk (x, y))d(x, y) ; μk| X = η; μk|Y = ν k ;
μ,η Zk
k=1
μk ∈ M+ (Z k ), k = 1, . . . , K ; η ∈ M(X ).
(9.102)
9.3 Transportation Theory 301
Computing the ‘bidual’ problem we again recognize the Fenchel duality framework
with
⎧
⎪
⎨ K
f (μ, η) = f 1 (μ) + f 2 (η); f 1 (μ) = ε P(μk (x, y))d(x, y)
(9.103)
⎪
⎩ k=1 Z k
K
ϕ k (x) + ψ k (y) − ck (x, y)
Min − ηk , ϕ k C(X ) − νk , ψ k C(Yk ) +ε P∗ d(x, y) ;
ϕ,ψ Zk ε
k=1
K
ϕ k = 0; ϕ k ∈ C(X ); ψ k ∈ C(Yk ), k = 1, . . . , K .
k=1
(9.104)
9.4 Notes
Brøndsted [30] and Dolecki and Kurcyusz [44] are early references for generalized
convexity. The augmented Lagrangian approach was introduced by Powell [88] and
Hestenes [58], and linked to the dual proximal algorithm in Rockafellar [101]. For its
application to infinite-dimensional problems, see Fortin and Glowinski [50]. Convex
functions of measures are discussed in Demengel and Temam [41, 42].
On transportation theory, see the monographs by Villani [122] and Santambrogio
[108]. The link (9.80) between a transportation map and the derivative of a convex
function is known as Brenier’s theorem [27]. Augmented Lagrangians are a useful
numerical tool for solving optimal transport problems, see Benamou and Carlier
[17]. Cuturi [35] introduced the entropic penalty, and showed that the resulting prob-
lem can be efficiently solved thanks to Sinkhorn’s algorithm [116] (for computing
matrices with prescribed row and column sums). Barycenters in the optimal transport
framework were introduced in Carlier and Ekeland [31]. See also Agueh and Carlier
[1]. It gives a powerful tool for clustering, see Cuturi and Doucet [36].
References
1. Agueh, M., Carlier, G.: Barycenters in the Wasserstein space. SIAM J. Math. Anal. 43(2),
904–924 (2011)
2. Akhiezer, N.I.: The Classical Moment Problem. Hafner Publishing Co., New York (1965)
3. Aliprantis, C.D., Border, K.C.: Infinite dimensional analysis, 3rd edn. Springer, Berlin (2006)
4. Alizadeh, F., Goldfarb, D.: Second-order cone programming. Math. Program. 95(1, Ser. B),
3–51 (2003)
5. Altman, E.: Constrained Markov Decision Processes. Stochastic Modeling. Chapman &
Hall/CRC, Boca Raton (1999)
6. Ambrosio, L., Fusco, N., Pallara, D.: Functions of Bounded Variation and Free Discontinuity
Problems. Oxford Mathematical Monographs. The Clarendon Press, Oxford University Press,
New York (2000)
7. Arapostathis, A., Borkar, V.S., Fernández-Gaucherand, E., Ghosh, M.K., Marcus, S.I.:
Discrete-time controlled Markov processes with average cost criterion: a survey. SIAM J.
Control Optim. 31(2), 282–344 (1993)
8. Araujo, A., Giné, E.: The Central Limit Theorem for Real and Banach Valued Random
Variables. Wiley, New York (1980)
9. Artzner, P., Delbaen, F., Eber, J.M., Heath, D.: Coherent measures of risk. Math. Financ. 9(3),
203–228 (1999)
10. Attouch, H., Brézis, H.: Duality for the sum of convex functions in general Banach spaces.
In: Barroso, J.A. (ed) Aspects of Mathematics and its Applications, pp. 125–133 (1986)
11. Aubin, J.-P., Ekeland, I.: Estimates of the duality gap in nonconvex optimization. Math. Oper.
Res. 1(3), 225–245 (1976)
12. Aubin, J.P., Frankowska, H.: Set-Valued Analysis. Birkhäuser, Boston (1990)
13. Bardi, M., Capuzzo-Dolcetta, I.: Optimal Control and Viscosity Solutions of Hamilton-Jacobi-
Bellman Equations. Birkhäuser, Boston (1997)
14. Ben-Tal, A., Golany, B., Nemirovski, A.: Vial, J-Ph: Supplier-retailer flexible commitments
contracts: a robust optimization approach. Manuf. Serv. Oper. Manag. 7(3), 248–273 (2005)
15. Ben-Tal, A., Nemirovski, A.: Lectures on modern convex optimization. Society for Industrial
and Applied Mathematics (SIAM), Philadelphia, MPS/SIAM Series on Optimization (2001)
16. Ben-Tal, Aharon: Ghaoui, El: Laurent, Nemirovski, Arkadi: Robust Optimization. Princeton
University Press, Princeton (2009)
17. Benamou, J.-D., Carlier, G.: Augmented Lagrangian methods for transport optimization, mean
field games and degenerate elliptic equations. J. Optim. Theory Appl. 167(1), 1–26 (2015)
18. Benders, J.F.: Partitioning procedures for solving mixed-variables programming problems.
Numer. Math. 4, 238–252 (1962)
19. Bertsekas, D.P.: Dynamic Programming and Optimal Control, 2nd edn., vol I & II. Athena
Scientific, Belmont (2000, 2001)
20. Billingsley, P.: Convergence of Probability Measures, 2nd edn. Wiley Inc., New York (1999)
21. Birge, J.R., Louveaux, F.: Introduction to Stochastic Programming. Springer, New York (1997)
22. Bokanowski, O., Maroso, S., Zidani, H.: Some convergence results for Howard’s algorithm.
SIAM J. Numer. Anal. 47(4), 3001–3026 (2009)
23. Bonnans, J.F., Cen, Z.: Christel, Th: Energy contracts management by stochastic programming
techniques. Ann. Oper. Res. 200, 199–222 (2012)
24. Bonnans, J.F., Gilbert, J.C., Lemaréchal, C., Sagastizábal, C.: Numerical Optimization: The-
oretical and Numerical Aspects, 2nd edn. Universitext. Springer, Berlin (2006)
25. Bonnans, J.F., Ramírez, H.: Perturbation analysis of second-order cone programming prob-
lems. Math. Program. 104(2–3, Ser. B), 205–227 (2005)
26. Bonnans, J.F., Shapiro, A.: Perturbation Analysis Of Optimization Problems. Springer, New
York (2000)
27. Brenier, Y.: Polar factorization and monotone rearrangement of vector-valued functions. Com-
mun. Pure Appl. Math. 44(4), 375–417 (1991)
28. Brézis, H.: Functional Analysis. Sobolev Spaces and Partial Differential Equations. Springer,
New York (2011)
29. Brézis, H., Lieb, E.: A relation between pointwise convergence of functions and convergence
of functionals. Proc. Am. Math. Soc. 88(3), 486–490 (1983)
30. Brøndsted, A.: Convexification of conjugate functions. Math. Scand. 36, 131–136 (1975)
31. Carlier, G., Ekeland, I.: Matching for teams. Econ. Theory 42(2), 397–418 (2010)
32. Carpentier, P., Chancelier, J-Ph, Cohen, G., De Lara, M.: Stochastic Multi-stage Optimization.
Springer, Berlin (2015)
33. Castaing, C., Valadier, M.: Convex Analysis and Measurable Multifunctions. Lecture Notes
in Mathematics, vol. 580. Springer, Berlin (1977)
34. Csiszár, I.: Information-type measures of difference of probability distributions and indirect
observations. Stud. Sci. Math. Hungar. 2, 299–318 (1967)
35. Cuturi, M.: Sinkhorn distances: lightspeed computation of optimal transportation. In: Neural
Information Processing Conference Proceedings, pp. 2292–2300 (2013)
36. Cuturi, M., Doucet, A.: Fast computation of Wasserstein barycenters. In: Xing, E.P., Jebara,
T. (eds.) Proceedings of the 31st International Conference on Machine Learning. Proceedings
of Machine Learning Research, vol. 32, pp. 685–693, Bejing, China (2014)
37. Dallagi, A.: Méthodes particulaires en commande optimale stochastique. Ph.D. thesis, Uni-
versité Paris I (2007)
38. Danskin, J.M.: The Theory of Max-Min and Its Applications to Weapons Allocation Problems.
Springer, New York (1967)
39. Decarreau, A., Hilhorst, D., Lemaréchal, C., Navaza, J.: Dual methods in entropy maximiza-
tion. Application to some problems in crystallography. SIAM J. Optim. 2(2), 173–197 (1992)
40. Dellacherie, C., Meyer, P.-A.: Probabilities and potential. North-Holland Mathematics Stud-
ies, vol. 29. North-Holland Publishing Co., Amsterdam (1978)
41. Demengel, F., Temam, R.: Convex functions of a measure and applications. Indiana Univ.
Math. J. 33(5), 673–709 (1984)
42. Demengel, F., Temam, R.: Convex function of a measure: the unbounded case. FERMAT days
85: mathematics for optimization (Toulouse, 1985). North-Holland Mathematics Studies, vol.
129, pp. 103–134. North-Holland, Amsterdam (1986)
43. Dentcheva, D., Ruszczyński, A.: Common mathematical foundations of expected utility and
dual utility theories. SIAM J. Optim. 23(1), 381–405 (2013)
44. Dolecki, S., Kurcyusz, S.: On Φ-convexity in extremal problems. SIAM J. Control Optim.
16(2), 277–300 (1978)
45. Dudley, R.M.: Real Analysis and Probability. Cambridge University Press, Cambridge (2002).
Revised reprint of the 1989 original
46. Ekeland, I., Temam, R., Convex Analysis and Variational Problems. Studies in Mathematics
and its Applications, vol. 1. North-Holland, Amsterdam (1976). French edition: Analyse
convexe et problèmes variationnels. Dunod, Paris (1974)
47. Fenchel, W.: On conjugate convex functions. Can. J. Math. 1, 73–77 (1949)
References 305
48. Fenchel, W.: Convex Cones and Functions. Lecture Notes. Princeton University, Princeton
(1953)
49. Föllmer, H., Schied, A.: Stochastic Finance: An Introduction in Discrete Time. de Gruyter
Studies in Mathematics, vol. 27. Walter de Gruyter & Co., Berlin (2002)
50. Fortin, M., Glowinski, R.: Augmented Lagrangian Methods. North-Holland, Amsterdam
(1983)
51. Georghiou, A., Wiesemann, W., Kuhn, D.: Generalized decision rule approximations for
stochastic programming via liftings. Math. Program. 152(1-2, Ser. A), 301–338 (2015)
52. Girardeau, P., Leclere, V., Philpott, A.B.: On the convergence of decomposition methods for
multistage stochastic convex programs. Math. Oper. Res. 40(1), 130–145 (2015)
53. Goberna, M.A., Lopez, M.A.: Linear Semi-infinite Optimization. Wiley Series in Mathemat-
ical Methods in Practice, vol. 2. Wiley, Chichester (1998)
54. Gol’shtein, E.G.: Theory of Convex Programming. Translations of Mathematical Mono-
graphs, vol. 36. American Mathematical Society, Providence (1972)
55. Gouriéroux, C.: ARCH Models and Financial Applications. Springer, New York (1997)
56. Hernández-Lerma, O., Lasserre, J.B.: Discrete-Time Markov Control Processes. Springer,
New York (1996)
57. Hernández-Lerma, O., Lasserre, J.B.: Further Topics on Discrete-Time Markov Control Pro-
cesses. Springer, New York (1999)
58. Hestenes, M.R.: Multiplier and gradient methods. J. Optim. Theory Appl. 4, 303–320 (1969)
59. Hoffman, A.: On approximate solutions of systems of linear inequalities. J. Res. Natl. Bureau
Stand., Sect. B, Math. Sci. 49, 263–265 (1952)
60. Horn, R.A., Johnson, C.R.: Matrix Analysis, 2nd edn. Cambridge University Press, Cambridge
(2013)
61. Hsu, S.-P., Chuang, D.-M., Arapostathis, A.: On the existence of stationary optimal policies
for partially observed MDPs under the long-run average cost criterion. Syst. Control Lett.
55(2), 165–173 (2006)
62. Kall, P., Wallace, S.W.: Stochastic Programming. Wiley, Chichester (1994)
63. Kelley, J.E.: The cutting plane method for solving convex programs. J. Soc. Indust. Appl.
Math. 8, 703–712 (1960)
64. Komiya, H.: Elementary proof for Sion’s minimax theorem. Kodai Math. J. 11(1), 5–7 (1988)
65. Krein, M., Milman, D.: On extreme points of regular convex sets. Studia Math. 9, 133–138
(1940)
66. Kuhn, D., Wiesemann, W., Georghiou, A.: Primal and dual linear decision rules in stochastic
and robust optimization. Math. Program. 130(1, Ser. A), 177–209 (2011)
67. Kushner, H.J., Dupuis, P.G.: Numerical Methods for Stochastic Control Problems in Contin-
uous Time. Applications of Mathematics, vol. 24, 2nd edn. Springer, New York (2001)
68. Lang, S.: Real and Functional Analysis, 3rd edn. Springer, New York (1993)
69. Lasserre, J.B.: Semidefinite programming versus LP relaxations for polynomial programming.
Math. Oper. Res. 27, 347–360 (2002)
70. Lemaréchal, C., Oustry, F.: Semidefinite relaxations and Lagrangian duality with application
to combinatorial optimization. Rapport de Recherche INRIA 3710, (1999)
71. Lewis, A.: The mathematics of eigenvalue optimization. Math. Programm. 97, 155–176 (2003)
72. Lewis, A.S.: The convex analysis of unitarily invariant matrix functions. J. Convex Anal.
2(1–2), 173–183 (1995)
73. Lewis, A.S., Overton, M.L.: Eigenvalue optimization. In: Acta numerica, 1996, pp. 149–190.
Cambridge University Press, Cambridge (1996)
74. Liapounoff, A.: Sur les fonctions-vecteurs complètement additives. Bull. Acad. Sci. URSS.
Sér. Math. [Izvestia Akad. Nauk SSSR] 4, 465–478 (1940)
75. Linderoth, J.T., Shapiro, A., Wright, S.: The empirical behavior of sampling methods for
stochastic programming. Technical Report 02-01, Computer Science Department, University
of Wisconsin-Madison (2002)
76. Lobo, M.S., Vandenberghe, L., Boyd, S., Lebret, H.: Applications of second-order cone pro-
gramming. Linear Algebra Appl. 284, 193–228 (1998)
306 References
77. Malliavin, P.: Integration and Probability. Springer, New York (1995). French edition: Masson,
Paris (1982)
78. Mandelbrojt, S.: Sur les fonctions convexes. C. R. Acad. Sci., Paris 209, 977–978 (1939)
79. Maréchal, P.: On the convexity of the multiplicative potential and penalty functions and related
topics. Math. Program. 89(3, Ser. A), 505–516 (2001)
80. Modica, L.: The gradient theory of phase transitions and the minimal interface criterion. Arch.
Ration. Mech. Anal. 98(2), 123–142 (1987)
81. Monahan, G.E.: A survey of partially observable Markov decision processes: theory, models,
and algorithms. Manag. Sci. 28(1), 1–16 (1982)
82. Moreau, J.-J.: Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. France 93,
273–299 (1965)
83. Moreau, J.-J.: Fonctionnelles convexes. In: Leray, J. (ed.) Séminaire sur les équations aux
dérivées partielles, vol. 2, pp. 1–108. Collège de France (1966/1967). www.numdam.org
84. Moreau, J.-J.: Inf-convolution, sous-additivité, convexité des fonctions numériques. J. Math.
Pures Appl. 9(49), 109–154 (1970)
85. Nesterov, Y., Nemirovskii, A.: Interior-Point Polynomial Algorithms in Convex Programming.
Society for Industrial and Applied Mathematics (SIAM), Philadelphia (1994)
86. Pereira, M.V.F., Pinto, L.M.V.G.: Multi-stage stochastic optimization applied to energy plan-
ning. Math. Program. 52(2, Ser. B), 359–375 (1991)
87. Pontryagin, L.S., Boltyanskiı̆, V.G., Gamkrelidze, R.V., Mishchenko, E.F.: The Mathematical
Theory of Optimal Processes. Gordon & Breach Science Publishers, New York (1986). Reprint
of the 1962 English translation
88. Powell, M.J.D.: A method for nonlinear constraints in minimization problems. In: Fletcher,
R. (ed.) Optimization, pp. 283–298. Academic, New York (1969)
89. Powell, M.J.D.: Approximation Theory and Methods. Cambridge University Press, Cam-
bridge (1981)
90. Pulleyblank, W.R.: Polyhedral combinatorics. In: Nemhauser, G.L., et al. (eds.) Optimization.
Elsevier, Amsterdam (1989)
91. Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming.
WileyInc, New York (1994)
92. Puterman, M.L., Shin, M.C.: Modified policy iteration algorithms for discounted Markov
decision problems. Manag. Sci. 24(11), 1127–1137 (1978)
93. Rockafellar, R.T.: Duality theorems for convex functions. Bull. Am. Math. Soc. 70, 189–192
(1964)
94. Rockafellar, R.T.: Extension of Fenchel’s duality theorem for convex functions. Duke Math.
J. 33, 81–90 (1966)
95. Rockafellar, R.T.: Extension of Fenchel’s duality theorem for convex functions. Duke Math.
J. 33, 81–89 (1966)
96. Rockafellar, R.T.: Integrals which are convex functionals. Pacif. J. Math. 24, 525–539 (1968)
97. Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)
98. Rockafellar, R.T.: Convex integral functionals and duality. In: Contributions to Nonlinear
Functional Analysis (Proc. Sympos., Math. Res. Center, University of Wisconsin, Madison,
Wisconsin, 1971), pp. 215–236. Academic, New York (1971)
99. Rockafellar, R.T.: Integrals which are convex functionals. II. Pacif. J. Math. 39, 439–469
(1971)
100. Rockafellar, R.T.:. Conjugate Duality and Optimization. Regional Conference Series in
Applied Mathematics, vol. 16. SIAM, Philadelphia (1974)
101. Rockafellar, R.T.: Augmented Lagrangians and applications of the proximal point algorithm
in convex programming. Math. Oper. Res. 1, 97–116 (1976)
102. Rockafellar, R.T.: Integral functionals, normal integrands and measurable selections. In: Non-
linear Operators and the Calculus of Variations (Summer School, Univ. Libre Bruxelles, Brus-
sels, 1975). Lecture Notes in Mathematics, vol. 543, pp. 157–207. Springer, Berlin (1976)
103. Rockafellar, R.T., Wets, R.J.-B.: Stochastic convex programming: basic duality. Pacif. J. Math.
62(1), 173–195 (1976)
References 307
104. Rockafellar, R.T., Wets, R.J.-B.: Stochastic convex programming: singular multipliers and
extended duality singular multipliers and duality. Pacif. J. Math. 62(2), 507–522 (1976)
105. Royden, H.L.: Real Analysis, 3rd edn. Macmillan Publishing Company, New York (1988)
106. Ruszczynski, A., Shapiro, A. (eds.): Stochastic Programming. Handbook in Operations
Research and Management, vol. 10. Elsevier, Amsterdam (2003)
107. Ruszczyński, A., Shapiro, A.: Conditional risk mappings. Math. Oper. Res. 31(3), 544–561
(2006)
108. Santambrogio, F.: Optimal Transport for Applied Mathematicians. Birkhäuser (2015)
109. Santos, M.S., Rust, J.: Convergence properties of policy iteration. SIAM J. Control Optim.
42(6), 2094–2115 (electronic) (2004)
110. Schrijver, A.: Theory of Linear and Integer Programming. Wiley, New Jersey (1986)
111. Shapiro, A., Asymptotic analysis of stochastic programs. Ann. Oper. Res., 30(1–4):169–186
(1991). Stochastic programming, Part I (Ann Arbor, MI, 1989)
112. Shapiro, A.: Asymptotics of minimax stochastic programs. Stat. Probab. Lett. 78(2), 150–157
(2008)
113. Shapiro, A.: Analysis of stochastic dual dynamic programming method. Eur. J. Oper. Res.
209(1), 63–72 (2011)
114. Shapiro, A., Dentcheva, D., Ruszczynski, A.: Lectures on Stochastic Programming: Modelling
and Theory, 2nd edn. SIAM (2014)
115. Shiryaev, A.N.: Probability. Graduate Texts in Mathematics, vol. 95, 2nd edn. Springer, New
York (1996). Translated from the first (1980) Russian edition by R.P. Boas
116. Sinkhorn, R.: Diagonal equivalence to matrices with prescribed row and column sums. Am.
Math. Mon. 74(4), 402–405 (1967)
117. Sion, M.: On general minimax theorems. Pacif. J. Math. 8, 171–176 (1958)
118. Skorohod, A.V.: Limit theorems for stochastic processes. Teor. Veroyatnost. i Primenen. 1,
289–319 (1956)
119. Tardella, F.: A new proof of the Lyapunov convexity theorem. SIAM J. Control Optim. 28(2),
478–481 (1990)
120. Tibshirani, R.: Regression shrinkage and selection via the lasso: a retrospective. J. R. Stat.
Soc. Ser. B 73, Part 3, 273–282 (2011)
121. Villani, C.: Intégration et analyse de Fourier. ENS Lyon (2007). Revised in 2010
122. Villani, C.: Optimal Transport. Old and New. Springer, Berlin (2009)
123. Wallace, S.W., Ziemba, W.T. (eds.): Aplications of Stochastic Programming. MPS/SIAM
Series Optimization, vol. 5. SIAM, Philadelphia (2005)
124. Wets, R.J.-B.: Stochastic programs with fixed recourse: the equivalent deterministic program.
SIAM Rev. 16, 309–339 (1974)
125. Wolkowicz, H., Saigal, R., Vandenberghe, L. (eds.): Handbook of Semidefinite Programming.
Kluwer Academic Publishers, Boston (2000)
126. Yosida, K., Hewitt, E.: Finitely additive measures. Trans. Am. Math. Soc. 72, 46–66 (1952)
127. Zhou, L.: A simple proof of the Shapley-Folkman theorem. Econom. Theory 3(2), 371–372
(1993)
128. Zou, J., Ahmed, S., Sun, X.A.: Stochastic dual dynamic integer programming. Math. Program.
(2018)
Index
A Compatibility, 209
Acceptation set, 170 Compatible
Accessible set, 234 multimapping, 144
Adjoint state, 215 Conditional
Algebra, 117 expectation, 202
Algorithm variance, 208
cutting plane, 268 Cone
Kelley, 268 normal, 31
Application recession, 41
measurable, 119 tangent, 31
Approximation Cone of nonincreasing vectors, 80
Moreau–Yosida, 38 Conjugate, 17, 21
Approximation in the sense of Chebyshev, Contact set, 96
101 Convergence
in law, 180
in measure, 127
B in probability, 127
Backward equation, 215 narrow, 180
Biconjugate, 19 simple, 120, 233
Bidual, 89 Convex
Bochner, 140 function, 3
Borel σ -algebra, 118 set, 3
Borel–Cantelli, 125 Convex closure, 20
Borelian function, 121 Core, 26
Bounded Costate, 215
in probability, 181 adapted, 217
Countable additivity, 122
Covariance, 185
C Cycle, 257
Calmness, 51 Cylinder, 125
Carathéodory, 142
Carathéodory theorem, 123
Castaing representation, 143 D
Cauchy sequence, 5 Distribution
Class empirical, 190
recurrent, 257 Disutility, 165
transient, 257 Domain, 1
© Springer Nature Switzerland AG 2019 309
J. F. Bonnans, Convex and Stochastic Optimization, Universitext,
https://doi.org/10.1007/978-3-030-14977-2
310 Index
N
I Neutral risk probability, 167
Indicatrix function, 2 Nonanticipativity constraint, 218
Index 311
Norm Set
differentiable, 29 feasible, 1
Frobenius, 75 negligible, 124
Normal integrand, 146 solution, 2
σ -algebra, 117
complete, 124
O Lebesgue, 125
Oblique hyperplane, 19 product, 118
Optimality condition, 35 trivial, 117
Optimization Skorokhod–Dudley representation theorem,
semi-infinite, 95 185
Space
P measurable, 117
Path, 257 measure, 122
Point of interpolation, 104 normed, 5
Policy probability, 122
feedback, 229 separable, 122
open-loop, 248 Stability, 36
Polyhedron, 57 State
Polynomial accessible, 257
nonnegative, 106 communicating, 257
of Chebyshev, 103 State equation, 215
Pontryagin’s principle, 251 linearized, 215
Positively homogeneous, 4 Subadditive, 4
Preference, 166 Subdifferential, 22
Probability partial, 44
invariant, 256 Support, 97
Programmation function, 22
semi-infinite, 95 of a measure, 56
Programming
positive semidefinite, 77 T
positive semidefinite linear, 77 Time
Projector, 209 exit, 239
stopping, 240
R Trajectory, 273
Recourse, 153 Transition
Reference (of a polynomial), 102 matrix (regular), 259
Regular operator, 224
probability, 179 Translation invariance, 168
Regularisation
Lipschitz, 181 U
Regularity Uniformly integrability, 134
constraints, 96
Relative interior, 9
Risk-adverse, 166 V
Rotationally invariant, 80 Value, 1
Value at risk, 174
Vertical hyperplane, 19
S Vitali, 134
Sample approximation, 186
SDP relaxation, 87
Second-order cone, 91 W
Separable, 140 Walk, 257
Separation, 6