0% found this document useful (0 votes)
63 views320 pages

Convex and Stochastic Optimization

This document provides an overview of the book "Convex and Stochastic Optimization" by J. Frédéric Bonnans. The book introduces convex analysis and its applications to stochastic programming. It covers classical topics such as convex functions, duality theory, and specific structures like polyhedra. It also discusses more advanced topics including semidefinite programming, semi-infinite programming, and duality theory for nonconvex problems. The book aims to present the basic tools from convex analysis and measure theory needed to study optimization problems involving uncertainties.

Uploaded by

Raghavendran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views320 pages

Convex and Stochastic Optimization

This document provides an overview of the book "Convex and Stochastic Optimization" by J. Frédéric Bonnans. The book introduces convex analysis and its applications to stochastic programming. It covers classical topics such as convex functions, duality theory, and specific structures like polyhedra. It also discusses more advanced topics including semidefinite programming, semi-infinite programming, and duality theory for nonconvex problems. The book aims to present the basic tools from convex analysis and measure theory needed to study optimization problems involving uncertainties.

Uploaded by

Raghavendran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 320

Universitext

J. Frédéric Bonnans

Convex and
Stochastic
Optimization
Universitext
Universitext

Series Editors
Sheldon Axler
San Francisco State University
Carles Casacuberta
Universitat de Barcelona
Angus MacIntyre
Queen Mary University of London
Kenneth Ribet
University of California, Berkeley
Claude Sabbah
École Polytechnique, CNRS, Université Paris-Saclay, Palaiseau
Endre Süli
University of Oxford
Wojbor A. Woyczyński,
Case Western Reserve University

Universitext is a series of textbooks that presents material from a wide variety of


mathematical disciplines at master’s level and beyond. The books, often well
class-tested by their author, may have an informal, personal even experimental
approach to their subject matter. Some of the most successful and established books
in the series have evolved through several editions, always following the evolution
of teaching curricula, to very polished texts.

Thus as research topics trickle down into graduate-level teaching, first textbooks
written for new, cutting-edge courses may make their way into Universitext.

More information about this series at http://www.springer.com/series/223


J. Frédéric Bonnans

Convex and Stochastic


Optimization

123
J. Frédéric Bonnans
Inria-Saclay
and
Centre de Mathématiques Appliquées
École Polytechnique
Palaiseau, France

ISSN 0172-5939 ISSN 2191-6675 (electronic)


Universitext
ISBN 978-3-030-14976-5 ISBN 978-3-030-14977-2 (eBook)
https://doi.org/10.1007/978-3-030-14977-2
Library of Congress Control Number: 2019933717

Mathematics Subject Classification (2010): 90C15, 09C25, 90C39, 90C40, 90C46

© Springer Nature Switzerland AG 2019


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
This book is dedicated to Viviane, Juliette,
Antoine, and Na Yeong
Preface

These lecture notes are an extension of those given in the master programs at the
Universities Paris VI and Paris-Saclay, and in the École Polytechnique. They give
an introduction to convex analysis and its applications to stochastic programming,
i.e., to optimization problems where the decision must be taken in the presence of
uncertainties. This is an active subject of research that covers many applications.
Classical textbooks are Birge and Louveaux [21], Kall and Wallace [62]. The book
[123] by Wallace and Ziemba is dedicated to applications. Some more advanced
material is presented in Ruszczynski and Shapiro [105], Shapiro et al. [113],
Föllmer and Schied [49], and Carpentier et al. [32]. Let us also mention the his-
torical review paper by Wets [124].
The basic tool for studying such problems is the combination of convex analysis
with measure theory. Classical sources in convex analysis are Rockafellar [96],
Ekeland and Temam [46]. An introduction to integration and probability theory is
given in Malliavin [76].
The author expresses his thanks to Alexander Shapiro (Georgia Tech) for
introducing him to the subject, Darinka Dentchev (Stevens Institute of Technology),
Andrzej Ruszczyńki (Rutgers), Michel de Lara, and Jean-Philippe Chancelier (Ecole
des Ponts-Paris Tech) for stimulating discussions, and Pierre Carpentier with whom
he shared the course on stochastic optimization in the optimization masters at the
Université Paris-Saclay.

Palaiseau, France J. Frédéric Bonnans

vii
Contents

1 A Convex Optimization Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


1.1 Convex Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Optimization Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Separation of Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . 4
1.1.3 Weak Duality and Saddle Points . . . . . . . . . . . . . . . . . . . . 10
1.1.4 Linear Programming and Hoffman Bounds . . . . . . . . . . . . 11
1.1.5 Conjugacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2 Duality Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.2.1 Perturbation Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
1.2.2 Subdifferential Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . 44
1.2.3 Minimax Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.2.4 Calmness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
1.3 Specific Structures, Applications . . . . . . . . . . . . . . . . . . . . . . . . . 53
1.3.1 Maxima of Bounded Functions . . . . . . . . . . . . . . . . . . . . . 53
1.3.2 Linear Conical Optimization . . . . . . . . . . . . . . . . . . . . . . . 56
1.3.3 Polyhedra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
1.3.4 Infimal Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
1.3.5 Recession Functions and the Perspective Function . . . . . . . 62
1.4 Duality for Nonconvex Problems . . . . . . . . . . . . . . . . . . . . . . . . . 65
1.4.1 Convex Relaxation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
1.4.2 Applications of the Shapley–Folkman Theorem . . . . . . . . . 70
1.4.3 First-Order Optimality Conditions . . . . . . . . . . . . . . . . . . . 72
1.5 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
2 Semidefinite and Semi-infinite Programming . . . . . . . . . . . . . . . . . . 75
2.1 Matrix Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2.1.1 The Frobenius Norm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
2.1.2 Positive Semidefinite Linear Programming . . . . . . . . . . . . 77

ix
x Contents

2.2 Rotationally Invariant Matrix Functions . . . . . . . . . . . . . . . . . . . . 80


2.2.1 Computation of the Subdifferential . . . . . . . . . . . . . . . . . . 80
2.2.2 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
2.2.3 Logarithmic Penalty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
2.3 SDP Relaxations of Nonconvex Problems . . . . . . . . . . . . . . . . . . 87
2.3.1 Relaxation of Quadratic Problems . . . . . . . . . . . . . . . . . . . 87
2.3.2 Relaxation of Integer Constraints . . . . . . . . . . . . . . . . . . . 90
2.4 Second-Order Cone Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . 91
2.4.1 Examples of SOC Reformulations . . . . . . . . . . . . . . . . . . 91
2.4.2 Linear SOC Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
2.4.3 SDP Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
2.5 Semi-infinite Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
2.5.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
2.5.2 Multipliers with Finite Support . . . . . . . . . . . . . . . . . . . . . 97
2.5.3 Chebyshev Approximation . . . . . . . . . . . . . . . . . . . . . . . . 101
2.5.4 Chebyshev Polynomials and Lagrange Interpolation . . . . . . 103
2.6 Nonnegative Polynomials over R . . . . . . . . . . . . . . . . . . . . . . . . 106
2.6.1 Nonnegative Polynomials . . . . . . . . . . . . . . . . . . . . . . . . . 106
2.6.2 Characterisation of Moments . . . . . . . . . . . . . . . . . . . . . . 111
2.6.3 Maximal Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
2.7 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
3 An Integration Toolbox . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.1 Measure Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.1.1 Measurable Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.1.2 Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
3.1.3 Kolmogorov’s Extension of Measures . . . . . . . . . . . . . . . . 125
3.1.4 Limits of Measurable Functions . . . . . . . . . . . . . . . . . . . . 127
3.1.5 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
3.1.6 Lp Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
3.1.7 Bochner Integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
3.2 Integral Functionals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
3.2.1 Minimization of Carathéodory Integrals . . . . . . . . . . . . . . 142
3.2.2 Measurable Multimappings . . . . . . . . . . . . . . . . . . . . . . . . 143
3.2.3 Convex Integrands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
3.2.4 Conjugates of Integral Functionals . . . . . . . . . . . . . . . . . . 146
3.2.5 Deterministic Decisions in Rm . . . . . . . . . . . . . . . . . . . . . 150
3.2.6 Constrained Random Decisions . . . . . . . . . . . . . . . . . . . . 152
3.2.7 Linear Programming with Simple Recourse . . . . . . . . . . . . 153
3.3 Applications of the Shapley–Folkman Theorem . . . . . . . . . . . . . . 156
3.3.1 Integrals of Multimappings . . . . . . . . . . . . . . . . . . . . . . . . 156
3.3.2 Constraints on Integral Terms . . . . . . . . . . . . . . . . . . . . . . 159
Contents xi

3.4 Examples and Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161


3.5 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
4 Risk Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.2 Utility Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.2.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.2.2 Optimized Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
4.3 Monetary Measures of Risk . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
4.3.1 General Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
4.3.2 Convex Monetary Measures of Risk . . . . . . . . . . . . . . . . . 169
4.3.3 Acceptation Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
4.3.4 Risk Trading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
4.3.5 Deviation and Semideviation . . . . . . . . . . . . . . . . . . . . . . 172
4.3.6 Value at Risk and CVaR . . . . . . . . . . . . . . . . . . . . . . . . . 174
4.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
5 Sampling and Optimizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
5.1 Examples and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
5.1.1 Maximum Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
5.2 Convergence in Law and Related Asymptotics . . . . . . . . . . . . . . . 178
5.2.1 Probabilities over Metric Spaces . . . . . . . . . . . . . . . . . . . . 178
5.2.2 Convergence in Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
5.2.3 Central Limit Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . 185
5.2.4 Delta Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
5.2.5 Solving Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
5.3 Error Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
5.3.1 The Empirical Distribution . . . . . . . . . . . . . . . . . . . . . . . . 190
5.3.2 Minimizing over a Sample . . . . . . . . . . . . . . . . . . . . . . . . 191
5.3.3 Uniform Convergence of Values . . . . . . . . . . . . . . . . . . . . 193
5.3.4 The Asymptotic Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
5.3.5 Expectation Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . 195
5.4 Large Deviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
5.4.1 The Principle of Large Deviations . . . . . . . . . . . . . . . . . . 198
5.4.2 Error Estimates in Stochastic Programming . . . . . . . . . . . . 200
5.5 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
6 Dynamic Stochastic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
6.1 Conditional Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
6.1.1 Functional Dependency . . . . . . . . . . . . . . . . . . . . . . . . . . 201
6.1.2 Construction of the Conditional Expectation . . . . . . . . . . . 201
6.1.3 The Conditional Expectation of Non-integrable
Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
6.1.4 Computation in Some Simple Cases . . . . . . . . . . . . . . . . . 206
xii Contents

6.1.5 Convergence Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . 207


6.1.6 Conditional Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208
6.1.7 Compatibility with a Subspace . . . . . . . . . . . . . . . . . . . . . 209
6.1.8 Compatibility with Measurability Constraints . . . . . . . . . . 212
6.1.9 No Recourse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
6.2 Dynamic Stochastic Programming . . . . . . . . . . . . . . . . . . . . . . . . 214
6.2.1 Dynamic Uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
6.2.2 Abstract Optimality Conditions . . . . . . . . . . . . . . . . . . . . . 215
6.2.3 The Growing Information Framework . . . . . . . . . . . . . . . . 217
6.2.4 The Standard full Information Framework . . . . . . . . . . . . . 218
6.2.5 Independent Noises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
6.2.6 Elementary Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
6.2.7 Application to the Turbining Problem . . . . . . . . . . . . . . . . 220
6.3 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
7 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
7.1 Controlled Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
7.1.1 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
7.1.2 The Dynamic Programming Principle . . . . . . . . . . . . . . . . 229
7.1.3 Infinite Horizon Problems . . . . . . . . . . . . . . . . . . . . . . . . 231
7.1.4 Numerical Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
7.1.5 Exit Time Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
7.1.6 Problems with Stopping Decisions . . . . . . . . . . . . . . . . . . 240
7.1.7 Undiscounted Problems . . . . . . . . . . . . . . . . . . . . . . . . . . 243
7.2 Advanced Material on Controlled Markov Chains . . . . . . . . . . . . 243
7.2.1 Expectation Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . 243
7.2.2 Partial Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
7.2.3 Linear Programming Formulation . . . . . . . . . . . . . . . . . . . 254
7.3 Ergodic Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
7.3.1 Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
7.3.2 Transient and Recurrent States . . . . . . . . . . . . . . . . . . . . . 256
7.3.3 Ergodic Dynamic Programming . . . . . . . . . . . . . . . . . . . . 263
7.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
8 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
8.1 Stochastic Dual Dynamic Programming (SDDP) . . . . . . . . . . . . . 267
8.1.1 Static Case: Kelley’s Algorithm . . . . . . . . . . . . . . . . . . . . 267
8.1.2 Deterministic Dual Dynamic Programming . . . . . . . . . . . . 269
8.1.3 Stochastic Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
8.2 Introduction to Linear Decision Rules . . . . . . . . . . . . . . . . . . . . . 275
8.2.1 About the Frobenius Norm . . . . . . . . . . . . . . . . . . . . . . . . 275
8.2.2 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
Contents xiii

8.2.3 Linear Programming Reformulation . . . . . . . . . . . . . . . . . 276


8.2.4 Linear Conic Reformulation . . . . . . . . . . . . . . . . . . . . . . . 277
8.2.5 Dual Bounds in a Conic Setting . . . . . . . . . . . . . . . . . . . . 278
8.3 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281
9 Generalized Convexity and Transportation Theory . . . . . . . . . . . . . 283
9.1 Generalized Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
9.1.1 Generalized Fenchel Conjugates . . . . . . . . . . . . . . . . . . . . 283
9.1.2 Cyclical Monotonicity . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
9.1.3 Duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287
9.1.4 Augmented Lagrangian . . . . . . . . . . . . . . . . . . . . . . . . . . 287
9.2 Convex Functions of Measures . . . . . . . . . . . . . . . . . . . . . . . . . . 289
9.2.1 A First Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
9.2.2 A Second Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
9.3 Transportation Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
9.3.1 The Compact Framework . . . . . . . . . . . . . . . . . . . . . . . . . 294
9.3.2 Optimal Transportation Maps . . . . . . . . . . . . . . . . . . . . . . 296
9.3.3 Penalty Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . 297
9.3.4 Barycenters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
9.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
Chapter 1
A Convex Optimization Toolbox

Summary This chapter presents the duality theory for optimization problems, by
both the minimax and perturbation approach, in a Banach space setting. Under some
stability (qualification) hypotheses, it is shown that the dual problem has a nonempty
and bounded set of solutions. This leads to the subdifferential calculus, which appears
to be nothing but a partial subdifferential rule. Applications are provided to the infimal
convolution, as well as recession and perspective functions. The relaxation of some
nonconvex problems is analyzed thanks to the Shapley–Folkman theorem.

1.1 Convex Functions

1.1.1 Optimization Problems

1.1.1.1 The Language of Minimization Problems

Denote the set of extended real numbers by R̄ := R ∪ {−∞} ∪ {+∞}. A minimiza-


tion problem is of the form
Min f (x); x ∈ K , (P f,K )
x

where K is a subset of some set X , and f : X → R̄ (we say that f is extended


real-valued). The domain of f is

dom( f ) := {x ∈ X ; f (x) < +∞}. (1.1)

We say that f is proper if its domain is not empty, and if f (x) > −∞, for all x ∈ X .
The feasible set and value of (P f,K ) are resp.

F(P f,K ) := dom( f ) ∩ K ; val(P f,K ) := inf{ f (x); x ∈ F(P)}. (1.2)

© Springer Nature Switzerland AG 2019 1


J. F. Bonnans, Convex and Stochastic Optimization, Universitext,
https://doi.org/10.1007/978-3-030-14977-2_1
2 1 A Convex Optimization Toolbox

Since the infimum over the empty set is +∞, we have that val(P f,K ) < +∞ iff
F(P f,K ) = ∅. The solution set of (P f,K ) is defined as

S(P f,K ) := {x ∈ F(P f,K ); f (x) = val(P f,K )}. (1.3)

Note that S(P f,K ) = ∅ when F(P f,K ) = ∅.


A metric over X is a function, say d : X × X → R+ , such that d(x, y) = 0 iff
x = y, that is symmetric: d(x, y) = d(y, x) and that satisfies the triangle inequality

d(x, z) ≤ d(x, y) + d(y, z), for all x, y, z in X. (1.4)

We say that X is a metric space if it is endowed with a metric. In that case, we


say that f is lower semicontinuous, or l.s.c., if for all x ∈ X , f (x) ≤ lim inf k f (x k )
whenever x k → x.
A minimizing sequence for problem (P f,K ) is a sequence xk in F(P f,K ) such that
f (xk ) → val(P f,K ). Such a sequence exists iff F(P f,K ) is nonempty. Any (infinite)
subsequence of a minimizing sequence is itself a minimizing sequence. If X is a
metric space, K is closed and f is l.s.c., then any limit point of a minimizing sequence
is a solution of (P f,K ).

Example 1.1 Consider the problem of minimizing the exponential function over R.
The value is finite, but the solution set is empty. Note that minimizing subsequences
have no limit point in R.

1.1.1.2 Operations on Extended Real-Valued Functions

In the context of minimization problems, that f (x) = +∞ is just a way to express


that x is not feasible. Therefore the following algebraic rules for extended real-valued
functions are to be used: if f and g are extended real-valued functions over X , then
h := f + g is the extended real-valued function over X defined by

+∞ if max( f (x), g(x)) = +∞,
h(x) = (1.5)
f (x) + g(x) otherwise.

Note that there is no ambiguity in this definition (taking the usual addition rules in
the presence of ±∞). The domain of the sum is the intersection of the domains.

Example 1.2 With a subset K of X we associate the indicatrix function I K : X → R̄


defined by 
0 if x ∈ K ,
I K (x) := (1.6)
+∞ otherwise.

Let f K (x) := f (x) + I K (x). Then (P f,K ) has the same feasible set, value and set of
solutions as (P f K ,X ).
1.1 Convex Functions 3

If f is an extended real-valued function and λ > 0, we may define λ f by the


natural rule
(λ f )(x) = λ f (x), for all x ∈ X. (1.7)

Observe that (Pλ f,K ) has the same feasible set and set of solutions as (P f,X ), and the
values are related by
val(Pλ f,K ) = λ val(P f,K ). (1.8)

For λ = 0, we must think of 0 f as the limit of λ f as λ ↓ 0, and therefore set



f (x) if f (x) = ±∞,
(0 f )(x) = (1.9)
0 otherwise.

Then (P0 f,K ) has the same feasible set as (P f,X ), and its set of solutions is F(P f,X )
if f is proper.

Example 1.3 Consider the entropy function f (x) = x log x (with the convention
that 0 log 0 = 0) if x ≥ 0, and +∞ otherwise. Then 0 f is the indicatrix of R+ .
More generally, if f is proper, then 0 f is the indicatrix of its domain.

1.1.1.3 Maximization Problems

For a maximization problem

Max g(x); x ∈ K (Dg,K )


x

we have similar conventions, adapted to the maximization framework. In particular


the domain of g is dom(g) = {x ∈ X ; g(x) > −∞}. The domain of the sum is still
the intersection of the domains. Note that (Dg,K ) is essentially the same problem as

Min −g(x); x ∈ K . (Dg,K )


x

Indeed these two problems have the same feasible set and set of solutions, and they
have opposite values.

1.1.1.4 Convex Sets and Functions

Let X be a vector space. We say that K ⊂ X is convex if

For any x and y in K , and α ∈ (0, 1), we have that αx + (1 − α)y ∈ K . (1.10)

We say that f : X → R̄ is convex if


4 1 A Convex Optimization Toolbox

For any x and y in dom( f ), and α ∈ (0, 1), we have that
(1.11)
f (αx + (1 − α)y) ≤ α f (x) + (1 − α) f (y).

We see that a convex function has a convex domain.


The epigraph of f : X → R̄ is the set

epi( f ) := {(x, α) ∈ X × R; α ≥ f (x)}. (1.12)

Its projection over X is dom( f ). One easily checks the following:

Lemma 1.4 Let f : X → R̄. Then


(i) its epigraph is convex iff f is convex,
(ii) if X is a metric space, its epigraph is closed iff f is l.s.c.

Example 1.5 The epigraph of the indicatrix of K ⊂ X is K × R+ .

1.1.2 Separation of Convex Sets

We recall without proof the Hahn–Banach theorem, valid in a vector space setting,
and deduce from it some results of separation of convex sets in normed vector spaces.

1.1.2.1 The Hahn–Banach Theorem

Let X be a vector space. We say that p : X → R is positively homogeneous and


subadditive if it satisfies

(i) p(αx) = αp(x), for all x ∈ X and α > 0,
(1.13)
(ii) p(x + y) ≤ p(x) + p(y), for all x and y in X.

Remark 1.6 (a) Taking x = 0 in (1.13)(i), we obtain that p(0) = 0, and so we could
as well take α = 0 in (1.13)(i).
(b) If β ∈ (0, 1), combining the above relations, we obtain that

p(βx + (1 − β)y) ≤ βp(x) + (1 − β) p(y), (1.14)

i.e., p is convex. Conversely, it is easily checked that a positively homogeneous


(finite-valued) convex function is subadditive.

The analytical form of the Hahn–Banach theorem, a nontrivial consequence of


Zorn’s lemma, is as follows (see [28] for a proof):

Theorem 1.7 Let p satisfy (1.13), X 1 be a vector subspace of X , and λ be a linear


form defined on X 1 that is dominated by p in the sense that
1.1 Convex Functions 5

λ(x) ≤ p(x), for all x ∈ X 1 . (1.15)

Then there exists a linear form μ on X , dominated by p, whose restriction to X 1


coincides with λ.

We say that a real vector space X is a normed space when endowed with a mapping
X → R, x → x, satisfying the three axioms

⎨ x ≥ 0, with equality iff x = 0,
αx = |α|x, for all α ∈ R, x ∈ X, (1.16)

x + x   ≤ x + x  , (triangle inequality).

Then (x, y) → x − y is a metric over X . We denote the norm of Euclidean spaces,


i.e., finite-dimensional spaces endowed with the norm ( i xi2 )1/2 , by |x|.
A sequence xk in a normed vector space X is said to be a Cauchy sequence if
x p − xq  → 0 when p, q ↑ ∞. We say that X is a Banach space if every Cauchy
sequence has a (necessarily unique) limit.
The topological dual X ∗ of the normed vector space X is the set of continuous
linear forms (maps X → R) on X . In the sequel, by dual space we will mean the
topological dual. We denote the duality product between x ∗ ∈ X ∗ and x ∈ X by
x ∗ , x X or simply x ∗ , x. Note that a linear form, say  over X , is continuous iff it
is continuous at 0, which holds iff sup{(x); x ≤ 1} < ∞. So we may endow X ∗
with the norm
x ∗ ∗ := sup{x ∗ , x; x ≤ 1}. (1.17)

It is easily checked that X ∗ is a Banach space. The dual of Rn (space of vertical


vectors) is denoted by Rn∗ (space of horizontal vectors).
In the sequel we may denote the dual norm by x ∗ . If X and Y are Banach
spaces, we denote by L(X, Y ) the Banach space of linear continuous mappings
X → Y , endowed with the norm A := sup{Ax; x ≤ 1}. We denote by B X
(resp. B̄ X ) the open (resp. closed) unit ball of X . If x1∗ is a continuous linear form on
a linear subspace X 1 of X , its norm is defined accordingly:

x1∗ 1,∗ = sup{x ∗ , x; x ∈ X 1 , x ≤ 1}. (1.18)

Here are some other corollaries of the Hahn–Banach theorem.

Corollary 1.8 Let x1∗ be a continuous linear form on a linear subspace X 1 of the
normed space X . Then there exists an x ∗ ∈ X ∗ whose restriction to X 1 coincides
with x1∗ , and such that
x ∗ ∗ = x1∗ 1,∗ . (1.19)

Proof Apply Theorem 1.7 with p(x) := x1∗ 1,∗ x. Since x ∗ , ±x ≤ p(x), we
have that x ≤ 1 implies x ∗ , ±x ≤ x1∗ 1,∗ . The result follows. 
6 1 A Convex Optimization Toolbox

Corollary 1.9 Let x0 belong to the normed vector space X . Then there exists an
x ∗ ∈ X ∗ such that x ∗  = 1 and x ∗ , x0  = x0 .

Proof Apply Corollary 1.8 with X 1 = Rx0 and x1∗ (t x0 ) = tx0 , for t ∈ R. 

The orthogonal of E ⊂ X is the closed subspace of X ∗ defined by

E ⊥ := {x ∗ ∈ X ∗ ; x ∗ , x = 0, for all x ∈ E}. (1.20)

Lemma 1.10 Let E be a subspace of X . Then E ⊥ = {0} iff E is dense.

Proof (a) If E is dense, given x ∈ X , there exists a sequence xk in E, xk → x and


hence, for all x ∗ ∈ E ⊥ , x ∗ , x = limk x ∗ , xk  = 0, proving that x ∗ = 0.
(b) If E is not dense, let x0 ∈
/ Ē (closure of E). We may assume that x0  = 1 and that
B(x0 , ε) ∩ E = ∅ for some ε > 0. Let E 0 := E ⊕ (Rx0 ) denote the space spanned
by E 0 and x0 . Consider the linear form λ on E 0 defined by

λ(e + αx0 ) = α, for all e ∈ E and α ∈ R. (1.21)

Since any x ∈ E 0 has a unique decomposition as x = e + αx0 with e ∈ E and α ∈


R, the linear form is well-defined. Let such an x satisfy α = 0. Since e := −e/α
does not belong to B(x0 , ε), we have that x = |α|x0 − e  ≥ ε|α|, and hence,
λ(x) = α ≤ x/ε. If α = 0 we still have λ(x) ≤ x/ε. By Corollary 1.8, λ has an
extension to a continuous linear form on X , which is a nonzero element of E ⊥ . 

Bidual space, Reflexivity

Given x ∈ X , the mapping x : X ∗ → R, x ∗ → x ∗ , x is by (1.17) linear continu-


ous. Since |x ∗ , x| ≤ x ∗ x, its norm x  (in the bidual space X ∗∗ ) is not greater
than x, and as a consequence of Corollary 1.9, is equal to x: the mapping x → x
is isometric. This allows us to identify X with a closed subspace of X ∗∗ . We say that
X is reflexive if X = X ∗∗ . The Hilbert spaces are reflexive, see [28].

1.1.2.2 Separation Theorems

We assume here that X is a normed vector space. A (topological) hyperplane of X


is a set of the form:

Hx ∗ ,α := {x ∈ X ; x ∗ , x = α, for some (x ∗ , α) ∈ X ∗ × R, x ∗ = 0}. (1.22)

We call a set of the form

{x ∈ X ; x ∗ , x ≤ α}, where x ∗ = 0, (1.23)

a (closed) half-space of X .
1.1 Convex Functions 7

Definition 1.11 Let A and B be two subsets of X . We say that the hyperplane Hx ∗ ,α
separates A and B if

x ∗ , a ≤ α ≤ x ∗ , b, for all (a, b) ∈ A × B. (1.24)

We speak of a strict separation if

x ∗ , a < x ∗ , b, for all (a, b) ∈ A × B, (1.25)

and of a strong separation if, for some ε > 0,

x ∗ , a + ε ≤ α ≤ x ∗ , b − ε, for all (a, b) ∈ A × B. (1.26)

We say that x ∗ ∈ X ∗ (nonzero) separates A and B if (1.24) holds for some α, strictly
separates A and B if (1.25) holds, and strongly separates A and B if (1.26) holds for
some ε > 0 and α. If A is the singleton {a}, then we say that x ∗ separates a and B,
etc.
Given two subsets A and B of a vector space X , we define their Minkowski sum
and difference as 
A + B = {a + b; a ∈ A, b ∈ B},
(1.27)
A − B = {a − b; a ∈ A, b ∈ B}.

The first geometric form of the Hahn–Banach theorem is as follows:


Theorem 1.12 Let A and B be two nonempty subsets of the normed vector space X ,
with empty intersection. If A − B is convex and has a nonempty interior, then there
exists a hyperplane Hx ∗ ,α separating A and B, such that

x ∗ , a < x ∗ , b, whenever (a, b) ∈ A × B and a − b ∈ int(A − B). (1.28)

Note that A − B has a nonempty interior whenever either A or B has a nonempty


interior. The proof needs the following concept.
Definition 1.13 Let C be a convex subset of X whose interior contains 0. The gauge
function of C is
gC (x) := inf{β > 0; β −1 x ∈ C}. (1.29)

Example 1.14 If C is the closed unit ball of X , then gC (x) = x for all x ∈ X .
A gauge function is obviously positively homogeneous and finite. If B(0, ε) ⊂ C
for some ε > 0, then:
gC (x) ≤ x/ε, for all x ∈ X, (1.30)

so it is bounded over bounded sets. In addition, for any β > gC (x) and γ > 0, since
x ∈ βC and B(0, γ ε) ⊂ γ C, we get x + B(0, γ ε) ⊂ (β + γ )C, so that gC (y) ≤
gC (x) + γ , for all y ∈ B(x, γ ε). We have proved that
8 1 A Convex Optimization Toolbox

If B(0, ε) ⊂ C, then gC is Lipschitz with constant 1/ε. (1.31)

It easily follows that

{x ∈ X ; gC (x) < 1} = int(C) ⊂ C̄ = {x ∈ X ; gC (x) ≤ 1}. (1.32)

Lemma 1.15 A gauge is subadditive and convex.

Proof Let x and y belong to X . For all βx > gC (x) and β y > gC (y), we have that
(βx )−1 x ∈ C and (β y )−1 y ∈ C, so that

x+y βx βy
= (βx )−1 x + (β y )−1 y ∈ C. (1.33)
βx + β y βx + β y βx + β y

Therefore, gC (x + y) ≤ βx + β y . Since this holds for any βx > gC (x) and β y >
gC (y), we obtain that gC is subadditive. Since gC is positively homogeneous, it
easily follows that gc is convex. 

Proof (Proof of theorem 1.12) Let x0 ∈ int(B − A); since A ∩ B = ∅, x0 = 0. Set

C := {a − b + x0 , b ∈ B, a ∈ A}. (1.34)

We easily check that 0 ∈ int(C). Obviously, x ∗ separates A and B iff it separates


C and {x0 }. Let λ be the linear form defined on X 1 := Rx0 by λ(t x0 ) = t, for t ∈
R. Since A ∩ B = ∅, x0 ∈ / C, and hence, gC (x0 ) ≥ 1. It easily follows that λ is
dominated by gC on X 1 . By Theorem 1.7, there exists a (possibly not continuous)
linear form x ∗ on X , dominated by gC , whose restriction to X 1 coincides with λ. Since
0 ∈ int(C) we have that (1.30) holds for some ε > 0, and hence, being dominated
by gC , x ∗ is continuous. It follows that x ∗ , x ≤ 1, for all x ∈ C, or equivalently

x ∗ , a − x ∗ , b + x ∗ , x0  ≤ 1, for all (a, b) ∈ A × B, (1.35)

whereas x ∗ , x0  = 1. Therefore x ∗ separates A and B. In addition, if a − b ∈


int(A − B), say B(a − b, ε) ⊂ A − B for some ε > 0, then x ∗ , a − b + e ≤ 0
whenever e ≤ ε; maximizing over e ∈ B(0, ε) we obtain that x ∗ , a − b ≤
−εx ∗ . Relation (1.28) follows. 

Corollary 1.16 Let E be a closed convex subset of the normed space X . Then there
exists a hyperplane that strongly separates any x0 ∈
/ E and E.

Proof For ε > 0 small enough, the open convex set A := B(x0 , ε) has empty inter-
section with E. By Theorem 1.12, there exists an x ∗ = 0 separating A and E, that
is
x ∗ , x0  + εx ∗ ∗ = sup{x ∗ , a; a ∈ A} ≤ inf{x ∗ , b; b ∈ E}. (1.36)

The conclusion follows. 


1.1 Convex Functions 9

Remark 1.17 Corollary 1.16 can be reformulated as follows: any closed convex
subset of a normed space is the intersection of half spaces in which it is contained.

The following example shows that, even in a Hilbert space, one cannot in general
separate two convex sets with empty intersection.

Example 1.18 Let X = 2 be the space of real sequences whose sum of squares
of coefficients is summable. Let C be the subset of X of sequences with finitely
many nonzero coefficients, the last one being positive. Then C is a convex cone
that does not contain 0. Let x ∗ separate 0 and C. We can identify the Hilbert space
X with its dual, and therefore x ∗ with an element of X . Since each element ei
of the natural basis belongs to the cone C, we must have xi∗ ≥ 0 for all i, and
x ∗j > 0 for some j. For any ε > 0 small enough, x := −e j + εe j+1 belongs to C,
but x ∗ , x = −x ∗j + εx ∗j+1 < 0. This shows that one cannot separate the convex
sets. So, 0 and C cannot be separated.

1.1.2.3 Relative Interior

Again, let X be a normed vector space, and E be a convex subset of X . We denote


by affhull(E) the intersection of affine spaces containing E, and by affhull(E) its
closure; the latter is the smallest closed affine space containing E. The relative
interior of E, denoted by rint(E), is the interior of E viewed as a subset of affhull(E).

Proposition 1.19 Let A and B be two nonempty subsets of X , with empty intersec-
tion. If A − B is convex and has a nonempty relative interior, then there exists a
hyperplane Hx ∗ ,α separating A and B, and such that

x ∗ , a < x ∗ , b, whenever (a, b) ∈ A × B and a − b ∈ rint(A − B). (1.37)

Proof Set E := B − A and Y := affhull(E). By Theorem 1.12, there exists a y ∗


in Y ∗ separating 0 and E, with strict inequality for rint(E). By Theorem 1.7, there
exists an x ∗ ∈ X ∗ whose restriction to Y is y ∗ , and the conclusion holds with x ∗ . 

Remark 1.20 By the previous proposition when B = {b} is a singleton, noting that
rint(A − b) = rint(A) − b, we obtain that when A is convex, if b ∈
/ rint(A) then
there exists an x ∗ ∈ X ∗ such that

x ∗ , a < x ∗ , b, whenever a ∈ rint(A). (1.38)

Since any convex subset of a finite-dimensional subspace has a nonempty relative


interior,1 we deduce the following:

1 Except maybe when the set is a singleton and then the dimension is zero, where this is a matter
of definition. However the case when A − B reduces to a singleton means that both A and B are
singletons and then it is easy to separate them.
10 1 A Convex Optimization Toolbox

Corollary 1.21 Let A and B be two convex and nonempty subsets of a Euclidean
space, with empty intersection. Then there exists a hyperplane Hx ∗ ,α separating A
and B, such that (1.37) holds.

1.1.3 Weak Duality and Saddle Points

Let X and Y be two sets and let L : X × Y → R. Then we have the weak duality
inequality
sup inf L(x, y) ≤ inf sup L(x, y). (1.39)
y∈Y x∈X x∈X y∈Y

Indeed, let (x0 , y0 ) ∈ X × Y . Then

inf L(x, y0 ) ≤ L(x0 , y0 ) ≤ sup L(x0 , y). (1.40)


x∈X y∈Y

Removing the middle term, maximizing the left-hand side w.r.t. y0 and minimizing
the right-hand side w.r.t. x0 , we obtain (1.39). We next define the primal and dual
values, resp., for x ∈ X and y ∈ Y , by

p(x) := sup L(x, y); d(y) := inf L(x, y), (1.41)


y∈Y x∈X

and the primal and dual problem by

Min p(x) (P)


x∈X

Max d(y). (D)


y∈Y

The weak duality inequality says that val(D) ≤ val(P). We say that (x̄, ȳ) ∈ X × Y
is a saddle point of L over X × Y if

L(x̄, y) ≤ L(x̄, ȳ) ≤ L(x, ȳ), for all (x, y) ∈ X × Y. (1.42)

An equivalent relation is

sup L(x̄, y) = L(x̄, ȳ) = inf L(x, ȳ). (1.43)


y∈Y x∈X

Minorizing the left-hand term by changing x̄ into the infimum w.r.t. x ∈ X and
majorizing symmetrically the right-hand term, we obtain

inf sup L(x, y) ≤ L(x̄, ȳ) ≤ sup inf L(x, y), (1.44)
x∈X y∈Y y∈Y x∈X
1.1 Convex Functions 11

which, combined with the weak duality inequality, shows that x̄ ∈ S(P), ȳ ∈ S(D)
and
val(D) = val(P) = L(x̄, ȳ). (1.45)

But we have more, in fact, if we denote by S P(L) the set of saddle points, then:

Lemma 1.22 The following holds:

{val(D) = val(P) is finite} ⇒ S P(L) = S(P) × S(D). (1.46)

Proof Indeed, let x̄ ∈ S(P) and ȳ ∈ S(D). Then

val(D) = inf L(x, ȳ) ≤ L(x̄, ȳ) ≤ sup L(x̄, y) = val(P). (1.47)
x∈X y∈Y

If val(D) = val(P), then these inequalities are equalities, so that (1.43) holds,
and therefore (x̄, ȳ) is a saddle point. The converse implication has already been
obtained. 

1.1.4 Linear Programming and Hoffman Bounds

1.1.4.1 Linear Programming

We assume in this section that X is a vector space (with no associated topology; this
abstract setting does not make the proofs more complicated). Consider the infinite-
dimensional linear program

Minc, x; ai , x ≤ bi , i = 1, . . . , p; (L P)


x∈X

where c and ai , i = 1, . . . , p, are linear forms over X , b ∈ R p , and ·, · denotes the
action of a linear form over X . The associated Lagrangian function L : X × R p∗ →
R is defined as
 p
L(x, λ) := c, x + λi (ai , x − bi ) , (1.48)
i=1

p∗
where the multiplier λ has to belong to R+ . The primal value satisfies

c, x if x ∈ F(L P),
p(x) = sup L(x, λ) = (1.49)
p∗
λ∈R+
+∞ otherwise.

Therefore (L P) and the primal problem (ofminimizing p(x)) have the same value
p
and set of solutions. Since L(x, λ) = c + i=1 λi ai , x − λb, we have that
12 1 A Convex Optimization Toolbox
 p
−λb if c + i=1 λi ai = 0,
d(λ) = inf L(x, λ) = (1.50)
x −∞ otherwise.

The dual problem has therefore the same value and set of solutions as the following
problem, called dual to (L P):


p
Max
p∗
−λb; c + λi ai = 0. (L D)
λ∈R+
i=1

For x ∈ F(L P), we denote the associated set of active constraints by

I (x) := {1 ≤ i ≤ p; ai , x = bi }. (1.51)

Consider the optimality system


 p
(i) c + i=1 λi ai = 0,
(1.52)
(ii) λi ≥ 0, ai , x ≤ bi , λi (ai , x − bi ) = 0, i = 1, . . . , p.

Lemma 1.23 The pair (x, λ) ∈ F(L P) × F(L D) is a saddle point of the
Lagrangian iff (1.52) holds.

Proof Let x ∈ F(L P) and λ ∈ F(L D). Then (1.52)(i) holds, implying that the dif-
ference of cost function is equal to


p
c, x + λb = λi (bi − ai , x). (1.53)
i=1

This sum of nonnegative terms is equal to zero iff the last relation of (1.52)(ii)
holds, and then x ∈ S(L P) and λ ∈ S(L D), proving that (x, λ) is a saddle point.
The converse implication is easily obtained. 

We next deal with the existence of solutions.

Lemma 1.24 If (L P) (resp. (L D)) has a finite value, then its set of solutions is
nonempty.

Proof The proof of the two cases being similar, it suffices to prove the first statement.
Since (L P) has a finite value, there exists a minimizing sequence x k . Extracting a
subsequence if necessary, we may assume that I (x k ) is constant, say equal to J .
Among such minimizing sequences we may assume that J is of maximal cardinality.
If c, x k  has, for large enough k, a constant value, then the corresponding x k
is a solution of (L P). Otherwise, extracting a subsequence if necessary, we may
assume that c, x k+1  < c, x k  for all k. Set d k := x k+1 − x k , and consider the set
E k := {ρ ≥ 0; x k + ρd k ∈ F(L P)}. Since c, d k  < 0 and val(L P) > −∞, this
set is bounded. The maximal element is
1.1 Convex Functions 13

ρk := argmin{(bi − ai , x k )/ai , d k ; ai , d k  > 0}. (1.54)


i

We have that y k := x k + ρk d k ∈ F(L P), c, y k  < c, x k  and J ⊂ I (y k ) strictly.


Extracting from y k a minimizing sequence with constant set of active constraints,
strictly containing J , we contradict the definition of J . 
 q
Lemma 1.25 Given y 1 , . . . , y q in Rn , the set E := { i=1 λi y i , λ ≥ 0} is closed.
Proof (i) Assume first that the y i are linearly independent, and denote by Y the
generated vector space, of which they are a basis. The coefficients λi represent the
coordinates in this basis, and so, if ek → e in E, its coordinates converge to those of
e and the result follows.
(ii) We next deal with the general  case. Let ek → e in E. Let us associate with each
q q
e some λ ∈ R+ such that e = i=1 λik y i , and that λk has minimal support (the
k k k

support being the set of nonzero components). Taking if necessary a subsequence,


we may assume that this support J is constant q along the sequence. We claim that
{y i , i ∈ J } is linearly independent, for if i=1 μi y i = 0 with μ = 0, and μi = 0
when λik = 0, we can find β k so that μ := λ + βk μ is nonnegative and has a support
k k
q
smaller than J , and ek = i=1 μik y i , in contradiction with the definition of J .
Since {y i , i ∈ J } is linearly independent,
q we deduce as in step (i) that the λk
converge to some λ̄ and that e = i=1 λ̄i y i . The conclusion follows. 
We next prove a strong duality result.
Lemma 1.26 If val(L P) is finite, then val(L P) = val(L D) and both S(L P) and
S(L D) are nonempty.
Proof By Lemma 1.26, (L P) has a solution x̄. Set J := I (x̄), j := |J | (the cardi-
nality of J ); we may assume that J = 1, . . . , j. Consider the set

C := {d ∈ X ; ai , d ≤ 0, i ∈ J }. (1.55)

Obviously, if d ∈ C, then x̄ + ρd ∈ F(L P) for small enough ρ > 0, and conse-


quently
c, d ≥ 0, for all d ∈ C. (1.56)

Consider the mapping Ax := (a1 , x, . . . , a j , x, c, x) over X , with image E 1 ⊂
R j+1 . We claim that the point z := (0, . . . , 0, −1) does not belong to the set
j+1
E 2 := E 1 + R+ . Indeed, otherwise we would have z ≥ Ax, for some x ∈ X , and
so ai , x ≤ 0, for all i ∈ J , whereas z j+1 = −1 contradicts (1.56).
Let z 1 , . . . , z q be a basis of the vector space E 1 . Then E 2 is the set of nonnegative
linear combinations of {±z 1 , . . . , ±z q , e1 , . . . , e j+1 }, where by ei we denote the
elements of the natural basis of R j+1 . By Lemma 1.25, E 2 is closed. Corollary 1.16,
allows us to strictly separate z and E 2 . That is,

− λ j+1 = λz < inf λy, for some nonzero λ ∈ R j+1 . (1.57)


y∈E 2
14 1 A Convex Optimization Toolbox

Since E 2 is a cone, the above infimum is zero, whence λ j+1 > 0. Changing λ into
j+1
λ/λ j+1 if necessary, we may assume that λ j+1 = 1. Since E 2 = E 1 + R+ it follows
j+1 
that λ ∈ R+ , and since E 1 ⊂ E 2 we deduce that 0 ≤  i∈J λi ai + c, d for all
d ∈ X , meaning that i∈J λi ai + c = 0. Let us now set λi = 0 for i ≤ p, i ∈ / J.
Then (1.52) holds at the point x̄. By Lemma 1.23, (x̄, λ) is a saddle point and the
conclusion follows. 

Remark 1.27 It may happen, even in a finite-dimensional setting, that both the pri-
mal and dual problem are unfeasible, so that they have value +∞ and −∞ resp.;
consider for instance the problem Min x∈R {−x; 0 × x = 1; −x ≤ 0}, whose dual is
Maxλ∈R2 {−λ1 ; −1 − λ2 = 0; λ2 ≥ 0}.

1.1.4.2 Hoffman Bounds

As an application of linear programming duality we present Hoffman’s lemma [59].

Lemma 1.28 Given a Banach space X , a1 , . . . , a p in X ∗ , and b ∈ R p , set

Cb := {x ∈ X ; ai , x ≤ bi , i = 1, . . . , p}. (1.58)

Then there exists a Hoffman constant M > 0, not depending on b, such that, if
Cb = ∅, then


p
dist(x, Cb ) ≤ M (ai , x − bi )+ , for all x ∈ X. (1.59)
i=1

Proof (a) Define A ∈ L(X, R p ) by Ax = (a1 , x, . . . , a p , x). Let x1 , . . . , xq be


elements of X such that (Ax1 , . . . , Axq ) is a basis of Im(A). Then q ≤ p, and the
family {x1 , . . . , xq } is linearly independent. Denote by H the vector space with basis
x1 , . . . , xq . The Euclidean norm over H is equivalent to the one induced by X , since
all norms are equivalent on finite-dimensional spaces.
(b) Let x ∈ X . We may express the coordinates of Ax in the basis {Ax j } as functions
q
of x, i.e., write Ax = j=1 α j (x)Ax j for some linear function α : X → Rq that is
continuous. Indeed, by the equivalence of norms in a finite-dimensional space, for
some positive c not depending on x:

|α(x)| ≤ c |Ax| ≤ c Ax. (1.60)


q
Setting C x = j=1 α j (x)x j , and Bx := x − C x, we may write x = Bx + C x, and


q
ABx = Ax − α j (x)Ax j = 0, (1.61)
j=1
1.1 Convex Functions 15

i.e., AB = 0. So, Cb (assumed to be nonempty) is invariant under the addition of an


element of the image of B. In particular, it contains an element x̂ such that x̂ − x ∈ H ,
that is,
q
x̂ = x + βjxj. (1.62)
j=1

Specifically, let x̂ be such an element of Cb for which |β| is minimum, that is, β is a
solution of the problem

1  q
Minq |γ |2 ; ai , x + γ j x j  ≤ bi , i = 1, . . . , p. (1.63)
γ ∈R 2
j=1

q
The following optimality conditions hold: there exists a λ ∈ R+ such that


p
γj + λi ai , x j  = 0, j = 1, . . . , q, (1.64)
i=1

and ⎛ ⎞

q
λi ⎝ai , x + γ j x j  − bi ⎠ = 0, i = 1, . . . , p. (1.65)
j=1

Using first (1.64) and then (1.65), we obtain

q
  q
 q

|γ |2 = γ j2 = − λi ai , γ j x j  = λi (ai , x − bi ) ≤ λ∞ (ai , x − bi )+ .
j=1 i, j i=1 i=1
(1.66)

(c) Among all possible multipliers λ we may take one with minimal support. From
(1.64) we deduce the existence of M1 > 0 not depending on x and b, such that

λ∞ ≤ M1 |γ |. (1.67)

Combining with (1.66), we deduce that


q
|γ | ≤ M1 (ai , x − bi )+ . (1.68)
i=1

The conclusion follows since, as noticed before, the Euclidean norm on H is equiv-
alent to the one induced by the norm of X . 
16 1 A Convex Optimization Toolbox

1.1.4.3 The Open Mapping Theorem

We will generalize the previous result, and for this we need the following fundamental
result in functional analysis.
Theorem 1.29 (Open mapping theorem) Let X and Y be Banach spaces, and let
A ∈ L(X, Y ) be surjective. Then α BY ⊂ AB X , for some α > 0.
Proof See e.g. [28]. 

Corollary 1.30 Let A and α be as in Theorem 1.29. Then Im(A ) is closed, and

A λ ≥ αλ, for all λ ∈ Y ∗ . (1.69)

Proof By the open mapping theorem, we have that

A λ = sup λ, Ax X ≥ α sup λ, yY = αλ, (1.70)


x≤1 y≤1

proving (1.69). Let us now check that Im(A ) is closed. Let xk∗ in Im(A ) converge
to x ∗ . There exists a sequence λk ∈ Y ∗ such that xk∗ = A λk . In view of (1.69), λk
is a Cauchy sequence and hence has a limit λ̄ ∈ Y ∗ . Therefore x ∗ = A λ̄ ∈ Y ∗ . The
conclusion follows. 
Proposition 1.31 Let X and Y be Banach spaces, and A ∈ L(X, Y ). Then Im(A ) ⊂
(Ker A)⊥ , with equality if A has a closed range.
Proof (a) Let x ∈ Im(A ), i.e., x = A y ∗ , for some y ∈ Y , and x ∈ Ker A. Then
x, x X = y, AxY = 0. Therefore, Im(A ) ⊂ (Ker A)⊥ .
(b) Assume now that A has closed range. Let x ∗ ∈ (Ker A)⊥ . For y ∈ Y , set

v(y) := x ∗ , x, where x ∈ X satisfies Ax = y. (1.71)

Since x ∗ ∈ (Ker A)⊥ , any x such that Ax = y gives the same value of x ∗ , x X ,
and therefore v(y) is well-defined. It is easily checked that it is a linear function.
By the open mapping theorem, applied to the restriction of A from X to its image
(the latter being a Banach space by hypothesis), there exists an x ∈ α −1 yY B X
such that Ax = y, so that |v(y)| ≤ α −1 x ∗ yY . So, v is a linear and continuous
mapping, i.e., there exists a y ∗ ∈ Y ∗ such that v(y) = y ∗ , yY . For all x ∈ X , we
have therefore x ∗ , x X = y ∗ , AxY = A y ∗ , x X , so that x ∗ = A y ∗ , as was to
be proved. 
Remark 1.32 See in Example 1.115 another proof, based on duality theory.
Example 1.33 Let X := L 2 (0, 1), Y := L 1 (0, 1), and A ∈ L(X, Y ) be the injection
of X into Y . Then Ker A is reduced to 0, and therefore its orthogonal is X ∗ . On the
other hand, we have that for y ∗ ∈ L ∞ (0, 1), A y ∗ is the operator in X ∗ defined by
x → 0 y ∗ (t)x(t)dt. So the image of A is a dense subspace of X ∗ , but A is not
1

surjective, and therefore its image is not closed.


1.1 Convex Functions 17

We next give a useful generalization of Hoffman’s Lemma 1.28 in the homoge-


neous case.

Lemma 1.34 Given Banach spaces X and Y , A ∈ L(X, Y ) surjective, and a1 , . . . ,


a p in X ∗ , set C := {x ∈ X ; Ax = 0; ai , x ≤ 0, i = 1, . . . , p}. Then there exists
a Hoffman constant M > 0 such that


p
dist(x, C) ≤ M Ax + (ai , x)+ , for all x ∈ X. (1.72)
i=1

Proof By the open mapping Theorem 1.29, there exists an x  ∈ Ker A such that
x  − x ≤ α −1 Ax, where α is given by Theorem 1.29. Therefore for some M > 0
not depending on x:
 
ai , x   + ≤ (ai , x)+ + MAx. (1.73)

Applying Lemma 1.28 to x  , with Ker A in place of X , we obtain the desired con-
clusion. 

1.1.5 Conjugacy

1.1.5.1 Basic Properties

Let X be a Banach space and f : X → R̄. Its (Legendre–Fenchel) conjugate is the


function f ∗ : X ∗ → R̄ defined by

f ∗ (x ∗ ) := supx ∗ , x − f (x). (1.74)


x∈X

This can be motivated as follows. Let us look for an affine minorant of f of the form
x ∗ , x − β. For given x ∗ , the best (i.e., minimal) value of β is precisely f ∗ (x ∗ ).
Being a supremum of affine functions, f ∗ is l.s.c. convex. We obviously have the
Fenchel–Young inequality

f ∗ (x ∗ ) ≥ x ∗ , x − f (x), for all x ∈ X and x ∗ ∈ X ∗ . (1.75)

Remark 1.35 We have that

f ∗ (x ∗ ) := −∞ for each x ∗ if dom( f ) = ∅. (1.76)

f ∗ (x ∗ ) > −∞ for each x ∗ if dom( f ) = ∅. (1.77)


18 1 A Convex Optimization Toolbox

Since the supremum over an empty set is −∞, we may always express f ∗ by maxi-
mizing over dom( f ):

f ∗ (x ∗ ) := sup x ∗ , x − f (x). (1.78)


x∈dom( f )

If f (x) is finite we can write the Fenchel–Young inequality in the more symmetric
form
x ∗ , x ≤ f (x) + f ∗ (x ∗ ). (1.79)

Lemma 1.36 Let f be proper. Then the symmetric form (1.79) of the Fenchel–Young
inequality is valid for any (x, x ∗ ) in X × X ∗ .

Proof By (1.77), f ∗ (x ∗ ) > −∞, and f (x) > −∞, so that (1.79) makes sense and
is equivalent to the Fenchel–Young inequality. 

Example 1.37 Let X be a Hilbert space, and define f : X → R by f (x) := 21 x2 .


We identify X with its dual. Then it happens that f ∗ (x) = f (x) for all x in X , leading
to the well-known inequality (where the l.h.s. is the scalar product between x and y):

1 1
(x, y) ≤ x2 + y2 , for all x, y in X. (1.80)
2 2
Example 1.38 Let p > 1. Define f : R → R by f (x) := |x| p / p. For y ∈ R, the
maximum of x → x y − f (x) is attained at 0 if y = 0, and otherwise for some
x = 0 of the same sign as y such that |x| p−1 = |y|. Introducing the conjugate

exponent p ∗ such that 1/ p ∗ + 1/ p = 1, we get f ∗ (y) = |y| p / p ∗ , so that x y ≤
p∗ ∗
|x| / p + |y| / p , for all x, y in R. Similarly, 
p
for some p > 1, let f : Rn → R
p p n
be defined by f (x) := x p / p, where x p = i=1 |xi | p . We easily obtain that

f ∗ (y) = y p∗ / p ∗ , leading to the Young inequality
p


n
1 1 p∗
xi yi ≤ x pp + ∗ y p∗ , for all x, y in R. (1.81)
i=1
p p

Exercise 1.39 Let A be a symmetric, positive definite n × n matrix. (i) Check that
the conjugate of f (x) := 21 x  Ax is f ∗ (y) := 21 y  A−1 y. Taking x = y, deduce the
Young inequality

1  1
|x|2 ≤ x Ax + x  A−1 x, for all x ∈ Rn . (1.82)
2 2
Conclude that

A + A−1 − 2I is positive semidefinite, if A is symmetric and positive definite.


(1.83)
1.1 Convex Functions 19

Exercise 1.40 Check that the conjugate of the indicatrix of the (open or closed) unit
ball of X is the dual norm.
Exercise 1.41 Let f (x) := αg(x) with α > 0 and g : X → R̄. Show that

f ∗ (x ∗ ) = αg ∗ (x ∗ /α). (1.84)

Exercise 1.42 Show that the Fenchel conjugate of the exponential is the entropy
function H with value H (x) = x(log x − 1) if x > 0, H (0) = 0, and H (x) = ∞ if
x < 0. Deduce the inequality x y ≤ e x + y(log y − 1), for all x ∈ R and y > 0.
The biconjugate of f is the function f ∗∗ : X → R̄ defined by

f ∗∗ (x) := sup x ∗ , x − f ∗ (x ∗ ). (1.85)


x ∗ ∈X ∗

Proposition 1.43 The biconjugate f ∗∗ is the supremum of the affine minorants of f .


Proof Let x ∗ ∈ X ∗ . If f ∗ (x ∗ ) = ∞, then f has no affine minorant with slope x ∗ .
Otherwise, as we already observed, x ∗ , x − f ∗ (x ∗ ) is an affine minorant of f with
the best possible constant term. The conclusion follows. 
A hyperplane (−x ∗ , β) ∈ X ∗ × R separating (x0 , α0 ) from epi( f ) is such that

−x ∗ , x0  + βα0 ≤ −x ∗ , x + βα, for all (x, α) ∈ epi( f ). (1.86)

If dom( f ) = ∅, then for some x ∈ dom( f ), we can take α → +∞ and it follows


that β ≥ 0. If β = 0, we say that the hyperplane is vertical and the above relation
reduces to
x ∗ , x0  ≥ x ∗ , x for all x ∈ dom( f ), (1.87)

i.e., −x ∗ separates x0 from dom( f ). Otherwise, we say that the separating hyperplane
is oblique. We may then assume that β = 1 and we obtain that

−x ∗ , x0  + α0 ≤ −x ∗ , x + f (x), for all x ∈ dom( f ). (1.88)

This is equivalent to saying that x → x ∗ , x − x0  + α0 is an affine minorant of f .


Theorem 1.44 Let f : X → R̄ be proper, l.s.c. convex. Then f = f ∗∗ .
Proof (a) It suffices to prove that any (x0 , α0 ) ∈ / epi( f ) can be strictly separated
from epi( f ) by an oblique hyperplane. Indeed, the corresponding affine minorant
then guarantees that (x0 , α0 ) ∈/ epi( f ∗∗ ). Since f ∗∗ ≤ f , it follows that epi( f ) =
∗∗
epi( f ) as was to be proved.
(b) Since epi( f ) is a closed convex subset of X × R, by Corollary 1.16, it can be
strictly separated from (x0 , α0 ). Note that not all separating hyperplanes are vertical,
since otherwise f would have value −∞ over its (nonempty) domain. It follows that
f has an affine minorant, say x → x ∗ , x + γ .
20 1 A Convex Optimization Toolbox

If the hyperplane strictly separating (x0 , α0 ) and epi( f ) is oblique, it provides an


affine minorant of f with value greater than α0 at the point x0 , as required. If, on
the contrary, the hyperplane strictly separating (x0 , α0 ) and epi( f ) is vertical, say
y ∗ , x − x0  + ε ≤ 0 for all x ∈ dom( f ), with y ∗ = 0 and ε > 0, then we have that
for any β > 0 and all x ∈ dom( f ):

f (x) ≥ β(y ∗ , x − x0  + ε) + x ∗ , x + γ , (1.89)

meaning that the above r.h.s. is an affine minorant of f . At the same time, its value at
x0 is βε + x ∗ , x0  + γ , which for β > 0 large enough is larger than α0 . So this r.h.s.
is an oblique hyperplane separating (x0 , α0 ) from epi( f ). The conclusion follows.

Definition 1.45 (i) Let E ⊂ X . The convex hull conv(E) is the smallest convex
set containing E, i.e., the set of finite convex combinations (linear combinations
with nonnegative weights whose sum is 1) of elements of E. The convex closure
of E, denoted by conv(E), is the smallest closed convex set containing E (i.e., the
intersection of closed convex set containing E).
(ii) Let f : X → R̄. The convex closure of f is the function conv( f ) : X → R̄
whose epigraph is conv(epi( f )) (note that conv( f ) is the supremum of l.s.c. convex
minorants of f ).
We obviously have that f = conv( f ) iff f is convex and l.s.c.
Theorem 1.46 (Fenchel–Moreau–Rockafellar) Let f : X → R̄. We have the fol-
lowing alternative: either
(i) f ∗∗ = −∞ identically, conv( f ) has no finite value, and has value −∞ at some
point, or
(ii) f ∗∗ = conv( f ) and conv( f )(x) > −∞, for all x ∈ X .
Proof If f is identically equal to +∞, the conclusion is obvious. So we may
assume that dom( f ) = ∅. Since f ∗∗ is an l.s.c. convex minorant of f , we have
that f ∗∗ ≤ conv( f ). So, if conv( f )(x1 ) = −∞ for some x1 ∈ X , then f has no
affine minorant and f ∗∗ = −∞. In addition, since conv( f ) is l.s.c. convex, for any
x ∈ dom(conv( f )), setting x θ := θ x + (1 − θ )x1 , we have that

conv( f )(x) ≤ lim conv( f )(x θ ) ≤ θ conv( f )(x) + (1 − θ ) conv( f )(x1 ) = −∞,
θ↑1
(1.90)
so that (i) holds. On the contrary, if (i) does not hold, then f has a continuous affine
minorant, so that then conv( f )(x) > −∞, for all x ∈ X . Being proper, l.s.c. and
convex, conv( f ) is by Theorem 1.44 the supremum of its affine minorants, which
coincide with the affine minorants of f . The conclusion follows. 
Corollary 1.47 Let f be convex X → R̄. Then
(i) conv( f )(x) = lim inf x  →x f (x  ), for all x ∈ X ,
(ii) if f is finite-valued and l.s.c. at some x0 ∈ X , then f (x0 ) = f ∗∗ (x0 ).
1.1 Convex Functions 21

Proof (i) Set g(x) := lim inf x  →x f (x  ). It is easily checked that g is an l.s.c. convex
minorant of f , and therefore g ≤ conv( f ). On the other hand, since conv( f ) is an
l.s.c. minorant of f , we have that conv( f )(x) ≤ lim inf x  →x f (x  ) = g(x), proving
(i).
(ii) By point (i), since f is finite-valued and l.s.c. at x0 , we have that f (x0 ) =
conv( f )(x0 ) > −∞, and we conclude by Theorem 1.46. 

Example 1.48 Let K be a nonempty closed convex subset of X , and set f (x) = −∞
if x ∈ K , and f (x) = +∞ otherwise. Then f is l.s.c. convex, and f ∗∗ has value −∞
everywhere, so that f = f ∗∗ .

1.1.5.2 Conjugacy in Dual Spaces

Let X be a Banach space, and g : X ∗ → R̄. Its (Legendre–Fenchel) conjugate (in the
dual sense) is the function g ∗ : X → R̄ defined by

g ∗ (x) := sup x ∗ , x − g(x ∗ ). (1.91)


x ∗ ∈X ∗

So we have the dual Fenchel–Young inequality

g ∗ (x) ≥ x ∗ , x − g(x ∗ ), for all x ∈ X and x ∗ ∈ X ∗ . (1.92)

Being a supremum of affine functions, g ∗ is l.s.c. convex. Its biconjugate g ∗∗ is the


Legendre–Fenchel conjugate (in the sense of Sect. 1.1.5) of g ∗ , i.e.

g ∗∗ (x ∗ ) := supx ∗ , x − g ∗ (x). (1.93)


x∈X

Let us call a function X ∗ → R of the form x ∗ → x ∗ , x + α, with (x, α) ∈ X × R,


a ∗affine function; note that this excludes the affine functions of the form x ∗ →
x ∗∗ , x ∗  + α, where x ∗∗ ∈ X ∗∗ \ X . We call the ∗affine functions that minorize g
∗affine minorants of g. By the same arguments as in Sect. 1.1.5, we obtain that

g ∗∗ is the supremum of ∗ affine minorants of g, (1.94)

and so we get the following result:

Lemma 1.49 Let g : X ∗ → R̄. We have that g = g ∗∗ iff g is a supremum of ∗affine


functions.

Remark 1.50 For any f : X → R̄, the Fenchel conjugate f ∗ is a supremum of


∗affine functions. It follows that

f ∗∗∗ = f ∗ . (1.95)
22 1 A Convex Optimization Toolbox

Example 1.51 Recall the definition (1.6) of the indicatrix function. The support
function σ K : X ∗ → R̄ (sometimes also denoted by σ (·, K )) is defined by

σ K (x ∗ ) := sup{x ∗ , x; x ∈ K }. (1.96)

Clearly σ K is the conjugate of I K . If K is closed and convex, then I K is proper, l.s.c.


and convex, and hence, is equal to its biconjugate, so that the conjugate of σ K is I K .
Otherwise, let K be the smallest closed convex set containing K . It is easily checked

that IK = σ K . Since IK is l.s.c. convex and proper it is equal to its biconjugate. We
proved that
IK and σ K are conjugate to each other. (1.97)

1.1.5.3 Continuity and Subdifferentiability

Let f : X → R̄ have a finite value at some x ∈ X . We define the subdifferential of


f at x as the set

∂ f (x) := {x ∗ ∈ X ∗ ; f (x  ) ≥ f (x) + x ∗ , x  − x, for all x  ∈ X }. (1.98)

Equivalently, ∂ f (x) is the set of slopes of affine minorants of f that are exact (i.e.,
equal to f ) at the point x. The inequality in (1.98) may be written as

x ∗ , x − f (x) ≥ x ∗ , x   − f (x  ), for all x  ∈ X. (1.99)

Therefore x ∗ ∈ ∂ f (x) iff x attains the maximum in the definition of f ∗ (x ∗ ), i.e., we


have that
{x ∗ ∈ ∂ f (x)} ⇔ { f ∗ (x ∗ ) + f (x) = x ∗ , x}. (1.100)

In other words, x ∗ ∈ ∂ f (x) iff it gives an equality in the Fenchel–Young inequality


(1.79).
By (1.85), f ∗∗ is the supremum of its affine minorants which are of the form
x , x − β, β ≥ f ∗ (x ∗ ), for x ∗ ∈ dom( f ∗ ). It follows that the affine minorants that

are exact at x for f ∗∗ are those which attain the supremum in (1.85), i.e.

{x ∗ ∈ ∂ f ∗∗ (x)} ⇔ { f ∗ (x ∗ ) + f ∗∗ (x) = x ∗ , x}. (1.101)

Also, if ∂ f (x) = ∅, then the corresponding affine minorants, exact at x, are also
minorants of f ∗∗ exact at x, and therefore

{∂ f (x) = ∅} ⇒ { f ∗∗ (x) = f (x)} ⇒ {∂ f ∗∗ (x) = ∂ f (x)}. (1.102)

We may also define the subdifferential of a function g : X ∗ → R̄ with finite value at


x ∗ as
1.1 Convex Functions 23

∂g(x ∗ ) := {x ∈ X ; g(y ∗ ) ≥ g(x ∗ ) + y ∗ − x ∗ , x, for all y ∗ ∈ X ∗ }. (1.103)

Similarly to what was done before we can express the above inequality as

x ∗ , x − g(x ∗ ) ≥ y ∗ , x − g(y ∗ ), for all y ∗ ∈ X ∗ . (1.104)

This means that x ∈ ∂g(x ∗ ) iff x ∗ attains the maximum in the definition of g ∗ (x),
i.e., we have that

{x ∈ ∂g(x ∗ )} ⇔ {g(x ∗ ) + g ∗ (x) = x ∗ , x}. (1.105)

With similar arguments we obtain that

{x ∈ ∂g ∗∗ (x ∗ )} ⇔ {g ∗∗ (x ∗ ) + g ∗ (x) = x ∗ , x}. (1.106)

When g is itself a conjugate function we deduce the following.

Lemma 1.52 Let f : X → R̄ have a finite value at some x ∈ X . That equality holds
in the Fenchel–Young inequality (1.79) implies that x ∈ ∂ f ∗ (x ∗ ); the converse holds
if f is proper, l.s.c. convex.

Proof If (1.79) holds with equality, we know that f (x) = f ∗∗ (x) and so, by (1.105)
applied to g = f ∗ , x ∈ ∂ f ∗ (x ∗ ) holds. Conversely, if x ∈ ∂ f ∗ (x ∗ ), then by (1.105)
applied to g = f ∗ , we have that f ∗ (x ∗ ) + f ∗∗ (x) = x ∗ , x. When f is proper, l.s.c.
convex, f ∗∗ (x) = f (x), so that equality holds in the Fenchel–Young inequality, as
was to be proved. 

Remark 1.53 So, if f is proper, l.s.c. convex, we have that

x ∗ ∈ ∂ f (x) iff x ∈ ∂ f ∗ (x ∗ ). (1.107)

In this sense the Fenchel Legendre transform is an extension of the property of


the classical Legendre transform which, under certain conditions, associates with a
smooth function f over Rn another smooth function fˆ over Rn such that y = f  (x)
iff x = fˆ (y).

In the analysis of stochastic problems we will need the following sensitivity anal-
ysis results for linear programs.

Example 1.54 Given d ∈ Rm , b ∈ R p and matrices A and M of size p × m and


p × n resp., let f : Rn → R̄ be defined by

f (x) := infm {d · y; Ay = b + M x}. (1.108)


y∈R+

This is the value of a linear program whose dual is


24 1 A Convex Optimization Toolbox

Maxp −λ · (b + M x); d + A λ ≥ 0. (Dx )


λ∈R

The next lemma gives an expression of ∂ f (x).

Lemma 1.55 Let f have a finite value at x̄ ∈ X . Then ∂ f (x̄) is nonempty and
satisfies
∂ f (x̄) = {−M  λ; λ ∈ S(Dx̄ )}. (1.109)

Proof The conjugate of f is

f ∗ (x ∗ ) = supx;y≥0 {x ∗ · x − dy; Ay = b + M x}
(1.110)
= − inf x;y≥0 {d · y − x ∗ · x; Ay = b + M x}.

Since f (x̄) is finite, the linear program involved in the above r.h.s. is feasible. By
linear programming duality (Lemma 1.26) it has the same value as its dual, and
hence,

− f ∗ (x ∗ ) = supλ∈R p {−λ · b; x ∗ = −M  λ; d + A λ ≥ 0}. (1.111)

The Fenchel–Young inequality implies

0 ≤ f (x̄) + f ∗ (x ∗ ) − x ∗ · x̄
(1.112)
= f (x̄) − x ∗ · x̄ + inf λ∈R p {λb; x ∗ = −M  λ; d + A λ ≥ 0}.

Since f (x̄) is the finite value of a feasible linear program, it is equal to val(Dx̄ ). So,
let λ̄ ∈ S(Dx̄ ). The Fenchel–Young inequality (1.112) is equivalent to

λ̄ · (b + M x̄) ≤ −x ∗ · x̄ + infp {λ · b; x ∗ = −M  λ; d + A λ ≥ 0}. (1.113)


λ∈R

When equality holds, the linear program on the r.h.s. has a solution, say λ, and
−x ∗ · x̄ = λ M x̄, so that equality holds iff

λ̄ · (b + M x̄) = minp {λ · (b + M x̄); x ∗ = −M  λ; d + A λ ≥ 0}. (1.114)


λ∈R

Recall that this is the case of equality in the Fenchel–Young inequality, and therefore
it holds iff x ∗ ∈ ∂ f (x̄). Since the cost function and last constraint correspond to those
of (Dx̄ ), it follows that any solution λ̂ of the linear program on the r.h.s. belongs to
S(Dx̄ ). We have proved that, if x ∗ ∈ ∂ f (x), then x ∗ = −M  λ for some λ ∈ S(Dx̄ ).
The converse obviously holds in view of (1.114). 

Remark 1.56 Consider the particular case when b = 0 and M is the opposite of the
identity. Rewriting as b the variable x, we obtain that the function

f (b) := infm {d · y; Ay + b = 0} (1.115)


y∈R+
1.1 Convex Functions 25

has, over its domain, a subdifferential equal to the set of solutions of the dual problem

Maxp λ · b; d + A λ ≥ 0. (1.116)
λ∈R

We now show that, for convex functions, a local uniform upper bound implies a
Lipschitz property as well as subdifferentiability.

Lemma 1.57 Let f : X → R̄ be convex, finitely-valued at x0 , and uniformly upper


bounded near x0 , i.e., for some a ∈ R and r > 0:

f (x) ≤ a whenever x − x0  ≤ r. (1.117)

Then f is Lipschitz with constant say L on B(x0 , 21 r ).

Proof Let ε ∈]0, 1[ and h ∈ X , h = εr . Since x0 ± ε−1 h ∈ B̄(x0 , r ), we have by


convexity of f that

f (x0 + h) ≤ (1 − ε) f (x0 ) + ε f (x0 + ε−1 h) ≤ (1 − ε) f (x0 ) + εa,


f (x0 + h) ≥ (1 + ε) f (x0 ) − ε f (x0 − ε−1 h) ≥ (1 + ε) f (x0 ) − εa.

It follows that

| f (x0 + h) − f (x0 )| ≤ ε(a − f (x0 )) = r −1 (a − f (x0 ))h. (1.118)

Therefore, for all x ∈ B̄(x0 , r ), we have that f (x) ≥ b, with b := f (x0 ) −


(a − f (x0 )). Let x1 ∈ B̄(x0 , r1 ), where r1 := 21 r . Then b ≤ f (x) ≤ a, for all x ∈
B̄(x1 , r1 ). Applying (1.118) at the point x1 , with r = r1 , we get

| f (x1 + h) − f (x1 )| ≤ r1−1 (a − b)h, for allh < r1 . (1.119)

Therefore f is Lipschitz with constant r1−1 (a − b) over B̄(x0 , r1 ), as was to be


proved. 

Corollary 1.58 Let f : Rn → R̄ be proper convex. Then it is locally Lipschitz over


the interior of its domain.

Proof Let x̄ ∈ int dom( f ). There exists x 0 , . . . , x n in dom( f ) such that x̄ ∈ int E,
where E := conv({x 0 , . . . , x n }). Then f (x) ≤ max{ f (x 0 ), . . . , f (x n )} over E. We
conclude by Lemma 1.57. 

Lemma 1.59 Let f : X → R̄ be convex, and Lipschitz with constant L near x0 ∈ X .


Then ∂ f (x0 ) is nonempty and included in B̄(0, L).

Proof Let x̂ ∈ B(x0 , 21 r ). Set E = {(x, γ ) ∈ X × R; γ > f (x)}. Since f is con-


tinuous at x0 , for ε > 0 small enough,
26 1 A Convex Optimization Toolbox

B(x0 , ε) × [ f (x0 ) + 1, ∞) ⊂ E, (1.120)

so that E has a nonempty interior. By Theorem 1.12, there exists (λ, α) ∈ X ∗ × R


separating (x̂, f (x̂)) and E, i.e., such that

λ, x̂ + α f (x̂) ≤ λ, x + αγ , for all x ∈ dom( f ), γ > f (x). (1.121)

Taking x = x0 and γ > f (x), γ → +∞, we see that α ≥ 0. The separating hyper-
plane cannot be vertical, since x̂ ∈ int(dom( f )), so that we may take α = 1. Mini-
mizing w.r.t. γ we obtain that f (x) ≥ f (x̂) − λ, x − x̂, proving that −λ ∈ ∂ f (x̂).
We now check that ∂ f (x) ⊂ B̄(0, L). Assume that x ∗ ∈ ∂ f (x), with x ∗ ∗ > L.
Then there exists a d ∈ X with d = 1 and x ∗ , d > L. Therefore by the definition
of a subdifferential,

f (x + σ d) − f (x)
lim ≥ x ∗ , d > L , (1.122)
σ ↓0 σ

in contradiction with the fact that L is a local Lipschitz constant. 

Example 1.60 Consider the entropy function f (x) = x log x if x ≥ 0 (with value 0
at zero), and f (x) = +∞ otherwise. Then f is l.s.c. convex, and the subdifferential
is empty at x = 0. So, in general, even in a Euclidean space, an l.s.c. convex function
may have an empty subdifferential at some points of its domain.

We next introduce a concept that in some sense is an algebraic variant of the


interior.

Definition 1.61 Let S ⊂ X , where X is a vector space. Then we say that x ∈ S


belongs to the core of S and write x ∈ core(S) if, for each h ∈ X , there exists an
ε > 0 such that [x − εh, x + εh] ⊂ S.

Lemma 1.62 We have that int(S) ⊂ core(S), and the converse holds in the following
cases: (i) int(S) = ∅, (ii) S is finite-dimensional, (iii) S is closed and convex.

Proof That int(X ) ⊂ core(S) is an immediate consequence of the definition. That


the converse holds in cases (i) and (ii) is left as an exercise. Let us suppose now that
S is closed and convex. If core(S) = ∅, the conclusion trivially holds. Otherwise, we
may assume that 0 ∈ core(S). For k ∈ N, set Sk := k S; then X = ∪k Sk . By Baire’s
lemma,2 at least one element of the family has a nonempty interior. Since Sk = k S
this means that there exists an x1 ∈ S such that B(x1 , ε) ⊂ S for some ε > 0. We
have that −x1 ∈ S for some  ∈ N, as well as B(x1 , ε) ⊂ S , and so since S is
convex, B(0, 21 ε) ⊂ S , proving that B(0, ε ) ⊂ S with ε := 21 ε/. The conclusion
follows. 

2 Baire’s lemma tells us that any countable intersection of dense subsets in


X is dense, or equivalently,
that any countable union of closed sets with empty interiors has an empty interior.
1.1 Convex Functions 27

We next give an example3 of a set with an empty interior, and a nonempty core.
Example 1.63 Let X be an infinite-dimensional Banach space. It is known that there
exists a non-continuous linear form on X , that we denote by a(x). Set

A := {x ∈ X ; |a(x)| ≤ 1}. (1.123)

Clearly, 0 ∈ core(A). However, since a(x) is not continuous, and therefore not
bounded in a neighbourhood of 0, A has an empty interior.
Proposition 1.64 Let f be a convex function Rn → R̄. Then it is continuous over
the interior of its domain.
Proof Let x̄ belong to the interior of dom( f ). Then there exists x 0 , . . . , x n in dom( f )
whose convex hull E is such that B(x̄, ε) ∈ int(E), for some ε > 0. Since f is convex
it follows that f (x) ≤ maxi f (x i ) for all x in B(x̄, ε). So, the conclusion follows
from Lemma 1.57. 
Proposition 1.65 Let f be a proper l.s.c. convex function X → R̄. Then it is con-
tinuous over the interior of its domain.
Proof Let x0 ∈ int(dom( f )), and set S := {x ∈ X ; f (x) ≤ f (x0 ) + 1}. Since f is
l.s.c., this is a closed set. Fix h ∈ X ; for t ∈ R, the function ϕ(t) := f (x0 + th) has a
finite value at 0, and its domain contains [−ε, ε] for some ε > 0. For t ∈ [−ε, ε], we
have that ϕ(t) ≤ max( f (x0 − εh), f (x0 + εh)). By Lemma 1.57, ϕ is continuous
at 0, proving that x0 ∈ core(S). By Lemma 1.62, x0 ∈ int(S), meaning that f is
bounded from above near x0 . We conclude with Lemma 1.57. 
If f is convex, then for all x and h in X , ( f (x + th) − f (x))/t is nondecreasing
w.r.t. t ∈ (0, ∞). Therefore the directional derivative

f (x + th) − f (x)
f  (x, h) := lim (1.124)
t↓0 t

always exists. Let us see how its value is related to the subdifferential. For this we
need a preliminary lemma on positively homogeneous functions.
Lemma 1.66 Let F : X → R̄ be positively homogeneous, i.e.

F(γ x) = γ F(x), for all γ > 0, (1.125)

with F(0) = 0. Then (i) F ∗ is the indicatrix of ∂ F(0), and

∂ F(0) := {x ∗ ∈ X ∗ ; x ∗ , h ≤ F(h), for all h ∈ X }, (1.126)


F ∗∗ (h) = sup{x ∗ , h; x ∗ ∈ ∂ F(0)}. (1.127)

3 Provided by Lionel Thibault, U. Montpellier II.


28 1 A Convex Optimization Toolbox

(ii) If ∂ F(0) = ∅, then

∂ F ∗∗ (h) = {x ∗ ∈ ∂ F(0); F ∗∗ (h) = x ∗ , h}. (1.128)

Proof (i) Relation (1.126) is the definition of ∂ F(0). By positive homogeneity,


F ∗ (x ∗ ) is equal to 0 if x ∗ ∈ ∂ F(0), and +∞ otherwise, proving that F ∗ is the
indicatrix of ∂ F(0). It follows that F ∗∗ satisfies (1.127), and so is positively homo-
geneous.
(ii) if ∂ F(0) = ∅, then by (1.102), F ∗∗ (0) = 0 = F(0), and ∂ F(0) = ∂ F ∗∗ (0). Now
let x ∗ ∈ ∂ F ∗∗ (h). It is easily checked that for each γ > 0 and y ∈ X :

γ F ∗∗ (y) = F ∗∗ (γ y) ≥ F ∗∗ (h) + x ∗ , γ y − h. (1.129)

Dividing by γ ↑ ∞ we obtain that x ∗ , y ≤ F ∗∗ (y), for all y ∈ X , i.e., x ∗ ∈


∂ F ∗∗ (0) = ∂ F(0). Taking y = 0 in (1.129) we get the opposite inequality F ∗∗ (h) ≤
x ∗ , h; therefore x ∗ belongs to the r.h.s. of (1.128).
Conversely, let x ∗ ∈ ∂ F ∗∗ (0) = ∂ F(0) be such that x ∗ , h = F ∗∗ (h). Then for
any y ∈ X , we have that F ∗∗ (y) ≥ x ∗ , y = x ∗ , y − h + F ∗∗ (h), proving that
x ∗ ∈ ∂ F ∗∗ (h). The conclusion follows. 

Theorem 1.67 Let f : X → R̄ be convex, finitely-valued at x̄, and set F(·) :=


f  (x̄, ·). Then (i) ∂ f (x̄) = ∂ F(0), and (ii) if ∂ f (x̄) = ∅, then

f  (x̄, h) ≥ lim inf f  (x̄, h  ) = sup{x ∗ , h; x ∗ ∈ ∂ f (x̄)}. (1.130)


h →h

Proof (i) The function F is positively homogeneous, with value 0 at 0, and is easily
proved to be convex. Let x ∗ ∈ ∂ F(0). Then

x ∗ , x − x̄ = F(0) + x ∗ , x − x̄ ≤ F(x − x̄) ≤ f (x) − f (x̄), for all x ∈ X,


(1.131)
implying that ∂ F(0) ⊂ ∂ f (x̄). Conversely, let x ∗ ∈ ∂ f (x̄). Then, for any h ∈ X :

f (x̄ + th) − f (x̄)


F(h) = lim ≥ x ∗ , h, (1.132)
t↓0 t

proving the converse inclusion; point (i) follows.


(ii) Since ∂ f (x̄) = ∅, we have that ∂ F(0) = ∅. By Theorem 1.46 and Corollary
1.47(i), we have that

F ∗∗ (h) = conv(F)(h) = lim inf F(h  ). (1.133)


h →h

We then deduce the equality in (1.130) from Lemma 1.66(i). The first inequality
being trivial, the conclusion follows. 
1.1 Convex Functions 29

Definition 1.68 Let F : X → Y , where X and Y are Banach spaces. We say that F is
Gâteaux differentiable (or G-differentiable) at x̄ ∈ X if, for any h ∈ X , the directional
derivative F  (x̄, h) exists and the mapping h → F  (x̄, h) is linear and continuous.
We denote by D F(x̄) ∈ L(X, Y ) the derivative of F defined by D F(x̄)h = F  (x̄, h),
for all h ∈ X .
Corollary 1.69 Let f : X → R̄ be convex, and continuous at x̄. Then

f  (x̄, h) = max{x ∗ , h; x ∗ ∈ ∂ f (x̄)}. (1.134)

If in addition ∂ f (x̄) is the singleton {x ∗ }, then f is G-differentiable at x̄ with G-


derivative x ∗ .
Proof By Lemmas 1.57 and 1.59, f is locally Lipschitz near x̄ and ∂ f (x̄) is
nonempty. Since F(·) := f  (x̄, ·) is Lipschitz (as is easily shown), we have equality
in (1.130), and F = F ∗∗ has a subderivative at h, characterized by (1.128), so that the
supremum in (1.130) is a maximum. If ∂ f (x̄) is a singleton, the G-differentiability
easily follows. 
Exercise 1.70 Let X := 2 be the spaceof square summable sequences and f :
X → R ∪ {+∞} be defined by f (x) := k kxk2 . Let x̄ be the null sequence. Show
that ∂ f (x̄) is a singleton, although f is not G-differentiable at x̄.
Hint: show that ∂ f (x̄) = {0}, and that t → f (t x) is not continuous at 0 if x ∈
/
dom( f ).
Definition 1.71 The norm of X is said to be differentiable if it is Fréchet differ-
entiable at any nonzero point, and strictly subadditive if x  + x   < x   + x  
except if x  and x  are both nonnegative multiples of some x ∈ X . We adopt similar
definitions for X ∗ .
Exercise 1.72 Show that (i) the conjugate of the norm is the closed unit ball of X ∗ ,
(ii) the norm is never differentiable at 0 (except for null spaces!), (iii) if x ∈ X is
nonzero, then

∂x = {x ∗ ∈ X ∗ ; x ∗ ∗ = 1; x ∗ , x = x}, (1.135)

(iv) the space X has a differentiable norm if the dual norm is strictly subadditive.

Example 1.73 (Example of strict inequality in (1.130)) Let f (x) := 21 x12 /x2 , with
domain the elements x ∈ R2 such that x1 > 0 and x2 > 0. Since
   
x1 /x2 1/x2 −x1 /x22
∇ f (x) = ; D 2 f (x) = , (1.136)
− 21 x12 /x22 −x1 /x22 x12 /x23

we easily check that D 2 f (x) is positive semidefinite, and hence, f is convex over
its convex domain. We set f (x) = +∞ if x1 < 0 or x2 < 0, and examine how to
define f on R2+ when min(x1 , x2 ) = 0, in order to make f l.s.c., i.e. we compute
30 1 A Convex Optimization Toolbox

f (x) := lim inf{ f (x  ); min(x1 , x2 ) > 0}. Clearly, when x2 > 0 (resp. x1 > 0) there
exists a limit of value 0 (resp. +∞), and so, since f is nonnegative, its value at 0
should be 0 (resp. +∞). So we finally set

⎨0 if x = 0,
f (x) := x /x2
1 2
if x1 ≥ 0 and x2 > 0, (1.137)
⎩ 2 1
+∞ otherwise.

We easily check that f has the following strange property: while min( f ) = 0, there
exists a sequence x k ∈ dom( f ) such that D f (x k ) → 0 and f (x k ) → +∞.
The directional derivatives of f at x = 0 are for h = 0:

0 if h 1 ≥ 0 and h 2 > 0,
f  (0, h) := (1.138)
+∞ otherwise.

For h̄ = (1 0) , we have that lim inf h  →h̄ f (0, h  ) = 0 < +∞ = f  (0, h̄). In (1.130),
the inequality is strict, and the supremum is attained for x ∗ = 0.

1.1.5.4 Polarity of Convex Sets

We have already discussed in Example 1.51 the link between the indicatrix and
support functions.

Definition 1.74 Let K be a subset of a Banach space X , and x0 ∈ X . The (negative)


polar set of K w.r.t. x0 is the set

K − (x0 ) := {x ∗ ∈ X ; x ∗ , x − x0  ≤ 1, for all x ∈ K }. (1.139)

Let K ∗ be a subset of X ∗ , and x0∗ ∈ X ∗ . The (negative) polar set of K ∗ w.r.t. x0∗ is
the set

K ∗− (x0∗ ) := {x ∈ X ; x ∗ − x0∗ , x ≤ 1, for all x ∗ ∈ K ∗ }. (1.140)

Observe that we obtain the same polar sets if we replace K or K ∗ by their convex
closure. We also define the positive polar sets as

K + (x0 ) := −K − (x0 ) = {x ∗ ∈ X ; −x ∗ , x − x0  ≤ 1, for all x ∈ K }, (1.141)

and similarly K ∗− (x0∗ ) := −K ∗− (x0∗ ). When x0 = 0 (resp. x0∗ = 0), we simply denote
the polar set by K − (resp. K ∗− ). The bipolar set is defined as e.g. K −− := (K − )− .

Exercise 1.75 Let C be the closed unit ball of the Banach space X . Check that C −
is the closed unit ball of X ∗ , and that C −− = C.
Hint: use Corollary 1.8.
1.1 Convex Functions 31

Exercise 1.76 Let K be a convex subset of a Banach space X . Check that:


(i) If B(x0 , ε) ⊂ K , for some ε > 0, then K − (x0 ) ⊂ B(0, 1/ε).
(ii) If K is bounded, then 0 ∈ int(K − (x0 )).

Lemma 1.77 Let K be a subset of X . Then K −− = conv(K ∪ {0}). In particular, if


K is closed and convex, and contains 0, then K −− = K .

Proof It suffices to prove the first statement. It is easily seen that both K and 0 belong
to K −− . Since K −− is closed and convex, it contains K := conv(K ∪ {0}). Now let
/ K . We can strictly separate K from x̄, i.e., there exists an x ∗ ∈ X ∗ such that
x̄ ∈
supx∈K x ∗ , x < x ∗ , x̄. Since 0 ∈ K , x ∗ , x̄ > 0. For any positive α < x ∗ , x̄,
close enough to x ∗ , x̄, we have that y ∗ := α −1 x ∗ is such that y ∗ , x̄ > 1, and
y ∗ , x ≤ 1 for all x ∈ K , so that y ∗ ∈ K − and then x̄ cannot belong to K −− . The
conclusion follows. 

We will mostly use the notion of polarity for convex cones.

Exercise 1.78 Check that, when K (resp. K ∗− ) is a cone, then (i) K − (resp. K ∗− ) is
itself a cone, called the (negative) polar cone, such that

K − := {x ∗ ∈ X ∗ ; x ∗ , x ≤ 0, for all x ∈ K },
(1.142)
K ∗− := {x ∈ X ; x ∗ , x ≤ 0, for all x ∗ ∈ K ∗ },

and (ii) the Fenchel conjugate of the corresponding indicatrix functions satisfy

σ K = I K∗ = I K − ; I K∗ ∗ = I K ∗− . (1.143)

Exercise 1.79 Let X be a Banach space and C1 and C2 be two convex cones of the
same space Y , with either Y = X or Y = X ∗ . Check that

(C1 + C2 )− = C1− ∩ C2− . (1.144)

Definition 1.80 Let K be a convex subset of a Banach space X , and x̄ ∈ K . (i) We


call the closure of R+ (K − x̄) the tangent cone (in the sense of convex analysis) to
K at x̄, and denote it by TK (x̄). (ii) We call the set

N K (x̄) := {x ∗ ∈ X ∗ ; x ∗ , x − x̄ ≤ 0, for all x ∈ K } (1.145)

the normal cone to K at x̄.

We note that, in the setting of the previous definition, if h ∈ TK (x̄), then

dist(x̄ + σ h, K ) = o(σ ), for σ > 0. (1.146)

Exercise 1.81 Let K be a closed convex subset of a Banach space X .


1. Check that the tangent and normal cone (to a convex set) are polar to each other.
32 1 A Convex Optimization Toolbox

2. Let x̄ ∈ K . Check that ∂ I K (x̄) = N K (x̄).


3. Let x ∗ ∈ X ∗ be such that σ K (x ∗ ) is finite. Show that

∂σ K (x ∗ ) = {x ∈ K ; x ∗ , x ≥ x ∗ , x  , for all x  ∈ K },

or equivalently, setting N K−1 (x ∗ ) := {x ∈ K ; x ∗ ∈ N K (x)}:

∂σ K (x ∗ ) = N K−1 (x ∗ ). (1.147)

Exercise 1.82 Let C be a closed convex cone of a Banach space X , and let x̄ ∈ C.
Check that
NC (x̄) = C − ∩ (x̄)⊥ ; TC (x̄) = C + Rx̄. (1.148)

Hint: for the second relation, apply (1.144) with C1 = C and C2 = Rx̄.

1.2 Duality Theory

1.2.1 Perturbation Duality

1.2.1.1 General Relations

Consider the family of “primal” problems

Min ϕ(x, y) − x ∗ , x, (Py )


x∈X

where X and Y are Banach spaces, ϕ : X × Y → R̄, x ∗ ∈ X ∗ , and y ∈ Y . We denote


the associated value function by

v(y) := inf (ϕ(x, y) − x ∗ , x). (1.149)


x

Observe that
 
v∗ (y ∗ ) = sup y ∗ , y − inf (ϕ(x, y) − x ∗ , x)
y  x
 (1.150)
= sup y ∗ , y + x ∗ , x − ϕ(x, y) = ϕ ∗ (x ∗ , y ∗ ).
x,y

It follows that
v∗∗ (y) = sup y ∗ , y − ϕ ∗ (x ∗ , y ∗ ). (1.151)
y ∗ ∈Y ∗
1.2 Duality Theory 33

Define the dual problem as

Max y ∗ , y − ϕ ∗ (x ∗ , y ∗ ). (D y )
y ∗ ∈Y ∗

Then by the definition of (D y ) we obtain, without any hypothesis, using (1.101)–


(1.102):

Proposition 1.83 The following weak duality relation holds:

val(D y ) = v∗∗ (y) ≤ v(y) = val(Py ), (1.152)

and we have that, if val(D y ) is finite:

S(D y ) = ∂v∗∗ (y). (1.153)

Additionally:
If ∂v(y) = ∅, then ∂v(y) = S(D y ). (1.154)

In the sequel we will analyze the case of strong duality, i.e. when v(y) = v∗∗ (y), in
order to get some information of ∂v(y).

Remark 1.84 The dual problem can also be obtained by dualizing in the usual way
an equality constraint. Indeed, write the primal problem in the form below, with
z ∈ Y:
Min ϕ(x, z) − x ∗ , x; y − z = 0, (1.155)
x,z

with associated duality Lagrangian function, where y ∗ ∈ Y ∗ :

L (x, z, y, y ∗ ) := ϕ(x, z) − x ∗ , x + y ∗ , y − z. (1.156)

We have that

sup y ∗ L (x, z, y, y ∗ ) = ϕ(x, y) − x ∗ , x if y = z, +∞ otherwise,
(1.157)
inf x,z L (x, z, y, y ∗ ) = y ∗ , y − ϕ ∗ (x ∗ , y ∗ ).

The dual problem obtained in the present perturbation duality framework may there-
fore be viewed as a particular case of the minimax duality discussed in Sect. 1.1.3.

1.2.1.2 Problems in Dual Spaces

Let ψ : X ∗ × Y ∗ → R̄. Consider a family of problems in the dual space:

Min ψ(x ∗ , y ∗ ) − y ∗ , y, (PxD∗ )


y ∗ ∈Y ∗
34 1 A Convex Optimization Toolbox

with value denoted by v D (x ∗ ). Then v∗D : X → R̄ satisfies


 
v∗D (x) = sup x ∗ , x + y ∗ , y − ψ(x ∗ , y ∗ ) = ψ ∗ (x, y), (1.158)
x ∗ ,y ∗

so that  ∗ 
v∗∗ ∗ ∗
D (x ) = sup x , x − ψ (x, y) . (1.159)
x

Therefore, we may define a problem dual to (PxD∗ ) as

Max x ∗ , x − ψ ∗ (x, y). (DxD∗ )


x∈X

As in Proposition 1.83, we have the weak duality relation

v∗∗ ∗ ∗
D (x ) = val(D x ∗ ) ≤ val(Px ∗ ) = v D (x ),
D D
(1.160)

and also, in view of (1.106):

S(DxD∗ ) = ∂v∗∗
D (y). (1.161)

Additionally,
if ∂v D (x ∗ ) = ∅, then ∂v D (x ∗ ) = S(DxD∗ ). (1.162)

Now starting from a problem of type (Py ), and rewriting its dual (D y ) as a mini-
mization problem, we can dualize it. Writing the obtained bidual as a minimization
problem, we see that its expression is nothing but

Min ϕ ∗∗ (x, y) − x ∗ , x. (Py∗∗ )


x∈X

By Theorem 1.44, the duality mapping is involutive in the class of proper, l.s.c.
convex functions, in the following sense:

Lemma 1.85 Let ϕ be proper, l.s.c. and convex. Then (Py ) and its bidual problem
coincide.

Remark 1.86 If X and Y are reflexive, then the bidual problem is the classical dual
of the dual one, so that we will be able to apply the duality theory that follows to the
dual problem.

1.2.1.3 Strong Duality

We call the relation of equality between a primal and a dual cost, that is, for
(x, y, y ∗ ) ∈ X × Y × Y ∗ :
1.2 Duality Theory 35

ϕ(x, y) − x ∗ , x = y ∗ , y − ϕ ∗ (x ∗ , y ∗ ) (1.163)

an optimality condition (in the context of duality theory). By weak duality, this
implies that the primal and dual problem have the same value. If the latter is finite,
then x ∈ S(Py ) and y ∗ ∈ S(D y ), and (1.163) is equivalent to

ϕ(x, y) + ϕ ∗ (x ∗ , y ∗ ) = x ∗ , x + y ∗ , y. (1.164)

We recognize the case of equality in the Fenchel–Young inequality. By (1.100), this


is equivalent to
(x ∗ , y ∗ ) ∈ ∂ϕ(x, y). (1.165)

Theorem 1.87 The following relations hold:

∂v(y) = ∅ ⇒ val(D y ) = val(Py ) ⇒ ∂v(y) = S(D y ) (1.166)

and

If (1.163) holds with finite value, then
(1.167)
x ∈ S(Py ), y ∗ ∈ S(D y ), val(Py ) = val(D y ), and ∂v(y) = S(D y ).

Proof Relation (1.166) follows from Proposition 1.83, and is easily seen to imply
(1.167). 

We next need stronger assumptions that guarantee the equality of the primal and
dual cost.

Theorem 1.88 Assume that v is convex, uniformly upper bounded near y, and
finitely-valued at y. Then (i) val(D y ) = val(Py ), (ii) x ∈ S(Py ) iff there exists a
y ∗ ∈ Y ∗ such that the optimality condition (1.163) holds, (iii) ∂v(y) = S(D y ), the
latter being nonempty and bounded, and (iv) the directional derivatives of v satisfy,
for all z ∈ Y :
v (y, z) = sup{y ∗ , z; y ∗ ∈ S(D y )}. (1.168)

Proof By Lemma 1.57, v is continuous at y. By Corollary 1.47, v(y) = v∗∗ (y),


meaning that val(D y ) = val(Py ), and by Lemma 1.59, ∂v(y) is nonempty and
bounded. The conclusion follows from the second implication in (1.166) and Corol-
lary 1.69. 

Remark 1.89 (i) A sufficient condition for v to be convex is that ϕ is convex. (ii)
A sufficient condition for having a uniform upper bound near y is that ϕ(x0 , ·) is
continuous at y, for some x0 ∈ X .

It may happen, however, that while ϕ is l.s.c. convex, v is not l.s.c., and this
prevents us from deducing its continuity from Proposition 1.65.
36 1 A Convex Optimization Toolbox

Exercise 1.90 Let X = L ∞ (0, 1), Y = L 2 (0, 1), and denote by A the injection from
X into Y . Take x ∗ = 0 and

ϕ(x, y) := 0 if Ax = y, +∞ otherwise. (1.169)

Check that ϕ is l.s.c. convex, but v(y), equal to the indicatrix of L ∞ (0, 1), is not l.s.c.
(see the related analysis in Example 1.136).

We next state a stability condition, also called a qualification condition, that pro-
vides a sufficient condition for the continuity of the value function. The condition is
that y ∈ int(dom(v)), or equivalently:

For all y  ∈ Y close enough to y,
(1.170)
there exists an x  ∈ X such that ϕ(x  , y  ) < ∞.

Lemma 1.91 Assume that ϕ is l.s.c. convex, the stability condition (1.170) holds,
and v( ȳ) is finite. Then v is continuous at ȳ.

Proof See e.g. [26, Prop. 2.152]; the proof is too technical to be reproduced here. 

Corollary 1.92 Under the assumptions of Lemma 1.91, the conclusion of Theorem
1.88 holds.

Proof Combine the previous lemma with Theorem 1.88. 

Example 1.93 (A strange example) Consider the reverse entropy function, where
x ∈ R:

Ĥ (x) = x log x if x > 0, Ĥ (0) = 0, and Ĥ (x) = +∞ if x < 0. (1.171)

This is an l.s.c. convex function, with domain R+ . Consider the problem

Min x; s.t. Ĥ (x) ≤ 0, (1.172)


x∈R

corresponding to ϕ(x, y) = x + I{ Ĥ (x)+y≤0} . It obviously has the unique solution


x̄ = 0, and the stability condition holds (with here y = 0). The Lagrangian of the
problem is L(x, λ) := x + λ Ĥ (x), and we can check that the dual problem is

Max δ(λ), (1.173)


λ≥0

where δ(λ) := inf x L(x, λ). By the duality theory, the dual problem has a bounded
and nonempty set of solutions and the primal and dual value are equal, i.e., λ is a dual
solution iff δ(λ) = 0, with infimum in the Lagrangian attained at 0. Now if λ > 0,
the infimum is attained at a positive point. So, the unique dual solution is λ̄ = 0 and
the optimality condition reads
1.2 Duality Theory 37
 
0 ∈ argmin x + 0 × Ĥ (λ) . (1.174)
x∈R

This indeed holds if we correctly interpret the product 0 × Ĥ (λ) as being equal to
+∞ whenever Ĥ (λ) = +∞, see Sect. 1.1.1.2.

1.2.1.4 Projections and Moreau–Yosida Approximations

In many applications, we can check in a direct way the continuity of the value
function. Here is a specific example.

Proposition 1.94 Let K be a closed convex subset of the Hilbert space X . Then the
function v(y) := 21 dist(y, K )2 is convex and of class C 1 , with derivative Dv(y) =
y − PK (y).

Proof Consider the function X × X → R, ϕ(x, y) := 21 x − y2 + I K (x), and take


x ∗ = 0. Obviously, ϕ is l.s.c. and convex, and the unique solution of the primal
problem (Py ) is x(y) := PK (y), the projection of y onto K . The (convex) primal
value is v(y).
We next compute the dual cost, identifying X and its dual. We have that

ϕ ∗ (0, y ∗ ) = supx,y (y ∗ , y) − ϕ(x, y)


= supx∈K ,y (y ∗ , y − x) − 21 x − y2 + (y ∗ , x)
(1.175)
= supx∈K ,y  (y ∗ , y  ) − 21 y  2 + (y ∗ , x)
= 21 y ∗ 2 + σ K (y ∗ ).

Since v is locally upper bounded, it is locally Lipschitz, so that by Lemma 1.59,


its subdifferential is nonempty and bounded. It is equal to the solution of the dual
problem
1
Max (y ∗ , y) − y ∗ 2 − σ K (y ∗ ), (1.176)
y ∗ ∈X 2

and the optimality condition can be arranged in the following way:

1 1
x − y2 + y ∗ 2 − (y ∗ , y − x) + I K (x) + σ K (y ∗ ) − (y ∗ , x) = 0. (1.177)
2 2

The sum of the three first terms is 21 y − x − y ∗ 2 , and the sum of the three last is,
by the Fenchel–Young inequality, nonnegative. Therefore (1.177) is equivalent to

(i) y ∗ = y − x; (ii) (y ∗ , x  − x) ≤ 0, for all x  ∈ K . (1.178)

This is easily seen to be equivalent to x = PK (y), so that ∂v(y) = {y − PK (y)}. Since


v is a continuous function, we deduce from Corollary 1.69 that it is G-differentiable
38 1 A Convex Optimization Toolbox

with G-derivative v (y) = y − PK (y). The derivative being a continuous function of


y, it follows that v is of class C 1 . 

A generalization of the above result is provided by the Moreau–Yosida approxi-


mation, which is the subject of the exercise below.

Exercise 1.95 Given a Hilbert space X (identified with its dual), f l.s.c. proper
convex X → R̄, y ∈ X , and ε > 0, consider the problem
ε
Min f (x) + x − y2 . (1.179)
x∈X 2

(i) Show that this problem has a unique solution xε (y) (hint: the cost is strongly
convex), called the proximal point to y.
(ii) Check that the dual problem is

1 ∗ 2
Max

(y ∗ , y) − y  − f ∗ (y ∗ ). (1.180)
y ∈X 2ε

(iii) Show that the primal and dual values are equal, and that the dual problem has a
unique solution yε∗ (y) = ε(y − xε (y)).
(iv) Show that f ε (y) := inf x ( f (x) + 2ε x − y2 ) (the Moreau–Yosida approxima-
tion) has a continuous derivative D f ε (y) = ε(y − xε (y)).

1.2.1.5 Composite Functions

In most applications we have to solve optimization problems with the following


structure:
Min f (x) + F(G(x) + y) − x ∗ , x. (Py )
x∈X

Here G : X → Y and F : Y → R̄. This enters into our general framework, with here

ϕ(x, y) := f (x) + F(G(x) + y). (1.181)

Defining the standard Lagrangian4

L(x, y ∗ ) := f (x) + y ∗ , G(x), (1.182)

we have that

4 Not to be confused with the duality Lagrangian defined in (1.156).


1.2 Duality Theory 39

ϕ(x ∗ , y ∗ ) = supx,y (y ∗ , y − f (x) − F(G(x) + y) + x ∗ , x)


= supx,y (y ∗ , G(x) + y − F(G(x) + y) − L(x, y ∗ ) + x ∗ , x)
= supx,y  (y ∗ , y   − F(y  ) − L(x, y ∗ ) + x ∗ , x)
= F ∗ (y ∗ ) − inf x (L(x, y ∗ ) − x ∗ , x),
(1.183)
so that the dual problem is

Max

y ∗ , y − F ∗ (y ∗ ) + inf (L(x, y ∗ ) − x ∗ , x). (D y )
y x

We can express the optimality condition (1.164) in the form

(F(G(x)
 + y) + F ∗ (y ∗ ) − y ∗ , G(x) + y)  (1.184)
+ L(x, y ∗ ) − x ∗ , x) − inf x  (L(x  , y ∗ ) − x ∗ , x  ) = 0.

Each row above being nonnegative by the Fenchel–Young inequality, this is equiva-
lent to the relations

(i) y ∗ ∈ ∂ F(G(x) + y);
(1.185)
(ii) x ∈ argmin(L(·, y ∗ ) − x ∗ , ·).

Remark 1.96 Since, as we have seen, these relations express nothing but the Fenchel–
Young equality for ϕ, we deduce that if ϕ(x, y) is finite, then

∂ϕ(x, y) = {(x ∗ , y ∗ ) ∈ X ∗ × Y ∗ ; (1.186) holds}. (1.186)

Remark 1.97 Since (Py ) is feasible iff y ∈ dom(F) − G(x) for some x ∈ dom( f ),
we have that dom(v) = dom(F) − G(dom( f )), and the stability condition (1.170)
reads:
y ∈ int (dom(F) − G(dom( f ))) . (1.187)

Proposition 1.98 Let ϕ be l.s.c. convex, and y ∈ Y be such that v(y) is finite, and
(1.187) holds. Then v is continuous at y, and the conclusion of Theorem 1.88 holds.

Proof Immediate consequence of Corollary 1.92. 

Example 1.99 Given K ⊂ Y nonempty, closed and convex, the problem

min f (x) − x ∗ , x; G(x) + y ∈ K (1.188)


x

enters into the previous framework with F = I K , the indicatrix of K . In that case
the optimality conditions (1.185) reduce to

(i) y ∗ ∈ N K (G(x) + y);
(1.189)
(ii) x ∈ argmin(L(·, y ∗ ) − x ∗ , ·).
40 1 A Convex Optimization Toolbox

1.2.1.6 Convexity of Composite Functions

Under which conditions is the function ϕ defined in (1.181) jointly convex and l.s.c.?
Varying only y, we see that F must be l.s.c. convex. An obvious case is when f and
F are l.s.c. convex, and G is affine and continuous. But there are some other cases
when this property holds, although G is nonlinear.

Example 1.100 Let F be a nondecreasing, l.s.c. proper convex function over R, and
G be an l.s.c. proper convex function over X . We claim that ψ(x, y) := F(G(x) +
y) is l.s.c. convex. Setting X  := X × Y and G  (x, y) := G(x) + y, we reduce the
discussion to the l.s.c. and convexity of F(G(x)). Let xk → x̄ in X . Then

F(G(x̄)) ≤ F(lim G(xk )) ≤ lim F(G(xk )). (1.190)


k k

The first inequality uses the fact that F is nondecreasing and G is l.s.c.; the second
inequality uses the l.s.c. of F. So, F ◦ G is l.s.c. Now for α ∈ (0, 1) and x  , x  in X ,
setting x := αx  + (1 − α)x  :

F(G(x)) ≤ F(αG(x  ) + (1 − α)G(x  ) ≤ α F(G(x  )) + (1 − α)F(G(x  )).


(1.191)
We have used the convexity of G and the fact that F is nondecreasing in the first
inequality, and the convexity of F in the second one. So, F ◦ G is convex; the claim
follows.

Example 1.101 More generally, consider the case when F is an l.s.c. proper convex
function over R p that is nondecreasing (for the usual order relation y ≤ z if yi ≤ z i ,
for i = 1 to p), and G(x) = (G 1 (x), . . . , G p (x)) with G i (x) an l.s.c. proper convex
function over X , for i = 1 to p). By similar arguments we get that ψ(x, y) :=
F(G(x) + y) is l.s.c. convex. A particular case is that of the supremum of convex
functions, see Sect. 1.2.3.

A more general analysis of the case of composite functions in the format (1.181)
is as follows. Assume F to be l.s.c. proper convex. By Theorem 1.44, it is equal to
its biconjugate, and hence,

F(y) = sup{y ∗ , y − F ∗ (y ∗ ); y ∗ ∈ dom F ∗ }. (1.192)

Therefore,

ϕ(x, y) = f (x) + sup{y ∗ , G(x) + y − F ∗ (y ∗ ); y ∗ ∈ dom F ∗ }. (1.193)

Since the supremum of l.s.c. convex functions is l.s.c. convex, we deduce that

Lemma 1.102 Let F be l.s.c. proper convex, and x → y ∗ , G(x) be l.s.c. convex
for any y ∗ ∈ dom F ∗ . Then ϕ is l.s.c. convex.
1.2 Duality Theory 41

1.2.1.7 Convex Mappings

Definition 1.103 The recession cone of the closed convex subset K of Y is the
closed convex cone defined by

K ∞ := {y ∈ Y ; K ⊂ K + y}. (1.194)

Remark 1.104 (i) If K is bounded, its recession cone reduces to {0}. The con-
verse holds if Y is finite-dimensional. In infinite-dimensional spaces, there may
exist unbounded convex sets with recession cone reducing to {0}: see [26, Example
2.43].
(ii) We have that K ∞ = K if K is a closed convex cone.

Definition 1.105 Let G : X → Y , and K be a closed convex subset of Y . We say


that G is K -convex if, for all α ∈ (0, 1) and x  , x  in X :

G(αx  + (1 − α)x  ) − αG(x  ) − (1 − α)G(x  ) ∈ K ∞ . (1.195)

Remark 1.106 We slightly changed the classical definition [26, Def. 2.103], but
the theory is essentially the same. Note that any affine mapping is K -convex. The
converse holds if K ∞ = {0}. On the other hand, if K = Y then any mapping is
K -convex.

Lemma 1.107 Let f : X → R̄ be l.s.c. convex, and G be continuous and K -convex,


where K is a closed convex subset of Y . Then ϕ(x, y) := f (x) + I K (G(x) + y) is
l.s.c. convex.

Proof The l.s.c. being obvious, it suffices to check that I K (G(x) + y) is convex. Let
α ∈ (0, 1), x  , x  in X , y  , y  in Y . Set (x, y) := α(x  , y  ) + (1 − α)(x  , y  ). Then

κ := G(x) − αG(x  ) − (1 − α)G(x  ) (1.196)

belongs to K ∞ , and therefore

G(x) + y = α(y  + G(x  )) + (1 − α)(y  + G(x  )) + κ (1.197)

belongs to K . The result follows. 

We next give a practical tool for recognizing K -convex mappings.

Lemma 1.108 We have that G : X → Y is K -convex iff, for any λ ∈ (K ∞ )−, the
function G λ (x) := λ, G(x) is convex.

Proof Since K ∞ is closed and convex, by Lemma 1.77, it is the negative polar cone of
(K ∞ )−, i.e., y0 ∈ K ∞ iff λ, y0  ≤ 0, for all λ ∈ (K ∞ )− . Therefore, G is K -convex
iff
λ, G(αx  + (1 − α)x  ) − αG(x  ) − (1 − α)G(x  ) ≤ 0, (1.198)
42 1 A Convex Optimization Toolbox

for all λ ∈ (K ∞ )−. The conclusion follows. 

Remark 1.109 Lemma 1.107 can be deduced from Lemma 1.102, where F = I K
and F ∗ = σ K , observing that dom(σ K ) ⊂ (K ∞ )− .
p
Example 1.110 Let Y := R p and K := R− (the case of finitely many inequality
constraints). Then (K ∞ )− = K − = R+ . As expected we obtain that G is K -convex
p

iff each of the p components of G is convex.

Example 1.111 Let Y := C(Ω) where Ω is a metric compact set, and K := Y−


(the case of punctual inequality constraints). Then (K ∞ )− = K − = Y+∗ is the set of
nonnegative Borel measures on Y, and we obtain that G is K -convex iff G ω (x) is
convex, for each ω ∈ Ω.

Exercise 1.112 Let K = {x ∈ R2 ; x2 ≥ x12 }. (i) Show that K ∞ = {0} × R+ , and


that we have the strict inclusion

dom(σ K ) = {0} ∪ (R × (−∞, 0)) ⊂ R × R− = (K ∞ )− . (1.199)

(ii) Show that G is K convex iff G 1 is affine and G 2 is concave.

1.2.1.8 Fenchel Duality

When G(x) = Ax, with A ∈ L(X, Y ), the Lagrangian defined in (1.182) is such that

inf x (L(x, y ∗ ) − x ∗ , x) = inf x ( f (x)


 + A y ∗ − x ∗ , x) 
= − supx x − A y ∗ , x − f (x)

(1.200)
= − f ∗ (x ∗ − A y ∗ ).

The expression of the primal and dual problem are therefore

Min f (x) + F(Ax + y) − x ∗ , x, (Py )


x∈X

Max

y ∗ , y − f ∗ (x ∗ − A y ∗ ) − F ∗ (y ∗ ). (D y )
y

Finally, the optimality condition


 
f (x) + f ∗ (x ∗ − A y ∗ ) − x ∗ − A y ∗ , x
(1.201)
+ (F(Ax + y) + F ∗ (y ∗ ) − y ∗ , Ax + y) = 0

is equivalent to the relations

y ∗ ∈ ∂ F(Ax + y); ∂ f (x) + A y ∗  x ∗ . (1.202)


1.2 Duality Theory 43

The function ϕ(x, y) = f (x) + F(Ax + y) is l.s.c. convex if f and F are l.s.c.
convex, and the stability condition (1.170) reads, since dom(v) = y + dom(F) −
A dom( f ):
y ∈ int (dom(F) − A dom( f )) . (1.203)

We have obtained the following:

Theorem 1.113 (Fenchel duality) Let f and F be l.s.c. convex, and (1.203) hold.
Then

inf { f (x) + F(Ax) − x ∗ , x} = max



{− f ∗ (x ∗ − A y ∗ ) − F ∗ (y ∗ )} < +∞,
x y
(1.204)
the maximum being attained on a nonempty and bounded set if the above value is
finite.

Example 1.114 Given a nonempty closed convex subset K of X , the problem

Min f (x) − x ∗ , x); Ax + y ∈ K (Py )


x

is the particular case of the previous example in which F(y) = I K (y) and x ∗ = 0,
and therefore the dual problem is

Max

y ∗ , y − σ K (y ∗ ) − f ∗ (x ∗ − A y ∗ ). (D y )
y

The optimality condition is equivalent to

x ∗ − A y ∗ ∈ ∂ f (x); y ∗ ∈ N K (Ax + y). (1.205)

The function ϕ(x, y) = f (x) + I K (Ax + y) is l.s.c. convex if f is l.s.c. convex


and K is a closed convex set, and dom(v) = K − A dom( f ). By Theorem 1.88
applied when y = 0, we have that

⎨ If f is l.s.c. convex and 0 ∈ int (K − A dom( f )) , then
inf x { f (x) − x ∗ , x; Ax ∈ K } = max y ∗ {− f ∗ (x ∗ − A y ∗ ) − σ K (y ∗ )},

the maximum being attained on a bounded set if the value is finite.
(1.206)

Example 1.115 Consider the particular case of the previous example in which
A is surjective, K = {0}, x ∗ = 0, and f (x) = c, x, with c ∈ (Ker A)⊥ . By the
open mapping theorem, for some c > 0, there exists a feasible x(y) such that
x(y) ≤ cy. Since c ∈ (Ker A)⊥ , x(y) is a primal solution. The value function
v(·), being both locally upper bounded and finite, is locally Lipschitz. By the discus-
sion in Example 1.114, we have that c = A λ, for some λ ∈ Y ∗ . We have proved
that (Ker A)⊥ ⊂ Im(A ). Since the converse inclusion is easily proved, we have
obtained another proof of Proposition 1.31.
44 1 A Convex Optimization Toolbox

Exercise 1.116 (Tychonoff and Lasso [120] type regression) Assuming that Y is
a Hilbert space identified with its topological dual, and given A ∈ L(X, Y ), b ∈ Y ,
ε > 0 and a ‘regularizing function’ R : X → R̄, consider the regularized linear least-
square problem
1
Min Ax − b2H + ε R(x). (Py )
x∈X 2

Deduce from (1.84) that the dual problem is

1
Max −(λ, b)Y − λ2Y − ε R ∗ (−A λ/ε). (1.207)
λ∈Y 2

Show that the optimality conditions are

1
λ = Ax − b; − A λ ∈ ∂ R(x). (1.208)
ε

In the case of the Tychonoff regularization R(x) = 21 x2X (assuming X to be a


Hilbert space identified with its topological dual), show that: the primal and dual
values are equal, both the primal and dual problems have a unique solution, and the
second relation in (1.208) reduces to −A λ = εx.
In the case when R is positively homogeneous, convex and continuous, with subdif-
ferential at 0 denoted by K , show that: the primal and dual values are equal, and the
dual problem is
1
Max −(λ, b)Y − λ2Y ; −A λ ∈ εK , (1.209)
λ∈Y 2

and the second relation in (1.208) is equivalent to

− A λ ∈ εK and − (λ, Ax) H = ε R(x). (1.210)

n to the Lasso type regularization where X = R , Y = R and


n p
Specialize this result
R(x) = x1 = i=1 |xi |.

We will next see how to compute the subdifferential of a composition of functions.


This will be a consequence of the duality theory, based on a formula for partial
subdifferentials.

1.2.2 Subdifferential Calculus

1.2.2.1 General Subdifferential Calculus Rules

We come back to the general format of Sect. 1.2.1.1. Given ϕ : X × Y → R̄, we


denote the partial subdifferential w.r.t. x by
1.2 Duality Theory 45

∂x ϕ(x, y) := {x ∗ ∈ X ∗ ; ϕ(x  , y) ≥ ϕ(x, y) + x ∗ , x  − x, for all x  ∈ X }.


(1.211)
As in the case of differentiable functions one may ask if the partial subdifferential is
the restriction of the “full” subdifferential, i.e., if x ∗ ∈ ∂x ϕ(x, y), does there exist a
y ∗ ∈ Y ∗ such that
(x ∗ , y ∗ ) ∈ ∂ϕ(x, y). (1.212)

Theorem 1.117 If (1.212) holds, then x ∗ ∈ ∂x ϕ(x, y). Conversely, if ϕ is l.s.c. con-
vex and the stability condition (1.170) holds, then x ∗ ∈ ∂x ϕ(x, y) iff the set of y ∗ ∈ Y ∗
satisfying (1.212) is nonempty and bounded.

Proof That x ∗ ∈ ∂x ϕ(x, y) when (1.212) holds follows from the definition of full
and partial subdifferentials. Now let ϕ be as in the theorem. It suffices to prove that
if x ∗ ∈ ∂x ϕ(x, y), then there exists a y ∗ ∈ Y ∗ such that (1.212) holds. Since x ∗ ∈
∂x ϕ(x, y), we have that the function x  → ϕ(x  , y) − x ∗ , x   attains its minimum at
x. By the duality result in Corollary 1.92, the set of solutions y ∗ of the dual problem,
satisfying the optimality condition (1.163), which (by the discussion after (1.163))
is equivalent to (1.212), is nonempty and bounded. The conclusion follows. 

We now specialize the previous theorem to the case of the composite function

ϕ(x, y) = f (x) + F(G(x) + y), (1.213)

recalling that the (standard) Lagrangian was defined in (1.182). We give a direct
proof of the expression of the subdifferential of ϕ, already obtained in Remark 1.96:

Lemma 1.118 We have that (x ∗ , y ∗ ) ∈ ∂ϕ(x, y) iff (1.185) holds.

Proof Let (x ∗ , y ∗ ) ∈ ∂ϕ(x, y). Using ϕ(x, y  ) ≥ ϕ(x, y) + y ∗ , y  − y for all y  ∈


Y , we obtain (1.185)(i). Taking x  ∈ X and y  := G(x) − G(x  ) + y, we get that

f (x  ) + F(G(x) + y) ≥ ϕ(x, y) + x ∗ , x  − x + y ∗ , G(x) − G(x  ) (1.214)

or equivalently

L(x  , y ∗ ) ≥ L(x, y ∗ ) + x ∗ , x  − x for all x  ∈ X, (1.215)

implying (1.185)(ii). Conversely, let (1.185) hold. Then

ϕ(x  , y  ) = f (x  ) + F(G(x  ) + y  )
≥ f (x  ) + F(G(x) + y) + y ∗ , G(x  ) − G(x) + y  − y,
(1.216)
= ϕ(x, y) + L(x  , y ∗ ) − L(x, y ∗ ) + y ∗ , y  − y,
≥ ϕ(x, y) + x ∗ , x  − x + y ∗ , y  − y,

proving that (x ∗ , y ∗ ) ∈ ∂ϕ(x, y). 


46 1 A Convex Optimization Toolbox

Theorem 1.119 Assume that ϕ(x, y) = f (x) + F(G(x) + y) is l.s.c. convex, and
that the stability condition (1.187) holds. Then x ∗ ∈ ∂x ϕ(x, y) iff the set of y ∗ ∈ Y ∗
such that (1.185) holds is nonempty and bounded.

Proof Combine Theorem 1.117 and Lemma 1.118. 

1.2.2.2 Fenchel’s Duality

In the case of Fenchel’s duality, i.e., when G(x) = Ax with A ∈ L(X, Y ), we see
that (1.185)(ii) holds iff

f (x  ) ≥ f (x) + x ∗ − A y ∗ , x  − x, for all x  ∈ R, (1.217)

i.e. iff x ∗ − A y ∗ ∈ ∂ f (x). We obtain the following Fenchel subdifferential formula:

Lemma 1.120 Let X and Y be Banach spaces, A ∈ L(X, Y ), f : Y → R̄ and


F : Y → R̄ be l.s.c. convex, and set ϕ(x, y) = f (x) + F(Ax + y). Then (x ∗ , y ∗ ) ∈
∂ϕ(x, y) iff
(i) y ∗ ∈ ∂ F(Ax + y); (ii) x ∗ − A y ∗ ∈ ∂ f (x). (1.218)

We have that x ∗ ∈ ∂x ϕ(x, y) iff (1.218) holds for some y ∗ ∈ Y ∗ , whenever the
stability condition (1.203) is satisfied.

Proof Direct application of the previous statements. 

In the case when A is the identity operator, we obtain the

Corollary 1.121 Let f and g be l.s.c. convex functions X → R̄, with finite value at
x0 . If 0 ∈ int (dom( f ) − dom(g)) (which holds in particular if f or g is continuous
at x0 ), then ∂( f + g)(x0 ) = ∂ f (x0 ) + ∂g(x0 ).

We next discuss the case of the sum of a finite number of functions.

Example 1.122 Let gi , i = 1 to n, be l.s.c. proper convex functions over the Banach
space X . We set


n
G(x) := gi (x), with dom(G) = ∩i=1
n
dom(gi ). (1.219)
i=1

Then G is of the form F ◦ A, with Y := X n , Ax = (x, . . . , x) (n times), and


n
F(x1 , . . . , xn ) := gi (xi ), with dom(F) = Πi=1
n
dom(gi ). (1.220)
i=1
1.2 Duality Theory 47

n
For (x1∗ , . . . , xn∗ ) ∈ (X ∗ )n , we have that A (x1∗ , . . . , xn∗ ) = i=1 xi∗ (the transpose
of the copy operator is the sum). The qualification condition (1.203) can be written,
since BY = (B X )n , as

∀(x1 , . . . , xn ) ∈ ε(B X )n ; ∃ x ∈ X ; xi ⊂ dom(gi ) − x, i = 1, . . . , n. (1.221)

It follows by Lemma 1.120, where here f = 0 and y = 0, that


n
∂G(x) = ∂gi (x), for all x ∈ X, if (1.221) holds. (1.222)
i=1

Remark 1.123 A sufficient condition for (1.221) is that (indeed, take x = x0 − xn ):



There exists an x0 ∈ dom(gn ) such that
(1.223)
gi is continuous at x0 , for i = 1 to n − 1.

1.2.2.3 Geometric Calculus Rules

We show here how subdifferential calculus gives calculus rules for normal and tangent
cones, starting with the simple case of the intersection of two convex sets.

Lemma 1.124 Let K 1 and K 2 be two closed convex subsets of X , and let K :=
K 1 ∩ K 2 , and x̄ ∈ K . Then

TK (x̄) ⊂ TK 1 (x̄) ∩ TK 2 (x̄) and N K (x̄) ⊃ N K 1 (x̄) + N K 2 (x̄). (1.224)

If in addition 0 ∈ int(K 1 − K 2 ), equality holds in the above two inclusions.

Proof The relations in (1.224) are easy consequences of the definition of tangent
and normal cones. We next apply Corollary 1.121 with f := I K 1 and g := I K 2 , so
that f + g = I K . Since dom( f ) − dom(g) = K 1 − K 2 and ∂ I K (x) = N K (x), we
deduce that if 0 ∈ int(K 1 − K 2 ), then N K (x̄) = N K 1 (x̄) + N K 2 (x̄). Computing the
normal cones (we have seen in (1.144) that the polar of a sum of convex cones is
the intersection of their polar cones), it follows that TK (x̄) = TK 1 (x̄) ∩ TK 2 (x̄). The
conclusion follows. 

By similar techniques one can prove various extensions of this result, given as
exercises.

Exercise 1.125 Consider the subsets of R2 defined by K 1 = {x; x2 ≥ x12 }, K 2 :=


−K 1 , and K := K 1 ∩ K 2 . Check that (1.224) holds with strict inclusion. Does 0
belong to int(K 1 − K 2 )? Make the connection with Lemma 1.124.

Exercise 1.126 Let K 1 , . . . , K n be closed convex subsets of X . Set K := K 1 ∩ · · · ∩


K n . Let x̄ ∈ K . Assume that
48 1 A Convex Optimization Toolbox

∀(x1 , . . . , xn ) ∈ ε(B X )n ; ∃ x ∈ X ; xi ⊂ K i − x, i = 1, . . . , n. (1.225)

Show that, then:


n
N K (x̄) = N K i (x̄); TK (x̄) = ∩i=1
n
N K i (x̄). (1.226)
i=1

Hint: apply Example 1.122 with gi (x) = I K i (x), and use ∂ I K (x̄) = N K (x̄).

Exercise 1.127 Let K X and K be closed convex subsets of X and Y resp., A ∈


L(X, Y ), and b ∈ Y . Set

K := {x ∈ K X ; Ax + b ∈ K }. (1.227)

(i) Show that K is a closed convex set.


(ii) Let x̄ ∈ K . Show that, if 0 ∈ int (K − b − AK X ), then

NK (x̄) = N K X (x̄) + A N K (A x̄ + b). (1.228)

Hint: apply Lemma 1.120, with f = I K X and F(y) = I K (y).

Exercise 1.128 Let K X and K be closed convex subsets of X and Y resp., and
G : X → Y . Set for ȳ ∈ Y :

Kˆ := {(x, y  ) ∈ K X × Y ; G(x) + y  ∈ K },
(1.229)
K := {x ∈ K X ; G(x) + ȳ ∈ K }.

Assume that Kˆ is a closed convex set, and that

0 ∈ int (K − G(K X ) − ȳ) . (1.230)

Set L(x, y ∗ ) := y ∗ , G(x). Show that


 
NK (x̄) = x ∗ ∈ X ∗ ; x̄ ∈ argmin L(·, y ∗ ) − x ∗ , x, for some y ∗ ∈ N K (G(x̄) + ȳ) .
x∈K X
(1.231)

Hint: apply Theorem 1.119, with f = I K X , and F = I K .

Remark 1.129 In the framework of the previous exercise, assume in addition that
G(x) is G-differentiable and x → L(x, y ∗ ) is convex, for all y ∗ ∈ N K (G(x̄) + ȳ).
Then x̄ ∈ argmin x∈K X (L(·, y ∗ ) − x ∗ , x) iff x ∗ ∈ N K X (x̄) + DG(x̄) y ∗ , so that

NK (x̄) = N K X (x̄) + DG(x̄) N K (G(x̄) + ȳ). (1.232)


1.2 Duality Theory 49

This holds in particular if G is affine (and continuous): we then recover the conclusion
of Exercise 1.127.

1.2.3 Minimax Theorems

In this section we start from a relatively general Lagrangian function and see how to
obtain the minmax duality thanks to the perturbation duality. Let X and Y be Banach
spaces, X 0 ⊂ X and Y0∗ ⊂ Y ∗ , both nonempty, L : X 0 × Y0∗ → R. By (1.39), we
have the weak duality inequality:

sup inf L(x, y ∗ ) ≤ inf sup L(x, y ∗ ). (1.233)


y ∗ ∈Y0∗ x∈X 0 x∈X 0 y ∗ ∈Y ∗
0

In order to see when equality holds, it is of interest to introduce the perturbation


Lagrangian, where y ∈ Y is the perturbation parameter:

L (x, y, y ∗ ) := y ∗ , y + L(x, y ∗ ). (1.234)

We have the more general weak duality inequality

sup inf L (x, y, y ∗ ) ≤ inf sup L (x, y, y ∗ ), for all y ∈ Y. (1.235)


y ∗ ∈Y0∗ x∈X 0 x∈X 0 y ∗ ∈Y ∗
0

Let us apply the perturbation duality theory of Sect. 1.2.1 with:



sup y ∗ ∈Y0∗ L (x, y, y ∗ ) if x ∈ X 0 ,
ϕ(x, y) := (1.236)
+∞ otherwise.

Clearly the primal problem


Min ϕ(x, y) (Py )
x∈X

has value v(y) = val(Py ) equal to the r.h.s. of (1.235). We know by (1.151) that
v∗∗ (y) = sup y ∗ y ∗ , y − ϕ ∗ (0, y ∗ ). Define L̂ : X × Y ∗ → R̄ by

⎨ +∞ / Y0∗ ,
if y ∗ ∈

L̂(x, y ) := −L(x, y ) if (x, y ∗ ) ∈ X 0 × Y0∗ ,

(1.237)

−∞ otherwise.

Denoting by L̂ ∗y (x, y) the partial Fenchel–Legendre transform (in the dual space Y ∗ )
of L̂(x, ·) w.r.t. the second variable, we have that for all x ∈ X :
 
ϕ(x, y) = sup y ∗ , y − L̂(x, y ∗ ) = L̂ ∗y (x, y). (1.238)
y ∗ ∈Y ∗
50 1 A Convex Optimization Toolbox

It follows that ϕ y∗ (x, y ∗ ) := L̂ ∗∗ ∗


y (x, y ) (equal to −∞ if x ∈
/ X 0 ), and therefore
 
ϕ ∗ (0, y ∗ ) = sup ϕ y∗ (x, y ∗ ) = − inf − L̂ ∗∗ ∗
y (x, y ) . (1.239)
x∈X 0 x∈X 0

Consequently, v∗∗ (y) = val(D y ) where the dual problem (D y ) is defined as


 
Max
∗ ∗
inf y ∗ , y − L̂ ∗∗ ∗
y (x, y ) . (D y )
y ∈Y x∈X 0

Since a function always majorizes its biconjugate,

L (x, y, y ∗ ) = y ∗ , y − L̂(x, y ∗ ) ≤ y ∗ , y − L̂ ∗∗ ∗
y (x, y ). (1.240)

We deduce the “canonical” relation between minimax and perturbation dualities

sup inf L (x, y, y ∗ ) ≤ v∗∗ (y) ≤ v(y) = inf sup L (x, y, y ∗ ). (1.241)
y ∗ ∈Y0∗ x∈X 0 x∈X 0 y ∗ ∈Y ∗
0

In view of the expression of (D y ), the inequality on the left is an equality whenever

L̂(x, y ∗ ) = L̂ ∗∗ ∗
y (x, y ), for all x ∈ X 0 . (1.242)

By Lemma 1.49, this holds iff for each x ∈ X 0 , y ∗ → L̂(x, y ∗ ) is a supremum


of ∗affine functions, or equivalently, if y ∗ → L(x, y ∗ ) is an infimum of ∗affine
functions.

Theorem 1.130 Assume that X 0 and Y0∗ are nonempty and convex subsets, X 0 is
closed, L(·, y ∗ ) is l.s.c. convex for each y ∗ ∈ Y0∗ , (1.242) holds, and Y0∗ is bounded.
Then equality holds in (1.233), and the set of y ∗ for which the supremum on the left
is attained is nonempty and bounded.

Proof (a) Since X 0 is convex and closed, for each y ∗ ∈ Y0∗ , the function (x, y) →
L (x, y, y ∗ ) extended by +∞ if x ∈ / X 0 is an l.s.c. convex function of (x, y), and
hence, its supremum w.r.t. y ∗ ∈ Y0∗ , i.e. ϕ(x, y), is itself l.s.c. convex.
(b) Let us check that v(y) < +∞. Fix x0 ∈ X 0 . Since y ∗ → L(x0 , y ∗ ) is an infimum
of ∗affine functions, we have that for some (y0 , c0 ) ∈ Y × R (depending on x0 ):

L(x0 , y ∗ ) ≤ y ∗ , y0  + c0 , for all y ∗ ∈ Y0∗ , (1.243)

and then since Y0∗ is bounded:

v(y) ≤ ϕ(x0 , y) ≤ sup y ∗ , y + y0  + c0 < ∞. (1.244)


y ∗ ∈Y0∗
1.2 Duality Theory 51

(c) If the primal value is −∞, the conclusion follows from the weak duality inequality,
the maximum of the dual cost being attained at each y ∗ ∈ Y0∗ .
(d) In view of the expression of ϕ in (1.236), if v is finite at some y ∈ Y , we have
that

|v(y  ) − v(y)| ≤ sup |y ∗ , y  − y| ≤ sup y ∗  y  − y, (1.245)
y ∗ ∈Y0∗ y ∗ ∈Y0∗

proving that v is everywhere finite and Lipschitz. Since v is convex and Lipschitz, by
Lemma 1.59, ∂v(y) is nonempty and bounded, and therefore v(y) = v∗∗ (y) and the
set of dual solutions is not empty and bounded. We conclude by (1.241), in which
by (1.242) the first inequality is an equality. 

A direct consequence of the previous result is, see [94, Corollary 37.3.2]:

Lemma 1.131 Let A and B be nonempty closed convex subsets of Rn and Rq , resp.,
with B bounded, and L be a continuous convex-concave mapping A × B → R. Then

sup inf L(x, y) = inf sup L(x, y), (1.246)


y∈Y x∈X x∈X y∈Y

and the supremum on the l.h.s. is attained.

1.2.4 Calmness

Definition 1.132 Let f : X → R̄ have a finite value at x̄. We say that f is calm at
x̄ with constant r > 0 if

f (x̄) ≤ f (x) + r x − x̄, for all x ∈ X. (1.247)

Lemma 1.133 Let f : X → R̄ be convex, and calm at x̄ with constant r > 0. Then
(i) f is l.s.c. at x̄, and (ii) ∂ f (x̄) has at least an element of norm at most r .

Proof (i) Immediate consequence of (1.247).


(ii) Let f¯(x) := conv( f )(x). By the Fenchel–Moreau–Rockafellar Theorem 1.46,
f¯ = f ∗∗ . In view of (i) and Corollary 1.47(i), f (x̄) = f¯(x̄) = f ∗∗ (x̄).
By (1.247), f¯r (x) := f¯(x) + r x − x̄ attains its minimum at x̄, and so 0 ∈
¯
∂ fr (x̄). By the subdifferential calculus rule for a sum (Corollary 1.121), and since
the subdifferential of the norm is the closed dual unit ball, we have that

0 ∈ ∂ f¯r (x̄) = ∂ f¯(x̄) + B̄(0, r ) X ∗ , (1.248)

proving that ∂ f¯(x̄) has an element in B̄(0, r ) X ∗ . The conclusion follows. 


52 1 A Convex Optimization Toolbox

Remark 1.134 Conversely, if f : X → R̄ has a subdifferential q at x̄ of norm not


greater than r > 0, then

f (x) ≥ f (x̄) + q, x − x̄ ≥ f (x̄) − r x − x̄, (1.249)

which shows that f is calm at x̄ with constant r . So, if f is convex and f (x̄) is finite,
then ∂ f (x̄) is nonempty iff f is calm at x̄.
Corollary 1.135 In the framework of the perturbation duality theory presented in
Sect. 1.2.1.1, assume that ϕ is convex (not necessarily l.s.c.), and that the value
function v(·) is calm at y ∈ Y , with constant r > 0. Then val(Py ) = val(D y ), and
∂v(y) = S(D y ) has at least one element of norm at most r .
Since, by Remark 1.134, calmness characterizes subdifferentiability for convex
functions, the difficulty is of course to check this condition! We first present a “patho-
logical” example that illustrates the theory.
Example 1.136 Let X = L 2 (0, 1), Y = L 1 (0, 1), g ∈ X , and A be the canonical
injection X → Y . Denote by (·, ·) X the scalar product in X . Consider the problem

Min (g, x) X ; Ax + y = 0 in Y. (Py )


x∈X

This enters into the framework of perturbation duality, with



(g, x) X if x = −y,
ϕ(x, y) = (1.250)
+∞ otherwise.

The value of (Py ) is therefore



−(g, y) X if y ∈ X,
v(y) = (1.251)
+∞ otherwise.

We distinguish two cases:


/ L ∞ (0, 1). Given y ∈ Y , it is easy to build a sequence yk in X such that
(a) g ∈
yk → y in Y , and (g, yk ) → −∞. So v(·) is nowhere l.s.c.
(b) g ∈ L ∞ (0, 1). Then for y, y  in X we have that

|v(y  ) − v(y)| ≤ g∞ y  − y1 , (1.252)

proving that v is calm with constant r := g∞ at each y ∈ X .


We compute the dual problem by applying Example 1.114, with here K = {0}. Since
f ∗ (x ∗ ) = 0 if x ∗ = g, and +∞ otherwise, we get:

Max

y ∗ , y; g = −A y ∗ . (D y )
y ∈Z

If y ∈ X then y = −Ax, for some x ∈ X . Then, if y ∗ ∈ F(D y ):


1.2 Duality Theory 53

y ∗ , y = −y ∗ , Ax = −A y ∗ , x = (g, x) X = −(g, y) X , (1.253)

and so the primal and dual values are equal, and the dual problem has solution −g,
in accordance with Corollary 1.135. Of course it can be checked by direct means that
∂v(y) = −g.

Example 1.137 Consider the family of linear optimization problems

Min c, x; ai , x + yi ≤ 0, i = 1, . . . , p. (Py )


x∈X

Here

ϕ ∗ (0, y ∗ ) = sup y ∗ · y − c, x; ai , x + yi ≤ 0, i = 1, . . . , p. (1.254)


x,y

A supremum less than +∞ implies y ∗ ≥  0, and the optimal choice for y is then
yi = −ai , x, so that ϕ ∗ (0, y ∗ ) = 0 if c + i=1 yi∗ ai = 0, and +∞ otherwise. The
p

dual problem (in the framework of perturbation duality) is therefore, denoting by λ


the dual variable:
 p
Maxp λ · y; c + λi ai = 0. (D y )
λ∈R+
i=1

By Hoffman’s Lemma 1.28, calmness is satisfied whenever v(y) is finite, and hence,
the primal and dual values are equal and the dual problem has a solution, in agreement
with Lemma 1.26.

Remark 1.138 The stability condition (1.170) does not hold in Example 1.136, and
does not necessarily hold in Example 1.137. So these examples show the usefulness
of the concept of calmness.

1.3 Specific Structures, Applications

1.3.1 Maxima of Bounded Functions

Coming back to the minimization of composite functions in Sect. 1.2.1.5, assume that
f is proper, l.s.c. convex, and that F is l.s.c., convex, and positively homogeneous
with value 0 at 0. Then F(x) > −∞ for all x. By Theorem 1.44, F is equal to
its biconjugate, and by Lemma 1.66, F(y) = σ K ∗ (y), and F ∗ = I K ∗ , where K ∗ =
∂ F(0). So problem (Py ) in Sect. 1.2.1.5 is of the form

Min f (x) − x ∗ , x + sup y ∗ , G(x) + y. (1.255)


x∈X y ∗ ∈K ∗
54 1 A Convex Optimization Toolbox

As in (1.182) we set L(x, y ∗ ) := f (x) + y ∗ , G(x). Since F ∗ = I K ∗ , by


Sect. 1.2.1.5, the dual problem can be expressed as

Max y ∗ , y + inf (L(x, y ∗ ) − x ∗ , x). (1.256)


y ∗ ∈K ∗ x

In the sequel to this section, we assume that Y is a space of bounded functions,


denoted by yω , over a certain set Ω, containing constant functions, and is a Banach
space endowed with the uniform norm

y := sup {|yω |; ω ∈ Ω} . (1.257)

Remark 1.139 An obvious choice for Y is the space of bounded functions over Ω. If
Ω is a compact metric space, we can also choose the space of continuous and bounded
functions over Ω (indeed, by the Heine–Cantor theorem, a continuous function over
a compact set is uniformly continuous, and this easily implies that a uniform limit
of continuous functions is continuous).
The dual space Y ∗ is endowed with the norm

y ∗  := sup{y ∗ , y; y ∈ Y, |yω | ≤ 1, for all ω ∈ Ω}.

We say that y ∗ ∈ Y ∗ is nonnegative, and write y ∗ ≥ 0, if y ∗ , y ≥ 0, for all y ≥ 0 (we


recognize here a polarity relation between the (closed convex) cone of nonnegative
functions of Y , and the positive polar cone of nonnegative linear forms). In the sequel
to this section, we discuss problems of the type (in certain applications, the supremum
will be an essential supremum)

Min f (x) − x ∗ , x + sup {G ω (x) + yω }. (Py )


x ω∈Ω

If Y : Ω → R, we denote the supremum function by sup y := sup{yω , ω ∈ Ω}. Let


us denote by 1 the function with constant value 1 over Ω. We will see that the
subdifferential of the supremum at 0 is the set

S (Ω) := {y ∗ ∈ Y ∗ ; y ∗ ≥ 0; y ∗ , 1 = 1}. (1.258)

We say that a function is non-expansive if it has Lipschitz constant one.


Lemma 1.140 The convex, positively homogeneous function sup : Y → R is non-
expansive, and its subdifferential at 0 is S (Ω), so that for all y ∈ Y and y ∗ ∈ Y ∗ :

sup(y) = max y ∗ ∈S (Ω) y ∗ , y; (sup)∗ (y ∗ ) = IS (Ω) (y ∗ );
(1.259)
∂ sup(y) = {y ∗ ∈ S (Ω); sup(y) = y ∗ , y}.

Proof The non-expansivity is a direct consequence of the definition of the supremum,


and ensures that the subdifferential of the supremum at any point is contained in the
1.3 Specific Structures, Applications 55

closed unit ball. Let y ∗ ∈ S (Ω), and y ∈ Y . Since y ∗ is nonnegative, we have

sup(y) = y ∗ , sup(y)1 − y + y ∗ , y ≥ y ∗ , y,

which proves that S (Ω) ⊂ ∂ sup(0). Conversely, let y ∗ ∈ ∂ sup(0). Then ±1 =


sup(±1) ≥ y ∗ , ±1 implies y ∗ , 1 = 1. In addition, for all y ≥ 0, we have that
0 ≥ sup(−y) ≥ y ∗ , −y, so y ∗ ≥ 0. We have shown that y ∗ ∈ S (Ω). We conclude
by Lemma 1.66. 
It follows from Sect. 1.2.1.5 that the dual of (Py ) can be written in the form

Max y ∗ , y + inf (L(x, y ∗ ) − x ∗ , x). (D y )


y ∗ ∈S (Ω) x

Theorem 1.141 Let Y be a Banach space endowed with the norm (1.257), contain-
ing the constant functions. We assume that f is proper, l.s.c., convex, x → G(x) is
continuous and that for any y ∗ ∈ S (Ω), x → y ∗ , G(x) is convex. Then problems
(Py ) and (D y ) have the same value, that is finite or equal to −∞. If this value is
finite, then S(D y ) is nonempty (necessarily bounded since S (Ω) is). In addition,
x ∈ S(Py ) and y ∗ ∈ S(D y ) iff (x, y ∗ ) satisfies

x ∗ ∈ ∂x L(x, y ∗ ); y ∗ ∈ S (Ω); y ∗ , G(x) + y = sup(G(x) + y). (1.260)

Proof The function

(x, y) → σS (Ω) (G(x) + y) = sup (G ω (x) + yω ) = sup y ∗ , G(x) + y


ω∈Ω y ∗ ∈S (Ω)
(1.261)
is continuous (being a composition of continuous functions), and convex. Since
f is l.s.c. convex, the function (x, y) → ϕ(x, y) := f (x) + σS (Ω) (G(x) + y) is
l.s.c. convex. As f is proper, (Py ) is feasible for all y. Corollary 1.92 ensures the
equality of primal and dual values. Finally, the optimality conditions follow from
the duality theory for the composite functions (Proposition 1.98) combined with
Lemma 1.66. 
Note that the hypotheses made on G in the above theorem imply in particular that
for all ω ∈ Ω, the function x → G ω (x) is convex continuous (since y → yω is a
linear continuous form that belongs to S (Ω)).
Example 1.142 Under the hypotheses of Theorem 1.141, let Ω be finite, of cardinal-
ity p (therefore each component G i is convex). We will then
 ∗ identify C(Ω) with Rp
p p
and S (Ω) with the set of probabilities over Ω: S := y ∈ R+ ; i=1 yi∗ = 1 .
p

Using the subdifferential calculus rule in Example 1.122 and especially (1.223), we
see that the optimality condition (1.260) reduces to

p
0 ∈ ∂ f (x) + yi∗ ∂x G i (x); y∗ ∈ S p ; y ∗j = 0, j ∈
/ argmax G i (x). (1.262)
i=1 i
56 1 A Convex Optimization Toolbox

Another case of interest is that of compact spaces.

Example 1.143 Let Y = C(Ω), the space of continuous functions on the compact
metric space Ω. The dual space is the space of finite Borel measures over Ω, and
S (Ω) is nothing but the set P(Ω) of Borel probability measures over Ω. We can
define the support of a measure, denoted by supp(·), as the complement of the largest
open set where it is equal to 0. Then the two last relations of (1.260) are equivalent
to
y ∗ ∈ P(Ω); supp(y ∗ ) ⊂ argmax(G(x) + y). (1.263)

1.3.2 Linear Conical Optimization

The literature often refers to linear conical optimization problems, which are as
follows. Given two Banach spaces X and Y , consider the problem

Min c, x; Ax − b ∈ C, (1.264)


x∈X

where C ⊂ Y is a closed convex cone, c ∈ X ∗ , A ∈ L(X, Y ), and b ∈ Y . When Y =


p
Rq+ p and C = {0}Rq × R− , we recover the class of (possibly infinite-dimensional)
linear programs. If C is the cone Sn+ of symmetric positive semidefinite matri-
ces of size n, we obtain (possibly infinite-dimensional) semidefinite programming
problems.
Linear conical problems are nothing but particular cases of Fenchel duality, more
precisely of those problems discussed in Example 1.114, where K := b + C, so that
σ K (λ) = λ, b + IC − (λ), and f (x) := c, x is such that

0 if z = c,
f ∗ (z) = (1.265)
+∞ otherwise.

So the expression of the dual problem when y = 0 is

Max− −λ, b; c + A λ = 0. (1.266)


λ∈C

If we prefer to use the positive polar set C + = −C − , the expression of the dual
problem becomes
Max+ η, b; A η = c. (1.267)
η∈C

The dual problem is itself in the conical linear class, except of course that the spaces
are of dual type. It can be rewritten, setting K := C − × {0} (zero in X ∗ ), in the form
(formally close to (1.264)):

Min∗ λ, b; (λ, c + A λ) ∈ K . (1.268)


λ∈Y
1.3 Specific Structures, Applications 57

We can also dualize the dual problem (1.268); in view of Lemma 1.85, the resulting
bidual problem will coincide with the original one.

Corollary 1.144 (Primal qualification) If there exists an ε > 0 such that

ε BY ⊂ C + b + Im(A) (1.269)

(in particular, if there exists an x0 ∈ X such that Ax0 − b ∈ int C), then (1.264) and
(1.266) have the same value. If the latter is finite, then the solution set of the dual
problem (1.266) is nonempty and bounded.

Proof Apply Corollary 1.92. 

Corollary 1.145 (Dual qualification) Let X and Y be reflexive. If there exists an


ε > 0 such that
ε B X ∗ ⊂ c + A C − , (1.270)

then (1.264) and (1.266) have the same value, and if this common value is finite, then
the solution set of the primal problem (1.264) is nonempty and bounded.

Proof It suffices to check that (1.270) is equivalent to the stability condition for the
dual problem. The latter holds iff there exists an ε > 0 such that

ε (BY ∗ × B X ∗ ) ⊂ C − × {−c} − {(λ, A λ); λ ∈ Y ∗ }. (1.271)

This holds iff, for all (μ, η) close to 0 in Y ∗ × X ∗ , there exists a λ ∈ Y ∗ such that

μ ∈ C − − λ; η = −c − A λ. (1.272)

The first relation is equivalent to λ = λ̂ − μ, with λ̂ ∈ C − . Eliminating λ in the second


relation, we obtain c + A λ̂ = A μ − η. One easily shows that this is equivalent to
the existence of an ε > 0 such that (1.270) holds. 

1.3.3 Polyhedra

Let X be a Banach space. A polyhedron P of X is a subset defined by a finite number


of inequalities:
P = {x ∈ X ; ai , x ≤ bi , i = 1, . . . , p}, (1.273)

where a1 , . . . , a p belong to X ∗ . We set I = {1, . . . , p} and call {(ai , bi ); i ∈ I } a


parameterization of P. The latter is of course not unique. If x ∈ P, we denote the
set of active constraints by

I (x) := {i ∈ I ; ai , x = bi }. (1.274)


58 1 A Convex Optimization Toolbox

Lemma 1.146 Let x̄ ∈ P. Then


⎧ ⎫
⎨ ⎬
N P (x̄) = λi ai ; λ ≥ 0 . (1.275)
⎩ ⎭
i∈I (x̄)


Proof Let N̂ P (x̄) denote the r.h.s. of (1.275). If x ∗ = i∈I (x̄) λi ai with λ ≥ 0, and
x ∈ P, then
 
x ∗ , x − x̄ = λi ai , x − x̄ = λi (ai , x − bi ) ≤ 0, (1.276)
i∈I (x̄) i∈I (x̄)

proving that N̂ P (x̄) ⊂ N P (x̄). Conversely, let x ∗ ∈ N P (x̄). Then x̄ is a solution of


the linear program Min{−x ∗ , x; x ∈ P}. By the strong duality for linear programs
(Lemma 1.26), there exists a solution λ of the dual problem. By dual feasibility and
the complementarity conditions, we deduce that x ∗ ∈ N̂ P (x̄). 

Consider now a collection ai in X ∗ , for i in I ∪ J , the sets I and J being finite.


Set Q = {x ∈ X ; a j , x ≤ 0, j ∈ J }, and for x ∈ Q,

I (x) = {i ∈ I ; ai , x ≥ ak , x, for all k ∈ I };
(1.277)
J (x) = { j ∈ J ; a j , x = 0}.

Set g(x) := max{ai , x; i ∈ I }.

Lemma 1.147 Let x̄ ∈ X . Then

∂g(x̄) = conv{ai ; i ∈ I (x̄)}. (1.278)

Proof (i) For ∈ Rn , set max(z) := max(z 1 , . . . , z n ). It is an elementary exercise to


check that
∂ max(z) = conv{ei ; 1 ≤ i ≤ n; z i = max(z)}. (1.279)

Since g is the composition of the max function by the linear mapping


 (assuming that
|I | = n): x → Ax := (a1 , x, . . . , an , x), so that A λ = j λ j ai , we conclude
with Lemma 1.120. 

Let Φ : X → R̄ be defined by

Φ(x) := max{ai , x; i ∈ I } + I Q (x) = g(x) + I Q (x). (1.280)

Definition 1.148 If E is a subset of X , we denote the convex cone generated by E


(the set of finite nonnegative combinations of elements of E) by cone(E).
1.3 Specific Structures, Applications 59

Lemma 1.149 Let x̄ ∈ Q. Then


 
∂Φ(x̄) = conv {ai , i ∈ I (x̄)} + cone a j , i ∈ J (x̄) . (1.281)

Proof Since g is convex and continuous, and I Q is l.s.c. convex, by the subdifferential
calculus rules (Lemma 1.120), we have that ∂Φ(x̄) = ∂g(x̄) + ∂ I Q (x̄). We conclude
by noting that ∂ I Q (x̄) = N Q (x̄), whose expression is given by Lemma 1.146, and
by Lemma 1.147. 

Lemma 1.150 With the above notations, let M ∈ L(Z , X ), where Z is a Banach
space, and let Ψ = Φ ◦ M have a finite value at z̄ ∈ Z . Set x̄ = M z̄. Then

∂Ψ (z̄) = M  ∂Φ(M z̄). (1.282)

Proof The function Ψ is of the same nature as Φ, replacing the ai by M  ai , for


i ∈ I ∪ J . The conclusion follows by Lemma 1.149. 

We admit the Minkowski–Weyl theorem of representations of polyhedra (see [97,


Part IV], or [110, Chap. 8]; we assume that X = Rn .

Theorem 1.151 Let P satisfy (1.273) and be nonempty. Then there exists an element
xi ∈ X , with i ∈ I ∪ J , I and J finite sets, such that

P = conv{xi , i ∈ I } + cone{x j , j ∈ J }. (1.283)

Consider now the following family of linear programs

Minc, x; ai , x + yi ≤ 0, i = 1, . . . , p, (L Py )


x∈X

parameterized by y ∈ R p . The dual problem is


p
Maxp λ · y; c + λi ai = 0. (L D y )
λ∈R+
i=1

Now let Z be a Banach space, and let Mi ∈ L(Z , X ), for i = 1 to p. Define M z :=


(M1 z, . . . , M p z) . Set v(y) := val(P L y ), and V (z) := v(M z).

Theorem 1.152 Fix z̄ ∈ Z , set ȳ = M z̄, and let x̄ ∈ S(L Pȳ ). Then

∂ V (z̄) = M  ∂v( ȳ) = M  S(L D ȳ ). (1.284)


60 1 A Convex Optimization Toolbox

Proof By linear programming duality and the general duality theory (Lemma 1.26
and Theorem 1.87), we have that

val(L Pȳ ) = val(L D ȳ ); ∂v( ȳ) = S(L D ȳ ). (1.285)

Let {λi , i ∈ I ∪ J } be a Minkowski–Weyl representation of F(L D ȳ ); note that the


latter does not depend on ȳ. It is easily checked that

val(L D ȳ ) = Φ(M z) = Ψ (z), (1.286)

where Φ was defined in (1.280). We conclude by Lemma 1.150. 

This result, which is essentially another proof of (1.109), will be used in


Sect. 3.2.7.

1.3.4 Infimal Convolution

Let X be a Banach space, and f 1 , f 2 be two extended real-valued functions over X .


Their infimal convolution is the extended real-valued function over X defined as

f 1  f 2 (y) := inf ( f 1 (y − x) + f 2 (x)) . (1.287)


x∈X

It is easily seen that the operator  (that to two extended real-valued functions
over X associates their infimal convolution) is commutative and associative. More
generally, the infimal convolution of n extended real-valued functions f 1 , . . . , f n
over X is defined as
 n 
 n   
n
i=1 f i (y) := infn f i (xi ); xi = y . (1.288)
x∈X
i=1 i=1

 3 
One easily checks that ( f 1  f 2 ) f 3 = i=1 f i (y). In order to fit with our duality
theory, consider the related problem


n
  n
Minn f i (xi ) − xi∗ , xi  ; xi = y, (Py )
x∈X
i=1 i=1

with value function denoted by v(y); we have that


 n 
v(y) = i=1 f i (y) whenever x ∗ = 0, (1.289)
1.3 Specific Structures, Applications 61

as well as

v∗ (y ∗ ) = sup y y ∗ , y − v(y)
n
 ∗  n
= supy ∗ , y + xi , xi  − f i (xi ) ; xi = y,
x,y
i=1 i=1 (1.290)

n
 ∗  n
= sup y + xi∗ , xi  − f i (xi ) = f i∗ (y ∗ + xi∗ ).
x
i=1 i=1

Taking all xi∗ equal to zero, we obtain that the Fenchel conjugate of the infimal
convolution is the sum of conjugates, i.e.

 n ∗ 
n
i=1 f i (y ∗ ) = f i∗ (y ∗ ). (1.291)
i=1

The dual problem to (Py ) is


n
Max
∗ ∗
y ∗ , y − f i∗ (y ∗ + xi∗ ). (D y )
y ∈Y
i=1

n
Since dom(v) = i=1 dom( f i ), the stability condition is


n
y ∈ int dom( f i ) . (1.292)
i=1

We deduce that
Proposition 1.153 Assume that the f i are l.s.c. convex, and let (1.292) hold. Then
v(y) = val(D y ), and if the value is finite, S(D y ) is nonempty and bounded.
When the f i are proper, l.s.c. convex, the cost function of (Py ) is itself a proper,
l.s.c. and convex function of (x, y). By Lemma 1.85, (Py ) is the dual of (D y ). In
view of Remark 1.86, when X is reflexive, we may regard (Py ) as the “classical” dual
of (D y ), with perturbation parameter x ∗ . Clearly (D y ) is feasible iff there exists a
y ∗ ∈ Y ∗ such that y ∗ + xi∗ ∈ dom( f i∗ ), for i = 1 to n, i.e., if x ∗ ∈ Πi dom( f i ) − Ay ∗ ,
where the operator A : Y ∗ → (Y ∗ )n is defined by Ay ∗ = (y ∗ , . . . , y ∗ ) (n times). The
dual stability condition is therefore
 n 
(xi∗ , . . . , xn∗ ) ∈ int Πi=1 dom( f i∗ ) − AY ∗ . (1.293)

We have proved that:


Proposition 1.154 Let X be reflexive and the f i be proper, l.s.c. convex, and (1.293)
hold. Then val(Py ) = val(D y ), and if the value is finite, S(Py ) is nonempty and
bounded.
62 1 A Convex Optimization Toolbox

Example 1.155 Let X = R, f 1 (x) = e x , f 2 (x) = e−x . Set g(x) := ( f 1  f 2 )(x).


Then g(x) = 0 for all x in R, dom( f 1∗ ) = R+ , dom( f 2∗ ) = R− , and, with the con-
vention that 0 log 0 = 0:

f 1∗ (x  ) = x  log x  − x  ; f 2∗ (x  ) = −x  log(−x  ) + x  . (1.294)

Let x1∗ = x2∗ = 0. The dual problem reads

Max y ∗ , y − f 1∗ (y ∗ ) − f 2∗ (y ∗ ). (D y )
y ∗ ∈Y ∗

The unique feasible point is y ∗ = 0, which is also the unique dual solution. The
primal stability condition holds, and accordingly we find that the primal and dual
values are equal and that the dual solution (equal to 0) is the subgradient of the infimal
convolution.
The dual stability condition cannot hold since the infimum in the infimal convo-
lution is not attained. Indeed this condition is that for any x ∗ close to 0 in R2 , there
exists a y ∈ R such that x2∗ ≤ y ≤ x1∗ , which is impossible.

1.3.5 Recession Functions and the Perspective Function

1.3.5.1 Recession Functions

Let f be a proper l.s.c. convex function over X . Given x0 ∈ dom( f ), we define the
recession function f ∞ : X → R̄ by

f (x0 + τ d) − f (x0 )
f ∞ (d) := sup . (1.295)
τ >0 τ

It is easily checked that f ∞ is convex and positively homogeneous, and that the
supremum is attained when τ → +∞.

Lemma 1.156 The recession function is the support function of the domain of f ∗ ,
that is,  
f ∞ (d) = sup x ∗ , d; f ∗ (x ∗ ) < +∞ . (1.296)
x ∗ ∈X ∗

Proof Being proper l.s.c. convex, f is equal to its biconjugate, that is,

f (x0 + τ d) = sup x ∗ , x0 + τ d − f ∗ (x ∗ ). (1.297)


x ∗ ∈X ∗

Therefore,
x ∗ , x0 + τ d − f ∗ (x ∗ ) − f (x0 )
f ∞ (d) = sup sup . (1.298)
τ >0 x ∗ ∈X ∗ τ
1.3 Specific Structures, Applications 63

Changing the order of maximization, we get that


 
∗ x ∗ , x0  − f ∗ (x ∗ ) − f (x0 )
f ∞ (d) = sup x , d + sup . (1.299)
x ∗ ∈X ∗ τ >0 τ

By the Fenchel–Young inequality, and since f ∗ is proper, the second supremum is 0


if f ∗ (x ∗ ) < +∞, and −∞ otherwise. The result follows. 

By the above lemma, the recession function does not depend on the element x0 ∈
dom( f ) used in its definition.

1.3.5.2 Perspective Function

With f as before we associate the perspective function g : X × R → R̄, with domain


]0, ∞[× dom( f ), defined by

g(x, t) := t f (x/t) (where t > 0). (1.300)

Being proper l.s.c. convex f has an affine minorant, say a, x X + b; then g has
affine minorant a, x X + bt.

Lemma 1.157 The perspective function is convex and positively homogeneous; its
conjugate is the indicatrix of the set

C := {(x ∗ , t ∗ ) ∈ X ∗ × R; f ∗ (x ∗ ) + t ∗ ≤ 0}. (1.301)

Proof Note that the domain of g is convex. Let x1 , x2 in X , t1 > 0, t2 > 0, and
θ ∈]0, 1[. Set

x := θ x1 + (1 − θ )x2 ; t := θ t1 + (1 − θ )t2 ; θ  := θ t1 /t. (1.302)

Then θ  ∈]0, 1[, (1 − θ  ) = (1 − θ )t2 /t, and x/t = θ  x1 /t1 + (1 − θ  )x2 /t2 . Using
the convexity of f , we get
 
g(x, t) ≤ t θ  f (x1 /t1 ) + (1 − θ  ) f (x2 /t2 )
= θ t1 f (x1 /t1 ) + (1 − θ )t2 f (x2 /t2 ) (1.303)
= θg1 (x1 , t1 ) + (1 − θ )g(x2 , t2 ),

proving that g is convex. The positive homogeneity is obvious; it follows that, by


Lemma 1.66, g ∗ is the indicatrix of the convex set
 
C1 := (x ∗ , t ∗ ) ∈ X ∗ × R; x ∗ , x + t ∗ t ≤ g(x, t), for all (x, t) ∈ dom(g) .
(1.304)
64 1 A Convex Optimization Toolbox

Dividing by t > 0 and setting y := x/t we see that the above set of inequalities is
equivalent to x ∗ , y − f (y) + t ∗ ≤ 0, for all y ∈ X . Maximizing in y we obtain the
conclusion. 
Since f is proper l.s.c. convex, so is f ∗ . Therefore C is nonempty. It follows that
∗∗
g = σC is never equal to −∞ (this also follows from the fact that g has, as already
established, an affine minorant). By the Fenchel–Moreau–Rockafellar Theorem 1.46,
g ∗∗ is equal to the convex closure of g.
Lemma 1.158 The biconjugate of the perspective function satisfies, for all x ∈ X :

⎨ (i) +∞ if t < 0,

∗∗
g (x, t) = (ii) g(x, t) if t > 0, (1.305)


(iii) f ∞ (x) if t = 0.

Proof We have that g ∗∗ is the support function of C, and therefore:

g ∗∗ (x, t) = sup{x ∗ , x + tt ∗ ; f ∗ (x ∗ ) + t ∗ ≤ 0}. (1.306)

(i) If t < 0, we may take x0∗ in the nonempty set dom( f ∗ ) and set (x ∗ , t ∗ ) :=
(x0∗ , − f ∗ (x0∗ ) − t  ), with t  → ∞; it follows that g ∗∗ (x, t) = +∞.
(ii) If t > 0, maximizing in t ∗ in (1.306) and since f = f ∗∗ , we get

g ∗∗ (x, t) = sup{x ∗ , x − t f ∗ (x ∗ )} = t sup{x ∗ , x/t − f ∗ (x ∗ )} = t f (x/t) = g(x, t).


x∗ x∗

(iii) If t = 0, then

g ∗∗ (x, 0) = sup{x ∗ , x;


f ∗ (x ∗ ) ≤ −t ∗ } = sup{x ∗ , x; x ∗ ∈ dom( f ∗ )}.
(1.307)
So, by Lemma 1.156, g ∗∗ (x, 0) = f ∞ (x). 

1.3.5.3 Minimizing over a Union of Convex Sets

We next relate the perspective function to the resolution of the nonconvex problem

Minc, x; f 1 (x) ≤ 0 or f 2 (x) ≤ 0, (P12 )


x∈X

with c ∈ X ∗ , f i l.s.c. proper convex functions X → R, for i = 1, 2. We assume that


the sets K i := f i−1 (R− ), i = 1, 2 are nonempty. Next, consider the convex problem

Min c, x1 + x2 ; t1 f 1 (x1 /t1 ) ≤ 0 t2 f 2 (x2 /t2 ) ≤ 0;


(x1 ,x2 )∈X ×X,t1 >0,t2 >0 
(P12 )
t1 > 0; t2 > 0; t1 + t2 = 1.
1.3 Specific Structures, Applications 65


Lemma 1.159 Problems (P12 ) and (P12 ) have the same value.

Proof Let (x1 , x2 , t1 , t2 ) be in the feasible set of (P12 ). Setting xi := xi /ti , for i =

1, 2, one easily checks that (P12 ) has the same value as the problem

Min c, t1 x1 + t2 x2 ; f 1 (x1 ) ≤ 0 f 2 (x2 ) ≤ 0;


(x1 ,x2 )∈X ×X,t1 >0,t2 >0 
(P12 )
t1 > 0; t2 > 0; t1 + t2 = 1.

Minimizing w.r.t. to (x1 , x2 ) first, we see that the value of problem (P12

) is equal to
 
inf t1 inf

c, x1  + t2 inf

c, x2  = min inf c, x1 , inf c, x2  .
ti >0,t1 +t2 =1 x1 ∈K 1 x2 ∈K 2 x1 ∈K 1 x2 ∈K 2
(1.308)
The result easily follows. 

1.4 Duality for Nonconvex Problems

1.4.1 Convex Relaxation

In this section we discuss a nonconvex optimization problem with a finite dimensional


constraint.

1.4.1.1 Coercive Dual Cost

Consider a problem of the form

Min f (x); g(x) ∈ K . (P)


x∈X

Here X is an arbitrary set, f : X → R, g : X → R p , and K is a (possibly nonconvex)


nonempty subset of R p . Denote by K the closed convex hull of K , and recall that
K and K have the same support function. The associated Lagrangian is


p
L(x, λ) := f (x) + λi gi (x), (1.309)
i=1

and the opposite of the dual criterion is

d(λ) := σ K (λ) + sup{−L(x, λ)}. (1.310)


x

This is obviously an l.s.c. convex function, everywhere greater than −∞. We will
assume that it is proper; this holds, for instance, if inf f > −∞, since then 0 ∈
66 1 A Convex Optimization Toolbox

dom(d). The dual problem can be written as

Min d(λ). (D  )
λ∈R p

We denote it by (D  ) to take into account the change of sign, but call the amount
− val(D  ) the dual value in order to remain coherent with the general duality theory.

Proposition 1.160 We assume that (i) the function d(·) is proper, and (ii) the exis-
tence of ε > 0 such that

ε B ⊂ conv (g(X ) − K ) . (1.311)

Then the dual problem has a nonempty and compact set of solutions.

Remark 1.161 Note that (1.311) is equivalent to the same relation in which we write
K instead of K .

Proof (Proof of Proposition 1.160) Since Y ⊂ R p , (1.311) implies the existence of


x1 , . . . xr in X and k1 , . . . kr in K such that

1
ε B ⊂ conv ({ki − g(xi ), i = 1, . . . , r }) . (1.312)
2
We have that
d(λ) ≥ maxi {− f (xi ) + λ, ki − g(xi )}
≥ mini {− f (xi )} + maxi {λ, ki − g(xi )} (1.313)
≥ mini {− f (xi )} + 21 ε|λ|,

the last inequality using the fact that a maximum of linear forms is equal to the
maximum over their convex hull. It follows that a minimizing sequence λk (which
exists since d is proper) is bounded and therefore has a subsequence converging to
some λ̄. Since d(·) is l.s.c. (being a supremum of linear forms), λ̄ ∈ S(D  ). That
S(D  ) is bounded is a consequence of the coercivity property (1.313). 

1.4.1.2 Dual Optimality Conditions

Let us now add some hypotheses for ensuring the existence of points minimizing the
Lagrangian in the vicinity of dual solutions.

Proposition 1.162 Assume that there exists a metric compact set Ω ⊂ X such that,
if λ is close enough to S(D  ), the set of minima of L(·, λ) has at least one point in
Ω, and that f and g are continuous over Ω.
Then λ ∈ S(D  ) iff there exists a Borelian probability measure μ over Ω such
that, denoting by Eμ g(x) = Ω g(x)dμ(x) the associated expectation, the following
holds:
1.4 Duality for Nonconvex Problems 67

supp μ ⊂ argmin L(·, λ); Eμ g(x) ∈ K ; λ ∈ NK (Eμ g(x)). (1.314)

Proof Set δ(λ) := supx∈X {−L(x, λ)}. By our assumptions, when λ is close enough
to S(D  ), δ(λ) is equal to the continuous function

δ  (λ) := max{−L(x, λ)}. (1.315)


x∈Ω

Since δ(·) and δ  (·) are convex, and coincide near S(D  ), they have the same subd-
ifferential near S(D  ).
Let λ ∈ S(D  ). Since δ  (·) is continuous at λ, Corollary 1.121 implies that

0 ∈ ∂d(λ) = ∂σK (λ) + ∂δ  (λ). (1.316)

By (1.147), y ∈ ∂σK (λ) iff y ∈ K and λ ∈ NK (y). So, (1.316) is equivalent to

λ ∈ NK (−q), for some q ∈ ∂δ  (λ). (1.317)

We next give an expression for ∂δ  (λ). We have that δ  (λ) = F[G(λ)], where F :
C(Ω) → R is defined by F(y) := max{yx , x ∈ Ω}, and G affine R p → C(Ω) is
defined by (denoting the value at x ∈ Ω by a subindex) G(λ)x := −L(x, λ). Set A :=
DG(λ). Since F is Lipschitz, the subdifferential calculus rules (Theorem 1.119)
apply, so that by (1.317):

q ∈ ∂δ  (λ) = A ∂ F(G(λ)). (1.318)


p
Now A ∈ L(R p , C(Ω)) satisfies (Aλ)x := − i=1 λi gi (x). For μ ∈ C(Ω)∗ we
have 
p
μ, AλC(Ω) = − λi gi (x)dμ(x)
i=1 Ω

so that A μ = Ω g(x)dμ(x). By Lemma 1.140, ∂ F(y) is equal to the set of Borel


measures over Ω, with support over the set of points where y attains its maximum.
The conclusion follows. 

Remark 1.163 When X is a metric compact set we can also consider the following
relaxed formulation
 
Min f (x)dμ(x); g(x)dμ(x) ∈ K . (1.319)
μ∈P (X ) X X

The stability condition for this convex problem is precisely (1.311). So, under this
condition, if the above problem is feasible, there is no duality gap. The Lagrangian
is
68 1 A Convex Optimization Toolbox

  

p
L (μ, λ) = f (x) + λi gi (x) dμ(x) − σK (λ) = L(x, λ)dμ(x) − σK (λ).
X i=1 X
(1.320)
Therefore the infimum of the Lagrangian w.r.t. the primal variable μ ∈ P can be
expressed as

inf L (μ, λ) − σK (λ) = inf L(x, λ) − σK (λ). (1.321)


μ∈P (X ) x∈X

That is, (1.319) has the same dual as the original problem. If (1.311) holds, then
the stability condition holds for the convex problem (1.319), and hence, there is no
duality gap. We can therefore interpret the dual problem as the dual of the relaxed
problem.
Proposition 1.164 (i) Let λ̄ ∈ S(D  ). If x̄ ∈ argmin L(·, λ̄) is such that g(x̄) ∈ K
and λ ∈ NK (g(x̄)), then x̄ ∈ S(P), and the primal and dual problems have the same
value.
(ii) Under the hypotheses of Proposition 1.162, if K is closed and convex, and
λ̄ ∈ S(D  ) is such that x → g(x) is constant over argmin L(·, λ̄) (which is the
case in particular if L(·, λ̄) attains its minimum at a single point), then any
x̄ ∈ argmin L(·, λ̄) is a solution of (P) and the conclusion of point (i) is therefore
satisfied.
Proof (i) Since L(x̄, λ̄) = inf x L(x, λ̄) and λ̄ ∈ NK (g(x̄)), and consequently σK (λ̄)
is equal to λ̄, g(x̄), we have that

f (x̄) = L(x̄, λ̄) − σK (λ̄) = inf L(x, λ̄) − σK (λ̄) = −d(λ̄), (1.322)
x

i.e., x̄ and λ̄ are primal and dual feasible with equal cost, meaning that x̄ is a solution
of the primal problem, λ̄ is a solution of the dual one, and the primal and dual values
are equal.
(ii) We apply Proposition 1.162. Since g(x) is constant over argmin L(·, λ̄), we obtain
the existence of a probability measure with support over argmin L(·, λ̄), such that
for any x̄ ∈ argmin L(·, λ̄), g(x̄) = Eμ g(x) ∈ K . We conclude using point (i). 
Remark 1.165 In most non-convex problems there is a duality gap; the hypotheses
of the proposition above are not satisfied. Now the hypotheses of Propositions 1.160
and 1.162 are weak. So, in general, the dual problem has a compact and nonempty
set of solutions, but in each of them, the minimum of the Lagrangian is reached at
several points (with different values of the constraint g(x)).
Remark 1.166 We will apply point (ii) of Proposition 1.164 (in a case when the set
of minima of the Lagrangian is in general not a singleton) to the study of controlled
Markov chains with expectation constraints, see Theorem 7.34.
Exercise 1.167 For x ∈ R, let f (x) = 1 − x 2 and g(x) = x. The problem of mini-
mizing f (x) over X := [−1, 1], under the constraint g(x) = 0, has a unique solution
1.4 Duality for Nonconvex Problems 69

x̄ = 0 and value 1. The Lagrangian L(x, λ) = 1 − x 2 + λx is concave, and therefore


attains its minimum over ±1, so that the opposite of dual cost is d(λ) = |λ| (note
that here σ K is the null function). So, the dual problem has unique solution λ̄ = 0,
for which the Lagrangian attains its minimum at ±1, and so, the relaxed solution is
the measure with equal probability 1/2 at ±1 (so that, as required, the expectation
of g(x) is zero).

1.4.1.3 Estimate of Duality Gap

We assume here that X is a convex subset of a Banach space X  and that g(x) =
Ax, with A ∈ L(X  , Rn ). The convexification of f : X → R̄ is defined, for x ∈
conv(dom( f )), by
 
 
conv( f )(x) := inf αi f (x );
i
αi x = x ,
i
(1.323)
i i


over all finite families i ∈ I with αi ≥ 0, i αi = 1 and x i ∈ X . This is the largest
convex function minorizing f . It satisfies, for x ∈ X :

f (x) − conv( f )(x) ≤ ρ X ( f ), where ρ X ( f ) := sup ( f (x) − conv( f )(x)) .


x∈X
(1.324)
Note that ρ X ( f ) ≥ 0, with equality iff f is convex over X . We call it the estimate of
lack of convexity of f over X . The perturbed relaxed primal problem, in this setting,
for y ∈ R p , reads
Min conv( f )(x); Ax + y ∈ K . (Py )
x∈X

In this setting the stability condition similar to (1.311) reads as

ε B ⊂ conv (A dom( f ) − K ) . (1.325)

Proposition 1.168 Let (1.325) hold, and val(P) be finite. Then

val(P) − val(D) ≤ ρ X ( f ). (1.326)

Proof If ρ X ( f ) = ∞ the conclusion is obvious. So, let us assume that ρ X ( f ) < ∞.


We easily check that (P) and (P0 ) have the same dual (a similar observation was
made after (1.321)). It follows that

val(D) ≤ val(P0 ) ≤ val(P) < ∞. (1.327)


70 1 A Convex Optimization Toolbox

By the stability condition (1.325), val(Py ) is finite near ȳ = 0. By Proposition 1.64,


val(Py ) is continuous at ȳ = 0. So, by Theorem 1.88, val(D) = val(P0 ). Therefore,

val(P) − val(D) = val(P) − val(P0 ) ≤ ρ X ( f ), (1.328)

where the last inequality follows from the definition of ρ X ( f ). The conclusion fol-
lows. 

Remark 1.169 Proposition 1.168 was obtained by Aubin and Ekeland [11, Thm. A].

We will next see how to improve this estimate in the case of decomposable prob-
lems.

1.4.2 Applications of the Shapley–Folkman Theorem

1.4.2.1 The Shapley–Folkman Theorem

We give a simple proof of this theorem, following [127].

Theorem 1.170 Let Si , i = 1 to p, be nonempty subsets of Rn , with  p > n. Set S :=


p
S1 + · · · + S p . Then any x ∈ conv(S) has the representation x = i=1 xi , where
xi ∈ conv(Si ) for all i, and xi ∈ Si for at least ( p − n) indices.
p
Proof Sincea sum of convex sets is convex, and S ⊂ i=1 conv(Si ), we have  p that
p
conv(S) ⊂ i=1 conv(Si ). So any x ∈ conv(S) has the representation x = i=1 yi ,
with yi ∈ conv(Si ). By the definition
 of conv(Si ), there exists finite sets Ji , coef-
ficients αi j ≥ 0,j ∈ Ji , with j∈Ji αi j = 1, and elements yi j ∈ Si , for all j ∈ Ji ,
such that yi = j∈Ji αi j yi j . Define the following elements of Rn+ p by
⎧ 
⎪ z := (x  , 1, 1, . . . , 1),



⎪ z 1j := (y1j , 1, 0, . . . , 0),



⎨ z
2j := (y2j , 0, 1, . . . , 0),
(1.329)

⎪ ...



⎪ z pj := (y  , . . . , 1),

⎪ pj 0, 0,

...
p 
Then z = i=1 j∈Ji αi j z i j . Since any nonnegative combination of elements
of Rn+ p is a nonnegative combination of at most n + p of them,5 we have that

the minimal number of a nonnegative combination was greater than n + p, adding some linear
5 If

combination of these elements equal to 0 and with nonzero elements, we could easily find another
nonnegative combination of z with fewer nonzero coefficients, which would give a contradiction.
1.4 Duality for Nonconvex Problems 71

p  
z = i=1 j∈Ji βi j z i j with at most n + p nonzero βi j . Since for all i, j∈Ji βi j = 1,
at least one βi j is nonzero
for each i, meaning that at most n indices have more than
p 
one nonzero βi j . As x = i=1 j∈Ji βi j yi j , the conclusion follows. 

1.4.2.2 Estimate of Duality Gap for Decomposable Problems

Consider again problem (P) of Sect. 1.4.1.1, that is,

Min f (x); g(x) ∈ K , (1.330)


x∈X

with K a convex subset of R p , assuming that the constraints are linear and a decom-
posable cost function


N 
N
f (x) = f k (xk ); g(x) = gk (xk ), gk (xk ) = Ak xk , k = 1, . . . , N ,
k=1 k=1
(1.331)
with X = X 1 × · · · × X N , X k a closed convex subset of a Banach space X k , Ak ∈
L(X k , R p ), and xk ∈ X k for each k. The associated Lagrangian defined in (1.309)
satisfies


N
L(x, λ) := L k (xk , λ), where L k (xk , λ) := f k (xk ) + λ · gk (x). (1.332)
k=1

So, we have a decomposability property for the Lagrangian: the (opposite of) the
dual cost satisfies


N
d(λ) = dk (λ), where dk (λ) := inf L k (xk , λ). (1.333)
xk ∈X k
k=1

We recall the definition of the measure of lack of convexity in (1.324), and set
ρk := ρ X k ( f k ), for k = 1 to N .

Proposition 1.171 Let (1.325) and (1.331) hold. Then


 

val(P) − val(D) ≤ max ρk ; |I | ≤ p + 1 ≤ ( p + 1) max ρk . (1.334)
I ⊂{1,...,N } k
k∈I

Proof Let Sk := {( f (xk ), g(xk ), xk ∈ X k }, for k = 1 to N , and set S := S1 + · · · +


S N . Write s ∈ S as (s  , s  ) where s  is the first component and s  ∈ R p . The relaxed
problem has the same value as

Min s  ; s  ∈ K . (1.335)
s∈conv(S)
72 1 A Convex Optimization Toolbox

Let s be feasible for this problem. By the Shapley–Folkman Theorem 1.170, we may
assume that sk = ( f (x̄k ), g(x̄k )), with x̄k ∈ X k , except for at most p + 1 indexes,
say the p + 1 first. For k = 1 to p + 1, since X k is convex, there exists x̄k ∈ X k such
that sk = Ak x̄k . Then x̄ (which is well defined as an element of X ) is feasible and
satisfies f k (xk ) ≤ sk + ρk , for k = 1 to p + 1. The result follows. 

Remark 1.172 This result is due to Aubin and Ekeland  N [11, Thm. D]. In this decom-
posable setting, it is easily checked that ρ X ( f ) = k=1 ρk . So, in general the duality
estimate improves the one in Proposition 1.168 when p + 1 < N .

1.4.3 First-Order Optimality Conditions

While these notes are mainly devoted to convex problems, it is useful to discuss
optimality conditions in the case of nonlinear equality constraints. The (general)
result below will have an application in the theory of semidefinite programming, see
the proof of Lemma 2.12. So, consider the following problem

Min f (x); g(x) = 0, (P)


x∈X

where X and Y are Banach spaces, g : X → Y is of class C 1 , and F : X → R is


continuous and convex.

Definition 1.173 We say that x̄ ∈ X is a local solution of problem (P) if g(x̄) = 0


and f (x̄) ≤ f (x) whenever g(x) = 0 and x is close enough to x̄.

Theorem 1.174 Let x̄ be a local solution of (P), such that Dg(x̄) is onto. Then
there exists a unique λ ∈ Y ∗ such that ∂ f (x̄) + Dg(x̄) λ  0.

The proof of the theorem is based on Liusternik’s theorem that essentially gives a
sufficient condition for an element of Ker Dg(x̃) to be a tangent direction to g −1 (0).

Theorem 1.175 Let x̃ be such that g(x̃) = 0 and Dg(x̃) is onto. Let h ∈ Ker Dg(x̃).
Then there exists a path R+ → X , t → x(t) such that x(t) = x̃ + th + o(t) and
g(x(t)) = 0.

Proof Set A := Dg(x̃) and denote by c(·) the modulus of continuity of Dg(x) at x̃,
such that
Dg(x) − Dg(x̃) ≤ c(r ) whenever x − x̃ ≤ r. (1.336)

By the open mapping Theorem 1.29, there exists a c > 0 such that, for any b ∈
Y , there exists an a ∈ X that satisfies Aa = b and a ≤ c A b. So given t > 0,
consider the sequence x k in X such that x 0 = x̃ + th and

g(x k ) + A(x k+1 − x k ) = 0 and x k+1 − x k  ≤ c A g(x k ), k ≥ 1. (1.337)


1.4 Duality for Nonconvex Problems 73

Set ek := x k+1 − x k . Then


 
1 1  
g(x k+1
) = g(x ) +
k
Dg(x + σ e )e dσ =
k k k
Dg(x k + σ ek ) − A ek dσ
0 0
(1.338)
and therefore
 1  
g(x k+1
) ≤  Dg(x k + σ ek ) − A dσ ek  ≤ ck ek  ≤ ck c A g(x k ),
0
(1.339)
where  
ck ≤ max c(x k+1 − x̃), c(x k − x̃) . (1.340)

Let R > 0 be such that c(x − x̃) ≤ 1/(2c A ) whenever x − x̃ ≤ R. Let K be
an integer such that for all 0 ≤ k ≤ K + 1, we have that x k − x̃ ≤ R. Then by
induction we obtain that

g(x k+1 ) ≤ 2−k−1 g(x); ek  ≤ 2−k c A g(x), (1.341)

and so

x  − x̃ ≤ x − x̃ + 2c A g(x), for all  ≤ K + 1. (1.342)

Now let x be such that x − x̃ + 2c A g(x) ≤ R. The above relations imply that
(1.341) hold for all k. Therefore x k converges to x a such that g(x a ) = 0, and in
addition x a − x ≤ 2c A g(x 0 ). Since g(x 0 ) = g(x + th) = o(t), the result
follows by taking x(t) := x a . 

Proof (Proof of Theorem 1.174) (a) The difference of two multipliers belongs to the
kernel of Dg(x̄) . However, that Dg(x̄) is onto implies that its transpose is injective.
The uniqueness of the multiplier follows.
(b) We prove the existence of the multiplier. Given h ∈ Ker Dg(x̄), let x(t) be the
associated feasible path provided by Theorem 1.175. As F is locally Lipschitz, we
have that F(x(t)) = F(x̄ + th) + o(t). Since x̄ is a local solution it follows that

F(x(t)) − F(x̄) F(x̄ + th) − F(x̄)


0 ≤ lim = lim = F  (x̄, h). (1.343)
t↓0 t t↓0 t

Consider the convex problem

min F(x̄ + h); Dg(x̄)h = 0. (1.344)


h∈X
74 1 A Convex Optimization Toolbox

If h is feasible then 0 ≤ F  (x̄, h) ≤ F(x̄ + h) − F(x̄), and therefore h̄ = 0 is a


solution of (1.344). This problem satisfies the stability condition since Dg(x̄) is onto.
We conclude by applying Fenchel’s duality (Example 1.114), with here K = {0}, and
so, N K (g(x̄)) = Y ∗ . 

1.5 Notes

Conjugate functions were introduced by Mandelbrojt [78] for functions on R, and in


the Rn setting by Fenchel [47]. The Fenchel conjugate, in the smooth case, reduces to
the Legendre transform. Then (as quoted in [93], which includes an extension of this
result) Fenchel stated a strong duality result [48] for problems which have a structure
corresponding to our Example 1.2.1.8, whence the name “Fenchel duality”.
Many extensions were obtained in the sixties, especially by Moreau in his univer-
sity lecture notes and various notes to the French Academy of Sciences, synthetized
in [82, 84], and by Rockafellar [93, 95], who introduced the technique of duality
through perturbations [100]. Some classical references, still worth consulting, are
the lecture notes by Moreau [83], and the books by Rockafellar [97] in the finite-
dimensional setting, and by Ekeland and Temam [46] for infinite-dimensional spaces.
Theorem 1.130 is a particular case of Sion’s theorem [117], in which hypotheses
of “quasi-convexity” and “quasi-concavity” are made, in a topological vector space
setting; see [64] for a simple proof.
The Attouch–Brézis theorem [10] has a weak qualification condition under which
the equality of primal and dual values hold, the dual problem having solutions.
About extensions of the perspective function, see Maréchal [79].
Chapter 2
Semidefinite and Semi-infinite
Programming

Summary This chapter discusses optimization problems in the cone of positive


semidefinite matrices, and the duality theory for such ‘linear’ problems. We relate
convex rotationally invariant matrix functions to convex functions of the spectrum;
this allows us to compute the conjugate of the logarithmic barrier function and the
dual of associate optimization problems. The semidefinite relaxation of problems
with nonconvex quadratic cost and constraints is presented. Second-order cone opti-
mization is shown to be a subclass of semidefinite programming.
The second part of the chapter is devoted to semi-infinite programming and its
dual in the space of measures with finite support, with application to Chebyshev
approximation and to one-dimensional polynomial optimization.

2.1 Matrix Optimization

This section is devoted to optimization problems in matrix spaces. We identify


L(R p , Rn ) with the vector space of matrices of size p × n, and denote by S n the
space of symmetric matrices of size n.

2.1.1 The Frobenius Norm

Let us endow L(R p , Rn ) with the Frobenius norm and its associated scalar product;
for p × n matrices A and B:
⎛ ⎞1/2
 
A F := ⎝ Ai2j ⎠ ; A, B F := Ai j Bi j . (2.1)
i, j i, j

© Springer Nature Switzerland AG 2019 75


J. F. Bonnans, Convex and Stochastic Optimization, Universitext,
https://doi.org/10.1007/978-3-030-14977-2_2
76 2 Semidefinite and Semi-infinite Programming

In particular, let x and x  be in Rn , y and y  be in R p . Denoting by “·” the Euclidean


scalar product, we get:

A, yx   F = y  Ax; y  (x  ) , yx   F = (y  · y) (x  · x). (2.2)

Let A and B belong to L(R p , Rn ). Then

A, B F = trace(AB  ) = trace(B A ) = trace(A B) = trace(B  A). (2.3)

To prove this, it suffices to check the first relation and use the fact that the Frobenius
scalar product is symmetric and the identity A, B F = A , B   F .
Being the sum of eigenvalues, the trace of a matrix is invariant under a basis
change: for all square matrices M and P, with P invertible, we have that

trace(M) = trace(P −1 M P). (2.4)

In particular, let Q and Q̂ be orthonormal matrices of size resp. p and n, so that


Q −1 = Q  and Q̂ −1 = Q̂  . Then

A, B F = trace(Q  A Q̂ Q̂  B  Q) = Q  A Q̂, Q  B Q̂ F . (2.5)

In other words, the Frobenius scalar product is invariant under orthonormal basis
changes in Rn and R p . In particular, let x 1 , . . . , x n be an orthonormal system (i.e.,
the columns of an orthonormal matrix Q). Then

A2F = trace(Q  A AQ) = |Ax i |2 . (2.6)
i

Consider now the case of symmetric matrices. We know that A ∈ S n can be diag-
onalized by an orthonormal basis change. Denoting by λi (A) the eigenvalues of A,
counted with their multiplicity, and arranged in nonincreasing order, we obtain by
(2.5):
n
A2F = trace(A2 ) = λi (A)2 . (2.7)
i=1

Let A and B belong to S n . Denote by λi and μ j their eigenvalues, and x i , y j an


orthonormal system of associated eigenvectors. We get by (2.2)


n
A, B F = λi μ j (x i · y j )2 . (2.8)
i, j

One easily deduces from this formula the following result:


2.1 Matrix Optimization 77

Theorem 2.1 (Fejer) The symmetric square matrix A is positive semidefinite iff we
have A, B F ≥ 0 for all symmetric positive semidefinite B.

Denote by S+n the set of semidefinite positive matrices. By the Fejer theorem this
is a selfdual (i.e., equal to its positive polar) cone.

Proposition 2.2 Let A ∈ S n , and Q be an orthonormal matrix such that A =


Q  D Q, where D is a diagonal matrix. Then the projection in the Frobenius norm
of A over S+n is Q  D+ Q, where D+ is the diagonal matrix of diagonal elements
(Dii )+ , i = 1 to n.

Proof The Frobenius norm endows S n with a Hilbert space structure. The projec-
tion, say B, of A over the nonempty closed convex set S+n is therefore well defined,
and characterized by the relation

B ∈ S+n ; B − A, C − B F ≥ 0, for all C ∈ S+n , (2.9)

which is a consequence of (and in fact is equivalent to) the two relations B −


A, C F ≥ 0, for all C ∈ S+n , and B − A, B F = 0. Clearly, Q  D+ Q satisfies these
relations (the first one by Fejer’s theorem). 

2.1.2 Positive Semidefinite Linear Programming

2.1.2.1 Framework

Positive semidefinite linear programs (SDP) are optimization problems of the form


n
Minn c · x; A0 + x i Ai 0, (S D P)
x∈R
i=1

where the Ai , i = 0 to n, are symmetric matrices of size p, and, given two sym-
metric matrices A and B of the same size, “A B” means that A − B is positive
semidefinite. (In a similar way we will use to denote positive definiteness, and
and ≺ for negative semidefiniteness and negative definiteness resp.). Let us see how
to reduce some optimization problems to the SDP format. It is trivial to reduce linear
constraints to SDP constraints1 :

Ax − b ≤ 0 ⇔ −diag(Ax − b) 0.

In the case of quadratic convex constraints such as

1 We denote by diag the operator that to a vector associates the diagonal matrix having this vector
for its diagonal, and also the operator that to a square matrix associates its diagonal.
78 2 Semidefinite and Semi-infinite Programming

q(x) := (Ax + b) · (Ax + b) − c · x − d,

we have that q(x) ≤ 0 iff


 
I Ax + b
0.
(Ax + b) c · x + d

This is a trivial consequence of the following, easily proved lemma:


 
A B
Lemma 2.3 (Schur lemma) Let A = , with A and C symmetric and A
B C
invertible. Then
A 0 ⇔ {A 0 and C B  A−1 B}.
 
1 x
Example 2.4 Let (X, x) ∈ S n × Rn . Then X x x  iff 0.
x X
The following problem, with quadratic criterion and constraints:

Minn q0 (x) ; qi (x) ≤ 0, i = 1 to p,


x∈R

is equivalent to the problem with a linear cost and quadratic constraints:

Min t ; q0 (x) − t ≤ 0 ; qi (x) ≤ 0, i = 1 to p.


(x,t)

This allows us to reduce problems with convex quadratic cost function and constraints
to the SDP format. Another type of example is that of minimisation of the greatest
eigenvalue:
Min t ; t I − A(x) 0.
(x,t)

2.1.2.2 Linear Duality

We next apply the duality theory to problem (S D P) of Sect. 2.1.2.1. It is a special


case of linear conical optimization (Chap. 1, Sect. 1.3.2). However we will derive the
dual problem in a direct way. We have seen that, by Fejer’s theorem 2.1, the polar
cone of S+n is S−n := −S+n .
n
Set A(x) := A0 + i=1 xi Ai . The Lagrangian of problem (S D P) is

L(x, λ) = c · x + λ, A(x) F


n
with λ ∈ S n , i.e., L(x, λ) = λ, A0  F + i=1 (ci + λ, Ai  F ) xi . The dual problem
is therefore

Max A0 , λ F ; ci + Ai , λ F = 0, i = 1, . . . , n. (DS D P)


λ∈S −n
2.1 Matrix Optimization 79

On the other hand, the family of perturbed problems associated with (S D P) is


n
Minn c · x ; A0 + x i Ai + y 0, (S D Py )
x∈R
i=1

with here y ∈ S p . Set v(y) := val(S D Py ). The (strong duality) Corollary 1.92
implies the following.
Theorem 2.5 Assume that val(S D P) is finite, and that the following stability con-
dition holds: there exists an x̂ ∈ Rn such that A(x̂) 0. Then
(a) we have the equality val(DS D P) = val(S D P),
(b) the set S(DS D P) is nonempty and bounded,
(c) for all z ∈ S n , we have v (0, z) = max{y ∗ , z F ; y ∗ ∈ S(DS D P)}.
By Lemma 1.85, the primal problem is also the dual of its dual. So, consider the
perturbation of equality constraints

Max A0 , λ F ; ci + Ai , λ F + h i = 0, i = 1, . . . , n. (DS D Ph )


λ∈S −n

Here h ∈ Rn can be interpreted as a perturbation of the primal cost. Set w(h) :=


val(DS D Ph ); this is a concave function. When h = 0, the bidual problem is nothing
else than the primal problem (S D P). Applying the strong duality Corollary 1.92,
we deduce that:
Theorem 2.6 Assume that val(DS D P) finite, and the following stability condition
is satisfied: the family A1 , . . . , An is linearly independent, and there exists a λ ≺ 0,
feasible for (DS D P). Then
(a) we have the equality val(DS D P) = val(S D P),
(b) the set S(S D P) is nonempty and bounded,
(c) for all d ∈ Rn , we have w (0, d) = min{x, d; x ∈ S(S D P)}.
The exercises below, taken from [125, Chap. 4], show important differences with
the duality theory for linear programming.
Exercise 2.7 Check that the following problem has no solution, despite the absence
of a duality gap and the finiteness of the common value:
 
x1 1
Min x1 ; 0.
1 x2

Exercise 2.8 Show that the following problem has a nonzero duality gap, although
both the primal and dual problems have solutions:
⎛ ⎞
x2 + 1 0 0
Min x2 ; ⎝ 0 x1 x2 ⎠ 0.
0 x2 0
80 2 Semidefinite and Semi-infinite Programming

Hint: check that the feasible set is R+ × {0}, and so the primal value is 0, while the
dual is
Maxn λ11 ; λ22 = 0, 1 + λ11 + 2λ23 = 0,
λ∈S −

and therefore, any dual feasible λ satisfies λ23 = 0, and so λ11 = −1, so that the dual
value is −1.

2.2 Rotationally Invariant Matrix Functions

2.2.1 Computation of the Subdifferential

Let F be an application of S n in R̄. One says that F is rotationally invariant if, for
all orthonormal matrices Q of size n, we have

F(M) = F(Q M Q  ), for all M ∈ S n . (2.10)

Let f : Rn → R̄. One says that f is symmetric if, for all permutations π of {1, . . . , n}
(i.e., a bijective mapping from {1, . . . , n} into itself), we have

f (x1 , . . . , xn ) = f (xπ1 , . . . , xπn ), for all x ∈ Rn . (2.11)

Let us recall that we denote by λ1 (M), . . . , λn (M) the eigenvalues of M ∈ S n in


nonincreasing order, and we set λ(M) := (λ1 (M), . . . , λn (M)) .

Lemma 2.9 The function F : S n → R̄ is rotationally invariant iff there exists a


symmetric function f : Rn → R̄, such that

F(M) = f (λ1 (M), . . . , λn (M)) for all M ∈ S n . (2.12)

Proof Let F be rotationally invariant. We can choose Q in such a way that Q M Q 


is a diagonal matrix, whose diagonal elements are the eigenvalues of M arranged in
an arbitrary order. It follows that F is a symmetric function of the spectrum of M,
whence (2.12). The converse is immediate. 

We will call f the spectral function associated with F. Let us see how to compute
the Fenchel conjugate of a rotationally invariant function.

Theorem 2.10 Let F : S n → R̄ be rotationally invariant, and f the associated


spectral function. Then (i) the Fenchel conjugate of F is rotationally invariant, with
associated spectral function f ∗ , the Fenchel conjugate of f , (ii) the function F is
convex, l.s.c. and proper iff f is so.

The first step of the proof deals with the cone of nonincreasing vectors:
2.2 Rotationally Invariant Matrix Functions 81

K d := x ∈ Rn ; x1 ≥ x2 ≥ · · · ≥ xn . (2.13)

Lemma 2.11 (i) The polar of the cone of nonincreasing vectors is


j

n
K d− = y∈R ;n
yi ≤ 0, j = 1, . . . , n − 1; yi = 0 . (2.14)
i=1 i=1

(ii) In addition, if x ∈ K d and y ∈ K d− , then x · y = 0 iff


i−1
(xi−1 − xi ) yk = 0, i = 2, . . . , n. (2.15)
k=1

(iii) If x and z are elements of K d , and P is a permutation matrix, then y := P z − z


belongs to K d− , and x  y = 0 iff there exists a permutation matrix Q such that

Qx = x; Q P z = z. (2.16)

Proof If x and y belong to Rn , we have


n−1 
n
x  y = (x1 − x2 )y1 + (x2 − x3 )(y1 + y2 ) + · · · + (xn−1 − xn ) yk + xn yk .
k=1 k=1
(2.17)
It follows that the r.h.s. of (2.14) is included in K − . Conversely, let 1 ≤ p ≤ n − 1,
y ∈ K d− , and x ∈ Rn whose p first coordinates are equal to 1, and the other ones
to 0. Then x ∈ K d , and so 0 ≥ x  y = k=1 yk . Choosing x = ±1, we obtain 0 ≥
p

(±1) y = − k=1 yk , whence (i). Point (ii) is an immediate consequence of (i) and
n

(2.17). Let us show (iii). By the definition of y and (i), it is clear that y ∈ K d− . If
(2.16) is satisfied, then

x  P z = x  Q  Q P z = (Qx) z = x · z (2.18)

and so x · y = 0. Let us show the converse. By (ii), x · y = 0 iff (2.15) is satisfied. Let
I be the set of equivalence classes of components of x, and Q be a permutation; then
Qx = x iff any I ∈ I is stable under Q. In particular, there exists a permutation Q
for which Q Pz is nonincreasing over all I ∈ I . If i(I ) denotes the smallest index of
)−1
each class, we observe that (2.15) is equivalent to i(Ik=1 yk = 0, for all I ∈ I , and
so i∈I yi = 0, I ∈ I , i.e., i∈I (Q P z)i = i∈I z i , for all I ∈ I . But Q P z ≤ z,
and these two vectors have nondecreasing components over each I ∈ I ; they are
therefore equal. 

Lemma 2.12 Let X and Y belong to S n . Then

X, Y  F ≤ λ(X ) · λ(Y ), (2.19)


82 2 Semidefinite and Semi-infinite Programming

with equality iff there exists an orthonormal matrix U diagonalizing these two matri-
ces, and such that

U  XU = diag(λ(X )); U  Y U = diag(λ(Y )). (2.20)

Proof Consider the optimization problem

Max trace Z  X Z Y ; I − Z  Z = 0, (2.21)


Z ∈M n

where I is the identity in Rn . We take S n (and not M n ) as the constraint space.


The feasible set is the set of orthonormal matrices, which is compact. So, the above
problem has (at least) one solution Z̄ . Let us check that the derivative of the constraints
is surjective at this point. Indeed, the linearized equation

− Z̄  W − W  Z̄ = A, (2.22)

where A ∈ S n , has solution W = − 21 Z̄ A. The Lagrangian of this problem can be


expressed as  
trace Z  X Z Y + Λ − Z  Z Λ . (2.23)

By Theorem 1.174, there exists a unique Lagrange multiplier Λ ∈ S n such that the
above Lagrangian has a zero derivative w.r.t. Z at Z̄ . In other words, for all W ∈ M n ,
we have that
 
trace W  X Z̄ Y + Z̄  X W Y − W  Z̄ Λ − Z̄  W Λ = 0. (2.24)

Using (2.3), we obtain the equivalent expression


   
trace W  X Z̄ Y − Z̄ Λ + trace Y Z̄  X − Λ Z̄  W = 0. (2.25)

Set M := X Z̄ Y − Z̄ Λ. Taking W = M, we obtain 0 = trace(M  M) = M2F , and


so M = 0. It follows that

Z̄  X Z̄ Y = Λ = Λ = Y Z̄  X Z̄ . (2.26)

This means that Y and Z̄  X Z̄ do commute. So there exists [60] an orthonormal


matrix V diagonalizing these two matrices, which means that

V Y V = diag(P1 λ(Y ));


(2.27)
V  Z̄  X Z̄ V = diag(P2 λ( Z̄  X Z̄ )) = diag(P2 λ(X )),

where P1 and P2 are permutation matrices. We can assume that P2 = I . We get then,
since Z̄ is a solution of (2.21), that
2.2 Rotationally Invariant Matrix Functions 83

trace X Y ≤ trace Z̄  X Z̄ Y = trace(V  Z̄  X Z̄ V V  Y V ) = λ(X ) P1 λ(Y ).


(2.28)
By Lemma 2.11(iii), we have λ(X ) P1 λ(Y ) ≤ λ(X ) · λ(Y ), with equality iff there
exists a permutation matrix Q leaving λ(X ) invariant, and such that Q P1 λ(Y ) =
λ(Y ). Then U := V Q  satisfies (2.20). Indeed, using (2.27), we get (leaving the
details of the last equality in (2.29) to the reader),

U  Y U = QV  Y V Q  = Qdiag(P1 λ(Y ))Q  = diag(λ(Y )) (2.29)

and
U  XU = QV  X V Q  = Qdiag(λ(X ))Q  = diag(λ(X )). (2.30)

Conversely, if (2.20) is satisfied, it is clear that equality holds in (2.19). 

Proof (Proof of theorem 2.10) By Lemma 2.12, we have that

F ∗ (Y ) = sup {X, Y  F − F(X )} ≤ sup {λ(X ) · λ(Y ) − f (λ(X ))} ≤ f ∗ (λ(Y )).
X ∈S n X ∈S n
(2.31)
Taking Y = U  diag(λ(Y ))U , with U orthonormal, and X of the form U  diag(x)U ,
with x ∈ Rn , we get

F ∗ (Y ) ≥ supx∈Rn {U  diag(x)U, Y  F − F(X )}


(2.32)
= supx∈Rn {x  λ(Y ) − f (x)} = f ∗ (λ(Y )).

By (2.31)–(2.32), f ∗ is the spectral function associated with F ∗ , whence (i).


If F is convex, l.s.c. and proper, it is, by Theorem 1.44, equal to its biconjugate
F ∗∗ , which by (i) has the spectral function f = f ∗∗ . Therefore f is convex l.s.c.,
and proper since F is. Conversely, if f is convex, l.s.c. and proper, then F = F ∗∗ by
(i), so F is l.s.c. convex, and proper since f is; whence (ii). 

One deduces from the previous results an expression for the subdifferential of a
rotationally invariant function.

Proposition 2.13 Let F : S n → R̄ be rotationally invariant, f be the associated


spectral function, and let X ∈ S n be such that F(X ) ∈ R. Then Y ∈ ∂ F(X ) iff the
following two relations are satisfied: (i) λ(Y ) ∈ ∂ f (λ(X )) and (ii) there exists an
orthonormal matrix U satisfying (2.20).

Proof The Fenchel–Young inequality ensures that Y ∈ ∂ F(X ) iff F(X ) + F ∗ (Y ) =


X, Y  F , which is equivalent to f (λ(X )) + f ∗ (λ(Y )) = X, Y  F . Now Lemma 2.12,
combined with the Fenchel–Young inequality, ensures that the equality is satisfied
iff λ(Y ) ∈ ∂ f (λ(X )) and there exists an orthogonal matrix U satisfying (2.20). The
conclusion follows. 
84 2 Semidefinite and Semi-infinite Programming

2.2.2 Examples

When applying the previous results, it is convenient to rewrite (2.20) in the form

X = U diag(λ(X ))U  ; Y = U diag(λ(Y ))U  . (2.33)

The columns of U form an orthonormal basis of eigenvectors of X , the latter by


nonincreasing order of eigenvalues. We will speak of an ordered basis. The condition
over Y is therefore that at least one ordered basis for X is also an ordered basis for
Y . Denote by Ui the ith column of U . Then


n
Y = U diag(λ(Y ))U  = λi (Y )Ui Ui . (2.34)
i=1

Example 2.14 Let q be a nonnegative integer, and let the function F : S n → R be


defined by F(X ) := trace(X q ). Since λ(X q ) = λ(X )q (we take here the power of the
n
vector componentwise) we get F(X ) = i=1 λi (X )q , and so the associated spectral
n q
function is f (x) = i=1 xi .
If q is even, then f is convex and (since it is differentiable) its subdifferential
reduces to its derivative. We get, for U orthonormal satisfying (2.20):

D F(X ) = Y = qU diag(λ(X q−1 ))U  = q X q−1 . (2.35)

Example 2.15 The function F(X ) := λ1 (X ) (greatest eigenvalue) has associated


spectral function f (x) = maxi xi . If x ∈ Rn has nonincreasing components, and
x1 = · · · = x p > x p+1 , it follows from Lemma 1.140 that
n
∂ f (x) = y ∈ Rn+ , i=1 yi = 1; yi = 0, i > p . (2.36)

Denote by U the set of orthonormal matrices whose p first columns form a base of
the eigenspace E 1 associated with λ1 (X )). By (2.34), Y ∈ ∂ F(X ) iff, for a certain
U ∈U:
 p
 p
αi Ui Ui ; α ∈ R+ ,
p
Y = αi = 1. (2.37)
i=1 i=1

Setting
p

p
P p = α ∈ R+ , αi = 1 , (2.38)
i=1

we deduce the directional derivative formula


2.2 Rotationally Invariant Matrix Functions 85

λ1 (X, Z ) = max{Y, Z  F ; Y ∈ ∂λ1 (X )}


p 
= max i=1 αi Ui ZUi ; α ∈ P p , U ∈ U (2.39)

= max{λ1 (U1: p ZU1: p ); U ∈ U }.

We have proved the following:



⎨ The directional derivative of the greatest eigenvalue of X
in direction Z is the greatest eigenvalue (2.40)

of the restriction of Z (seen as a quadratic form) to E 1 .

2.2.3 Logarithmic Penalty

2.2.3.1 Logarithmic Barrier Function

Set R++ := (0, ∞), R−− := −R++ . The function


n
f (λ) := − log λi if λi > 0, i = 1, . . . , n, +∞ otherwise, (2.41)
i=1

is l.s.c. convex, and differentiable over its domain Rn++ . The associated matrix func-
tion, called the logarithmic barrier of the cone S+n , is

F(X ) := − log det X if X 0, +∞ otherwise. (2.42)

By the above theory, its derivative is the opposite of the inverse of X :

D F(X ) = U diag(−λ(X )−1 ))U  = −X −1 , (2.43)

where here still the inversion of the vector is computed componentwise. Since
the conjugate of f (t) = − log t (with domain R++ ) is f ∗ (t ∗ ) = −1 − log(−t ∗ )
n
(with domain R−− ), the conjugate of f (x) = − i=1 log xi (with domain Rn++ )
∗ ∗ n ∗
is f (x ) = −n − i=1 log(−xi ) (with domain R−− ), and the conjugate function
n

of F is
F ∗ (Y ∗ ) = −n − log det(−Y ∗ ), (2.44)

whose domain is the set of negative definite symmetric matrices of size n.

2.2.3.2 Central Trajectory

The logarithmic barrier allows the extension to linear SDP problems of the interior
point algorithms for linear programming. Here we just give a brief discussion of
86 2 Semidefinite and Semi-infinite Programming

the penalized problem. With problem (S D P) of Sect. 2.1.2.1 we associate one with
logarithmic penalty, where μ > 0 is the penalty parameter, setting A(x) := A0 +
n
i=1 x i Ai :
Minn c · x − μ log det(A(x)). (S D Pμ )
x∈R

We apply Fenchel’s duality (Chap. 1, Sect. 1.2.1.8), taking into account that (i) the
conjugate of x → c · x is the indicatrix of {c}, (ii) F ∗ (μ−1 Y ) = F ∗ (Y ) + n log μ,
(iii) the conjugate of F1 (A) := μF(A0 + A) is

F1∗ (Y ) = supY ∈S n Y, A F − μF(A0 + A)


= supY ∈S n −Y, A0  F + μ(μ−1 Y, (A + A0 ) F − F(A0 + A))
(2.45)
= −Y, A0  F + μF ∗ (μ−1 Y )
= −nμ(1 − log μ) − Y, A0  F − μ log det(−Y ).

The dual problem is therefore

Max nμ(1 − log μ) + A0 , Y  F + μ log det(−Y ); ci = −Ai , Y  F , i = 1, . . . , n.


Y ∈S−−
n

(DS D Pμ )
It is usually written in terms of S = −Y as

Max nμ(1 − log μ) − A0 , S F + μ log det S; ci = Ai , S F , i = 1, . . . , n.


n
S∈S ++
(DS D Pμ )
The optimality condition can be written as
 

n 
n
c·x+ I{ci } (−Ai , Y  F ) + xi Ai , Y  F
(2.46)
 i=1 i=1 
+ μ − log det(A(x)) − n − log det(−μ−1 Y ) − A(x), μ−1 Y  F = 0.

Each row corresponds to an equality in the Fenchel–Young inequality for f (x) :=


c · x and F resp., and by (2.43), the above display is equivalent to, using the variable
S rather than Y and denoting by Id the identity matrix:

S A(x) = μId ; S 0; A(x) 0; ci = Ai , S F , i = 1, . . . , n. (2.47)

One may prefer to rewrite the first relation in a symmetrized form (which is equivalent
since S A(x) = μId implies that S and A(x) commute):

S A(x) + A(x)S = μId ; S 0; A(x) 0; ci = Ai , S F , i = 1, . . . , n.


(2.48)
See [125] for how to solve this system by efficient algorithms.
2.3 SDP Relaxations of Nonconvex Problems 87

2.3 SDP Relaxations of Nonconvex Problems

2.3.1 Relaxation of Quadratic Problems

In this section we study a problem with quadratic criterion and constraints:

Minn f 0 (x); f i (x) ≤ 0, i = 1, . . . , p, (QC P)


x∈R

with
1  i
f i (x) = x A x + bi · x + ci , i = 0, . . . , p, (2.49)
2

where the Ai , bi and ci are given in S n , Rn and R respectively; we can assume


that c0 = 0. We already discussed this problem in the case when the Ai are positive
semidefinite; here we make no such hypothesis, so that problem (QC P) is in general
non-convex. We can write it in the form

1 0 1
Min A , X  F + b0 · x; Ai , X  F + bi · x + ci ≤ 0, i = 1, . . . , p; X = x x  .
x ∈ Rn 2 2
X ∈Sn
(QC P  )
We will call the SDP relaxation of problem (QC P) the variant of the formulation
(QC P  ) in which we relax X = x x  in X x x  . By Example 2.4, an equivalent
formulation of the relaxed problem is

1 0
Min A , X  F + b0 · x; 21 Ai , X  F + bi · x + ci ≤ 0, i = 1, . . . , p;
x ∈ Rn 2
X ∈Sn   (R QC P)
1x
0.
x X

The SDP relaxation is therefore a linear SDP problem.

Remark 2.16 It may happen that bi = 0, for i = 0 to p. It is then optimal to choose


x = 0 in the SDP relaxation, and the SDP constraint reduces to X 0.

Proposition 2.17 We have that val(R QC P) ≤ val(QC P). If in addition the matri-
ces Ai , i = 0 to p, are positive semidefinite (in other words if the criterion and the
constraints are convex), then val(R QC P) = val(QC P).

Proof Since problem (R QC P) has the same criterion as (QC P  ), and a larger
feasible set, we have that val(R QC P) ≤ val(QC P). If the matrices Ai , i = 0 to p,
are positive semidefinite, let (x, X ) ∈ F(R QC P). Define X  := x x  and ϕ(x, X ) :=
1
2
A0 , X  F + b0 · x. Since X  X , by Fejer’s theorem, Ai , X   F ≤ Ai , X  F , so
that (x, X  ) ∈ F(R QC P), and ϕ(x, X  ) ≤ ϕ(x, X ); so val(QC P  ) ≤ val(R QC P)
and the result follows. 
88 2 Semidefinite and Semi-infinite Programming

In the sequel we will show that SDP relaxation is strongly related to classical dual-
ity, whose discussion needs a generalization of the Schur lemma 2.3. We introduce
the pseudo inverse of A ∈ S n ,


n
A† := λi† xi xi (2.50)
i=1

where xi is an orthonormal basis of eigenvectors of A, λi their associated eigenvalues,


and
λi† = λi−1 if λi = 0, and 0 otherwise. (2.51)

We leave the (easy) proof of the next lemma as an exercise.


 
A B
Lemma 2.18 (Generalized Schur lemma) Let A = , with A and C sym-
B C
metric. Then

A 0 ⇔ {A 0, C B  A† B, and Im(B) ⊂ Im(A)}.

The Lagrangian of problem (QC P) is

1 
L(x, λ) = x A(λ)x + b(λ) · x + c(λ),
2
where λ ∈ R p and (setting λ0 = 1)


p

p

p
A(λ) = λi A ; b(λ) =
i
λi b ; c(λ) =
i
λi ci .
i=0 i=0 i=1

We will denote the dual criterion by q(λ) := inf x L(x, λ). The dual problem is there-
fore:
Maxp q(λ); λ ≥ 0. (D QC P)
λ∈R

Lemma 2.19 (i) The dual criterion can be expressed as



c(λ) − 21 b(λ) A(λ)† b(λ) if A(λ) 0 and b(λ) ∈ Im(A(λ)),
q(λ) =
−∞ otherwise.

(ii) The dual problem is equivalent to the following SDP problem (in the sense that
it has the same value, and their solutions have the same components for λ)
 
c(λ) − w 21 b(λ)
Max w; 1 1
0. (D QC P  )
λ≥0,w∈R b(λ) A(λ)
2 2
2.3 SDP Relaxations of Nonconvex Problems 89

Proof Point (i) is an elementary computation, and point (ii) is an immediate appli-
cation of the generalized Schur lemma. 
Since (D QC P  ) is an SDP linear problem, we know how to compute its dual
problem; it is convenient to call the latter the bidual
 problem
 of (QC P). Let us write

α x
the multiplier, an element of S n+1 , in the form . The Lagrangian of problem
x X

(D QC P ) can be expressed as

1
L (λ, w, α, x, X ) := w + α(c(λ) − w) + b(λ) · x + A(λ), X  F .
2
Define    
αx
C := (α, x, X ) ∈ R × R × S ;
n n
0 .
x X

We can rewrite (D QC P  ) in the form

Max inf L (λ, w, α, x, X ).


λ≥0,w (α,x,X )∈C

The bidual problem is therefore

Min sup L (λ, w, α, x, X ). (B QC P)


(α,x,X )∈C λ≥0,w

We get


p  
1 i
L (λ, w, α, x, X ) = (1 − α)w + λi A , X  F + b · x + ci .
i

i=0
2

By an elementary computation, we obtain the


Lemma 2.20 The bidual problem coincides with (R QC P).
The qualification hypothesis for problem (D QC P  ) is equivalent to the existence
of λ ∈ R p and w ∈ R such that
 
c(λ) − w 21 b(λ)
λi > 0, i = 1, . . . , p; 1
0. (2.52)
2
b(λ) 21 A(λ)

It is easy to see that one obtains an equivalent condition by writing λi ≥ 0 in lieu of


λi > 0; using the Schur lemma, we see that (2.52) is equivalent to
p
There exists a λ ∈ R+ ; A(λ) 0. (2.53)

This hypothesis is for example satisfied if A0 0 (take λ = 0). The following the-
orem sums up the main results of the section.
90 2 Semidefinite and Semi-infinite Programming

Theorem 2.21 (i) We have the relations

val(D QC P) ≤ val(R QC P) ≤ val(QC P). (2.54)

(ii) If the criterion and constraints of (QC P) are convex, val(R QC P) = val(QC P).
(iii) If problem (D QC P  ) satisfies the qualification hypothesis (2.53), then (D QC P)
and (R QC P) have the same value, i.e., the SDP relaxation has the same value as
the classical dual.

SDP relaxation therefore does as well as classical duality and, in many cases, both
have the same value.

2.3.2 Relaxation of Integer Constraints

Consider a variant of the previous problem, where the f i are still defined by (2.49),
with an additional integrity constraint:

Minn f 0 (x); f i (x) ≤ 0, i = 1, . . . , p; x ∈ E := {−1, 1}n . (QC P I )


x∈R

Remark 2.22 One easily reduces the more usual constraint x ∈ {0, 1}n to the above
integrity constraint.

Observe that when x ∈ E, X = x x  has all diagonal elements equal to 1. This


leads to the SDP relaxation


⎪ Min 21 A0 , X  F + b0 · x;

⎪ x ∈ Rn

⎪ n
⎨ X ∈S
1
Ai , X  F + bi · x + ci ≤ 0, i = 1, . . . , p; (R QC P I )


2
 



⎪ 1x
⎩ 0; X ii = 1, i = 1, . . . , n.
x X

Note the obvious extension of Remark 2.16 to the present framework.

Remark 2.23 In the case of a linear programming problem with the above integrity
constraints, we have that all Ai are equal to 0, and hence, the formulation of the
relaxed problem reduces to


⎪ Min b0 · x; bi · x + ci ≤ 0, i = 1, . . . , p;
⎨ x ∈ Rnn
X ∈S   (2.55)

⎪ 1x
⎩ 0; X ii = 1, i = 1, . . . , n.
x X
2.4 Second-Order Cone Constraints 91

2.4 Second-Order Cone Constraints

2.4.1 Examples of SOC Reformulations

Given a nonzero, nonnegative integer m, we choose to denote by s = (s0 , . . . , sm )


the elements of Rm+1 , and we set s̄ := (s1 , . . . , sm ) . The second-order cone (SOC),
or Lorenz cone, is defined as

Q m+1 := {s ∈ Rm+1 ; s0 ≥ |s̄|}. (2.56)

The associated order relation is, given x and y in Rm+1 , x Q m+1 y if x − y ∈ Q m+1 .
We will see how to rewrite various relations in the form
p
Ax + b ∈ R− × Q m 1 +1 × · · · × Q m q +1 . (2.57)

We then speak of a linear SOC reformulation.

Exercise 2.24 Let w ∈ Rn , α and β scalars. Check that


 
 2w 
α ≥ 0, β ≥ 0, |w| ≤ αβ 2
⇔ α + β ≥  . (2.58)
α−β 

Exercise 2.25 Given ai , i = 1 to p, and c j , j = 1 to q in Rn , and given b ∈ R p and


d ∈ Rq , consider the problem


p
Minn 1/(ai · x + bi ); ai · x + bi > 0, i = 1, . . . , p,
x∈R (2.59)
i=1
ci · x + di ≥ 0, i = 1, . . . , q.

Check that an equivalent formulation is


p
Min ti ti (ai · x + bi ) ≥ 1, i = 1, . . . , p,
x∈R ,t∈R p
n
i=1 (2.60)
ti ≥ 0, ai · x + bi ≥ 0, i = 1, . . . , p,
ci · x + di ≥ 0, i = 1, . . . , q.

Obtain a linear SOC reformulation, by applying Example 2.24.

Exercise 2.26 Given ai , i = 1 to p in Rn , and b ∈ R p with positive coordinates,


consider the problem of uniform approximation, in the logarithmic scale:

Minn max |log(ai · x) − log(bi )| (2.61)


x∈R
92 2 Semidefinite and Semi-infinite Programming

with the implicit constraint ai · x > 0 for all i. Show that a reformulation of this
problem is
1 ai · x
Min t; ≤ ≤ t. (2.62)
x∈Rn ,t∈R t bi

Apply to the inequalities on the left, rewritten in the form tai · x ≥ bi , the result of
Exercise 2.24, and conclude that we have a linear SOC reformulation of this problem.

We next discuss some more elaborate examples.

Example 2.27 Let be a positive integer. Let us show that we can rewrite “linearly”
the relation
x ∈ R2+ ; t ∈ R; t ≤ (x1 x2 . . . x2 )1/2 . (2.63)

For = 1 this boils down to



t ≤ τ; 0 ≤ τ ≤ x1 x2 , (2.64)

and the last inequality can be rewritten as τ 2 ≤ x1 x2 ; we conclude by applying


Exercise 2.24. For = 2 we introduce y ∈ R2 and rewrite (2.63) in the form

√ √ √
x ∈ R4+ ; y ∈ R2+ ; t ∈ R; t ≤ τ ; 0 ≤ τ ≤ y1 y2 ; y1 ≤ x1 x2 ; y2 ≤ x3 x4 ;
(2.65)
which itself can be rewritten as

x ≥ 0; y ≥ 0; τ ≥ 0; t ≤ τ ; τ 2 ≤ y1 y2 ; y12 ≤ x1 x2 ; y22 ≤ x3 x4 . (2.66)

We again apply Exercise 2.24. We leave to the reader the generalization to arbitrary
, and check that one obtains O(2 ) “linear” relations in R3 .

Example 2.28 Consider the relations

x ∈ Rn+ ; t ∈ R+ ; t ≤ x1π1 x2π2 . . . xnπn . (2.67)

We assume that πi = pi / p, with pi a positive integer and p an integer, p ≥ i pi .


Let be such that 2 ≥ p. Consider the relation
 1/2
0 ≤ t ≤ x1 x2 . . . x2 (2.68)

where the xi are replaced by x1 for the first p1 indexes, x2 for the following p2
indexes, until xn , and then by t for the following 2 − p indexes, and finally by 1 for
the p − i pi remaining indexes. Computing the power 2 of both sides of (2.68),
we get
t 2 ≤ x1 1 x2 2 . . . xnpn t 2 − p .
p p
(2.69)
2.4 Second-Order Cone Constraints 93

Simplifying by t 2 and taking the pth root, we see that (2.68) is equivalent to (2.67);
using Example 2.27, it follows that (2.67) has a linear SOC reformulation.

Note that, in particular, the geometric mean can be SOC linearly rewritten.

2.4.2 Linear SOC Duality

Consider the following SOC linear problem, in which A j is an (m j + 1) × n matrix


and b j ∈ Rm j +1 , for j = 1, . . . , J :

Minn c · x ; A j x − b j Q m j +1 0, j = 1, . . . , J. (L S OC P)
x∈R

In order to compute the dual problem, we introduce the operator Rm+1 → Rm+1 ,
y → ỹ := (y0 , − ȳ), that leaves Q m+1 invariant.

Lemma 2.29 (i) The second-order cone Q m+1 is selfdual (equal to its positive polar
cone). (ii) In addition, when x and y are two nonzero elements of Q m+1 , we have
x · y = 0 iff x0 = |x̄| and y ∈ R+ x̃.

Proof Let x and y belong to Q m+1 . If x0 = 0, then x is zero and x · y = 0, and the
same for y. Assume now that x0 and y0 are positive. Then
 
|x̄| | ȳ|
x · y = x0 y0 + x̄ · ȳ ≥ x0 y0 − |x̄|| ȳ| = x0 y0 1 − . (2.70)
x0 y0

By definition of Q m+1 , the above fractions have values in [0, 1], whence x · y ≥ 0,
which proves that Q m+1 ⊂ Q + m+1 . In addition x · y = 0 iff x 0 = | x̄|, y0 = | ȳ| and
x̄ · ȳ = −|x̄|| ȳ|. Since x̄ = 0, the last relation is equivalent to ȳ ∈ R− x̄, whence (ii).
It remains to show that Q + +
m+1 ⊂ Q m+1 . Let y ∈ Q m+1 . If ȳ = 0, let x ∈ Q m+1 be
such that x0 = 1. We get that 0 ≤ x · y = y0 , so y ∈ Q m+1 . If on the contrary ȳ = 0,
set z := (| ȳ|, − ȳ). Then z ∈ Q m+1 , and so 0 ≤ y · z = y0 | ȳ| − | ȳ|2 = | ȳ|(y0 − | ȳ|),
implying y0 ≥ | ȳ|, as was to be shown. 

The dual of (LSOCP) (which again is a particular case of conical linear optimiza-
tion) can therefore be expressed as


J 
J
Max bj · y j; (A j ) y j = c. (LSOCP∗ )
J
y∈Πi=1 Q m j +1
j=1 j=1

We deduce the optimality conditions: Primal and dual feasibility and complemen-
tarity, the latter being obtained for each j:
94 2 Semidefinite and Semi-infinite Programming


J
A j x − b j ∈ Q m j +1 , y j ∈ Q m j +1 , (A j x − b j ) · y j = 0, j = 1, . . . , J ; (A j ) y j = c.
j=1
(2.71)

2.4.3 SDP Representation

Let us show how to represent a linear SOC constraint as a linear SDP constraint.
Given s ∈ Q m+1 , we define the “arrow” mapping: Rm+1 → S m+1 , (we recall that
S m+1 is the space of symmetric matrices of size m + 1):
 
s0 s̄ 
Arw(s) := . (2.72)
s̄ s0 Im

Lemma 2.30 We have s ∈ Q m+1 iff Arw(s) 0.

Proof If s0 < 0, then s ∈ / Q m+1 , and Arw(s) cannot be positive semidefinite. If


s0 > 0, by application of the Schur lemma (eliminating the last block) Arw(s) 0
iff s0 − |s̄|2 /s0 ≥ 0, and so s ∈ Q m+1 iff Arw(s) 0. Finally, if s0 = 0, we know
that a symmetric matrix with diagonal zero is positive semidefinite iff it is equal to
0, and so Arw(s) 0 iff s = 0, whence the conclusion. 

We can therefore rewrite an SOC linear problem as an SDP linear problem. We


will compare the dual solutions. The primal formulation can be expressed as

Minn c · x; Arw(A j x − b j ) 0, j = 1, . . . , J. (L S D P)
x∈R

We define s j := A j x − b j , j = 1 to J . Partitioning the symmetric matrices of S m+1


(with index from 0 to n) in the form
 
Y00 Ȳ0
Y = , (2.73)
Ȳ0 Ȳ

we obtain that Arw(s) · Y = s0 trace(Y ) + 2s̄ · Ȳ0 , and so Arw : S m+1 → Rm+1
can be expressed as  
trace(Y )
Arw Y := . (2.74)
2Ȳ0

The dual formulation of (L S D P) hence has the expression


J 
J  
trace(Y j )
(A j ) = c. (LSDP∗ )
j j
Max b0 trace(Y j ) + 2b̄ · Ȳ0 ; j
Y ∈Πi=1
J m j +1
S+
2Ȳ0
j=1 j=1
2.4 Second-Order Cone Constraints 95

Proposition 2.31 (i) The dual problems (LSOCP∗ ) and (LSDP∗ ) have the same
value. (ii) The feasible set of (LSOCP∗ ) is the image under the mapping Arw of the
feasible set of (LSDP∗ ).
Proof It suffices to check point (ii), which, in view of the dual costs, implies point
(i). Let us show that Arw S m+1 ⊂ Q m+1 . Indeed, if s ∈ Q m+1 and Y ∈ S m+1 , we
have by Fejer’s theorem 2.1

s  Arw Y = Arw(s), Y  F ≥ 0 (2.75)

and we conclude by Lemma 2.29. On the other hand, Arw is injective; its transpose
operator is therefore surjective. We conclude by identifying the feasible points of
(LSOCP∗ ) with the elements of the form (trace(Y j ), 2Ȳ0 ), where Y j ∈ S m j +1 . 
j

Note that no qualification hypothesis was made, so that the primal and dual values
can be different. In order to obtain an expression for the solutions of (LSDP∗ ) as a
function of those of (LSOCP∗ ), we must, given y ∈ Q m+1 , express the set

Arw− (y) = {Y ∈ S m+1 ; Arw Y = y}. (2.76)

We will only discuss the most interesting case when y0 = | ȳ| > 0.
Lemma 2.32 Let y ∈ Q m+1 such that y0 = | ȳ| > 0. Then Arw− (y) reduces to the
single element  
1 y0 ( ȳ)
Y (y) = . (2.77)
2 ȳ ȳ ȳ  /y0

Proof We have that Arw Y (y) = y, and by the Schur lemma 2.3, Y (y) 0. Let now
Y ∈ Arw− (y). Since Ȳ0 = 21 ȳ, and Y00 cannot be zero, the Schur lemma implies
Ȳ = 14 ȳ ȳ  /Y00 + M, with M 0. Therefore,

y0 = trace(Y ) = Y00 + trace(Ȳ ) = Y00 + 41 y02 /Y00 + trace(M)


 2 (2.78)
= y0 + Y00 − 21 y0 /Y00 + trace(M).

This implies that Y00 = 21 y0 and trace(M) = 0, whence M = 0, as was to be


proved. 

2.5 Semi-infinite Programming

2.5.1 Framework

In this section we study linear semi-infinite programming problems of the following


type
96 2 Semidefinite and Semi-infinite Programming

Minn c · x; aω · x ≤ bω , ω ∈ Ω. (S I L)
x∈R

Here c ∈ Rn , Ω is a compact metric space, and for each ω ∈ Ω, aω ∈ Rn , bω ∈


R, and the mapping Ω → Rn+1 , ω → (aω , bω ) is continuous. We denote by Y =
C(Ω) the space of continuous functions over Ω. Endowed with the norm y :=
max{|yω |; ω ∈ Ω}, this is a Banach space. One defines the contact set of x̄ ∈ F(S I L)
as
Ω(x̄) := {ω ∈ Ω; aω x̄ = bω }, (2.79)

and the qualification (Slater) condition by

There exists an x̂ ∈ Rn such that aω x̂ < bω , for all ω ∈ Ω(x̄). (2.80)

By the compactness of Ω and the continuity of the application ω → (aω , bω ), this


hypothesis implies the existence of an ε > 0 such that

aω · x̂ − bω ≤ −ε, for all ω ∈ Ω. (2.81)

Finally, the linearized problem at point x̄ is:

Minn c · h; aω h ≤ 0, ω ∈ Ω(x̄). (L x̄ )
h∈R

The following lemma allows us to reduce the study of first-order optimality con-
ditions to those of a homogeneous problem.
Lemma 2.33 (i) If x̄ ∈ F(S I L) is such that h = 0 is a solution of the linearized
problem, then x̄ ∈ S(S I L).
(ii) If the qualification condition (2.80) holds, and x̄ ∈ F(S I L), then x̄ ∈ S(S I L)
iff h = 0 is a solution of the linearized problem.
Proof (i) If, on the contrary, x̄ ∈ / S(S I L), then there exists an x̃ ∈ F(S I L) such
that c · x̃ < c · x̄, and then h := x̃ − x̄ is feasible for the linearized problem, and
c · h < 0, so that 0 is not a solution of the linearized problem.
(ii) In view of step (i), it suffices to prove that, if x̄ ∈ S(S I L), then h = 0 is a
solution of the linearized problem. Assume on the contrary that c · h < 0, for some
h ∈ F(L x̄ ). Let ε > 0 small enough be such that h ε := h + ε(x̂ − x̄) satisfies c ·
h ε < 0. Set x(t) := x̄ + th ε . Let us show that, for t > 0 small enough, we have
x(t) ∈ F(S I P). If this is not the case, there exists a sequence tk ↓ 0 and ωk ∈ Ω
such that aωk x(tk ) > bωk . Extracting a subsequence if necessary, we can assume that
ωk → ω̄. Passing to the limit in the previous inequality, we get ω̄ ∈ Ω(x̄), and so
there exists an α > 0 such that aω̄ (x̄)h ε ≤ εaω̄ (x̂ − x̄) < 0. For (x, ω) in (x̄, ω̄), we
have therefore aω h ε < −0, and so, if k is large enough:

aωk x(t) = aωk x̄ + tk aωk h ε < bωk , (2.82)

which gives the desired contradiction.


2.5 Semi-infinite Programming 97

Since x(t) ∈ S(S I L), we have 0 ≤ limt↓0 t −1 c · (x(t) − x̄)) = c · h ε . Passing to


the limit over ε, we get c · h ≥ 0, for all h ∈ F(L x̄ ), implying val(L x̄ ) ≥ 0. Since
0 ∈ F(L x̄ ), point (i) follows. 

2.5.2 Multipliers with Finite Support

We know that the topological dual of C(Ω) is the space M(Ω) of (signed) finite,
Borel measures over Ω, (see Malliavin [77, Chap. 2]), or in short, measures. We will
rather show how to obtain in a “direct way” the existence of Lagrange multipliers
as measures with finite support, i.e., linear combinations of finitely many Dirac
measures, in the form λ, y = ω∈supp(λ) λω yω . Here the set supp(λ) is a finite
subset of Ω, called the support of λ, and such that λω = 0, for all ω ∈ supp(λ).
Denote by M(Ω)+ the cone of positive measures.
If λ is a measure with finite support, we call {λω , ω ∈ supp(λ)} the components of
λ, and we will say that λ is positive if its components are. We will denote by M F (Ω)
p
the set of finite measures over Ω, by M F (Ω) the set of finite measures of support of
p
cardinality at most p, and by M F (Ω)+ , M F (Ω)+ the corresponding positive cones.
One defines the dual problem “with finite support”, or “finite dual”, as
 
Max −bω λω ; c + λω aω = 0. (F S I D)
λ∈M F (Ω)+
ω∈supp(λ) ω∈supp(λ)

Let us first state a weak duality result:

Proposition 2.34 (i) We have val(F S I D) ≤ val(S I D).


(ii) Let λ ∈ F(F S I D) and x ∈ F(S I L). If val(F S I D) = val(S I D), then λ ∈
S(F S I D) and x ∈ S(S I L) implies the complementarity condition

aω · x = bω , for all ω ∈ supp(λ). (2.83)

(iii) Conversely, if λ ∈ F(F S I D) and x ∈ F(S I L) satisfy (2.83), then (F S I D) and


(S I L) have the same value, λ ∈ S(F S I D), and x ∈ S(S I L).

Proof Let λ ∈ F(F S I D) and x ∈ F(S I L). Then


 
c·x ≥c·x+ λω (aω · x − bω ) = −bω λω .
ω∈supp(λ) ω∈supp(λ)

Taking the infimum over x ∈ F(S I L) and the supremum over λ ∈ F(F S I D), we
obtain (i). In addition, if the primal and dual values are equal, this relation implies
that x ∈ S(S I L) and λ ∈ S(F S I D) iff the inequality is in fact an equality, whence
(ii) and by the same type of argument (iii). 
98 2 Semidefinite and Semi-infinite Programming

Let us now state the main result of the section. We will say that E ⊂ M F (Ω)+ is
bounded if there exists an α > 0 such that ω∈supp(λ) λω ≤ α, for all λ ∈ E.

Theorem 2.35 Let the Slater hypothesis (2.80) hold, and val(S I L) be finite. Then
val(S I L) = val(F S I D), and S(F S I D) is nonempty and bounded. In addition,
(F S I D) has at least one solution with support of cardinality at most n.

The proof is based on the next lemmas, which have their own interest.

Remark 2.36 Applying the duality theory of Chap. 1, we obtain the existence of
Lagrange multipliers in M(ω)+ . Then the Krein–Milman theorem [65] allows us
to obtain the existence of multipliers with finite support. On the other hand, our
approach uses only elementary computations.

Let the convex cone generated by {aω ; ω ∈ Ω} be denoted by


⎧ ⎫
⎨  ⎬
C := λω aω ; λ ∈ M F (Ω)+ ∪ {0}. (2.84)
⎩ ⎭
ω∈supp(λ)

Lemma 2.37 The cone C is generated by the nonnegative linear combinations of


at most n terms of aω . In other words,

For all y ∈ C \{0}, there exists a λ ∈ M Fn (Ω)+ such that y = λω a ω .
ω∈supp(λ)
(2.85)

Proof Let y ∈ C . There exists a λ ∈ M F (Ω)+ such that y = ω∈supp(λ) λω aω .


Choose such a λ with support of minimal cardinality, say p. Let us obtain a contra-
diction if p > n. Then there exists a μ ∈ R p , μ = 0, such that ω∈supp(λ) μω aω = 0.
Changing μ into −μ if necessary, we can assume that min μω < 0. Let t > 0
be the smallest positive value such that λω + tμω ≥ 0, for all ω ∈ supp(λ). Then
y = ω∈supp(λ) (λω + tμω ) aω , and the support of λ + tμ is strictly included in that
of λ. This gives the desired contradiction. 

Lemma 2.38 Let the Slater hypothesis (2.80) hold. If the dual problem (F S I D) is
feasible, then it has a nonempty and bounded solution set.

Proof Let λ ∈ F(F S I D). Using (2.81), we get for some ε > 0:
 
−c · x̂ = λω aω · x̂ ≤ λω (bω − ε),
ω∈supp(λ) ω∈supp(λ)

and so  
ε λω ≤ c · x̂ + λω bω .
ω∈supp(λ) ω∈supp(λ)
2.5 Semi-infinite Programming 99

If λ is an ε solution of (F S I D), with ε > 0 (this exists since, the primal being
feasible, the dual value is finite), let

− λω bω ≥ val(F S I D) − ε , (2.86)
ω∈supp(λ)

then we obtain the estimate ω∈supp(λ) λω = O(1). As a consequence, a maximizing


sequence {λk } of problem (F S I D) is bounded. In addition, by Lemma 2.37, we
can w.l.o.g. assume that the cardinality of the support of elements of this sequence
is equal to at most n. Extracting a subsequence if necessary, we can assume that
this cardinal p ≤ n is constant along the sequence. So, let {ω1k , . . . , ωkp } denote the
support of λk . Extracting again a subsequence, we can assume that the points in the
supports converge to (ω̄1 , . . . , ω̄ p ) (some of these limits could coincide) and that
λik → λ̄i . We deduce that λ̄ ∈ S(F S I D), with support of cardinality at most n. 

Lemma 2.39 Let the Slater hypothesis (2.80) hold. If (S I L) has a solution, then it
has the same value as (F S I D), and S(F S I D) is nonempty and bounded.

Proof (a) Let x̄ ∈ S(S I L). By Lemma 2.33, h = 0 is a solution of the linearized
problem (L x̄ ). Set
⎧ ⎫
⎨  ⎬
C (x̄) := λω aω ; λ ∈ M F (Ω)+ ; supp(λ) ⊂ Ω(x̄) ∪ {0}. (2.87)
⎩ ⎭
ω∈supp(λ)

The argument of the proof of Lemma 2.37 tells us that



∀ y ∈ C (x̄)\{0}; ∃λ ∈ M Fn (Ω)+ ; supp(λ) ⊂ Ω(x̄); y = λω a ω .
ω∈supp(λ)
(2.88)
(b) Let us show that C (x̄) is closed. Indeed, let y ∈ C (x̄). Computing the scalar
product of y by x̂ − x̄, with x̂ given by the Slater condition, we get
    
y · (x̂ − x̄) = λω aω · (x̂ − x̄) = λω aω · x̂ − bω ≤ −ε λω ,
ω∈supp(λ) ω∈supp(λ) ω∈supp(λ)

which shows that ω∈supp(λ) λω = O(y). If the sequence y k of C (x̄) converges to


ȳ, the associated sequence λk (for which, by Lemma 2.37, we can assume that the
support is of cardinality at most n) is therefore bounded, and hence, we can pass to
the limit, whence the closedness of C (x̄).
(c) Let us show that −c ∈ C (x̄). Since C (x̄) is convex and closed, if this is not the
case, we can strictly separate −c and C (x̄). So, there exist h ∈ Rn and α ∈ R such
that −c · h > α and y · h ≤ α, for all y ∈ C (x̄). Taking y = 0, we get c · h < 0, and
also aω · h ≤ 0, for all ω ∈ Ω(x̄), which gives the desired contradiction to Lemma
2.33.
100 2 Semidefinite and Semi-infinite Programming

(d) Since −c ∈ C (x̄), there exists a λ ∈ F(F S I D) with support in C (x̄). By propo-
sition 2.34(iii), λ ∈ S(F S I D), and problems (S I L) and (F S I D) have the same
value. Finally, S(F S I D) is nonempty and bounded in view of Lemma 2.38. 

We now relax the hypothesis of existence of a solution to the primal problem.

Lemma 2.40 Let val(S I L) be finite, and the Slater hypothesis (2.80) hold. Then
(S I L) and (F S I D) have the same value, and S(F S I D) has at least an element of
cardinality at most n.

Proof (a) We apply Lemma 2.39 to the perturbed problem


n
Minn c · x + γ |xi |; aω · x ≤ bω , ω ∈ Ω, (S I L γ )
x∈R
i=1

where γ > 0. We first show that this problem, whose value is finite, has a solution.
Given ε > 0, let x be an ε solution of (S I L γ ). We then have


n 
n
val(S I L) + γ |xi | ≤ c · x + γ |xi | ≤ val(S I L γ ) + ε, (2.89)
i=1 i=1

and so

n
γ |xi | ≤ val(S I L γ ) + ε − val(S I L).
i=1

A minimizing sequence of (S I L γ ) is therefore bounded. Passing to the limit, we


deduce, for all γ > 0, the existence of x γ ∈ S(S I L γ ).
(b) Problem (S I L γ ) can be rewritten as a linear, semi-infinite optimization problem:


n
Min
n
c·x +γ z i ; ±xi ≤ z i , i = 1, . . . , n; aω · x ≤ bω , ω ∈ Ω. (S I L γ )
x∈R
z∈Rn i=1

In addition, set ẑ i := 1 + |x̂i |, i = 1 to n (where x̂ satisfies (2.80)). Then (x̂, ẑ)


satisfies the Slater hypothesis for problem (S I L γ ). Denote by 1 the vector of Rn
with components equal to 1. Lemma 2.39 implies the equality of values of (S I L γ )
and of its finite dual, that can be written as
 
Max
n n
−b ω λ ω ; c + μ − η + λω aω = 0; μ + η = γ 1.
μ∈R η∈R
+ +
λ∈M F (Ω)+ ω∈supp(λ) ω∈supp(λ)
(F S I Dγ )
 γ γ γ
It also ensures that (F S I Dγ ) has a solution (μ , η , λ ). We have in addition, when
γ ↓ 0,
2.5 Semi-infinite Programming 101
 
 
     
γ 
−bω λω → val(S I L); c + λω a ω 
γ
 → 0. (2.90)
ω∈supp(λγ )  ω∈supp(λγ ) 

Indeed, the first relation follows from the equality val(S I L γ ) = val(F S I Dγ ), and
from the equality limγ ↓0 val(S I L γ ) = val(S I L), which can be easily checked. The
second is an immediate consequence of the definition of (F S I Dγ ). These relations
allow us to show that λγ is bounded; indeed,
⎛ ⎞
 
o(1) = ⎝c + λγω aω ⎠ · x̂ ≤ c · x̂ + λγω (bω − ε),
ω∈supp(λγ ) ω∈supp(λγ )

and so, by the first relation of (2.90),


 
ε λγω ≤ c · x̂ + λγω bω + o(1) = O(1).
ω∈supp(λγ ) ω∈supp(λγ )

To obtain λ ∈ S(F S I D), via Proposition 2.34(i) it then suffices to pass to the limit
(in a subsequence) in (2.90). 

Proof (Proof of theorem 2.35) Under the hypotheses of the theorem, Lemma 2.40
ensures the equality val(S I L) = val(F S I D) as well as the existence of an element
of S(F S I D) of cardinality at most n. Combining with Lemma 2.38, we obtain that
S(F S I D) is bounded. 

2.5.3 Chebyshev Approximation

Let a and b be two real numbers, with a < b. The problem of the best uniform
approximation of a continuous function f over [a, b] by a polynomial of degree n
can be written as
Min max | p(x) − f (x)|; x ∈ [a, b], (AT )
p∈P n

where Pn denotes the set of polynomials of degree at most n with real coeffi-
cients. We denote by I+ ( p) (resp. I− ( p)) the set of points where p(x) − f (x) attains
its maximum (resp. minimum), and we set I ( p) := I+ ( p) ∪ I− ( p). We recall that
 f ∞ := sup{| f (x)|, x ∈ [a, b]}.

Lemma 2.41 A polynomial p ∈ Pn is a solution of (AT ) iff there exists no poly-


nomial r ∈ Pn such that

( f (x) − p(x))r (x) < 0, for all x ∈ I ( p). (2.91)


102 2 Semidefinite and Semi-infinite Programming

Proof We can rewrite (AT ) as a linear semi-infinite optimization problem:

Min v; ±( f (x) − p(x)) − v ≤ 0, for all x ∈ [a, b]. (AT  )


v∈R
p∈P n

The cost function and constraints are affine functions of the optimization parameters,
and the Slater condition (2.80) is satisfied (take h̄ = (v̄, p̄) with v̄ = 1 +  f ∞ and
p̄ = 0). By Lemma 2.33, (v, p) ∈ S(AT ) iff (w, r ) = 0 is a solution of the linearized
problem. The latter can be written as follows:

Min w; r (x) ≤ w, x ∈ I+ ( p); −r (x) ≤ w, x ∈ I− ( p). (L AT  )


w∈R
r ∈P n

In other words, (v, p) ∈ S(AT ) iff there exists no polynomial r ∈ Pn such that
r (x) < 0 when x ∈ I+ ( p) and r (x) > 0 when x ∈ I− ( p). The conclusion
follows. 

Theorem 2.42 (Characterization theorem) A polynomial p ∈ Pn is a solution of


(AT ) iff there exist n + 2 points x0 < x1 < · · · < xn+1 in [a, b] such that

| p(xi ) − f (xi )| =  p − f ∞ , i = 0, . . . , n + 1. (2.92)


p(xi+1 ) − f (xi+1 ) = −[ p(xi ) − f (xi )], i = 0, . . . , n. (2.93)

Proof By Lemma 2.41, it suffices to check that (2.92)–(2.93) is satisfied iff (2.91)
has no solution. If (2.92)–(2.93) is satisfied, then by (2.91), r changes sign at least
n + 1 times, and therefore has at least n + 1 distinct roots, which is impossible. If
on the contrary (2.92)–(2.93) is not satisfied, then ( p − f ) changes sign at most n
times over I ( p). So there exist m ≤ n and numbers α0 , . . . , αm+1 , with αi ∈
/ I ( p)
for all i, such that
a = α0 < α1 < · · · < αm+1 = b,

and ( p − f ) is non-zero and has a constant sign, alternatively +1 and −1, over
I ( p)∩]αi , αi+1 [ for all i = 0 to m. The same holds for r (x) := Πi=1
m
(x − αi ). There-
fore either r or −r satisfies (2.91). 

We say that the set of points x0 < x1 < · · · < xn+1 in [a, b] is a reference of the
polynomial p if (2.92)–(2.93) is satisfied.

Theorem 2.43 (Uniqueness theorem) Problem (AT ) has a unique solution.

Proof (a) Existence: the space Pn of polynomials of degree n, whose elements


n
are denoted by pz = i=0 z i x i , being of finite dimension, the two norm  pz  P :=
n
i=0 |z i | and  pz ∞ are equivalent. A minimizing sequence is therefore bounded.
We easily deduce the existence of a solution.
(b) Uniqueness. Let p and q be two distinct solutions. Set r := p − q, and let
x0 , . . . , xn+1 be a reference of p. Relations  f − p∞ =  f − q∞ and
2.5 Semi-infinite Programming 103

r (xi ) = ( p(xi ) − f (xi )) − (q(xi ) − f (xi )), i = 0, . . . , n + 1 (2.94)

imply that either r (xi ) is equal to zero, or it has the sign of p(xi ) − f (xi ), for all i.
Set

I := {i : r (xi ) = 0, i = 0, . . . , n + 1}; J := {i : r (xi ) = 0, i = 0, . . . , n + 1}.

Since r is a polynomial of degree n, it suffices to check that it has at least n + 1


zeros, the latter being counted with their order of multiplicity. If J is empty, and so r
changes sign at least n + 1 times, this holds; we get the same conclusion if J contains
no other points than 0 and n + 1. Otherwise, let i ∈ J be different from 0 and n + 1,
and s = ±1 be the sign of p(xi ) − f (xi ). Set α := max{sr (x); xi−1 ≤ x ≤ xi+1 }.
Then α ≥ sr (xi ) ≥ 0, and as r (xi+1 ) and r (xi−1 ) have a sign that is opposite to that
of s, the maximum is attained at a point x̂i ∈]xi−1 , xi+1 [. Set x̂i = xi , when i ∈ I , and
i = 0 or n + 1. Then r (x̂i ) is of the same sign as p(xi ) − f (xi ), and if r (x̂i ) = 0, then
x̂i is a zero of r with multiplicity at least two. Therefore, to each interval ]x̂i , x̂i+1 [
we can associate a zero of r that corresponds either to a change of sign of r over
]x̂i , x̂i+1 [, or to one of the multiple zeros of r at x̂i or x̂i+1 (or to simple zeros at the
end points of the interval). We have shown that r has at least n + 1 distinct zeros, as
was to be done. 

2.5.4 Chebyshev Polynomials and Lagrange Interpolation

The previous results allow us to present the theory of Chebyshev polynomials, and
their application to Lagrange interpolation.
The Chebyshev polynomial of degree n, denoted by Tn , is defined over [−1, 1]
by the equality Tn (cos θ ) = cos(nθ ), or equivalently Tn (x) = cos(n cos−1 x). The
formula
cos((n + 1)θ ) + cos((n − 1)θ ) = 2 cos θ cos(nθ )

implies the induction relation

Tn+1 (x) = 2x Tn (x) − Tn−1 (x).

In particular,
T0 (x) = 1, T1 (x) = x, T2 (x) = 2x 2 − 1.

Proposition 2.44 For all n ∈ N, the polynomial ( 21 )n Tn+1 is, among all polynomials
of degree n + 1 whose coefficient of x n+1 is 1, the one of minimal uniform norm over
[−1, 1].

Proof (a) One easily checks by induction that the coefficient of x n+1 of Tn+1 is 2n .
The coefficient of x n+1 of ( 21 )n Tn+1 is therefore 1.
104 2 Semidefinite and Semi-infinite Programming

(b) We want to show that the coefficients of degree 0 to n of ( 21 )n Tn+1 are a solution
of the problem  
 
 n+1  i 
n
Min max x − zi x  . (2.95)
z 0 ,...,z n x∈[−1,1]  
i=0

This can be interpreted as the problem of uniform approximation of x n+1 by a poly-


nomial of degree n over [−1, 1]. By Theorems 2.42 and 2.43, there exists a unique
solution characterized by (2.92)–(2.93), with here f (x) = x n+1 . Now the uniform
norm of ( 21 )n Tn+1 is ( 21 )n , and ( 21 )n Tn+1 (x) is equal to alternatively ±( 21 )n when
x = cos(1 − i/(n + 1))π , i = 0, . . . , n + 1. The result follows. 

Consider now the problem of interpolation of a continuous function f by a poly-


nomial of degree n over an interval [a, b]. The method of Lagrange interpolation
consists in the choice of n + 1 distinct points, called interpolation points, x0 , . . . , xn
in [a, b], and of the polynomial of degree n equal to f at these n + 1 points:

p(xi ) = f (xi ), i = 1, . . . , n. (2.96)

Given j ∈ {0, . . . , n}, there is a unique polynomial of degree nthat vanishes at all
points xi , except at x j where it is equal to one, that is j (x) := i= j (x − xi )/(x j −
n
xi ) and (2.96) therefore has the unique solution p(x) = i=0 f (xi ) j (x). A naive
choice of the interpolation points is to take them with constant increments. This leads
to significant errors. We will see that, in some sense, the zeros of the Chebyshev
polynomial are the best possible choice.

Lemma 2.45 Let f ∈ C n+1 [a, b]. Then the error e(x) = f (x) − p(x) satisfies

1  n
e(x) = (x − xi ) f (n+1) (ξ ), (2.97)
(n + 1)! i=0

where the point ξ ∈ [a, b] depends on x.

Proof (a) If a function of class C 1 over [a, b] vanishes at two distinct points, then
by Rolle’s theorem, its derivative has at least a zero between these two points. By
induction, we deduce that if g ∈ C n+1 [a, b] vanishes at n + 2 distinct points, then
its derivative of order n + 1 has at least a zero in [a, b].
(b) If x ∈ [a, b] is an interpolation point, (2.97) is trivially satisfied. Otherwise, set

n
t − xi
g(t) = f (t) − p(t) − e(x) . (2.98)
i=0
x − xi

Since g is of class C n+1 [a, b] and vanishes at all interpolation points and at x,
there exists an ξ ∈ [a, b] such that g (n+1) (ξ ) = 0. Computing g (n+1) (ξ ), the result
follows. 
2.5 Semi-infinite Programming 105

The previous lemma suggests, in the absence of specific information about f (n+1) ,
to choose the interpolation points that minimize the uniform norm of the product
function
n
prod(x) := (x − xi ). (2.99)
i=0

Proposition 2.46 There exists a unique choice of interpolation points minimizing


the uniform norm of the product function, which is

1 1 [2(n − i) + 1)]π
xi = (a + b) + (b − a) cos , i = 0, . . . , n. (2.100)
2 2 2(n + 1)

Choosing these points, we have that

 f (n+1) ∞
 f − p∞ ≤ i = 0, . . . , n. (2.101)
2n (n + 1)!

Proof It suffices to check the result when [a, b] = [−1, 1]. We must find the poly-
nomial of degree n + 1, with coefficient of x n+1 equal to 1 and roots in [−1, 1], of
minimal uniform norm. By Proposition 2.44, ( 21 )n Tn+1 is the unique solution of this
problem, and the interpolation points are the zeros of Tn+1 , whence (2.100). Using
Tn+1 ∞ = 1 and (2.97), we obtain (2.101). 

Corollary 2.47 If f is a polynomial of degree n + 1, its best uniform approximation


by a degree n polynomial is the one obtained by taking for reference the points given
by (2.100).

Proof Since f (n+1) is constant, by Lemma 2.45, the maximal error is proportional
to the uniform norm of the product function. By Proposition 2.46, this amount is
minimal if the interpolation points are given by (2.100). 

Remark 2.48 The corollary is useful in the following situation. Let p be a polynomial
of degree n + 1, which is a candidate for the approximation of f . We can wonder
what would be the quality of the approximation of f by a polynomial of degree n.
Taking the polynomial q obtained by using the interpolation points given by (2.100),
we have by (2.101) the estimate

 p (n+1) ∞
 f − q∞ ≤  f − p∞ + . (2.102)
2n (n + 1)!
106 2 Semidefinite and Semi-infinite Programming

2.6 Nonnegative Polynomials over R

2.6.1 Nonnegative Polynomials

We show in this section that the nonnegativity of a polynomial over an interval


of R is equivalent to the semidefinite positivity of a matrix whose coefficients are
related to those of the polynomial by linear relations. This is of interest since efficient
algorithms for solving linear positive semidefinite optimization problems are known.
Consider the polynomial
 n
pz (ω) = z k ωk (2.103)
k=0

of coefficients (z 0 , . . . , z n ). Let us start by characterizing the non negativity of this


polynomial over R. Of course, this implies that n is even.

Lemma 2.49 Let n be even. Then the polynomial pz (ω) is nonnegative over R iff
there exists a symmetric, positive semidefinite matrix Φ = {Φi j }, 0 ≤ i, j ≤ n/2,
such that 
zk = Φi j , k = 0, . . . , n. (2.104)
i+ j=k

Proof If (2.104) holds, set y := (1, ω, . . . , ωn ). Then


n  
n 
pz (ω) = Φi j ω =
k
Φi j ωi ω j = yΦy ≥ 0,
k=0 i+ j=k k=0 i+ j=k

and therefore the polynomial pz is nonnegative. Conversely, assume that pz (ω) ≥ 0


for all ω. Then its real roots have an even multiplicity, otherwise a change of sign
would occur. Denote by αi the real roots, of multiplicity 2ri , and by a j ± ib j the
conjugate complex roots. Necessarily z n ≥ 0. Let us decompose the polynomial

q(ω) := z n1/2 Πi (ω − αi )ri Π j (ω − a j − ib j )

in the form q(ω) = A(ω) + i B(ω), where A and B are polynomials with real coeffi-
cients. Then pz (ω) = A(ω)2 + B(ω)2 . We have obtained a decomposition pz (ω) as
a sum of two squares of polynomials, of degree at most n/2. Consider a polynomial
n/2
of degree at most n/2: k=0 ci ωi . Its square is of the desired form with Φi j = ci c j
for all i and j (this rank 1 matrix is positive semidefinite). The same holds for a sum
of squares (it suffices to sum the corresponding matrices). 

Remark 2.50 (i) We have shown that a polynomial is nonnegative over R iff it is the
sum of at most two squares of polynomials. (ii) Lemma 2.49 allows us to check the
nonnegativity of a polynomial by solving an SDP problem.
2.6 Nonnegative Polynomials over R 107

Example 2.51 Let K be a polyhedron in Rn+1 . Then the problem


n 
n
Min ci z i ; z ∈ K ; z k ωk ≥ 0, for all ω ∈ R,
z
i=0 k=0

is equivalent to the SDP problem


n 
Min ci z i ; z ∈ K ; z k = Φi j , k = 0, . . . , n; Φ 0.
z,Φ
i=0 i+ j=k

The previous result allows us to deduce an analogous result in the case of the
nonnegativity of a polynomial over R+ .

Lemma 2.52 A polynomial pz (ω) of degree n is nonnegative over R+ iff there exists
a symmetric, positive semidefinite matrix Φ = {Φi j }, 0 ≤ i, j ≤ n, such that

0 = i+ j=2k−1 Φi j , 1 ≤ k ≤ n,
(2.105)
zk = i+ j=2k Φi j, 0 ≤ k ≤ n.

Proof The nonnegativity of pz (ω) over R+ is equivalent to that of the polynomial

z 0 + z 1 ω2 + · · · + z n ω2n (2.106)

over R, whence the result by Lemma 2.49. 

This parametrization involves a matrix of size 1 + n. We can do better by


parametrizing with two matrices of size 1 + 21 n. We first give a preliminary result.

Lemma 2.53 Let a, f 1 and f 2 be three functions R → R such that f i (ω) = qi (ω)2 +
a(ω)ri (ω)2 , where qi (resp. ri ) are polynomials of degree n i (resp. n i−1 ). Then the
function f (ω) = f 1 (ω) f 2 (ω) is of the form q(ω)2 + a(ω)r (ω)2 , where q and r are
functions R → R that are polynomial if a(·) is. If in addition a(·) is a polynomial of
degree at most 2, we can choose polynomials q and r of degree at most n 1 + n 2 and
n 1 + n 2 − 1, respectively.

Proof It suffices to use the identity

f 1 (ω) f 2 (ω) = (q1 (ω)q2 (ω) + a(ω)r1 (ω)r2 (ω))2


(2.107)
+a(ω)(q1 (ω)r2 (ω) − q2 (ω)r1 (ω))2 .

We denote by x the integer part of x (greatest integer less than or equal to x).

Lemma 2.54 A polynomial pz (ω) of degree n is nonnegative over R+ iff it satisfies


one of the following two conditions:
108 2 Semidefinite and Semi-infinite Programming

(i) There exist two polynomials q and r of degree at most  21 n and  21 n − 1,
respectively, such that
pz (ω) = q(ω)2 + ωr (ω)2 . (2.108)

(ii) There exist two positive semidefinite matrices Φ and Ψ , of indexes varying from
0 to 1 + 21 n and 0 to  21 (n − 1) respectively, such that
 
z 0 = Φ00 ; z k = Φi j + Ψi j , k = 1, . . . , n. (2.109)
i+ j=k i+ j=k−1

Proof (i) It is clear that if pz is of the form (2.108), it is nonnegative over R+ .


Conversely, let pz be nonnegative over R+ . We give a proof by induction over n.
If n = 0 or 1, the decomposition (2.108) is easily obtained. Let us deal with the
case n = 2. If pz has real roots, either there is a double root and (2.108) holds with
r = 0, or they are simple and then pz is the product of two affine functions that are
nonnegative over R+ ; then (2.108) is a consequence of Lemma 2.53. Finally, in the
case of conjugate complex roots, then pz (ω) = a[(ω + β)2 + α], with β ∈ R, α > 0,
and a > 0 since p is nonnegative over R+ . Itis enough to discuss the casewhen a = 1.
Then pz (ω) = (ω − γ )2 + δω, with γ = β 2 + α and δ := 2β + 2 β 2 + α > 0
which gives the desired decomposition.
Assume that now the conclusion holds until n − 1, with n > 2. Let us check the
existence of α ≥ 0 and β ∈ R such that

pz (ω) = (ω + α)q(ω) or pz (ω) = ((ω + β)2 + α)q(ω), (2.110)

where q is a nonnegative polynomial over R+ , of degree n − 1 in the first case, and


n − 2 in the second. Indeed, if pz has a root ω0 over R− , it is of the first form with
α = −ω0 . Otherwise, pz has either a positive √ root −β, that necessarily has even
multiplicity, or two conjugate roots −β ± i α. In both cases p is of the second
form. We have shown that pz is a product of polynomials of the desired form (taking
into account the discussion of the case n = 2). We conclude then by Lemma 2.53.
(ii) If (2.109) is satisfied, set y := (1, ω, . . . , ωn ). Then
⎛ ⎞

n  
pz (ω) = ⎝ Φi j + Ψi j ⎠ ωk
k=0 i+ j=k i+ j=k−1

n  n 
= Φi j ωi ω j + ω Ψi j ωi ω j
k=0 i+ j=k k=0 i+ j=k−1
= yΦy + ωyΨ y

is nonnegative over R+ . Conversely, let pz be nonnegative over R+ . Then it has a


decomposition of the form (2.108). Denote by z 1 and z 2 the coefficients of q and r ,
resp. Then the matrices Φ = z 1 (z 1 ) and Ψ = z 2 (z 2 ) satisfy (2.109). 
We can state a similar result in the case of a bounded interval.
2.6 Nonnegative Polynomials over R 109

Lemma 2.55 Let a and b be two real numbers, with a < b. A polynomial pz (ω) of
degree n is nonnegative over [a, b] iff it satisfies one of the two following conditions:
(i) There exist two polynomials q and r of degree at most 21 n and 21 n − 1 resp. if n is
even, and at most 21 (n − 1) if n is odd, such that

q(ω)2 + (b − ω)(ω − a)r (ω)2 if n is even,
pz (ω) = (2.111)
(ω − a)q(ω)2 + (b − ω)r (ω)2 otherwise.

(ii) There exist two positive semidefinite matrices Φ and Ψ , with index varying from
0 to 21 n and 21 n − 1 resp. if n is even, and from 0 to 21 (n − 1) if n is odd, such that,
if n is even:
   
zk = Φi j − ab Ψi j + (a + b) Ψi j − Ψi j , k = 1, . . . , n.
i+ j=k i+ j=k i+ j=k−1 i+ j=k−2
(2.112)
and if n is odd:
   
z k = −a Φi j + Φi j + b Ψi j − Ψi j , k = 1, . . . , n.
i+ j=k i+ j=k−1 i+ j=k i+ j=k−1
(2.113)

Proof We follow the scheme of proof of Lemma 2.54(i).


(a) We first check that the set of polynomials of the form (2.111) is stable under mul-
tiplication. For the product of two even polynomials, this follows from Lemma 2.53,
with here a(ω) = (b − ω)(ω − a). For the product of two odd polynomials of the
form pi = (ω − a)qi (ω)2 + (b − ω)ri (ω)2 , with i = 1, 2, we obtain with (2.107),
omitting the argument ω:
  
b−ω 2 b−ω 2
p1 p2 = (ω − a) q1 +2 2
r q2 +
2
r
 ω−a 1 ω−a 2 
2
b−ω b−ω
= (ω − a) 2
q1 q2 + r1 r2 + (q1r2 − q2 r1 ) 2
ω−a ω−a
= ((ω − a)q1 q2 + (b − ω)r1r2 )2 + (b − ω)(ω − a)(q1r2 − q2 r1 )2 ,

which is of the form (2.111), since p1 p2 is even. Finally, if

p1 = q12 + (b − ω)(ω − a)r12 is even, and p2 = (ω − a)q22 + (b − ω)r22 is odd,


(2.114)
we get by (2.107)
110 2 Semidefinite and Semi-infinite Programming
  
q12 b−ω 2 b−ω 2
p1 p2 = (ω − a) 3
+ r q2 +
2
r
(ω − a)2 ω−a 1 ω−a 2
2  2 
q1 b−ω b−ω q1
= (ω − a)3
q2 + r1 r2 + r 2 − q2 r 1
(ω − a) ω−a ω − a (ω − a)
= (ω − a)(q1 q2 + (b − ω)r1r2 )2 + (b − ω)(q1r2 − (ω − a)q2 r1 )2 ,

which is still of the desired form.


(b) If pz is of the form (2.111), it is clear that it is nonnegative over [a, b]. Conversely,
let pz be nonnegative over [a, b]. We proceed by induction over n. For n = 0 or 1,
one easily obtains the decomposition (2.111). Let us check it when n = 2. In that
case q and r are of degree at most 1 and 0. If p has a real root, either it is of
even multiplicity and we obtain the desired form with δ = 0, or it is outside (a, b)
and p is then the product of two factors of the desired form for n = 1; we have
checked in point (a) that the product still has the desired form. It remains to deal
with the case of conjugate complex roots, i.e., (normalizing the leading coefficient)
p1 (ω) = (ω + β)2 + α, with β ∈ R, and α > 0. By the change of variable ω =
(ω − a)/(b − a), we boil down to the case when a = 0 and b = 1. In the sequel
we look for q of degree 1. If β = − 21 , the desired decomposition is p1 = (ω −
) + α = (4α + 1)(ω − 21 )2 + 4αω(1 − ω). If β = 21 , let us check that p1 is of
1 2
2
the form γ (ω − ω0 )2 + δω(1 − ω), with γ ≥ 0, δ ≥ 0, and ω0 ∈ R. Writing the
equality of coefficients of each degree, and eliminating δ = γ − 1 (second degree), it
remains to solve (1 − 2ω0 )γ = (2β + 1) and γ ω02 = α + β 2 . Since β = 21 , we have
ω0 = 21 , and so, γ = (2β + 1)/(1 − 2ω0 ). Combining with the previous equality,
we get (2β + 1)ω02 = (1 − 2ω0 )(α + β 2 ), which (since it has positive discriminant)
necessarily has a real solution, different from 21 .
Now assume the conclusion holds up to n − 1, with n ≥ 3. Proceeding as in the
proof of Lemma 2.54, we see that pz can be written as a product of polynomials with
constant sign over [a, b] of the form (2.110), with α ∈ R. We conclude by Lemma
2.53.
(ii) The argument is similar to the one used in the previous proofs. 

Remark 2.56 We can also reduce nonnegativity over an interval to nonnegativity


over R, by using the following relations:
 
pz (ω) ≥ 0, ω ∈ [a, ∞[, iff pz a+ ω2 ≥  0, ω ∈ R.
pz (ω) ≥ 0, ω ∈ (−∞, a], iff pz a − ω2 ≥ 0, ω ∈ R. 
ω2
pz (ω) ≥ 0, ω ∈ [a, b], iff (1 + ω2 )n pz a + (b − a) ≥ 0, ω ∈ R.
1 + ω2

However, if the polynomial is of degree n, the SDP constraints involve a matrix of


size 1 + n for the nonnegativity over a half-space or over a bounded interval. This is
less efficient than the characterizations of the previous lemmas.
2.6 Nonnegative Polynomials over R 111

2.6.2 Characterisation of Moments

In this section we assume that Ω is a closed interval of R, non-reduced to a point.


We have defined the space of measures M(Ω), as well as their positive and
negative cones M(Ω)+ and M F (Ω)− in Sect. 2.5.2. The moment of order k ∈ N of
the positive measure μ ∈ M(Ω)+ is, whenever it is defined, the integral

Mk (μ) := ωk dμ(ω). (2.115)


Ω

We denote the set of possible values of the first n + 1 moments of positive measures
over Ω by

M n := {(m 0 , . . . , m n ); ∃μ ∈ M(Ω)+ ; Mk (μ) = m k , k = 0, . . . , n} .

Similarly, we denote by M Fn the set of moments of positive measures with finite sup-
port over Ω; of course M Fn ⊂ M n . We will, in this section, study characterizations
of the sets M n and M Fn . The latter are obviously convex cones of Rn .
Lemma 2.57 The set M Fn has a nonempty interior, and Rn = M Fn − M Fn .
Proof We will prove a more precise result: the conclusion remains true if Ω includes
at least n + 1 distinct points.
(a) Let ω0 , . . . , ωn be distinct points of Ω. Let us show that the set of moments of
measures with support over ω0 , . . . , ωn is equal to Rn+1 . Indeed these moments form
a vector subspace; let z belong to its orthogonal. We then have, for all (λ0 , . . . , λn ),
 n 

n  
n
0= zi (ωk ) λk
i
= λk pz (ωk ).
i=0 k=0 k=0

This proves that the polynomial pz vanishes at the points ω0 , . . . , ωn , i.e., it has more
roots than its degree, implying that z = 0, as was to be proved.
(b) We show that the interior of M Fn is nonempty, by checking that the set E of
moments of positive measures with support over the distinct points ω0 , . . . , ωn of Ω
has a nonempty interior. Since E is convex, if it has an empty interior, it is included
in a hyperplane with normal λ; then λ is also normal to E − E, but we checked that
E − E = Rn , which is a contradiction. 
Remark 2.58 The set of moments of positive measures with support over the points
{ω0 , . . . , ωn } is of course the cone generated by the n + 1 Dirac measures asso-
ciated with {ω0 , . . . , ωn }. It is therefore characterized by a finite number of linear
inequalities (Pulleyblank [90]).
The aim of this section is to present a method of characterization of the set M n .
We first recall a classical result, based on the following matrices (often called moment
matrices in the literature)
112 2 Semidefinite and Semi-infinite Programming
⎛ ⎞
m0 m1 · · · mn
⎜m 1 m 2 · · · m n+1 ⎟
⎜ ⎟
M0 (m) := ⎜ . .. .. .. ⎟ ;
⎝ .. . . . ⎠
m n m n+1 · · · m 2n
⎛ ⎞
m1 m 2 · · · m n+1
⎜ m2 m 3 · · · m n+2 ⎟
⎜ ⎟
M1 (m) := ⎜ .. .. .. .. ⎟ .
⎝ . . . . ⎠
m n+1 m n+2 · · · m 2n+1

Lemma 2.59 (i) Let (m 0 , . . . , m 2n+1 ) ∈ M 2n+1 . Then M0 (m) 0. (ii) If in addition
Ω ⊂ R+ , then M1 (m) 0.
Proof (i) Set x(ω) = (1, ω, ω2 , . . . , ωn ) . From x(ω)x(ω) 0 and μ ≥ 0, we
deduce that
M0 (m) = x(ω)x(ω) dμ(ω) 0. (2.116)
Ω

(ii) Since Ω ⊂ R+ , the vector x̂(ω) = (ω1/2 , ω3/2 , . . . , ωn+1/2 ) is well-defined.


The relations x̂(ω)x̂(ω) 0 and μ ≥ 0, imply that

M1 (m) = x̂(ω)x̂(ω) dμ(ω) 0. (2.117)


Ω


Remark 2.60 We can give other examples of necessary conditions based on similar
arguments. For instance, when Ω = [0, 1], the vector

x̃(ω) = ((1 − ω)1/2 , (1 − ω)3/2 , . . . , (1 − ω)n+1/2 )


#
is well-defined, and so, M2 (m) := Ω x̃(ω)x̃(ω) dμ(ω) 0. This gives additional
information: for example, the nonnegativity of the first element of this matrix gives
m0 ≥ m1.
Our study of characterizations of moments will use duality theory. Consider the
following problem:


n
Min m k zk ; pz (ω) ≥ 0 over Ω. (P M)
z∈Rn+1
k=0

The criterion is linear, and the feasible domain is a cone; the value of this problem
is therefore 0 or −∞. The “finite dual” problem (in the sense of Sect. 2.5) is

Max 0; Mk (μ) = m k , k = 0, . . . , n. (D M F )
μ∈M F (Ω)+
2.6 Nonnegative Polynomials over R 113

Its value is 0 if m ∈ M n , and −∞ otherwise. Its feasible set is the set of positive
measures having m for first moments.
Lemma 2.61 (i) We have val(D M F ) ≤ val(P M).
(ii) If in addition Ω is compact, then val(D M F ) = val(P M), and M n = M Fn .
Proof (i) If the dual is not feasible, so that its value is −∞, then val(D M F ) ≤
val(P M) trivially holds. Otherwise, let μ ∈ M(Ω) be such that Mk (μ) = m k , k = 0
to n, and z ∈ F(P M). Then


n 
n 
n
m k zk ≥ m k zk − pz (ω)dμ(ω) = z k (m k − Mk (μ)) = 0. (2.118)
k=0 k=0 Ω k=0

In particular, taking μ ∈ (D M F ), we obtain (i).


(ii) Problem (P M) satisfies the Slater hypothesis (2.80): it suffices to take the polyno-
mial constant equal to 1. Theorem 2.35 ensures the equality val(D M F ) = val(P M).
In addition, if m ∈ M n , then val(P M) = 0 by (2.118), and Theorem 2.35 implies
m ∈ M Fn , whence the conclusion. 
The lemmas of Sect. 2.6.1 imply that, when Ω is a closed interval of R, bounded
or not, problem (P M) has the same value as an SDP problem of the type


n 
L
Min m k zk ; z = AΦ, Φ 0, = 1, . . . , L , (2.119)
z,Φ
k=0 =1

where L = 1 or 2, the Φ being symmetric; the linear mappings (depending on Ω)


A : S n → Rn+1 , for some n , can be deduced from the relations in Lemmas 2.53–
2.55, or from Remark 2.56.
Lemma 2.62 Problem (2.119) has value 0 if A m 0, for = 1 to L, and −∞
otherwise.
Proof Eliminating z, we can write (2.119) in the form


L
Min Φ , A m; Φ 0, = 1, . . . , L . (P M  )
Φ
=1

We conclude by Fejer’s theorem 2.1. 


Example 2.63 Let Ω = R+ . We have defined matrices M0 (m) and M1 (m) in
(2.116)–(2.117). Let A1 and A2 be deduced from the parametrization (2.109). We
can check that Ai m = Mi (m), for i = 0, 1 (taking the convention that the indexes
M1 (m) and M2 (m) go from 0 to n). Lemma 2.62 then implies that val(P M) = 0 iff
M0 (m) 0 and M1 (m) 0.
Combining Lemmas 2.61 and 2.62, we deduce the following result:
114 2 Semidefinite and Semi-infinite Programming

Theorem 2.64 Let Ω be a closed interval of R. (i) If m ∈ M n , then A m 0,


= 1, . . . , L. (ii) If in addition Ω is bounded, then the converse holds: m ∈ M n iff

A m 0, = 1, . . . , L. In addition, there exists a finite measure, with cardinality
of support at most n + 1, having m for its first moments.

The previous theorem provides a characterization of the set M n whenever Ω is


bounded. Let us briefly discuss the case when Ω is unbounded.

Proposition 2.65 Let m ∈ int M n . Then there exists a finite measure, with cardi-
nality of support at most n + 1, having m for first moments.

Proof Let r > 0 and set Ωr := Ω ∩ [−r, r ]. If the conclusion does not hold, there
exists no measure with support in Ωr having m for first moments. By Lemma 2.61,
there exists a z r ∈ Rn+1 such that pzr is nonnegative over Ωr , and k m k z rk < 0. Let
z̄ = 0 be a limit point of z r /|z r |. Then pz̄ is nonnegative over Ω, and k m k z̄ k ≤ 0.

Choose m  so close to m that k m k z̄ k < 0. Then problem (P M) for m has


value −∞, which by (2.118) implies that m ∈ / M , in contradiction with m ∈
n

int M n . 

Example 2.66 Let us show that if Ω = R+ , the set M n is not closed. Let r >
1. To the measure μr = (1 − r −n )δ0 + r −n δr are associated the moments m r =
(1, r 1−n , . . . , 1) with limit m = (1, 0, . . . , 0, 1). It is clear that m ∈
/ M n.

2.6.3 Maximal Loading

Let n ∈ N, n > 0, and S be an interval contained in Ω. We consider the problem of


maximal loading on the set S, under constraints of moments:

Max dμ(ω); Mk (μ) = m k , k = 0, . . . , n. (D M S )


μ∈M(Ω)+ S

The data are m = (m 0 , . . . , m n ) . We may assume that m 0 = 1; the value of this


problem is equal to 1 iff it is possible to realize the moments m k with a probability
with support over S. Denote by χ S the characteristic function of S:

1 if ω ∈ S,
χ S (ω) = (2.120)
0 otherwise.

With z = (z 0 , . . . , z n ) we associate the polynomial defined in (2.103). We will


interpret this problem as the dual of the following “primal” problem:


n
Min m k zk ; pz (ω) − χ S (ω) ≥ 0 over Ω. (P M S )
z∈Rn+1
k=0
2.6 Nonnegative Polynomials over R 115

We can rewrite the latter in the form


n
Min m k zk ; pz (ω) ≥ 1 over S; pz (ω) ≥ 0 over Ω\S. (P M S )
z∈Rn+1
k=0

Given measures μ1 and μ2 with support S and Ω\S resp., the Lagrangian of the
problem is, denoting by M0:n (·) the vector of moments of order 0 to n:


n
L(μ, z) := m k zk − ( pz (ω) − 1) dμ1 (ω) − pz (ω)dμ2 (ω)
k=0 S Ω\S (2.121)
= (m − M0:n (μ1 + μ2 )) · z + dμ1 (ω)
S

and therefore the dual problem is

Max dμ1 (ω); M0:n (μ1 + μ2 ) = m; μ1 ∈ M(S)+ ; μ2 ∈ M(Ω\S)+ .


μ1 ,μ2 S
(2.122)
We can write in a unique way μ2 = μ2 + μ2 with μ2 (Ω \ S) = 0 and μ2 (S) = 0.
Changing if necessary μ1 into μ1 + μ2 and μ2 into μ2 , we see that it is optimal
that μ2 (S) = 0, and hence the dual cost has the same value and constraints as the
maximal loading problem for μ := μ1 + μ2 . We deduce the following result:

Theorem 2.67 (i) We have val(D M S ) ≤ val(P M S ). (ii) If in addition Ω is compact,


then val(D M S ) = val(P M S ), and (D M S ) has a solution with finite support.

Remark 2.68 By the results of Sect. 2.6.1, problem (P M) is equivalent to a linear


positive semidefinite optimization problem. When Ω is compact, the optimal loading
problem therefore reduces to an SDP problem.

Remark 2.69 The previous results can be useful in the context of risk control.
Assume that certain moments of a probability of gains, with values in a bounded
interval Ω, are known. We can then compute the maximal value probability of
gain below a certain threshold s, by solving a maximal loading problem, with here
S :=] − ∞, s] ∩ Ω.

2.7 Notes

An overview of SDP optimization is provided in the Handbook [125] edited by


Wolkowicz et al. Proposition 2.13 is due to Lewis [72]; see Lewis and Overton [73]
and Lewis [71]. The SDP relaxation of quadratic problems is discussed in [125, Chap.
13]; our presentation is inspired by Lemaréchal and Oustry [70]. Second-order cone
models are discussed in Ben-Tal and Nemirovski [15] and Lobo Sousa et al. [76];
116 2 Semidefinite and Semi-infinite Programming

questions of sensitivity are dealt with in Bonnans and Ramírez [25], and an overview
is given in Alizadeh and Goldfarb [4].
About semi-infinite programming, see Bonnans and Shapiro [26, Sect. 5.4], or
Goberna and Lopez [53]. The problem of moments is discussed in Chap. 16 of [125];
a classical reference is Akhiezer [2]. The related work by Lasserre [69] deals with
the minimization of polynomial functions of several variables, with polynomial con-
straints. Our discussion of Chebyshev interpolation follows Powell’s book [89], a
classical reference in approximation theory.
Chapter 3
An Integration Toolbox

Summary This chapter gives a concise presentation of integration theory in a general


measure space, including classical theorems on the limit of integrals. It gives an
extension to the Bochner integrals, needed for measurable functions with values in
a Banach space. Then it shows how to compute the conjugate and subdifferential of
integral functionals, either in the convex case, based on convex integrand theory, or
in the case of Carathéodory integrands. Then optimization problems with integral
cost and constraint functions are analyzed using the Shapley–Folkman theorem.

3.1 Measure Theory

3.1.1 Measurable Spaces

Soit Ω be a set; we denote by P(Ω) the set of its subsets. We say that F ⊂ P(Ω)
is an algebra (resp. σ -algebra) if it contains ∅ and Ω, the complement of each
of its elements, and the finite (resp. countable) unions of its elements. Note that an
algebra (resp. a σ -algebra) also contains the finite (resp. countable) intersections of its
elements. The trivial σ -algebra is the algebra {∅, Ω}. An intersection of algebras
(resp. σ -algebras) is an algebra (resp. σ -algebra). Therefore, if E ⊂ P(Ω), we
may define its generated algebra (resp. σ generated-algebra) as the intersection of
algebras (resp. σ -algebras) containing it, or equivalently the smallest algebra (resp.
σ -algebra) containing it. The above intersections are not over an empty set since
they contain the trivial σ -algebra. If F is a σ -algebra of Ω, we say that (Ω, F ) is
a measurable space, and call the elements of F measurable sets.

Remark 3.1 We can build the algebra (resp. σ -algebra) generated by E ⊂ P(Ω) as
follows. Consider the sequence Ek ⊂ P(Ω), k ∈ N, such that E0 := E , and Ek+1 is
the subset of P(Ω) whose elements are the elements of Ek , as well as their com-
plements and finite (resp. countable) unions. We can call Ek the k steps completion

© Springer Nature Switzerland AG 2019 117


J. F. Bonnans, Convex and Stochastic Optimization, Universitext,
https://doi.org/10.1007/978-3-030-14977-2_3
118 3 An Integration Toolbox

(resp. σ -completion) of E . This is a nondecreasing sequence, and the algebra (resp.


σ -algebra) generated by E ⊂ P(Ω) is the limiting set ∪k Ek .

Example 3.2 (i) If E is a partition of Ω (a finite family of pairwise disjoint subsets


with union Ω), the generated σ -algebra is the set of (possibly empty) unions of
elements of E . More generally, if E is a countable partition of Ω (a countable family
of pairwise disjoint subsets with union Ω), the generated σ -algebra is the set of
(possibly empty) unions of elements of E .
(ii) If f : E → Ω, where E is an arbitrary set and F is a σ -algebra in Ω, then
{ f −1 (A); A ∈ F } is a σ -algebra in E, called the σ -algebra generated by f .
(iii) If Ω is a topological space,1 then we call the σ -algebra generated by the open
subsets of Ω the Borel σ -algebra and denote it by B(Ω).

Definition 3.3 Given two sets Ω1 and Ω2 , and subsets Fˆi of P(Ωi ), with generated
σ -algebras denoted by Fi , for i = 1, 2, we set

Fˆ1 ⊗ Fˆ2 := {F1 × F2 , Fi ∈ Fˆi , i = 1, 2}, (3.1)


σ
and let Fˆ1 ⊗ Fˆ2 be the σ -algebra in Ω1 × Ω2 generated by Fˆ1 ⊗ Fˆ2 (called the
product σ -algebra).

We have the obvious inclusion


σ σ
Fˆ1 ⊗ Fˆ2 ⊂ F1 ⊗ F2 . (3.2)

Consider the following hypothesis

Ωi is a countable union of elements of Fˆi , for i = 1, 2. (3.3)


σ σ
Proposition 3.4 If (3.3) holds, then Fˆ1 ⊗ Fˆ2 = F1 ⊗ F2 .

Proof We follow Villani [121, Prop. III.35].


σ σ
(a) In view of (3.2) it suffices to prove that F1 ⊗ F2 ⊂ Fˆ1 ⊗ Fˆ2 . Since the r.h.s.
is a σ -algebra, this holds if the following claim holds: for any A ∈ F1 and B ∈ F2 ,
σ
A × B ∈ Fˆ1 ⊗ Fˆ2 .
σ
(b) When A ∈ Fˆ1 and B ∈ Fˆ2 , by (3.3), A × Ω2 and Ω1 × B belong to Fˆ1 ⊗ Fˆ2 .
Using this, given B ∈ Fˆ2 , we easily check that the set of A ⊂ Ω1 such that A × B ∈
σ σ
Fˆ1 ⊗ Fˆ2 is a σ -algebra. So, A × B ∈ Fˆ1 ⊗ Fˆ2 whenever A ∈ F1 and B ∈ Fˆ2 .
σ
Since Fˆ1 ⊗ Fˆ2 is a σ -algebra, the claim follows. 

1 This means that there exists a subset O of P (Ω) that contains Ω and ∅, and is stable under finite
intersection and arbitrary union. Its elements are called open sets. The complements of open sets
are called closed sets.
3.1 Measure Theory 119

Definition 3.5 Let Ωi , i = 1, 2, be topological spaces. Then the product topology


in Ω := Ω1 × Ω2 is defined as follows: A ⊂ Ω is an open set if for any a ∈ A, there
exists O1 and O2 , open subsets of Ω1 and Ω2 resp., such that a ∈ O1 × O2 ⊂ A.

If (Y, ρ) is a metric space, its open balls with center y ∈ Y and radius r > 0 are
denoted as
B(y, r ) := {y  ∈ Y ; ρ(y, y  ) < r }. (3.4)

The open subsets of Y are defined as unions of open balls. This makes Y a topological
space. In the sequel metric spaces will always be endowed with their Borel σ -algebra.

Proposition 3.6 Let Ωi , i = 1, 2, be separable metric spaces. Let Ω := Ω1 × Ω2


be endowed with the product topology. Then
σ
B(Ω) = B(Ω1 ) ⊗ B(Ω2 ). (3.5)

Proof We follow Villani [121, Prop. III.36].


(a) Let ai ∈ Ωi , for i = 1, 2. Since Ωi is the union of open balls B(ai , k) for k ∈ N,
hypothesis (3.3) holds. Let Oi denote the family of open subsets of Ωi . By Propo-
σ σ
sition 3.4, where Fˆi = Oi and Fi = B(Ωi ), we have that O1 ⊗ O2 = B(Ω1 ) ⊗
σ σ
B(Ω2 ). So, we need to prove that O1 ⊗ O2 = B(Ω). That O1 ⊗ O2 ⊂ B(Ω) fol-
lows from the obvious inclusion O1 ⊗ O2 ⊂ B(Ω). We next show the converse
inclusion. Since B(Ω) is generated by the open subsets, it suffices to prove that any
σ
open subset of Ω belongs to O1 ⊗ O2 .
(b) For i = 1, 2, let xki be a dense sequence in Ωi , Oi be an open subset of Ωi ,
and x i ∈ Oi . Then B(x i , 2/n) ⊂ Oi for large enough n ∈ N. Pick k such that
dist(xki , x i ) < 1/n. Then x i ∈ B(xk , 1/n).
(c) Let O be an open subset of Ω, and z ∈ O. For i = 1, 2, there exist Oi , open subsets
of Ωi , such that z = (x, y) ∈ O1 × O2 ⊂ O. By point (b), x ∈ B(xk1 , 1/n  ) ⊂ O1
and y ∈ B(xk2 , 1/n  ) ⊂ O2 . So, an open subset of Ω is a countable union of sets of
σ
the form B(xk1 , 1/n  ) × B(xk2 , 1/n  ). It therefore belongs to O1 ⊗ O2 , as was to be
proved. 

Measurable Mappings

Let (X, F X ) and (Y, FY ) be two measurable spaces. The mapping f : X → Y is said
to be measurable if, for all Y1 ∈ FY , f −1 (Y1 ) ∈ F X . A composition of measurable
mappings is therefore measurable, if the σ -algebra of the intermediate space is the
same for the two mappings.

Lemma 3.7 Let (X, F X ) and (Y, FY ) be two measurable spaces, the σ -algebra
FY being generated by G ⊂ P(Y ). Then f : X → Y is measurable iff f −1 (g) is
measurable, for all g ∈ G .
120 3 An Integration Toolbox

Proof If f is measurable and g ∈ G , clearly f −1 (g) ⊂ F X . Conversely, assume that


f −1 (g) ⊂ F X , for all g ∈ G . As in Remark 3.1, denote by Gk the k step σ -completion
of G . For k = 0 we have that f −1 (g) ⊂ F X , for all g ∈ Gk . On the other hand, if
for some k ∈ N, f −1 (g) ⊂ F X , for all g ∈ Gk , since F X is a σ -algebra,we easily
see that f −1 (g) ⊂ F X , for all g ∈ Gk+1 . So, f −1 (g) ⊂ F X , for all g ∈ Gk , for all
k ∈ N. Since the σ -algebra FY generated by G coincides with ∪k Gk , the conclusion
follows. 
Corollary 3.8 Let (X, F X ) and (Y, FY ) be two measurable spaces, and let f :
X → Y . If FY is a Borel σ -algebra, then f is measurable iff the inverse image of
any open set is measurable.
Lemma 3.9 Let (X, F X ), (Y, FY ) and (Z , F Z ) be three measurable spaces, and
let f : X → Y × Z , with components denoted as f (x) := ( f 1 (x), f 2 (x)). Then f is
measurable iff its components f 1 and f 2 are.
Proof If f is measurable, then for all A ∈ FY , f 1−1 (A) = f −1 (A × Z ) is measur-
able. Therefore f 1 is measurable, as is f 2 by a symmetry argument.
Assume now that f 1 and f 2 are measurable. Since the product σ -algebra is gener-
ated by the elements of the form A × B, with A ∈ FY and B ∈ F Z , by Lemma 3.7,
it suffices to check that f −1 (A × B) is measurable, which is immediate since
f −1 (A × B) = f 1−1 (A) ∩ f 2−1 (B).
In the sequel (Y, ρ) is a metric space.
Definition 3.10 We denote by L 0 (Ω, Y ) the vector space of measurable functions
on Ω with values in Y , and by E 0 (Ω, Y ) the subspace of simple functions (sometimes
called step functions), i.e., of measurable functions with finite range. If Y = R, we
denote these spaces by L 0 (Ω) and E 0 (Ω) resp.
By simple convergence of a sequence of functions we mean the convergence
at any point. If V ⊂ Y and y ∈ Y , we define the distance function ρ(y, V ) :=
inf{ρ(y, y  ); y  ∈ V }, and for r > 0, we set2 Vr := {y ∈ Y ; ρ(y, Y \ V ) > 1/r }.
Lemma 3.11 Let f k be a sequence of measurable functions Ω → Y , simply con-
verging to f¯. Then f¯ is measurable, and for any open set O in Y :
 
 
f¯−1 (O) = f −1 (Or ) . (3.6)
r >0 ≥k
k∈N

Proof Let O be an open subset of Y . Clearly, O = ∪r >0 Or , and hence, x ∈ f¯−1 (O)
iff there exists an r0 > 0 such that x ∈ f¯−1 (Or0 ), i.e., there exists a y ∈ Or0 such that
y = f¯(x) = limk f k (x).This holds iff, for any r1 > r0 , f k (x) ∈ Or1 for large enough
k, i.e., iff x belongs to ≥k f −1 (Or1 ) for large enough k: relation (3.6) follows. 

2 By the definition, A \ B := {x ∈ A; x ∈
/ B}.
3.1 Measure Theory 121

We next give a way to approximate measurable functions with values in Rn by


functions having a countable or finite image.

Definition 3.12 Let a denote the integer part of a, i.e., the greatest integer m such
that m ≤ a. For k ∈ N, set
 −k k
2 2 a if a ≥ 0,
a k := (3.7)
−2−k −2k a otherwise.

If now f is a real-valued function, define f k by f k (x) := f (x) k . If f is a


mapping with values in Rn , define f k by f i k (x) := f i k (x), i = 1 to n. We call
f k the floor approximation of f .

We recall that a function is simple if it is measurable with finite image.

Lemma 3.13 Let f be a measurable function. Then (i) f k is measurable, has a


countable range, and converges uniformly to f , (ii) the truncation

f k := max(−k, min(k, f k )) (3.8)

is a sequence of simple functions that converges simply to f . In addition if f is


nonnegative, so is f k , and f k as well as f k are nondecreasing.

Proof It suffices to discuss the case when f is nonnegative. The image of f k is


included in 2−k N, and for j ∈ N, f k−1 (2−k j) = f −1 ([2−k j, 2−k ( j + 1)[) is measur-
able, so that f k is measurable. The conclusion easily follows. 

Definition 3.14 We say that f : Rn → R p is Borelian if it is measurable when Rn


and R p are endowed with the Borel σ -algebra. More generally, if f is measurable
X → Y , where X and Y are topological sets endowed with their Borel σ -algebra,
we say that f is Borelian.

Lemma 3.15 (Doob–Dynkin) Let Ω be an arbitrary set and X , Y be two measurable


functions from Ω to Rn and R p resp. Denote by F X the σ -algebra generated by X .
Then Y is F X measurable iff there exists a Borelian function g : Rn → R p such that
Y = g(X ).

Proof If Y = g(X ) for some Borelian function g : Rn → R p , and if B is an open


set in R p , then Y −1 (B) = X −1 [g −1 (B)] ∈ F X , and so Y is F X measurable.
We show the converse in the case when p = 1, the extension to p > 1 being easy.
Let Y be F X measurable. If Y = 1 A is the characteristic function of the set A ∈ F X ,
since A = X −1 (B) where B is Borelian, we have that Y = 1 X −1 (B) = 1 B (X ), and the
conclusion holds with g = 1 B . More generally, let Y be of the form Y = k αk 1 Ak
(finite or countable sum) with αk all different and Ak = Y −1 (αk ) ∈ F X for all k,
necessarily pairwise disjoint. Note that the sum is well defined since at most one
term is nonzero. Then Ak = X −1 (Bk ), where  the Bk are pairwise disjoint Borelian
sets and the conclusion holds with g(x) = k αk 1 Bk (x) (again, at most one term in
122 3 An Integration Toolbox

the sum is nonzero). Finally, in the general case, by the previous discussion, there
exists a measurable function gk such that Yk = gk (X ), and for m > n, we have
|gm (x) − gk (x)| ≤ 2−k , proving that the sequence gk simply converges to some g,
which is measurable by Lemma 3.26. 

Remark 3.16 For a measurable function f with value in a separable (i.e. that con-
tains a dense sequence) metric space (E, ρ), we can do a somewhat similar construc-
tion. Let ek , k ∈ N be a dense sequence in E. Set E k, := {e j , k ≤ j ≤ }. Define f k
by induction: f k (x) = e0 if ρ(e0 , f (x)) ≤ ρ(e, f (x)) for all e ∈ E 0,k , and at step i,
1 ≤ i ≤ k, if f k (x) has not been set yet, then f k (x) = ei if ρ(ei , f (x)) ≤ ρ(e, f (x)),
for all e ∈ E i,k . In this way we obtain a sequence of simple functions that simply
converge to f . It follows that the above Doob–Dynkin lemma holds when replacing
R p by a separable metric space.

3.1.2 Measures

Let (Ω, F ) be a measurable space. We say that μ : F → R+ ∪ {+∞} is a measure


if it satisfies the two axioms of countable additivity
 
μ (∪i∈I Ai ) = i∈I μ(Ai ), for I finite or countable
(3.9)
and {Ai }i∈I ⊂ F such that Ai ∩ A j = ∅ if i = j,

and σ -finiteness:

There exists an exhaustion sequence Ak in F , i.e., such that
(3.10)
μ(Ak ) < ∞ and Ω = ∪k Ak .

We say that (Ω, F , μ) is a measure space. If in addition μ(Ω) = 1, we say that μ


is a probability measure and that (Ω, F , μ) is a probability space. If A ∈ F , we
then interpret μ(A) as the probability that ω ∈ A. It follows from (3.9) that

μ (∪i∈I Ai ) ≤ μ(Ai ), if {Ai }i∈I is a finite or countable family in F . (3.11)


i∈I

Indeed, we may assume that I = N. Setting Bi := ∪ j≤i Ai and Ci := Bi \ B j−1 (with


B0 := ∅) it suffices to apply (3.9) to the family {Ci }i∈I whose union is ∪i∈I Ai ; since
Ci ⊂ Ai , the result follows from

μ (∪i∈I Ai ) = μ (∪i∈I Ci ) = μ(Ci ) ≤ μ(Ai ). (3.12)


i∈I i∈I

We also have
μ (∪k Ak ) = lim μ(Ak ) if Ak ⊂ Ak+1 for all k. (3.13)
k
3.1 Measure Theory 123

Indeed, apply (3.9) to the disjoint family Ak \ Ak−1 (for k ≥ 1, assuming w.l.o.g.
that A0 = ∅). Let us show next that (3.13) implies

⎨ μ (∩k Ak ) = limk μ(Ak )
if {Ak } is a nonincreasing family of measurable sets (3.14)

having finite measure for large enough k.

We may assume that mes(A1 ) is finite. The family Ak := A1 \ Ak is nondecreas-


ing, and A1 \ (∩k Ak ) = ∪k Ak . By (3.13), μ (∩k Ak ) = μ(A1 ) − limk μ(Ak ) = limk
μ(Ak ).

Construction of Measures

In the case of Lebesgue measure over R the starting point is the length of finite
intervals; one has to check that the latter can be extended to a measure over the
Borelian σ -algebra. More generally let Fˆ be an algebra of subsets of Ω, and let F
denote the generated σ -algebra. Let μ : Fˆ → R+ ∪ {+∞} be a σ -additive function
over Fˆ , i.e., if ∪i∈I Ai ∈ Fˆ , then

μ (∪i∈I Ai ) = i∈I μ(Ai ), for I finite or countable
(3.15)
and {Ai }i∈I ⊂ Fˆ such that Ai ∩ A j = ∅ if i = j.

We have Carathéodory’s extension theorem:


Theorem 3.17 Let μ̂ be a nonnegative σ -additive function over Fˆ . Then it has a
unique extension as a measure over F .
Proof See Royden [105, Chap. 12, Th. 8]. 
Remark 3.18 (i) Note that the proof needs Ω to be σ -finite.
(ii) The delicate point when applying this theorem is to check the σ -additivity
assumption.
(iii) For an extension see Villani [121, Thm. I.69].
As a consequence we obtain the construction of the Lebesgue measure on R.
Corollary 3.19 There exists a unique measure over R, endowed with the Borelian
σ -algebra, that for a finite segment [a, b] has value b − a.
Proof (i) It is easily checked that the length of segments has a unique extension μ̂
to the algebra Fˆ generated by segments, which is nothing but the finite union of
(finite or not) segments. So, by the extension Theorem 3.17, it suffices to check the
σ -additivity assumption of μ̂ over Fˆ . For this, see Royden [105, Chap. 3], or Dudley
[45, Chap. 3]. 
Remark 3.20 For the construction of a non-measurable subset of R, see Royden
[105, Chap. 4, Sect. 4].
124 3 An Integration Toolbox

Product Spaces

We next show how to construct measures over products of measure spaces.

Proposition 3.21 Let (Ωi , Fi , μi ), for i = 1, 2, be two measure spaces. Let μ be the
set function over F1 × F2 defined by μ(F1 × F2 ) := μ1 (F1 )μ2 (F2 ), for all F1 ∈ F1
σ
and F2 ∈ F2 . Then there exists a unique measure μ̄ over F1 ⊗ F2 that extends μ,
in the sense that μ̄(F) = μ(F), for all F ∈ F1 × F2 .

Proof (Taken from Royden [105, Chap. 12, lemma 14])


(i) Let Fˆ denote the algebra generated by F1 × F2 . Any A ∈ Fˆ is a finite disjoint
union of elements of F1 × F2 , say A = i Ai × Ai with Ai in F1 and Ai in F2 ,
the sum being over a finite set. We define μ̂(A) := i μ1 (Ai )μ2 (Ai ). While the
decomposition is not unique, all possible decompositions give the same value for
μ̂(A), so that μ̂ is well-defined as a nonnegative, finitely additive set function of Fˆ .
(ii) By the extension Theorem 3.17, if suffices now to check that μ̂ is σ -additive
over Fˆ . Since each element of Fˆ has a representation as a finite disjoint union of
elements of F1 × F2 , it suffices to prove that μ is σ -additive over F1 × F2 .
(iii) So, let A × B ∈ F1 × F2 be the disjoint union of Ai × Bi ∈ F1 × F2 , with
i ∈ I countable. Given (x, y) ∈ A × B, (x, y) belongs to only one of the Ai × Bi ,
whose index is denoted by j (x, y). Then I (x) := ∪ y∈B j (x, y) denotes the subset
of those i ∈ I such that x ∈ Ai . Since A × B is the disjoint union  of the Ai × Bi ,
B = ∪i∈I (x) Bi . Since μ2 is σ -additive it follows that μ2 (B) = i∈I (x) μ2 (Bi ), and
so,
μ2 (B)χ A (x) = μ2 (Bi )χ Ai (x). (3.16)
i∈I (x)

By the Lebesgue theorem on series Theorem 3.32


 (obtained later, but by indepen-
dent arguments) we deduce that μ2 (B)μ1 (A) = i∈I (x) μ2 (Bi )μ1 (Ai ), as was to be
proved. 

Negligible Sets and Completed σ -Algebras

A (not necessarily measurable) subset A of Ω is said to be negligible if, for all


ε > 0, it is contained in a measurable set of measure less than ε. In particular, for
each nonzero k ∈ N, there exists a measurable set Ωk of measure at most 1/k such
that A ⊂ Ωk , and so A ⊂ ∩k Ωk . This proves that a negligible subset is included in
a measurable set of zero measure.
A countable union of negligible sets is negligible. A property that holds outside
of a negligible set (or equivalently, outside of a subset of zero measure) is said to
be true almost everywhere (a.e.), or almost surely (a.s.) in the case of a probability
measure.
We say that the σ -algebra F is complete if it contains the negligible sets. We call
the family of subsets
3.1 Measure Theory 125

Fc := {A := A1 ∪ A2 ; A1 ∈ F ; A2 is negligible} (3.17)

the completed σ -algebra of F . It is easily seen that Fc is a σ -algebra, and we endow


it with the completed measure defined by μc (A1 ∪ A2 ) := μ(A1 ). The completion
of the Borel σ -algebra of Rn is called the Lebesgue σ -algebra.

Lemma 3.22 Let (Ω, F , μ) be a measure space, and f : Ω → R be Fc -


measurable. Then there exists an F -measurable real-valued function g such that
f = g a.e.

Proof We follow Aliprantis and Border [3, Thm. 10.35]


(a) Decomposing f as a difference of nonnegative functions, we see that it suffices to
consider the case when f (x) ≥ 0 a.e. If f = χ A for some A ∈ Fc , then A = A1 ∪ A2
with A1 ∈ F and A2 negligible, and we may take g = χ A1 . The set of functions f
for which the conclusion holds is a vector space, and so, the conclusion holds when
f is a simple function.
(b) In the general case, by Lemma 3.13, there exists a nondecreasing sequence f k of
Fc -simple functions, a.e. converging to f . By step (a), there exists gk , F -measurable
and equal to f k , except for a negligible set Ak . Being negligible, ∪k Ak is included in
a zero measure set whose complement is denoted by B. Then gk (x) := gk (x)χ B (x)
converges a.e. to the function g equal to f (x) on B, and to 0 on its complement. By
Lemma 3.11, g is measurable. The conclusion follows. 

A Useful Lemma

Lemma 3.23 (Borel–Cantelli)


 Let (Ω, F , μ) be a measure space. If the sequence
Ak in F satisfies k μ(Ak ) < ∞, then the following holds for almost all ω ∈ Ω:

{k ∈ N; ω belongs toAk } is finite. (3.18)

 Bn := ∪k≥n Ak is nonincreasing implies μ(∩n Bn ) = limn μ(Bn ). As


Proof That
μ(Bn ) ≤ k≥n μ(An ), the limit is equal to 0. Since ω ∈
/ ∩n Bn iff (3.18) holds, the
conclusion follows. 

3.1.3 Kolmogorov’s Extension of Measures

Set X = (R p )∞ , i.e., any x ∈ X has the representation x = (x1 , x2 , . . .) with each


xi in R p . To any Borelian subset A of R p×n we associate the cylinder

C(A) := {x ∈ X ; (x1 , . . . , xn ) ∈ A}. (3.19)

Denote by Fˆ (resp. F ) the algebra (resp. σ -algebra) generated by the cylinders.


Let μn be a sequence of probability measures on R p×n (endowed with the Borelian
σ -algebra), having the consistency property
126 3 An Integration Toolbox

μn+1 (A × R p ) = μn (A), n = 1, . . . . (3.20)

Then we can define a finitely additive set function μ̂ over the set of cylinders of X
by
μ̂(C(A)) := μn (A), for each measurable A in R p×n . (3.21)

If A and A are Borelian sets in R p×i and R p× j resp. with j > i, then C(A) coincides
with C(A ) iff A = A × R p×( j−i) . So, in view of (3.20), μ̂(C(A)) is well defined.

Theorem 3.24 The set function μ̂ has a unique extension to a probability measure
μ on F .

Proof (Taken from Shiryaev [115, Chap. 2, Sect. 3]). By Carathéodory’s extension
Theorem 3.17, it suffices to prove that μ̂ is σ -additive.
Since μ̂ is finitely additive, its σ -additivity is equivalent (taking complements)
to the property of “continuity at 0”, i.e., if Ck := C(Ak ) is a decreasing sequence in
Fˆ with empty intersection, then μ̂(Ck ) → 0. Note that we may assume that Ak is
a Borelian set in R p×n k , with n k ≥ k. Assume on the contrary that μ̂(Ck ) → δ > 0.
We prove (independently) in Lemma 5.4 that any Borelian subset A of Rn is such
that

For any ε > 0, there exist F, G resp. closed and open subsets of Rn
(3.22)
such that F ⊂ A ⊂ G and P(G \ F) < ε.

Intersecting F with a closed ball of arbitrarily large radius, it is easily seen that we
may in addition assume F to be compact. So, let Fk ⊂ Ak be compact sets such that
μn k (Ak \ Fk ) < δ/2k+1 . Let F̂k := C(∩q≤k Fk ) = ∩q≤k C(Fk ). Then
 
Ck \ F̂k = Ck \ ∩q≤k C(Fk ) = ∪q≤k Ck \ C(Fk ), (3.23)

and therefore

μ̂(Ck \ F̂k ) ≤ μ̂(Ck \ C(Fk )) = μn p (A p \ F p ) ≤ δ/2. (3.24)


p≤k p≤k

Since μ̂(Ck ) → δ > 0 it follows that limk μ̂( F̂k ) ≥ δ/2. So, there exists a sequence
x k in R∞ such that x k ∈ F̂k for all k. Let p ≥ 1 be an integer. Since Fk is a compact
subset of Rn k , with n k ≥ k, p → x kp is bounded. By a diagonal argument we can, up
to the extraction of a subsequence, assume that p → x kp is convergent for each p to,
say, x p . We easily see that x belongs to ∩k Ck , contradicting our hypothesis. 

Remark 3.25 As mentioned in [115], the proof has an immediate extension to the
case when X = Y ∞ , where Y is a metric space endowed with the Borelian σ -algebra.
We need probabilities μn on Y n , satisfying the consistency property μn+1 (A × Y ) =
μn (A), for all n = 1, . . .. Then we are able to build an extension of the μn on the
3.1 Measure Theory 127

σ -algebra generated by the cylinders, provided that to each Borelian A ∈ Y n and


ε > 0 we can associate a compact set F ⊂ A such that μn (A \ F) < ε.

3.1.4 Limits of Measurable Functions

Let (X, F X ), (Y, FY ) be two measurable spaces. We endow (X, F X ) with a measure
μ. Denote by M the vector space of functions X → Y that are measurable, after
modification on a negligible set, and by Mμ the quotient space through the equiv-
alence relation f ∼ f  iff f (x) = f  (x) a.e. We call an element of an equivalence
class a representative of that class, and say that the sequence f k ∈ Mμ converges
a.e. to f ∈ Mμ if the convergence holds a.e. for some representative.

Lemma 3.26 Let (X, F X , μ) be a measure space, and (Y, dY ) be a metric space. If
a sequence in Mμ converges a.e., then its limit is a measurable function.

Proof Let f k be a sequence in Mμ converging a.e. to f¯. Let Ω0 be a zero measure


set such that f k simply converges on Ω1 := Ω \ Ω0 . Let y0 ∈ Y , and set

gk (ω) := f k (ω) if ω ∈ Ω \ Ω1 ; gk (ω) := y0 otherwise. (3.25)

Then g ∈ Mμ simply converges to the function f˜ equal to f¯ on Ω1 , and y0 on Ω0 .


By Lemma 3.11, f˜ is measurable; so is f¯, being equal to f˜ a.e. 

Theorem 3.27 (Egoroff) Let (X, F X , μ) be a measure space such that μ(X ) < ∞,
(Y, ρ) a metric space, and f¯k a sequence of Mμ (X, Y ). If f¯k converge a.e. to ḡ, then
for any representatives ( f k , g) of ( f¯k , ḡ) and ε > 0, there exists a K ⊂ F X such that
μ(X \K ) ≤ ε and f k uniformly converges to g on K .

Proof The family indexed by k and q in N, q ≥ 1:

Ak,q := ∪≥k {x ∈ X ; ρ( f  (x), g(x)) > 1/q} (3.26)

is nonincreasing in k, and by (3.14), limk μ(Ak,q ) = μ(∩k Ak,q ) = 0. So there exists


kq ∈ N such that μ(Akq ,q ) ≤ ε2−q . Set K̂ := ∪q Akq ,q , and let K be the complement
of K̂ . Then μ( K̂ ) ≤ ε and

ρ( f  (x), g(x)) < 1/q whenever  ≥ kq , for all x ∈ K , (3.27)

implying the uniform convergence on K . 

If (Y, dY ) is a metric space, we say that a sequence f k in Mμ (X, Y ) converges in


measure (in probability) to g ∈ Mμ (X, Y ) if

For all ε > 0, we have that μ ({x ∈ X ; ρ( f k (x), g(x)) > ε}) → 0. (3.28)
128 3 An Integration Toolbox

Theorem 3.28 Let (X, F X , μ) be a measure space, and (Y, ρ) be a metric space.
Let f k be a sequence of measurable mappings X → Y . Then: (i) Convergence in
measure implies convergence a.e. of a subsequence. (ii) If μ(X ) < ∞, convergence
a.e. implies convergence in measure.

Proof (i) Let f k converge in measure to f¯, and ε > 0. Set

Ak := {ω ∈ Ω; ρ( f k (ω), f¯(ω)) > ε}. (3.29)

Extracting a subsequence if necessary, we may assume that μ(Ak ) ≤ 2−k . By the


Borel–Cantelli Lemma 3.23, for a.a. ω ∈ Ω, ω belongs to finitely many Ak ; that
is, there exists a function k(ω) such that ρ( f k (ω), f¯(ω)) < ε a.e. for k > k(ω).
This being true for all ε > 0, the convergence a.e. of f k to f¯ follows, along the
subsequence.
(ii) Immediate consequence of Egoroff’s Theorem 3.27. 

Metrizability of Convergence in Measure

We briefly review some results, referring for the proof to [77, Chap. 1]. For f , g in
L 0 (Ω) set
e( f, g) := inf {ε + μ(| f − g| > ε)}. (3.30)
ε>0

This is a symmetric function with values in R+ ∪ {+∞}, such that e( f, g) = 0 iff


f = g a.e., and that satisfies the triangle inequality

e( f, h) ≤ e( f, g) + e(g, h). (3.31)

It easily follows that δ( f, g) := e( f, g)/(1 + e( f, g)) is a metric over Mμ .

Theorem 3.29 We have that f k → f in measure iff δ( f k , f ) → 0, and Mμ endowed


with the metric δ is complete.

3.1.5 Integration

Let (Ω, F , μ) be a measure space. The spaces L 0 (Ω) and E 0 (Ω) of measurable
and simple functions, resp., were introduced in Definition 3.10. If f ∈ E 0 (Ω) has
values a1 < · · · < an , the sets Ai := f −1 (ai ), i = 1 to n, are measurable and give a
partition of Ω. Denote by E 1 (Ω) the subspace of E 0 (Ω) for which the Ai have a
finite measure whenever ai = 0. If f ∈ E 1 (Ω), we define the integral of f as
 n
f (ω)dμ(ω) := ai μ(Ai ). (3.32)
Ω i=1
3.1 Measure Theory 129

This defines a linear form over E 1 (Ω), and the function


 n
 f 1 := | f (ω)|dμ(ω) = |ai |μ(Ai ), (3.33)
Ω i=1

where |ai | μ(Ai ) = 0 if ai = 0, is a seminorm (nonnegative, positively homogeneous


function that satisfies the triangle inequality) of the same value for all representatives
of the equivalence class under the relation of being equal a.e. Over these equivalence
classes, this seminorm induces a norm, denoted again by  · 1 , that satisfies the
Tchebycheff inequality: for all ε > 0,

1 1
μ ({ω ∈ Ω; | f (ω) − g(ω)| > ε}) ≤ | f (ω) − g(ω)|dμ(ω) =  f − g1 .
Ωε ε
(3.34)
We shall build L 1 (Ω) as the completion, for the norm  · 1 , of the equivalence classes
of functions of the space E 1 (Ω). More precisely, let f k be a Cauchy sequence in
E 1 (Ω). Extracting a subsequence if necessary, we may assume that  f k − f k+1 1 ≤
2−k−1 . Fix ε > 0 and set
 
Ak := ω ∈ Ω; sup | f k (ω) − f  (ω)| > ε . (3.35)
≥k

A variant of the Tchebycheff inequality gives



1 1 2−k
μ ( Ak ) ≤ sup | f k (ω) − f  (ω)| ≤  f +1 − f  1 ≤ . (3.36)
ε Ω ≥k ε ≥k
ε

By the Borel–Cantelli Lemma 3.23, ω belongs a.e. to a finite number of Ak , showing


that | f k (ω) − f  (ω)| ≤ ε for k = k(ω) large enough and  ≥ k. In other words, f k (ω)
is a.e. a Cauchy sequence, and therefore f k converges a.e. to some g, which is
measurableby Lemma 3.26. Since the integral is a continuous mapping in the norm
 · 1 , limk Ω f k (ω)d(ω) also converges.
 
Lemma 3.30 We may set Ω g(ω)d(ω) := limk Ω f k (ω)d(ω), in the sense that if
another
 Cauchy sequence
 f k in E 1 (Ω) converges a.e. to the same function g, then

Ω f k (ω)d(ω) and Ω f k (ω)d(ω) have the same limit.

Proof We follow [77, French edition, p. 37]. If the conclusion does not hold, then

h k := f k − f k is a Cauchy sequence that converges a.e. to zero and such that
Ω h k (ω)dμ(ω) has a nonzero limit, say γ , so that, for large enough k, h  ≥
 k 1
1
2
|γ | > 0 and h k − h  1 < |γ |/8 for  > k. Fix such a k and write h k = q ξq 1 Aq .
Then, for  > k:
130 3 An Integration Toolbox
 
|ξq (ω) − h  (ω)|dμ(ω) ≤ h k − h  1 ≤ 41 h k 1 = 1
4
|ξq |dμ(ω).
q Aq q Aq
(3.37)
So, we must have that
 
|ξq (ω) − h  (ω)|dμ(ω) ≤ 1
4
|ξq (ω)|dμ(ω) = 41 |ξq (ω)|μ(Aq ), for some q.
Aq Aq
(3.38)
For this particular q, by the Tchebycheff inequality:

μ({ω ∈ Aq ; |ξq (ω) − h  (ω)| > 21 |ξq (ω)|}) < 21 μ(Aq ), (3.39)

so that
μ({ω ∈ Aq ; |h  (ω)| > 21 |ξq (ω)|}) ≥ 21 μ(Aq ). (3.40)

Since 1 Aq h  converges a.e. to zero on Aq , which has finite measure, by


Theorem 3.28(ii), it converges in measure, which contradicts (3.40). The conclusion
follows. 
The vector space L 1 (Ω) of equivalent classes (for the relation of equality a.e.)
of such limits is endowed with g1 := limk  f k 1 , which is easily checked to be a
norm. This space, being constructed as limitsof Cauchy sequences, is easily seen to
be complete. Over E 1 (Ω), the operator g → Ω g(ω)dμ(ω) is linear, nondecreasing
and non-expansive (Lipschitz with constant 1). Since E 1 (Ω) is a dense subset of
L 1 (Ω), the integral has a unique extension to L 1 (Ω) that keeps these properties.
The integral is a continuous linear form over the space L 1 (Ω), with unit norm.
Therefore, f k → g in L 1 (Ω) implies Ω f k (ω)dμ(ω) → Ω g(ω)dμ(ω). Since the
Tchebycheff inequality (3.34) holds on L 1 (Ω), by Theorem 3.28, the following holds:
Lemma 3.31 Convergence in L 1 (Ω) implies convergence in measure, and hence,
convergence a.e. for a subsequence.
If f ∈ L 0 (Ω) is bounded, by Lemma 3.13, we can approximate it uniformly by
functions in E 0 (Ω). Therefore, if μ(Ω) < ∞, or more generally if f is zero a.e.
outside of a set of finite measure, then f ∈ L 1 (Ω).
Theorem 3.32 (Lebesgue’s
 theorem on series) Let the sequencef k in L 1 (Ω) con-
n
verge normally, i.e., k  f k 1 < ∞. Then (i) the
 series Fn (ω) := k=0 f k converges
in L (Ω) to some g, (ii) Ω g(ω)dμ(ω) = 
1
lim Ω Fk (ω)dμ(ω), (iii) the series Fk (ω)
absolutely converges a.e. to g(ω), that is, k | f k (ω)| < ∞ and Fk (ω) → g(ω) a.e.
Proof (i) Any normally convergent sequence in a complete space is convergent. (ii)
Since the integral is linear and continuous, the integral ofthe limit is the limit of
integrals of the partial sums. (iii) The remainders rn := ∞ k=n | f k | converge to 0
in L 1 (Ω), and hence, a.e. for a subsequence. For a nonincreasing and nonnegative
sequence, convergence a.e. to 0 for a subsequence implies convergence a.e. for the
sequence. So the sequence rn converges a.e. to 0. The result follows. 
3.1 Measure Theory 131

We can define integrals with infinite values in the following way.



Definition 3.33 Let
 f ∈ L 0 (Ω). We set Ω f (ω)dμ(ω) := −∞ if f + ∈ L 1 (Ω) and
f− ∈
/ L 1 (Ω), and Ω f (ω)dμ(ω) := +∞ if f − ∈ L 1 (Ω) and f + ∈ / L 1 (Ω).

With the above definition we have the usual calculus rules such as
  
( f + g)(ω)dμ(ω) = f (ω)dμ(ω) + g(ω)dμ(ω), (3.41)
Ω Ω Ω

whenever the integrals of f and g are defined, except of course if f and g have
infinite integrals of opposite sign.

Theorem 3.34 (Monotone convergence) Let f k be a nondecreasing sequence of


L 1 (Ω), with limit a.e. g. Then
 
lim f k (ω)dμ(ω) = g(ω)dμ(ω), (3.42)
k Ω Ω

the limit being possibly +∞. If in addition, limk Ω f k (ω)dμ(ω) < ∞, then g ∈
L 1 (Ω) and f k → g in L 1 (Ω).
 
Proof Since f k ≤ g, Ω f k (ω)dμ(ω) ≤ Ω g(ω)dμ(ω). So, (3.42) holds if

lim f k (ω)dμ(ω) = ∞. (3.43)
k Ω

Otherwise, we conclude by applying Theorem 3.32 to the normally convergent series


f k+1 − f k . 

Example 3.35 The sequence of functions f k : R → R, f k (x) = −1x≥k (x), is non-


decreasing and has limit g(x) = 0 a.e., and yet
 
lim f k (ω)dω = −∞ < 0 = g(ω)dμ(ω). (3.44)
k Ω Ω

The above theorem does not apply since f k is not integrable.

 Let f ∈ L (Ω) be nonnegative. Then the mapping F → R, A →


1
Lemma 3.36
ρ f (A) := A f (ω)dω is a measure.

Proof The σ -finiteness axiom (3.10) holds since ρ f (Ω) :=  f 1 < ∞.  It remains to
show that, if the Ai satisfy the assumptions
  in (3.9), then ρ f (∪ A
i∈I i ) = i∈I ρ(Ai ),
or equivalently ∪i∈I Ai f (ω)dω = i∈I Ai f (ω)dω. This follows from the mono-

tone convergence Theorem 3.34, where we set f k (ω) := f (ω) ≤k 1 A (ω). 

Corollary 3.37 Let {Bk } ⊂ F be such that Bk+1 ⊂ Bk , and B := ∩k Bk has zero
measure. Then Bk f (ω)dμ(ω) → 0, for all f ∈ L 1 (Ω).
132 3 An Integration Toolbox

Proof Decomposing f into its positive


 and negative parts, we see that it suffices to
prove the result when f ≥ 0. Since Bk f (ω)dμ(ω) = ρ f (Bk ), and ρ(B) = 0, this
follows from Lemma 3.36 and (3.14). 
Theorem 3.38 (Lebesgue dominated convergence) Let the sequence f k of L 1 (Ω)
converge a.e. to g, and be dominated by h ∈ L 1(Ω), in the sense that
 | f k (ω)| ≤ h(ω)
a.e. Then g ∈ L 1 (Ω), f k → g in L 1 (Ω), and Ω f k (ω)dμ(ω) → Ω f (ω)dμ(ω).
Proof (a) Since g is dominated by h ∈ L 1 (Ω), so are the floor approximations g k ,
which (being measurable) are therefore integrable. Applying the monotone con-
vergence Theorem 3.34 to the positive and negative parts of g k , we deduce that
g ∈ L 1 (Ω).  
The relation Ω f k (ω)dμ(ω) → Ω g(ω)dμ(ω) is a consequence of the conver-
gence of f k to g in L 1 (Ω), which we prove next.
(b) We first assume that μ(Ω) < ∞. By Egoroff’s Theorem 3.27, for each  ∈ N,
 > 0, there exists K  ⊂ F X such that K  := X \K  satisfies μ(K  ) ≤ 1/, and f k
converges uniformly on K  to g. Changing if necessary K  into ∪q≤ K q , we may
assume that K  is nonincreasing. Let ρh denote the measure associated with h (see
Lemma 3.36). Since ∩ K has zero measure, and ρh (K  ) is finite, by (3.14), we have
that lim ρh (K  ) = lim K  |h(ω)|dω = 0, and so when  ↑ +∞:


 
α := sup | f k (ω) − g(ω)|dμ(ω) ≤ 2 h(ω)dμ(ω) → 0. (3.45)
k K  K 

On the other hand, since f k converges uniformly on K  :



| f k (ω) − g(ω)|dμ(ω) → 0. (3.46)
K

So, given γ > 0, take  such that α ≤ γ . By (3.46), we have that lim supk  f k −
g1 ≤ γ . It follows that f k → g in L 1 (Ω).
(c) Assume now that μ(Ω) = ∞,  and let A be the exhaustion sequence in (3.10).
Set B := Ω \ A . By step (b), A | f k (ω) − g(ω)|dω → 0, and so
 
lim sup  f k − g1 = lim sup | f k (ω) − g(ω)|dω ≤ 2 |h(ω)|dω. (3.47)
k k B B


Now B |h(ω)|dω = ρh (B ). Since ∩ B has zero measure, by Corollary 3.37, the
above r.h.s. converges to 0 when  ↑ +∞. The conclusion follows. 
Remark 3.39 We have proved in step (a) that a measurable function belongs to
L 1 (Ω) whenever it is dominated by some h ∈ L 1 (Ω).
Example 3.40 Define the functions f k and g R → R by f k (x) := e−(x−k) , g(x) = 0.
2

Then f k and g are integrable, and f k → g a.e. However, the integral of f k does not
converge to that of g. The above theorem does not apply, since the domination
hypothesis does not hold.
3.1 Measure Theory 133

We recall that, if f k is a sequence of real-valued functions over Ω, its lower limit


is defined by lim inf k f k (ω) := limk inf j≥k f j (ω). Since the r.h.s. is nondecreasing,
the limit exists in R̄.

Lemma 3.41 (Fatou’s lemma) Let f k be a sequence in L 1 (Ω), with f k ≥ g, where


g is an integrable function. Then
 
lim inf f k (ω)dμ(ω) ≤ lim inf f k (ω)dμ(ω). (3.48)
Ω k k Ω

Proof If lim inf k Ω f k (ω)dμ(ω) = ∞, then (3.48) certainly holds. Otherwise, note
that gk := inf j≥k f j satisfies g ≤ gk ≤ f k . So, by Remark 3.39, gk ∈ L 1 (Ω) and it
satisfies  
gk (ω)dμ(ω) ≤ f k (ω)dμ(ω). (3.49)
Ω Ω

Since the l.h.s. is nondecreasing, we have that


 
lim gk (ω)dμ(ω) ≤ lim inf f k (ω)dμ(ω). (3.50)
k Ω k Ω

Let ḡ(ω) := limk gk (ω) = lim inf k f k (ω). By the


 monotone convergence Theo-
rem 3.34, the l.h.s. of (3.50) is equal to Ω ḡ(ω)dμ(ω). The conclusion
follows. 

Remark 3.42 Fatou’s lemma allows us to prove the l.s.c. of some integral functionals,
see e.g. after (3.134).

Corollary 3.43 Let f k be a sequence in L 1 (Ω) such that  f k 1 ≤ C for all k. If f k


converges a.e. to f , then f ∈ L 1 (Ω) and  f 1 ≤ C.

Proof Apply Lemma 3.41 to the sequence | f k |, which converges a.e. to | f |, with
g = 0. 

Example 3.44 The integrable sequence f k (x) := e−(x−k) simply converges to 0, and
2

gives an example of strict inequality in (3.48). It also shows that the convergence
in L 1 (Ω) does not necessarily occur in the setting of Corollary 3.43. Taking now
f k (x) := −e−(x−k) , we verify that, to obtain (3.48), the hypothesis that f k ≥ g, with
2

g integrable, cannot be omitted.

We have until now presented the standard theorems of integration theory. We now
end this section with some more advanced results. Let us first show that Fatou’s
lemma implies an easy and useful generalized dominated convergence theorem, see
Royden [105, Chap. 4, Thm. 17].

Theorem 3.45 Let f k and gk be sequences in L 1 (Ω) such that


(i) | f k (ω)| ≤ gk (ω) a.e.,
(ii) ( f k , gk ) converges a.e. to ( f, g),
134 3 An Integration Toolbox
 
(iii) Ω gk (ω)dμ(ω) → Ω g(ω)dμ(ω).
Then Ω f k (ω)dμ(ω) → Ω f (ω)dμ(ω).

Proof Since ψk± := gk ± fk is integrable, nonnegative, and ±


 converges a.e. to ψ :=
g ± f , by Fatou’slemma, Ω ψ ± (ω)dμ(ω) ≤ lim inf k Ω ψk± dμ(ω). Using (iii), it
follows that ± Ω f (ω)dμ(ω) ≤ lim inf k Ω (± f k (ω))dμ(ω). The conclusion
follows. 

We next present Vitali’s convergence theorem.

Definition 3.46 (Uniform integrability) Let (Ω, F , P) be a probability space. We


say that a set E of measurable functions is uniformly integrable if, for all ε > 0,
there exists an Mε > 0 such that E| f |1{| f |>Mε } ≤ ε, for all f ∈ E.

Theorem 3.47 Let (Ω, F , P) be a probability space, and f k be a uniformly inte-


grable sequence in L 1 (Ω), with a.s. finite limit f . Then f ∈ L 1 (Ω), and f k → f in
L 1 (Ω).

Proof Let ε, Mε be as above. Since  f k 1 ≤ Mε + ε, f k is bounded in L 1 (Ω), and


Corollary 3.43 implies that f ∈ L 1 (Ω). By Egoroff’s Theorem 3.27, for all ε j ↓ 0,
there exists an E j ∈ F such that f k → f uniformly over E j , and F j := Ω \ E j has
measure less than ε j . Therefore,

| f k (ω)|dP(ω) ≤ Mε |F j | + E| f k |1{| fk |>Mε } ≤ ε j Mε + ε (3.51)
Fj

for all k. Changing E j into ∪i≤ j E i if necessary, we may assume that F j is a non-
increasing sequence, whose intersection has zero measure. By Corollary 3.37, we
may fix j such that F j | f (ω)|dμ(ω) ≤ ε and ε j Mε + ε ≤ 2ε, so that

F j | f k − f |(ω)dP(ω) ≤ 3ε. Since f k → f uniformly on E j , the conclusion fol-
lows. 

Remark 3.48 The theorem does not hold over a measure space when μ(Ω) is not
finite, as Example 3.44 shows.

Exercise 3.49 Let Ω := [0, 1] be endowed with Lebesgue’s measure. Let fk (ω) = k
over [0, 1/k] and f k (ω) = 0 otherwise. Show that this sequence is not uniformly
integrable, and does not satisfy the conclusion of Vitali’s theorem.

3.1.6 L p Spaces

Let (Ω, F , μ) be a measure space. For f ∈ L 0 (Ω), set

 f ∞ := inf{α > 0; | f (ω)| ≤ α a.e.}. (3.52)


3.1 Measure Theory 135

Let
L ∞ (ω) := { f ∈ L 0 (Ω);  f ∞ < ∞}. (3.53)

It is easily checked that this space, endowed with the norm  · ∞ , is a Banach space.
Now, for p ∈ [1, ∞) set

L p (ω) := { f ∈ L 0 (Ω); | f | p ∈ L 1 (Ω)}. (3.54)

For f ∈ L p (Ω) we set


 1/ p
 f  p := | f (ω)| dμ(ω)p
. (3.55)
Ω

We will check in Lemma 3.53 that this is a norm. Let us prove that L p (Ω) is a vector
space. It is enough to check that if f , g in L p (ω), then f + g in L p (ω). Indeed, the
function x → |x| p being convex, we have that
  
2− p  f + g pp = | 21 ( f + g)| p ≤ 1
2
| f |p + 1
2
|g| p = 21  f  pp + 21 g pp .
Ω Ω Ω
(3.56)

3.1.6.1 Hölder’s Inequality

Let p ∈ [1, ∞], and q be the conjugate exponent, such that 1/ p + 1/q = 1. The
following lemma shows that to every element of L q (Ω) is associated a continuous
linear form on L p (Ω):

Lemma 3.50 (Hölder inequality) Let f ∈ L p (Ω) and g ∈ L q (Ω). Then f g ∈


L 1 (Ω), and
 f g1 ≤  f  p gq . (3.57)

Proof The result is obvious if p ∈ {1, ∞}. So, let p ∈ (1, ∞). Since the inequality
(3.57) is positively homogeneous w.r.t. f and g, it is enough to check that  f g1 ≤ 1
whenever  f  p = gq = 1. So, given f ∈ L p (Ω),  f  p = 1, we need to check
that the convex problem below has value not less than −1:
 
1 1
Min − f (ω)g(ω)dμ(ω); |g(ω)|q dμ(ω) ≤ . (3.58)
g∈L q (Ω) Ω q Ω q

We may always assume that f g ≥ 0 a.e., since otherwise we obtain a lower cost
by changing g(ω) over −g(ω) on {ω ∈ Ω; f (ω)g(ω) < 0}. We may assume that
f (ω) ≤ 0 a.e., in view of the discussion on the sign of f g. We will solve this qualified
convex problem by finding a solution to the optimality system, with multiplier λ > 0.
The Lagrangian function can be expressed as
136 3 An Integration Toolbox

 
− f (ω)g(ω) + λ|g(ω)|q /q dμ(ω) − λ/q, (3.59)
Ω

whose minimum is attained for g ≥ 0 such that − f (ω) + λg(ω)q−1 = 0 a.e., i.e.,
g(ω) = ( f (ω)/λ) p/q , which is an element of L q (Ω). Since λ > 0 the constraint is
binding, and so,
 
−p
1= |g(ω)| dμ(ω) = λ
q
| f (ω)| p dμ(ω) = λ− p , (3.60)
Ω Ω

so that λ = 1. Finally, integrating the product of f (ω) = −g(ω)q−1 with g(ω), we


see that the value of problem (3.58) is −1, as was to be proved. 

Corollary 3.51 Let 1/ p + 1/q = 1/r with r ≥ 1, f ∈ L p (Ω), and g ∈ L q (Ω).


Then f g ∈ L 1 (Ω), and
 f gr ≤  f  p gq . (3.61)

Proof Apply the Hölder inequality (3.57) to f  := | f |r and g  := |g|r . 

Corollary 3.52 Let μ(Ω) < ∞, and 1/ p + 1/q = 1/r with r ∈ (1, ∞). Then
L p (Ω) ⊂ L r (Ω) and if f ∈ L p (Ω), we have that

 f r ≤ μ(ω)1/q  f  p . (3.62)

Proof Apply Corollary 3.51 with g(ω) = 1. 

Lemma 3.53 The space L p (Ω) is a normed vector space, so that for any f , g in
L p (Ω), the following Minkowski inequality holds:

 f + g p ≤  f  p + g p . (3.63)

Proof It is enough to check the triangle inequality (3.63) when f and g are nonneg-
ative. Let q be such that 1/ p + 1/q = 1. By Lemma 3.50, since p − 1 = p/q:
  
( f + g) p = f ( f + g) p−1 + g( f + g) p−1 ≤ ( f  p + g p )( f + g) p/q q . (3.64)
Ω Ω Ω

Note that
 1/q
( f + g) p/q q = ( f + g) p =  f + g p/q
p =  f + g p−1
p . (3.65)
Ω

p p−1
We obtain that  f + g p ≤ ( f  p + g p ) f + g p . The conclusion follows. 

Note the following variant in L p (Ω) of the dominated convergence Theorem 3.38:
3.1 Measure Theory 137

Theorem 3.54 (Dominated convergence in L p (Ω)) Let the sequence f k of L p (Ω),


with p ∈ (1, ∞), converge a.e. to g, and be dominated by h ∈ L p (Ω), in the sense
that | f k (ω)| ≤ h(ω) a.e. Then g ∈ L p (Ω), and f k → g in L p (Ω).

Proof Apply the dominated convergence Theorem 3.38 to f k := | f k − g| p , which


converges a.e. to 0, is integrable and is dominated by the integrable function
2ph p. 

Remark 3.55 Under the hypotheses of the above theorem, if μ(Ω) < ∞, by Corol-
lary 3.52, for all r ∈ [1,  to L (Ω) and f k → g in L (Ω). Taking
 p), f k and g belong
r r

r = 1 it follows that Ω f (ω)dμ(ω) → Ω g(ω)dμ(ω).

Next we will check that, for p ∈ (1, ∞), L p (Ω) is complete, by characterizing it
as a dual space.

3.1.6.2 Dual Spaces: the Riesz Theorem

In the sequel we will characterize the dual of L p (Ω) spaces. See also Royden [105,
Chap. 11] or Lang [68, Chap. VII].

Theorem 3.56 (Riesz representation theorem) Let G be a continuous linear form


on L p (Ω), with p ∈ [1, ∞[. Then there exists a g ∈ L q (Ω), with 1/ p + 1/q = 1,
such that 
G( f ) = f (ω)g(ω)dμ(ω), for any f ∈ L p (Ω). (3.66)
Ω

Proof We just give the proof in the case when p = 1.


(a) Assume first that μ(Ω) < ∞. Then L 2 (Ω) ⊂ L 1 (Ω) with continuous inclusion.
Denote by G  the restriction of G to L 2 (Ω). By the Cauchy–Schwarz inequality, for
all f ∈ L 2 (Ω), we have that

|G  ( f )| ≤ G f 1 ≤ G μ(Ω) f 2 , (3.67)

and therefore G  is a continuous linear form over L 2 (Ω). By the Riesz repre-
sentation theorem for Hilbert spaces, there exists a g ∈ L 2 (Ω) such that G( f ) =

Ω g(ω) f (ω)dμ(ω), for all f ∈ L (Ω). We next prove that g ∈ L (Ω). Let f k be
2

the characteristic function of the set {ω; g(ω) ≥ n}. Then G( f k ) ≥ k f k 1 and there-
fore we must have f k = 0 for large enough k. This proves that esssup g < ∞, and by
a symmetric argument we obtain that g∈ L ∞ (Ω). Since L 2 (Ω) is a dense subset of
L 1 (Ω), it easily follows that G( f ) = Ω g(ω) f (ω)dμ(ω), for all f ∈ L 1 (Ω), and
so the conclusion holds.
(b) When Ω = ∪k Ω  k with μ(Ωk ) < ∞, by the previous arguments, for each k we
have that G( f ) = Ωk gk (ω) f (ω)dμ(ω), for all f ∈ L 1 (Ωk ), with gk ∞ ≤ G.
We may assume that the Ωk are nondecreasing. Then we may define g ∈ L ∞ (Ω)
by g(ω) = gk (ω) for all ω ∈ Ωk and all k. Given f ∈ L 1 (Ω), let f k (ω) = f (ω) if
138 3 An Integration Toolbox

ω ∈ Ωk , and f k (ω) = 0 otherwise. By dominated convergence, ( f k , g f k ) → ( f, g f )


in L 1 (Ω), and therefore
 
G( f ) = lim G( f k ) = lim g(ω) f (ω)dμ(ω) = g(ω) f (ω)dμ(ω). (3.68)
k k Ωk Ω

The result follows. 


Remark 3.57 The conclusion when p ∈ (1, 2] can be obtained in a similar way, using
again the Riesz representation theorem for Hilbert spaces. For p ∈ (2, ∞) the idea
is to decompose a continuous linear form into the difference of nonnegative linear
forms. Applying such a nonnegative linear form G to characteristic functions, we
obtain a measure with value zero on negligible sets. It can be proved then that this
measure has a density g w.r.t. μ, and g is in L q (Ω).

3.1.6.3 The Brézis–Lieb Theorem

A somewhat surprising improvement of Fatou’s lemma is due to Brézis and Lieb


[29].
Theorem 3.58 Let f k be a bounded sequence in L p (Ω), p ∈ [1, ∞[, converging
a.e. to some f . Then we have that f ∈ L p (Ω), and in addition,
 
 f  pp = lim  f k  pp −  f − f k  pp . (3.69)
k

Proof That f ∈ L p (Ω) easily follows from Corollary 3.43. We check in Remark 3.59
below that, for any ε > 0, there exists a Cε > 0 such that, for any a, b in R :
 
|a + b| p − |a| p  ≤ ε|a| p + Cε |b| p . (3.70)

Set h k (ω) := | f k (ω)| p − | f k (ω) − f (ω)| p − | f (ω)| p and


 
gk (ω) := |h k (ω)| − ε| f k (ω) − f (ω)| p + . (3.71)

Obviously h k → 0 a.e., and so does gk . Taking a := f k (ω) − f (ω) and b := f (ω)


in (3.70), we obtain that

|h k (ω)| ≤ || f k (ω)| p − | f k (ω) − f (ω)| p | + | f (ω)| p


(3.72)
≤ ε| f k (ω) − f (ω)| p + (1 + Cε )| f (ω)| p ,

so that |gk (ω)| ≤ (1 + Cε )| f (ω)| p . By the Corollary 3.43 of Fatou’s lemma, | f | p


is integrable. So, by the dominated convergence Theorem 3.38, gk → 0 in L 1 (Ω).
On the other hand, |h k (ω)| ≤ gk (ω) + ε| f k (ω)
 − f (ω)| , and so, lim supk h k 1 =
p

O(ε). Therefore, h k → 0 in L (ω), and so, Ω h k (ω)dμ(ω) → 0. The conclusion


1

follows. 
3.1 Measure Theory 139

Remark 3.59 The inequality (3.70) can be justified as follows. For p = 1 it is trivial.
Now let p ∈ (1, ∞). If |b| > 2|a|, then
 
|a + b| p − |a| p  = |a + b| p − |a| p ≤ 2 p |b| p , (3.73)

and the desired relation holds with Cε = 2 p . Otherwise, by the mean value theorem,
we have that, for some θ ∈]0, 1[:
 
|a + b| p − |a| p  = p|a + θ b| p−1 |b| ≤ 3 p−1 p|a| p−1 |b|. (3.74)

Let q be such that 1/ p + 1/q = 1. We conclude by using Young’s inequality


(1.81): αβ ≤ α q /q + β p / p (for α, β nonnegative) with α = (qε)1/q |a| p−1 and
β := (qε)−1/q 3 p−1 p|b|. The desired relation holds with

Cε = max(2 p , (qε)− p/q 3 p( p−1) p p−1 ). (3.75)

3.1.7 Bochner Integrals

We need to discuss integrals with values in a Banach space Y . Given a measure space
(Ω, F , μ), by L 0 (Ω; Y ) we denote the space of measurable functions of (Ω) with
image Y ; remember that the Banach space Y is implicitly endowed with the Borel σ -
algebra (the one generated by open subsets), so that f ∈ L 0 (Ω; Y ) iff, for any Borel
subset A of Y , f −1 (A) is Lebesgue measurable. The subspace of simple functions
n set of [Ω]) is denoted by L (Ω; Y ).
00
(with finitely many values except on a null
Simple functions can be written as f = i=1 yi 1 Ai , where yi ∈ Y , and the Ai are
measurable subsets of [Ω], with negligible intersections. We may define the integral
and norm of the simple function f by
 n n
f (ω)dω := yi mes(Ai );  f 1,Y := yi Y mes(Ai ). (3.76)
Ω i=1 i=1

Note that 
 f 1,Y =  f (ω)Y dω, for all f ∈ L 00 (Ω; Y ). (3.77)
Ω

The space L 1 (Ω; Y ) of (Bochner) integrable functions is obtained, as is done for the
Lebesgue integral, by passing to the limit in Cauchy sequences of simple functions.
If f k is such a sequence, extracting a subsequence if necessary, we may assume
that  f q − f p 1,Y ≤ 2−q for any q < p, so that the series  f k+1 − f k 1,Y is con-
vergent. Consider  the series sk (ω) :=  f k+1 (ω) − f k (ω)Y and the corresponding
sums Sk (ω) := ≤k sk (ω). By the monotone convergence theorem, Sk converges
140 3 An Integration Toolbox

in L 1 (Ω) to some S∞ , and (being nondecreasing) converges also for a.a. ω. So, for
a.a. ω, the normally convergent sequence f k (ω) has a limit f (ω) in Y , such that

 f (ω) − f k (ω)Y ≤ S∞ (ω) − Sk (ω). (3.78)

Therefore
 
 f (ω) − f k (ω)Y dω ≤ |S∞ (ω) − Sk (ω)|dω = o(1). (3.79)
Ω Ω

We define the integral and norm of f as the limit of those of the f k . These integral
and norm of f are well defined, since they coincide for every Cauchy sequence of
simple functions having the same limit. Indeed, let f k be another Cauchy sequence
of simple functions for the L 1 norm, converging to f for a.a. ω. By (3.77), (3.79)
applied to f k and f k , and the triangle inequality:

 f k − f k 1,Y =  f k (ω) − f k (ω)1 dω
Ω  (3.80)
≤  f k (ω) − f (ω)1 dω +  f (ω) − f k (ω)1 dω
Ω Ω

converges to 0, so that gk := f k − f k converges to zero both in L 1 and (by the previous


discussion) a.e.

Remark 3.60 By (3.77) and (3.79), we have that


 
 f 1,Y = lim  f k 1,Y = lim  f k (ω)Y dω =  f (ω)Y dω, (3.81)
k k Ω Ω

the last equality being a consequence of the dominated convergence theorem; the
domination hypothesis holds since  f k (ω)Y ≤  f 0 (ω)Y + S∞ (ω) and the r.h.s.
belongs to L 1 (Ω).

Remark 3.61 An element of L 0 (Ω; Y ) is said to be Bochner measurable (or strong-


ly measurable) if it has values (up to a null measure subset of (Ω)) in a separable
subspace of Y (we recall that a subspace is separable if it contains a dense sequence).
Being an a.e. limit of simple functions, an element of L 1 (Ω; Y ) is strongly mea-
surable. Conversely, let f be strongly measurable. By Remark 3.16, f is a limit
a.e. of simple functions. If in addition  f (ω)Y is integrable, using the σ -finiteness
hypothesis (3.10) we easily deduce that f ∈ L 1 (Ω; Y ). In general, we have the strict
inclusion   
L (Ω; Y ) ⊂ f ∈ L (Ω; Y );
1 0
 f (ω)Y dω < ∞ . (3.82)
Ω

Note that there is a version of the dominated convergence theorem for Bochner
integrals, see also Aliprantis and Border [3, Thm. 11.46]:
3.1 Measure Theory 141

Theorem 3.62 Let f k be a sequence in L 1 (Ω; Y ), converging a.e. to f ∈ L 0 (Ω; Y ),


such that  f k (ω)Y ≤ g(ω) for a.a. ω, where g ∈ L 1 (Ω). Then f ∈ L 1 (Ω; Y ), and
f k → f in L 1 (Ω; Y ).

Proof Let gk (ω) :=  f k (ω) − f (ω)Y . Then gk → 0 a.e. and gk (ω) ≤ 2g(ω)Y
a.e. By the (standard) dominated convergence theorem, gk → 0 in L 1 (Ω). Extracting
a subsequence if necessary, we may assume that gk  L 1 (Ω) ≤ 2−k . Then
 
 f q − f k  L 1 (Ω;Y ) ≤ Ω  f q (ω) − f (ω)Y dμ(ω) + Ω  f (ω) − f k (ω)Y dμ(ω)

= Ω (gq (ω) + gk (ω))dμ(ω)
(3.83)
converges to 0. That is, f k is a Cauchy sequence in L 1 (Ω; Y ). Being constructed
as a set of limits of Cauchy sequences, L 1 (Ω; Y ) is necessarily complete, and we
have seen that convergence in this space implies convergence a.e. for a subsequence.
Since f k → f a.e, it follows that f k → f in L 1 (Ω; Y ). The conclusion follows. 

Example 3.63 Let Y = C(X ), the space on continuous functions over the metric
compact set X , known to be separable (as a consequence of the Stone–Weierstrass
theorem). Then L 1 (Ω, Y ) coincides with the set of measurable functions f : Ω → Y
such that  f (x, ω)Y is integrable, and

 f  L 1 (Ω,Y ) = max | f (x, ω)|dμ(ω). (3.84)
Ω x∈X

By the above dominated convergence theorem, if f k ∈ L 1 (Ω, Y ) satisfies the domi-


nation hypothesis, and if f k (·, ω) → f (·, ω) a.e. in C(X ), i.e., maxx∈X | f k (x, ω) −
f (x, ω)| → 0 a.e., then f k → f in L 1 (Ω, Y ), that is,

max | f k (x, ω) − f (x, ω)|dμ(ω) → 0. (3.85)
Ω x∈X

3.2 Integral Functionals

Let (Ω, F , μ) be a measure space, and let f : Ω × Rm → R̄. We consider an opti-


mization problem of the form

Min F(u) := f (ω, u(ω))dμ(ω); u(ω) ∈ U a.e., (3.86)
u∈L (Ω;Rm )
p
Ω

where p ∈ [1, ∞], and U ⊂ Rm . We adopt the following definition of the domain of
an integral cost, valid in the context of a minimization problem.

Definition 3.64 Let F be as above. We define its domain as


142 3 An Integration Toolbox
 
u ∈ L p (Ω; Rm ); f (ω, u(ω)) is measurable
dom(F) := . (3.87)
f (ω, u(ω))+ is integrable

Denote the set of elements of L p (Ω; Rm ) with values a.e. in U by

L p (Ω; U ) := {u ∈ L p (Ω)m ; u(ω) ∈ U a.e.}. (3.88)

Intuitively, we would expect that the infimum in (3.86) is obtained by the exchange
property of the minimization and integration operators, i.e.
 
inf f (ω, u(ω))dμ(ω) = inf f (ω, v)dμ(ω). (3.89)
u∈L p (Ω;U ) Ω Ω v∈U

This, however, raises some technical issues, the first of them being to check that
the r.h.s. integral is well-defined. We will solve this problem assuming that f is a
Carathéodory function, and then in the case when in addition the local constraint
depends on ω. We will analyze conjugate functions, also in the case of more general
convex integrands, and discuss the related problem of minimizing such an integral
subject to the constraint that some integrals of the same type are nonpositive.

3.2.1 Minimization of Carathéodory Integrals

Definition 3.65 We say that f : Ω × Rm → R is a Carathéodory function if, for


a.a. ω, f (ω, ·) is continuous, and if, for all v ∈ Rm , f (·, v) is measurable.

Lemma 3.66 Let f be a Carathéodory function. Then ω → f (ω, u(ω)) is measur-


able, for all u ∈ L 0 (Ω; Rm ).

 3.13, u ∈ L (Ω; R ) is the limit a.e. of a sequence of simple func-


0 m
Proof By Lemma
tions u k (ω) = i∈Ik u ki 1{ω∈Aki } , where the Ik are finite sets, and Aki are measurable
sets with null measure intersections. Therefore

f (ω, u k (ω)) = f (ω, u ki )1{ω∈Aki } (3.90)


i∈Ik

is measurable and converges a.e. to f (ω, u(ω)). We conclude by Lemma 3.26. 

Proposition 3.67 Let f be a Carathéodory function, and dom(F) be nonempty.


Then ω → inf v∈U f (ω, v) is a measurable function, and the exchange property (3.89)
holds.

Proof (a) Let û ∈ dom(F). Consider a dense sequence ak in Rm . Let bk ∈ U be such


that |ak − bk | ≤ 2 dist(ak , U ). Then bk is a dense sequence in U . Let the sequence
u k of functions Ω → Rm be inductively defined by u 0 = û, and for k ≥ 1:
3.2 Integral Functionals 143

u k−1 (ω) if f (ω, u k−1 (ω)) ≤ f (ω, bk ),
u k (ω) = (3.91)
bk otherwise.

(b) Then u k is measurable and f (ω, u k (ω)) is a nonincreasing function of k. Since


ω → f (ω, u 0 (ω)) ∈ dom(F), it follows that f (ω, u k (ω)) ∈ dom(F) as well. Since
bk is a dense sequence in U , and f (ω, ·) is continuous, we have that

lim f (ω, u k (ω)) = inf f (ω, v), (3.92)


k v∈U


proving that the r.h.s. is measurable. If Ω f (ω, u k (ω)dμ(ω) = −∞ for large enough
k, then the equality (3.89) holds with value −∞. Otherwise, we conclude by the
monotone convergence Theorem 3.34. 

3.2.2 Measurable Multimappings

We next discuss a more general case where we have a constraint of the form

u(ω) ∈ U (ω), for a.a. ω ∈ Ω, (3.93)

where U is a multimapping Ω → P(Rm ). We say that U is measurable if, for any


closed set C ⊂ Rm , U −1 (C) is measurable, and that U is closed-valued if U (ω) is
closed for a.a. ω. Given a measurable multimapping U , for p ∈ [1, ∞], consider the
set  
L p (Ω; U ) := u ∈ L p (Ω; Rm ) ; u(ω) ∈ U (ω), a.a. ω ∈ Ω . (3.94)

Definition 3.68 Let U be a multimapping Ω → P(Rm ). We call a sequence u k in


L p (Ω; U ) such that, for a.a. ω, U (ω) is the closure of {u k (ω), k ∈ N} a Castaing
representation of U in L p (Ω; Rm ).
By a result due to C. Castaing (see e.g. [102, Thm. 1B, p. 161]), any measurable
multimapping with closed values has a Castaing representation. We next prove this
result. We first need to properly define a single-valued projection on a nonconvex,
nonempty closed set. Let C be a closed subset of Rm . For z ∈ Rm , set Pz C :=
{c ∈ C; |z − c| = dist(z, C)}. Next, let z 0 , . . . , z m be affinely independent (i.e., not
included in a hyperplane). Set

πz 0 ...,z m C := Pz 0 · · · Pz m C. (3.95)

It can be proved by induction that the intersection of k + 1 spheres with affinely


independent centers in Rm is a sphere in a subspace of dimension less than m − k.
It follows that πz 0 ,...,z m is a singleton.
Definition 3.69 Assuming that z m = 0, we define the projection of a point a ∈ Rm
over a closed set C by:
144 3 An Integration Toolbox

P̂a (C) := πa+z 0 ...,a+z m C. (3.96)

Clearly, P̂a (C) ∈ Pa C, so that if C is convex we recover the usual projection on a


convex set. We next denote by I the countable set of affinely independent elements
of (Rm )m , with rational coordinates.
We say that the multimapping U (·) is compatible with the space L p (Ω; Rm ) if
the function dist(0, U (ω)) (which by the proposition below is measurable) belongs
to L p (Ω; Rm ).

Proposition 3.70 Let U : Ω → P(Rm ) be a measurable and closed-valued mul-


timapping. Then:
(i) For any b ∈ I , the map ω ∈ Ω → π̂b (ω) := πb U (ω) ∈ Rm is measurable.
(ii) If U (ω) is compatible with L p (Ω; Rm ), then the family {π̂b (ω), b ∈ I } is a
Castaing representation of U .

Proof Since (ii) easily follows from (i), it suffices to prove the latter. We essen-
tially reproduce the arguments in [102]. Let a ∈ Rm . Since P̂a is a composition of
projections, it suffices to prove that, if Γ (ω) is a measurable closed-valued mul-
timapping, then Pa (ω) := Pa Γ (ω) is measurable. For this, consider the sequence of
multimappings

Γ k (ω) := {v ∈ Rm ; dist(v, Γ (ω)) < k −1 ; |v − a| < dist(a, Γ (ω)) + k −1 }.


(3.97)
Let C be a closed subset of Rm . Then Pa (ω) ∈ C iff C ∩ Γ k (ω) = ∅ for all k, and
thus
Pa−1 (C) = ∩k Γk−1 (C). (3.98)

Next, let D be a countable dense subset of C, which always exists. We claim that

(Γ k )−1 (C) = (Γ k )−1 (D) = ∪d∈D (Γ k )−1 (d). (3.99)

The second equality is obvious and, since D is a dense subset of C, in order to


establish the first equality it suffices to check that if c ∈ C and ω0 ∈ (Γ k )−1 (c), then
for c close enough to c we have that ω0 ∈ (Γ k )−1 (c ). But this follows directly from
the definition of Γ k (ω) in (3.97). Our claim follows.
On the other hand, for any v ∈ Rm and α ≥ 0 we have

{ω ∈ Ω ; dist(v, Γ (ω)) < α} = Γ −1 (v + α B), (3.100)

where B is the unit ball in Rm . Thus, since Γ is measurable, so is the process


dist(v, Γ (ω)). Therefore, from the definition (3.97), for any (ω, v) ∈ Ω × Rm we
have that (Γ k )−1 (v) is measurable. By (3.98), (3.99), P̂a is an intersection of unions
of measurable sets, and is therefore itself measurable. 

We apply the previous result to the problem of minimizing an integral cost.


3.2 Integral Functionals 145

Proposition 3.71 Let f : Ω × Rm → R be a Carathéodory function, and U (ω) be


a measurable, closed-valued multimapping from Ω to Rm . Assume that there exists
a û in dom(F) ∩ L p (Ω, U ). Then the following exchange property holds:
 
inf f (ω, u(ω))dμ(ω) = inf f (ω, v)dμ(ω). (3.101)
u∈L (Ω;U ) Ω Ω v∈U (ω)
p

Proof Let ak be a Castaing representation of the multimapping U . Consider the


sequence u k defined by u 0 := û, and for k ≥ 1:

u k−1 (ω) if f (ω, u k−1 (ω)) ≤ f (ω, ak (ω)),
u k (ω) = (3.102)
ak (ω) otherwise.

We conclude, as in step (b) of the proof of Proposition 3.67, by the monotone con-
vergence Theorem 3.34. 

Remark 3.72 If U (ω) is, for a.a. ω, a finite set, then the above minimizing sequence
u k converges a.e. to some ū ∈ L 0 (Ω), with values in U (ω). By the monotone con-
vergence Theorem 3.34, we have that
 
inf f (ω, u(ω))dμ(ω) = f (ω, ū(ω))dμ(ω). (3.103)
u∈L (Ω;U ) Ω
p
Ω

3.2.3 Convex Integrands

In the case of convex integrands (such that f (ω, ·) is, for a.a. ω, convex) we can deal
with integral functionals using the following result.

Lemma 3.73 Let g be a proper, l.s.c. convex function Rm → R̄, and E be a dense
subset of dom(g). Then, for all y ∈ dom(g), we have that

g(y) = lim inf{g(x); x ∈ E, x → y}. (3.104)

Proof Denote by ĝ(y) the r.h.s. of (3.104). Since g is l.s.c., g(y) ≤ ĝ(y). We next
prove the opposite inequality. Changing if necessary Rm into the affine space spanned
by dom(g), we may assume that the latter has a nonempty interior. We know that g
is continuous over the interior of its domain. Since E is a dense part of dom(g), if
y ∈ int(dom(g)), then (3.104) holds, and hence, for all y ∈ dom(g):

ĝ(y) ≤ lim inf{g(x); x → y; x ∈ int(dom(g))}. (3.105)

Let y ∈ dom(g), y0 ∈ int(dom(g)), and t ∈ [0, 1]. Set yt := (1 − t)y0 + t y, with


t ∈ (0, 1). Since t → g(yt ) is l.s.c. convex, we have
146 3 An Integration Toolbox

ĝ(y) ≤ lim sup g(yt )) ≤ lim sup ((1 − t)g(y0 ) + tg(y)) = g(y), (3.106)
t↑1 t↑1

as was to be shown. 

Proposition 3.74 Let f : Ω × Rm → R be such that, for a.a. ω, f (ω, ·) is l.s.c.


convex with a nonempty interior, and for all v ∈ Rm , f (·, v) is measurable. Assume
that there exists a û ∈ dom(F). Then the exchange property (3.101) holds.

Proof The proof is similar to that of Proposition 3.67, with here U (ω) = Rm . Given
a dense sequence ak in Rm , set

u k−1 (ω) if f (ω, u k−1 (ω)) ≤ f (ω, ak ),
u k (ω) = (3.107)
ak otherwise.

Then u k is measurable, and limk f (ω, u k (ω)) = inf v∈U f (ω, v) in view of
Lemma 3.73. 

We next deal with the more general situation when dom( f (ω, ·)) may have an
empty interior.

Definition 3.75 Let p ∈ [1, ∞]. We say that f : Ω × Rm → R̄ is a normal inte-


grand if the multimapping dom( f (ω, ·)) has a Castaing representation, i.e., if there
exists a sequence u k in L p (Ω)m such that {u k (ω)} is dense in dom( f (ω, ·)), for a.a.
ω. If in addition f (ω, ·) is l.s.c. convex for a.a. ω, we say that f is a normal convex
integrand.

Proposition 3.76 Let f : Ω × Rm → R be a normal convex integrand. Then the


exchange property (3.101) holds.

Proof The proof is an easy variant of that of Proposition 3.74. The details are left to
the reader. 

Remark 3.77 The difficulty here is to check the existence of a Castaing representa-
tion in Definition 3.75. If dom( f (ω, ·)) is a closed-valued measurable multimapping,
this follows from Proposition 3.70. If f is a convex integrand and dom( f (ω, ·)) has
a nonempty interior a.e., a Castaing representation is the sequence u k constructed in
the proof of Proposition 3.74.

3.2.4 Conjugates of Integral Functionals

3.2.4.1 Case p < ∞

As we have seen, when p ∈ [1, ∞), the dual of L p (Ω)m is L q (Ω)m , where 1/ p +
1/q = 1, and when p = ∞, its dual contains L 1 (Ω)m . Let U (·) be a measurable
3.2 Integral Functionals 147

multimapping over Ω with image in Rm . Given f : Ω × Rm → R̄, consider the


function F : L p (Ω)m → R̄,

F(u) := f (ω, u(ω))dμ(ω), (3.108)
Ω

with domain

dom(F) := {u ∈ L p (Ω)m ; f (ω, u(ω)) is measurable; f (ω, u(ω))+ ∈ L 1 (Ω)},


(3.109)
and
FU (u) := F(u) if u ∈ L p (Ω; U ), +∞ otherwise, (3.110)

with domain
dom(FU ) := dom(F) ∩ L p (Ω; U ). (3.111)

Let u ∗ ∈ L q (Ω)m . Then



 ∗ 
FU∗ (u ∗ ) := sup u (ω) · u(ω) − f (ω, u(ω)) dμ(ω). (3.112)
u∈dom(FU ) Ω

This amounts to minimizing the integral of ω → f (ω, u(ω)) − u ∗ (ω) · u(ω) over
L p (Ω, U ). The latter is Carathéodory (resp. a normal convex integrand) iff the same
holds for f (ω, u). Set

fU (ω, u) := f (ω, u) + IU (ω) (u), (3.113)

whose Fenchel conjugate is


 ∗ 
fU∗ (ω, u ∗ ) := sup u (ω) · u − f (ω, u) . (3.114)
u∈U (ω)

As a consequence of Propositions 3.71 and 3.76, we obtain the following statements:

Proposition 3.78 Let f : R × Rm → R be a Carathéodory function, and U (ω) be a


measurable, closed-valued multimapping, such that dom(FU ) = ∅. Let p ∈ [1, ∞],
and u ∗ ∈ L q (Ω)m . Then

∗ ∗
FU (u ) := fU∗ (ω, u ∗ (ω))dμ(ω). (3.115)
Ω

Corollary 3.79 Let f , p and u ∗ be as in Proposition 3.78, and let F have a finite
value at u. Then u ∗ ∈ ∂ FU (u) iff u ∗ (ω) ∈ ∂ fU (ω, u(ω)) a.e.

Proof By the above proposition, the Fenchel–Young inequality for F reads


148 3 An Integration Toolbox

 
fU (ω, u(ω)) + fU∗ (ω, u ∗1 (ω)) − u ∗1 (ω) · u(ω) dμ(ω) ≥ 0, (3.116)
Ω

and u ∗ ∈ ∂ FU (u) iff equality holds, i.e., iff the above integrand is equal to 0 a.e. The
conclusion follows. 
Proposition 3.80 Let f : Ω × Rm → R be a normal convex integrand. Let p ∈
[1, ∞], and u ∗ ∈ L q (Ω)m . If dom(F) = ∅, then

F ∗ (u ∗ ) := f ∗ (ω, u ∗ (ω))dμ(ω). (3.117)
Ω

Corollary 3.81 Let f , p and u ∗ be as in Proposition 3.80, and let F have a finite
value at u. Then u ∗ ∈ ∂ F(u) iff u ∗ (ω) ∈ ∂ f (ω, u(ω)) a.e.
Proof The argument is similar to the one in the proof of Corollary 3.79. 
Example 3.82 We extend Example 1.38 to the present setting as follows. Take
f (x) := |x| p / p with p > 1. Then for u ∗ ∈ L p (Ω), with 1/ p + 1/q = 1, we have
that 
1
F ∗ (u ∗ ) = u ∗ (ω)qq dμ(ω). (3.118)
q Ω

3.2.4.2 General Case When p = ∞

We next consider the case when p = ∞, and u ∗ ∈


/ L 1 (Ω)m . We need the following
∞ ∗
characterization of elements of L (Ω) .
Lemma 3.83 Each u ∗ ∈ L ∞ (Ω)∗ has the unique decomposition u ∗ = u 1 + u s ,
where the regular part u 1 belongs to L 1 (Ω), and the singular part u s is such that
there exists a nondecreasing sequence Ak of measurable subsets of Ω, such that
Ω = ∪k Ak , and that for any k ∈ N, u s , u = 0, for all u ∈ L ∞ (Ω) with zero value
on Ω \ Ak .
Proof This difficult result is due to Yosida and Hewitt [126]. See also Castaing
and Valadier [33, Chap. 8] (it is convenient to say that u s is concentrated on the
complement of the Ak , in the sense of the above definition).
Example 3.84 Take Ω = N, endowed with a probability measure μ such that each
“basis” element ei (a sequence of zeros except for the ith term, which is equal to 1)
has a positive probability. Let X := ∞ be the space of bounded sequences. Given

u∗ ∈ X ∗ , set ai := u ∗ , ei , i ∈ N. Then
 the regular part is defined by u 1 , u =

i∈N ai u i , for all u ∈  , with u = i∈N u i ei , and the singular part depends only
on the behavior at infinity of u.
We can construct a singular element of X ∗ as follows. If x ∈ X has a limit, denote
it by lim(x). This is a continuous linear form over the subspace X 1 of sequences
having a limit. Then extend this linear form over X thanks to Corollary 1.8.
3.2 Integral Functionals 149

Lemma 3.85 Let f be a normal convex integrand, and F : L ∞ (Ω; Rm ) → R̄,


defined by F(u) = Ω f (ω, u(ω))dμ(ω), be proper. Let u ∗ ∈ L ∞ (Ω; Rm )∗ have
decomposition u ∗1 + u ∗s as in Lemma 3.83. Then we have the decoupling property:

⎨ F ∗ (u ∗ ) = F ∗ (u ∗1 ) + σ (u ∗s , dom(F)),
(3.119)
⎩ F ∗ (u ∗1 ) = f ∗ (u ∗1 (ω), ω)dμ(ω).
Ω

Proof The second relation follows from Proposition 3.80; let us prove the first one.
For all α < F ∗ (u ∗1 ), there exists a u α ∈ dom(F) such that

α < u ∗1 , u α  − f (ω, u α (ω))dμ(ω). (3.120)
Ω

For all β < σ (u ∗s , dom(F)), there exists a u β ∈ dom(F) such that u ∗s , u β  > β. Let
Ak be as in the definition of a singular part of an element of L ∞ (Ω). Set

u α (ω) if ω ∈ Ak ,
u α,β,k (ω) := (3.121)
u β (ω) otherwise.

Then
u ∗s , u α,β,k  = u ∗s , 1Ω\Ak u β  = u ∗s , u β  > β. (3.122)

For a.a. ω, u α,β,k (ω) = u α (ω) for large enough k, so that

u α,β,k (ω) → u α and f (ω, u α,β,k (ω)) → f (ω, u α (ω)) a.e., when k ↑ ∞. (3.123)

So, by the dominated convergence theorem (note that, by the definition of u α and
u β , f (ω, u α (ω)) and f (ω, u β (ω)) are integrable), we get

 
limk u ∗1 (ω) · u α,β,k (ω) − f (ω, u α,β,k (ω)) dμ(ω)
Ω  (3.124)
 ∗ 
= u 1 (ω) · u α (ω) − f (ω, u α (ω)) dμ(ω),
Ω

and so
  
F (u ) ≥ lim u ∗1 + u ∗s , u α,β,k  −
∗ ∗
f (ω, u α,β,k (ω))dμ(ω) > α + β. (3.125)
k Ω

Maximizing over α and β, we get F ∗ (u ∗ ) ≥ F ∗ (u ∗1 ) + σ (u ∗s , dom(F)). The opposite


inequality is obvious, since for u ∈ dom(F), by the Fenchel–Young inequality for
F ∗ (u ∗1 ), we get

u ∗ , u − F(u) = u ∗1 , u − F(u) + u ∗s , u ≤ F ∗ (u ∗1 ) + σ (u ∗s , dom(F)). (3.126)


150 3 An Integration Toolbox

The conclusion follows. 

Corollary 3.86 Let F be as in Lemma 3.85, with finite value at u. Then u ∗ = u ∗1 + u ∗s


belongs to ∂ F(u) iff the two conditions below hold:

(i) u ∗1 ∈ ∂ F(u), i.e., u ∗ (ω) ∈ ∂ f (ω, u(ω)) a.e.,
(3.127)
(ii) u ∗s , u = σ (u ∗s , dom(F)), i.e., u ∗s ∈ Ndom(F) (u).

Proof By Lemma 3.85, the Fenchel–Young inequality for F reads as



 
f (ω, u(ω)) + f ∗ (ω, u ∗1 (ω)) − u ∗1 (ω) · u(ω) dμ(ω)
Ω (3.128)
+σ (u ∗s , dom(F)) − u ∗s , u ≥ 0,

and u ∗ ∈ ∂ F(u) iff the sum equals 0, i.e., iff both the integral and

Δ := (σ (u ∗s , dom(F)) − u ∗s , u) (3.129)

are equal to 0 (since each of these two terms is nonnegative). Now

Δ = 0 iff u ∗s , u  − u ≤ 0, for all u  ∈ dom(F), (3.130)

i.e., Δ = 0 iff u ∗s ∈ Ndom(F) (u). The conclusion follows. 

Remark 3.87 With the previous notation, if u ∈ int(dom(F)), then

Ndom(F) (u) = {0} and ∂ F(u) ⊂ L 1 (Ω, Rm ). (3.131)

3.2.5 Deterministic Decisions in Rm

Consider now the case when the decision x ∈ Rm should not depend on ω. We have
to minimize 
f¯(x) := f (ω, x)dμ(ω), (3.132)
Ω

where x ∈ Rm s and f is a normal convex integrand. Set Y := L p (Ω, Rm ), with


p ∈ [1, ∞]. We need that Y includes constant functions, so that μ(Ω) < ∞, and so,
we may assume that (Ω, F , μ) is a probability space. Denote by A the operator that
to x ∈ Rm associates the constant function on Y with value x. Define F : Y → R̄ by

F(y) := f (ω, y(ω))dμ(ω). (3.133)
Ω
3.2 Integral Functionals 151

Then f¯ = F ◦ A, F is convex, and assuming that for some g ∈ L 1 (ω):

f (ω, y(ω)) ≥ g(ω) a.e. (3.134)

it follows from Fatou’s Lemma 3.41 and the l.s.c. of f (ω, ·) a.s. that F is l.s.c. Given
a nonempty closed, convex subset K of Rm , consider the problem

Min f¯(x). (P)


x∈K

We are in the framework of the Fenchel duality theory in Example 1.2.1.8. The
expression of the stability condition (1.203) becomes

ε BY ⊂ dom(F) − AK , for some ε > 0. (3.135)

By the subdifferential calculus rules (Lemma 1.120) if f¯ has a finite value at x ∈ Rm ,


then:
∂ f¯(x) ⊃ A ∂ F(Ax), with equality if (3.135) holds. (3.136)

Let us give the expression of A . Here ( p, q) are such that 1/ p + 1/q = 1.

Definition 3.88 Let ys∗ be a singular element of L ∞ (Ω, Rm )∗ . Let 1 denote the
constant function of L ∞ (Ω) with value 1. Then we define the expectation of ys∗ by,
for i = 1 to n:
(Eys∗ )i = ysi∗ , 1. (3.137)

Lemma 3.89 (i) If y ∗ ∈ L q (Ω, Rm ), then A y ∗ = Ω y ∗ (ω)dμ(ω) = Ey ∗ .
(ii) When p = ∞, if y ∗ ∈ L ∞ (Ω, Rm )∗ has the decomposition y ∗ = y1∗ + ys∗ , with
y1∗ ∈ L 1 (Ω, Rm ), and ys∗ singular with components denoted by ysi∗ , i = 1, . . . , n, we
have
A y ∗ = Ey1∗ + Eys∗ . (3.138)

Proof Point (i) follows from


  
y ∗ , Ax = y ∗ (ω) · xdμ(ω) = y ∗ (ω)dμ(ω) · x. (3.139)
Ω Ω

  m
Point (ii) follows from y ∗ , Ax = Ω y1∗ (ω)dμ(ω) · x + i=1 xi ysi∗ , 1. 

We deduce the following result.

Proposition 3.90 Let f be a normal convex integrand such that f¯ has a finite value
at x and that (3.134), (3.135) hold. Then, when p ∈ [1, ∞[ :
 
∂ f¯(x) = ∗ ∗ q ∗
x (ω)dμ(ω); x ∈ L (Ω); x (ω) ∈ ∂ f (ω, x) p.s. , (3.140)
Ω
152 3 An Integration Toolbox

and when p = ∞, for some singular xs∗ :


⎧ ⎫
⎨ x1 (ω)dμ(ω) + Exs ; x1 ∈ L (Ω); ⎬
∗ ∗ ∗ 1
∂ f¯(x) = Ω . (3.141)
⎩ ∗ ∗ ⎭
x1 (ω) ∈ ∂ f (ω, x) p.s.; xs ∈ Ndom(F) (x1)

Corollary 3.91 If x is a solution of (P), and (3.134),


 (3.135) hold, we deduce from
the above proposition that, if p ∈ [1,∞), then Ω x ∗ (ω)dμ(ω) + N K (x)  0, with
x ∗ as in (3.140), and if p = ∞, then Ω x1∗ (ω)dμ(ω) + Exs∗ + N K (x)  0, with x1∗ ,
xs∗ as in (3.141).

Remark 3.92 The sum in the first line of Equation 3.141 reduces to E(x1∗ + xs∗ )
(where the expectation of the sum is defined as the sum of expectations, which is
correct since the decomposition is unique).

3.2.6 Constrained Random Decisions

We consider a more general situation where the decision is a Banach space different
from L p (Ω), having in mind the case when ω is a vector and the decision might
depend on some components of the vector. So let X be a Banach space and let
A ∈ L(X, L p (Ω)). Given a closed convex subset K of X , we consider the problem

Min f¯(x), (P)


x∈K


where F(y) := Ω f (ω, y(ω))dμ(ω), and f¯(x) := F(Ax). We assume that the fol-
lowing optimality condition (similar to (3.135)) holds:

ε BY ⊂ dom(F) − AK . (3.142)

By the same arguments as before we obtain that

Proposition 3.93 Let f be a normal convex integrand such that f¯ has a finite value
at x, and that (3.134) and (3.142) hold. Then (i) when p ∈ [1, ∞[ :
 
∂ f¯(x) = A x ∗ ; x ∗ ∈ L q (Ω); x ∗ (ω) ∈ ∂ f (ω, x) p.s. , (3.143)

and when p = ∞, for some singular xs∗ :


 
A (x1∗ + xs∗ ) x1∗ ∈ L 1 (Ω);
∂ f¯(x) = . (3.144)
x1∗ (ω) ∈ ∂ f (ω, x) p.s.; xs∗ ∈ Ndom(F) (x1)

(ii) If x is solution of (P), say when p = ∞, we have, with x1∗ and xs∗ as above, that
3.2 Integral Functionals 153

A (x1∗ + xs∗ ) + N K (x)  0. (3.145)

We next apply this result when Ω = Ω1 × Ω2 , (Ωi , Fi , μi ) are measure spaces


for i = 1, 2, F is the product σ -algebra and μ is the product of μ1 and μ2 . We take
X = L p (Ω1 )m , that is, the decision x may depend on ω1 , but not on ω2 . Then A is
the embedding from X = L p (Ω1 )m into Y = L p (Ω)m , and so, A is the restriction
of x ∗ ∈ Y ∗ to the subspace X . If u ∗ ∈ L q (Ω)m , its restriction v∗ := A u ∗ is such
that for any v ∈ L p (Ω1 )m :

v∗ , v = u ∗ (ω1 , ω2 ) · v(ω1 )dμ(ω)
Ω   (3.146)

= v(ω1 ) · u (ω1 , ω2 )dμ2 (ω2 ) dμ1 (ω1 )
Ω1 Ω2

and therefore, for a.a. ω1 :



v∗ (ω1 ) = u ∗ (ω1 , ω2 )dμ2 (ω2 ). (3.147)
Ω2

3.2.7 Linear Programming with Simple Recourse

Let us consider the following problem of linear programming with simple recourse

Min c · x + Eω dω · yω
x,y
x ∈ Rn+ ; A0 x ≤ b0 (3.148)
ω ω ω
yω ∈ Rm+ ; A yω = b + M x, a.s.

Here (Ω, F , μ) is a probability space, and (dω , Aω , bω , M ω ) are measurable, essen-


tially bounded vector or matrix functions whose dimensions do not depend on ω. For
given x ∈ Rn and ω, the recourse yω is the solution of the following problem:

Min dω · yω ; yω ∈ Rm
+; Aω yω = bω + M ω x. (Pω (x))

The linear program dual to (Pω (x)) is

Max
ω
−λω · (bω + M ω x); dω + (Aω ) λω ≥ 0. (Dω (x))
λ

Since its feasible set does not depend on x, it is natural to suppose that it is nonempty
a.s. (otherwise (3.148) would have infimum −∞ whenever it is feasible). Denote
by vω (x) the value of problem (Pω (x)). By linear programming duality theory
(Lemma 1.26), we have that
154 3 An Integration Toolbox

vω (x) = val(Pω (x)) = val(Dω (x)) a.s. (3.149)

Lemma 3.94 We have that vω is a normal convex integrand.

Proof (i) It is easily checked that vω (x) is a.s. convex. Since vω (x) = val(Dω (x))
a.s., the latter being a supremum of affine functions of x, it is also l.s.c.
(ii) Let y k be a dense sequence in Rm + . Let | · |1 denote the  norm in a finite-
1

dimensional space. The function

ϕω (x) := inf |Aω y k − bω − M ω x|1 (3.150)


k

is measurable, and satisfies

ϕω (x) = minm |Aω y − bω − M ω x|1 , (3.151)


y∈R+

the infimum being attained a.e. since it corresponds to the value of a linear program.
Therefore, ϕω (x) = 0 iff x ∈ dom vω , and dom vω = ϕω−1 (0) is a.s. nonempty. In
addition, let the sequence x j → x be such that ϕω (x j ) → 0. Then there exists a
ω j ω ω j ω j
sequence y j ∈ Rm + such that |A y − b − M x |1 → 0. It follows that |A y −
ω ω
b − M x|1 → 0, that is,

x j → x and ϕω (x j ) → 0 implies ϕω (x) = 0. (3.152)

Set
G ω := {(yω , x) ∈ Rm
+ × R+ ;
n
Aω yω − bω − M ω x = 0}. (3.153)

By Lemma 1.28, there exists a Hoffman constant cω > 0 such that

dist((yω , x), G ω ) ≤ cω |Aω yω − bω − M ω x|, for all (yω , x) ∈ Rm


+ × R+ .
n

(3.154)
Minimizing the r.h.s. over yω ∈ Rm + , we obtain

dist(x, dom vω ) ≤ inf m dist((yω , x), G ω ) ≤ cω ϕω (x). (3.155)


yω ∈R+

Now it is enough to check that for any bounded closed subset C of Rn , (dom v)−1 (C)
is measurable. Let c be a dense sequence in C. We claim that

(dom v)−1 (C) = E, where E := {ω ∈ Ω; ∩ε>0 ∪ {B(c , ε); ϕω (c ) ≤ ε}.


(3.156)
Indeed, For any sequence εk ↓ 0 there exists some ĉk in the sequence c such that
|x − ĉk | < εk . Being a minimum of continuous functions, ϕω (·) is u.s.c. and therefore
lim supk ϕω (ĉk ) ≤ ϕω (x) = 0. It follows that ω ∈ E.
3.2 Integral Functionals 155

Conversely, let ω ∈ E. Given εk ↓ 0 there exists a sequence ck in C such that


ϕω (ck ) ≤ εk . Extracting a subsequence if necessary, we may assume that ck → x ∈
C. By (3.152) we have that ϕω (x) = 0, that is, x ∈ dom(vω ) as was to be proved.
The claim (3.156) follows.
Since the set E is obviously measurable, the multimapping ω → dom vω is mea-
surable. Since vω (·) is a closed-valued multimapping, we deduce from Proposi-
tion 3.70 the existence of a Castaing representation. The conclusion follows. 

Since vω is a normal convex integrand, we have that

inf {Edω · yω ; yω ∈ F(Pω (x)) a.s.} = Evω (x). (3.157)


y∈L ∞ (Ω)m

Therefore, the original problem is equivalent to

Min c · x + Eω vω (x); x ∈ Rn+ ; A0 x ≤ b0 . (3.158)


x

Consider the qualification condition

There exists ε > 0 and x̂ ∈ Rn such that B(x̂, ε) ∈ dom vω a.s. x̂ > 0 and A0 x̂ < b0 .
(3.159)
Define F : L ∞ (Ω)n → R̄ by F(z) := Eω vω (z ω ). We recall the definition of expec-
tation of elements of L ∞ (Ω, Rn )∗ given in Definition 3.88.

Theorem 3.95 Let the qualification condition (3.159) hold. If x̄ is a solution of


(3.148), then there exists s̄ ∈ Rn+ , λ̄(ω) ∈ L 1 (Ω), with λ̄(ω) ∈ S(Dω (x)) a.s., and
xs∗ ∈ Ndom(F) (x1), such that

c + s̄ + (A0 ) η̄ + Exs∗ − E(M ω ) λ̄(ω) = 0,


(3.160)
s̄ ≥ 0; s̄ · x̄ = 0; η̄ ≥ 0; η̄ · (A0 x̄ − b0 ) = 0.

Proof Denote by ∂vω (x) the partial subdifferential of vω (x) w.r.t x. By Lemma 1.55,
we have that
∂vω (x) = −(M ω ) S(Dω (x)) a.s. (3.161)

Proposition 3.90, whose hypotheses hold in view of (3.159), imply that


⎧ ⎫
⎨ x1∗ (ω)dμ(ω) + Exs∗ , 1; x1∗ ∈ L 1 (Ω); ⎬
∂ F(x̄) = Ω . (3.162)
⎩ ∗ ⎭
x1 (ω) ∈ ∂vω (x̄) a.s.; xs∗ ∈ Ndom(F) (x1)

Let P 0 := {x ∈ Rn+ ; A0 x ≤ 0}. Condition (3.159) implies that

c + ∂ F(x̄) + N P 0 (x̄)  0. (3.163)


156 3 An Integration Toolbox

By linear programming duality


q
N P 0 (x̄) = {(s, η) ∈ Rn+ × R+ ; s · x̄ = η · (A0 x̄ − b0 ) = 0}. (3.164)

We conclude by (3.161). 

3.3 Applications of the Shapley–Folkman Theorem

We have already stated the Shapley–Folkman Theorem 1.170.

3.3.1 Integrals of Multimappings

Let (Ω, F , μ) be a probability space. We assume that μ is non-atomic, i.e., for any
A ∈ F , with μ(A) > 0, there exists a B ∈ F , B ⊂ A, such that 0 < μ(B) < μ(A).
This is known to be equivalent to the Darboux property3

For all α ∈ (0, 1), there exists a B ∈ F , B ⊂ A,
(3.165)
such that μ(B) = αμ(A).

Let F be a (not necessarily measurable) multimapping Ω → Rn , defined a.e.


on Ω. If f ∈ L 1 (Ω) is such that f (ω) ∈ F(ω) a.e., we say that f is an integrable
selection of F. We set
  
F := f (ω)dμ(ω); f is an integrable selection of F . (3.166)
Ω Ω

The following holds [74]. Our proof follows [119].



Theorem 3.96 We have that Ω F is a convex subset of Rn .

Proof Let x1 and x2 belong  to Ω F, and x = αx1 + (1 − α)x2 , for some α ∈ (0, 1).
We have to prove that x ∈ Ω F. So, f 1 , f 2 being the integrable selections associated
with x1 and x2 , it suffices to consider the case when F(ω) := { f 1 (ω), f 2 (ω)}. Let
p > n. The Darboux property implies that Ω is the union of p disjoint measurable
sets Ai , each of measure 1/ p. Then
   p  
x ∈ conv F = conv F . (3.167)
Ω i=1 Ai

3 For a proof of the Darboux property, based on Zorn’s lemma, see [3, Theorem 10.52, p. 395].
3.3 Applications of the Shapley–Folkman Theorem 157

By the Shapley–Folkman Theorem 1.170, there exists an  Ip ⊂ {1, . . . , p} of cardinal-



ity at most n, such that we have the representation x = i=1 xi , with xi ∈ conv Ai F

if i ∈ I , and xi ∈ Ai F otherwise. Repeating a similar argument for each set
Ai , for i ∈ I , we obtain by induction a sequence of representations of the form
x = y k + z k , where for some measurable
 partition (Ak , Bk ) of Ω, with Ak nonde-
creasing and μ(Bk ) → 0, y k ∈ Ak F, Bk = ∪∈Ik Bk , the Bk being disjoint mea-
 
surable subsets of Bk , and z k = ∈Ik z k , with z k ∈ conv( Bk F) for all  ∈ Ik , Set

f¯(ω) := max(| f 1 (ω)|, | f 2 (ω)|). Since f¯ is integrable, |z k | ≤ Bk f¯(ω)dμ(ω) and
μ(Bk ) → 0, by Corollary 3.37, we have that z k → 0. We conclude by passing to the
limit. 

We next discuss the case when for some measurable multimapping I : Ω →


P{1, . . . , p}, and integrable functions f 1 , . . . , f p , F is defined by

F(ω) = { f i (ω); i ∈ I (ω)}, a.e. on Ω. (3.168)

The multimapping conv F is defined by

conv F(ω) = conv{ f i (ω); i ∈ I (ω)}, a.e. on Ω. (3.169)

Set  
p
A := F, A :=
c
conv F, S p := {α ∈ R+ ; αi = 1}, (3.170)
Ω Ω i

and
 
S Ic := α ∈ L ∞ (Ω) p ; α(ω) ∈ S p ; αi (ω) = 0, i ∈
/ I (ω); a.e. , (3.171)
 
S I := α ∈ S Ic ; αi (ω) ∈ {0, 1}; a.e. . (3.172)

The next proposition is a variant of the Lyapunov convexity theorem.

Proposition 3.97 Let (3.168) hold. Then A is equal to Ac , is convex and compact,
and any x ∈ A has the following representation:
p 
x= αi (ω) f i (ω)dμ(ω), for some α ∈ S I . (3.173)
i=1 Ω

Proof (a) By Theorem 3.96, A is convex; since the f i are integrable, it is bounded.
Let f be an integrable selection of F. Set

E 1 := {ω ∈ Ω; i ∈ I (ω); f 1 (ω) = f (ω)}, (3.174)


158 3 An Integration Toolbox

and by induction, for i = 2 to p:

E i := {ω ∈ Ω; i ∈ I (ω); ω ∈
/ E j , j < i; f i (ω) = f (ω)}. (3.175)

Let αi be the indicatrix of E i . Then α ∈ S I and (3.173) holds.


(b) It remains to show that A is closed and is equal to Ac . Since A is a convex subset
of Ac , it suffices to check that any x̄ ∈ ∂ Ac belongs to A. By Corollary 1.21, we can
separate x̄ and rint(Ac ), and so, there exists a λ ∈ Rn∗ such that λx̄ ≤ λx, for all
x ∈ Ac , or equivalently
p 
λx̄ = inf c λ αi (ω) f i (ω)dμ(ω). (3.176)
α∈S I Ω
i=1

Let f λ := (λ f 1 , . . . , λ f p ) . By Proposition 3.71, we have that



λx̄ = min{ f iλ (ω); i ∈ I (ω)}dμ(ω). (3.177)
Ω

By Remark 3.72, there exists an α ∈ S that reaches the infimum in (3.176). Set

Acλ := {x ∈ Āc ; λx = λx̄}; Aλ := {x ∈ A; λx = λx̄}. (3.178)

We have proved that Acλ contains an element of Aλ . On the other hand, since Ac
is bounded, to any nonzero λ ∈ R p is associated some x̄ ∈ ∂ Ac such that (3.176)
holds. Therefore A and Ac have the same support function, and since these sets are
convex, they have the same closure.4 So it suffices to prove that A is closed, which
is equivalent to the equality Acλ = Aλ , for any pair (x̄, λ) as above.
We conclude by an induction argument over the dimension of A. If A is one-
dimensional, then Acλ = {x̄}, and since it contains one point in A, it follows that
Acλ = Aλ . Let the result hold when the dimension of A is q − 1, for q ≥ 2, and let
A have dimension q. Define

I λ (ω) := {i ∈ I (ω); f iλ (ω) ≤ f jλ (ω), for all j ∈ I (ω)}, (3.179)

and consider the multimapping

F λ (ω) = { f i (ω); i ∈ I λ (ω)}, a.e. on Ω. (3.180)



We see that Aλ = Ω F λ . Since F λ has the same structure as F and Aλ has dimension
at most q − 1, we have that Aλ is closed and equal to Acλ . The conclusion follows. 

4 Theindicatrix function of a nonempty closed convex set is l.s.c. convex, and hence, equal to its
biconjugate. Since the conjugate of the indicatrix is the corresponding support function, two closed
convex sets having the same associated support function are equal.
3.3 Applications of the Shapley–Folkman Theorem 159

3.3.2 Constraints on Integral Terms

We next consider the following problem

Min F0 (u); (F1 (u), . . . , Fq (u)) ∈ K , (P I )


u∈L p (Ω)

where again μ is a non-atomic probability measure, K is a closed convex subset


of Rq , and for i = 0 to q, given Carathéodory functions i : Ω × Rm → R and a
measurable multimapping U : Ω → Rm , for all u ∈ L p (Ω, Rm ):
⎧

i (ω, u(ω))dμ(ω) if u(ω) ∈ U (ω) a.e.,
Fi (u) := Ω
⎩ +∞ otherwise,

with the convention that F0 (u) is equal to +∞ if 0 (ω, u(ω))+ is not integrable. We
assume (for the sake of simplicity) that for any u ∈ L p (U ), i (ω, u(ω)) is integrable,
i = 0 to q. The Lagrangian of the problem L : L p (Ω)m × Rq∗ → R̄ is defined by
q
L(u, λ) := F0 (u) + λi Fi (u). (3.181)
i=1

The dual problem is

Max d(λ) := inf L(u, λ) − σ K (λ). (D I )


λ u∈L p (U )

Set F(u) := (F0 (u), . . . , Fq (u)) , with range

E := {F(u); u ∈ L p (U )}. (3.182)

By Theorem 3.96, this set is convex; its components are indexed from 0 to q. We
may rewrite the primal problem in the form

Min e0 ; e1:q ∈ K ,
e∈E

where e1:q ∈ Rq has components e1 to eq , and set E 1:q := {e1:q ; e ∈ E}. The primal
problem is feasible iff 0 ∈ E 1:q − K , and the stability condition (1.170) of perturba-
tion duality is
ε B ⊂ E 1:q − K , for some ε > 0. (3.183)

Theorem 3.98 Let (3.183) hold. Then val(P I ) = val(D I ), and ū ∈ S(P I ) iff there
exists a λ ∈ N K (F(ū)) such that

L(ū, λ) ≤ L(u, λ), for all u ∈ L p (Ω). (3.184)


160 3 An Integration Toolbox

Proof The convex sets E and E  := (−∞, val(P I )) × K are disjoint, since any
point in the intersection is the image of a feasible u ∈ L p (Ω) with cost function
lower than the value of (P). By Corollary 1.21, we can separate E  and E, i.e., there
exists a nonzero pair (β, λ) ∈ R × R p∗ such that

βγ + λk ≤ βe0 + λ · e1:q , for all γ < val(P I ) and (k, e) ∈ K × E. (3.185)

Fixing k ∈ K and making γ ↓ −∞ we deduce that β ≥ 0. If β = 0 then λ = 0 and


λ(e1:q − k) ≥ 0, for all (k, e) ∈ K × E. By (3.183) this implies that λ = 0, which is
a contradiction. We have proved that β > 0, and so dividing (β, λ) by β if necessary,
we can assume that β = 1. Maximizing over γ in (3.185) and recalling the definition
of E, we deduce that

val(P I ) ≤ L(u, λ) − λk, for all (k, u) ∈ K × L p (Ω). (3.186)

Minimizing the r.h.s. over (k, u) we obtain that val(P I ) ≤ d(λ) ≤ val(D I ). Since
the converse inequality obviously holds, the primal and dual values are equal.
Assume now that ū ∈ S(P I ). Then, by (3.186),

L(ū, λ) − λF1:q (ū) = val(P I ) ≤ L(u, λ) − λk, for all (k, u) ∈ K × L p (Ω),
(3.187)
or equivalently
 
0≤ inf (L(u, λ) − L(ū, λ)) + inf λ(F1:q (ū) − k) . (3.188)
u∈L (Ω)
p k∈K

Taking u = ū and k = F1:q (ū), we see that each infimum is nonpositive. Therefore
they are both equal to zero, i.e., λ ∈ N K (F(ū)) and (3.184) holds. Conversely, if
ū ∈ dom(F) is such that λ ∈ N K (F(ū)) and (3.184) holds, then for all u ∈ dom(F):

F0 (u) = L(u, λ) − λF1:q (u) ≥ L(ū, λ) − λF1:q (ū) = F0 (ū), (3.189)

and hence, ū ∈ S(P I ). The conclusion follows. 

Remark 3.99 While problem (P I ) is not convex in general (for instance, its cost
function is not convex) we have been able to reformulate it as a convex problem. Set
[λ](ω, u) := 0 (ω, u) + λ1:q (ω, u). Since the Lagrangian is itself an integral, to
which Proposition 3.71 applies, we deduce that, under the hypotheses of the above
theorem, if ū ∈ S(P I ), then

ū(ω) ∈ argmin 0 [λ](ω, u), for a.a. ω. (3.190)

Exercise 3.100 Discuss the case when Ω = [0, 1], the integrands f i do not depend
on ω and are polynomials of degree at most n, and U (ω) = R.
3.4 Examples and Exercises 161

3.4 Examples and Exercises

Example 3.101 (Constrained entropy maximization) Let Ω be a measurable subset


of Rn with finite Lebesgue measure. Consider the set of measurable, a.e. positive
functions in X := L 1 (Ω):

X + := {u ∈ L 1 (Ω); u(ω) ≥ 0 a.e.}. (3.191)

We have observations

ai (ω)u(ω)dω = bi , i = 1, . . . , N , (3.192)
Ω

where each ai belongs to X ∗ = L ∞ (Ω) and b ∈ R N is a noisy measurement, so that


the available information is that b ∈ K , where K is a closed convex subset of R N .
We define 
Ĥ (x) := x log x; H (u) := Ĥ (u(ω))dω. (3.193)
Ω

The strictly l.s.c. convex function Ĥ (x) has domain R+ , with value 0 at 0. We have in
view cases when u is a density probability and so we assume that a1 (ω) = 1. In the
crystallographic applications that we have in mind, u(ω) is the density probability
for atoms to be at position ω and the observations correspond to the computation of
Fourier modes, see [39]. The problem to be considered is

Min H (u); Au ∈ K , (3.194)


u∈X


where (Au)i := Ω ai (ω)u(ω)dω. The cost function is obviously convex, and is l.s.c.
in view of Fatou’s Lemma 3.41 (where we can take g(ω) = −c, c being the maximum
of − Ĥ ). So, the Fenchel duality framework is applicable. Set

N
Ĥ λ (ω, v) := Ĥ (v) + λi ai (ω) · v. (3.195)
i=1

Let a(ω) := (a1 (ω), . . . , a N (ω)) . Observe that

inf Ĥ λ (ω, v) = − Ĥ ∗ (−a(ω) · λ). (3.196)


v

The Lagrangian function is




L(u, λ) := H (u) + λ Au = Ĥ λ (ω, u(ω))dω. (3.197)
Ω

The integrand is normal convex. Therefore, the dual cost satisfies


162 3 An Integration Toolbox


δ(λ) = inf Ĥ λ (ω, u(ω))dω − σ K (λ) = − Ω Ĥ ∗ (−a(ω) · λ)dω − σ K (λ).
u∈X Ω
(3.198)
We assume that the primal problem is feasible and that the stability condition holds:

0 ∈ int(K − A dom(H )). (3.199)

Since Ĥ (v) ≥ −c, the primal value is not less than −c|Ω|. So, the primal problem
has a finite value. By (3.199), the primal and dual values are equal and the set of dual
solutions is nonempty and bounded. Let λ̄ be a dual solution. Then u ∈ dom Ĥ is a
primal solution iff it satisfies the optimality condition

Ĥ (u(ω)) + Ĥ ∗ (−a(ω) · λ) = −(a(ω) · λ̄)u(ω) a.e. (3.200)

Since Ĥ is strictly convex, there is a unique primal solution ū that is determined by


the above relation. Indeed, we have that D Ĥ (v) = 1 + log v = z, iff v = e z−1 and
so
ū(ω) = e−a(ω)·λ̄−1 . (3.201)

It follows that
Ĥ (ū(ω)) = −(a(ω) · λ̄ + 1)e−a(ω)·λ̄−1 , (3.202)

so that the dual cost is



δ(λ) = − e−a(ω)·λ̄−1 dω − σ K (λ̄). (3.203)
Ω

Example 3.102 Consider the particular case of the previous example when N = 1,
and the constraint has a probability density, i.e. a(ω) = 1 a.e. and K = {1}. Then the
dual cost is −|Ω|e−λ−1 − λ, which attains its maximum when |Ω|e−λ−1 = 1, i.e.,
for λ̄ = log |Ω| − 1; the optimal density is u = e−λ̄−1 = 1/|Ω|, as expected (the
uniform law maximizes the entropy).

Example 3.103 (Phase transition models, see [80]) Let f : R → R, f (u) := u(1 −
u), and let Ω be a measurable subset of Rn . We
 choose the function space X :=
L p (Ω), p ∈ [1, ∞). For u ∈ X , set F(u) := Ω f (u(ω))dμ(ω), where dμ is the
Lebesgue measure. Consider theproblem of minimizing F(u) with the constraints
u(ω) ∈ U a.e., U := [0, 1], and Ω u(ω)dμ(ω) = a, a ∈ (0, mes(Ω)).
Given λ ∈ R, the Lagrangian of this problem is
  
L(u, λ) := FU (u) + λ u(ω)dμ(ω) − a = ( f (u(ω)) + λu(ω))dμ(ω) − λa. (3.204)
Ω Ω
3.4 Examples and Exercises 163

The dual cost function is therefore



δ(λ) = −FU∗ (−λ) − λa = − fU∗ (−λ)dμ(ω) − λa. (3.205)
Ω

We compute
fU∗ (z) := sup uz − u(1 − u). (3.206)
u∈U

Since u(1 − u) is concave the supremum is attained at 0 if z ≤ 0 and at 1 otherwise,


and so, 
∗ 0 if z ≤ 0,
fU (z) = (3.207)
z if z ≥ 0.

In other words, fU∗ (z) = max(0, z) = z + . So


 
λ(mes(Ω) − a) if λ ≤ 0,
δ(λ) = − (−λ)+ dμ(ω) − λa = (3.208)
Ω −λa if λ ≥ 0.

Clearly it attains its maximum at λ̄ = 0, and so, the primal and dual values are equal,
although the problem is nonconvex.

Example 3.104 This example illustrates how singular multipliers occur in optimality
systems. Consider the problem

Min x; x + 1/(k + 1) ≥ 0, k = 0, 1, . . . . (3.209)


x∈R

We choose ∞ (the space of bounded sequences) as the constraint space and denote
by 1 and b the sequences with generic term 1 and 1/(k + 1), respectively. Thus we
are considering the problem

Min x; x1 + b ≥ 0, k = 0, 1, . . . (3.210)
x∈R

where we have used the natural order relation for sequences. Let K = ∞ + be the
convex cone of elements of ∞ with nonnegative elements and let A : R → ∞ ,
Ax := x1. The constraint can be written as Ax + b ∈ K . The duality Lagrangian is

x + λ, x1 + b − σk (λ) = λ, b + x(1 + λ, 1) − σk (λ). (3.211)

The dual cone to K is

K − := {λ ∈ (∞ )∗ ; λ, y ≤ 0, for all y ∈ ∞


+ }. (3.212)

So, the dual problem is


164 3 An Integration Toolbox

Max λ, b; 1 + λ, 1 = 0. (3.213)


λ∈K −

The problem is convex and the stability condition obviously holds, and so, primal
and dual values are equal. The optimality condition is, in view of the dual constraint:

0 = x − λ, b = −λ, x1 + b. (3.214)

For any y ∈ K , N K (y) = K − ∩ y ⊥ (see Chap. 1), so that K − ∩ (x1 + b)⊥ =


N K (x1 + b), the set of dual solutions (which is nonempty and bounded), is

{λ ∈ N K (x1 + b); 1 + λ, 1 = 0}. (3.215)

We now use the structure of elements of (∞ )∗ . Any λ ∈ (∞ )∗ can be uniquely
decomposed as λ = λ1 + λs , where λ1 ∈ 1 and the singular part λs depends only
on the behavior at infinity.
For any y ∈ K , we have that

0 ≥ λ, y = λ1 , y + λs , y. (3.216)

Taking, for i ∈ N, y = ei (the sequence with all components equal to 0 except the
ith equal to 1) we obtain that λ1 ∈ K − . Then let y ∈ K . Denote by y N the sequence
whose N first terms are zero, the others being equal to those of y. We have that
λ1 , y N  = o(1) and λs , y N  = λs , y. Since λ ∈ K − , we deduce that λs ∈ K − .
Finally take y = ek /(k + 1). Then x1 ± y ∈ K , and therefore 0 ≥ λ, ±y =
λk /(k + 1), proving that λ1k = 0. and therefore λ1 = 0. In view of the dual constraint,
1

it follows that λs = 0.

3.5 Notes

For complements on Sect. 3.1 (measure theory) we refer to e.g. Malliavin [77]. The
integral functionals discussed in Sect. 3.2 were studied in Rockafellar [96, 98, 99];
see also Castaing and Valadier [33], Aubin and Frankowska [12]. The proof of the
Shapley–Folkman theorem is taken from Zhou [127]. We use it to prove the convexity
of integrals of multimappings. See Tardella [119] and its references on the Lyapunov
theorem. Maréchal [79] introduced useful generalizations of the perspective function.
Chapter 4
Risk Measures

Summary Minimizing an expectation gives little control of the risk of a reward that is
far from the expected value. So, it is useful to design functionals whose minimization
will allow one to make a tradeoff between the risk and expected value. This chapter
gives a concise introduction to the corresponding theory of risk measures. After an
introduction to utility functions, the monetary measures of risk are introduced and
connected to their acceptation sets. Then the case of deviation and semi-deviation,
as well as the (conditional) value at risk, are discussed.

4.1 Introduction

When minimizing an expectation, we miss the possibility of large variance of the


cost, leading to high risk of a poor result. So it may be wise to modify the cost
function in order to reduce the associated risk. We present in this chapter some tools
which allow us to do this.

4.2 Utility Functions

4.2.1 Framework

Definition 4.1 We call a nondecreasing function u : R → R̄, with connected domain


denoted by D(u), a disutility function.

Note that classical economic theory deals with gain maximization and (often con-
cave) utility functions. However, since we choose to analyze minimization problems,
we will use disutility functions (which will be the opposite of the utility functions).

© Springer Nature Switzerland AG 2019 165


J. F. Bonnans, Convex and Stochastic Optimization, Universitext,
https://doi.org/10.1007/978-3-030-14977-2_4
166 4 Risk Measures

Definition 4.2 Let (Ω, F , μ) be a probability space and let s ∈ [1, ∞]. The prefer-
ence function associated with the disutility function u is the function U : L s (Ω) → R̄
defined by
U (y) := E[u(y)], (4.1)

with domain

D(U ) = {x ∈ L s (Ω); u(x) is measurable and E[max(u(x), 0)] < +∞}. (4.2)

If x and y belong to L s (Ω) (representing losses) we say that x is preferred to y if


U (x) ≤ U (y).
Definition 4.3 We say that the preference function U is risk-adverse if the disutility
function u is proper, l.s.c. convex.
Let U be risk-averse. Then it is convex, and u has an affine minorant, say ay + b,
so that U has the affine minorant aEy + b and is therefore proper (its domain is
nonempty since it contains the constant functions with value in dom(u)). By Fatou’s
Lemma 3.41 we deduce that U is proper l.s.c. convex. Let y ∈ D(U ). By Jensen’s
inequality, we have that
u[E(y)] ≤ E[u(y)]. (4.3)

This expresses the preference for getting the mean value of a random variable rather
than the variable itself.
Definition 4.4 A certainty equivalent price (also called “utility equivalence price”)
of y ∈ D(U ) is defined as an amount α ∈ R such that

u[α] = E[u(y)]. (4.4)

Since u is nondecreasing, E[u(y)] belongs to the image of u, and so the set of


certainty equivalent prices is a nonempty interval. If u is increasing, it is a singleton,
equal to u −1 (E[u(y)]), denoted by ce(y).
Remark 4.5 If U is risk-averse and y ∈ D(U ), in view of (4.3), we always have
ce(y) ≥ E(y), and we can interpret ce(y) as the “fair price” of the random variable
y.
Example 4.6 The exponential disutility function u(x) = e x is risk-averse, and for
all a ∈ R, we have

U (y + a) = E[u(y + a)] = ea E[u(y)] = ea U (y), (4.5)

so that
ce(y + a) = log(ea U (y)) = a + log(U (y)) = a + ce(y). (4.6)

So, in the case of the exponential utility function, the certainty equivalent price
satisfies the relation of translation invariance: ce(y + a) = a + ce(y).
4.2 Utility Functions 167

4.2.2 Optimized Utility

We now interpret y as the gain of a portfolio that can by combined with other random
variables, called free assets. If a financial asset z, an element of L s (Ω), has price pz
on the market, then the asset z − pz has a zero price. These market prices should not
be confused with utility indifference prices that apply to assets that are not priced in
the market. So if z 1 , . . . , z n are zero value assets, for any θ ∈ Rn , we may choose to
have the portfolio
y(θ ) = y + θ1 z 1 + · · · + θn z n . (4.7)

We assume that there is no constraint on the decision variables θ . Therefore the


investor minimizes its disutility by solving the problem

Minn U [y(θ )]. (4.8)


θ∈R

Assuming that the above function of θ is differentiable, and that the rule for differ-
entiating the argument of the sum holds, we see that the optimality condition of this
problem is
∂U [y(θ )]
0= = Eμ [u  (y(θ ))z i ], i = 1, . . . , n. (4.9)
∂θi

Assume that u  (·) is positive everywhere. Let the random variable ηθ be defined by
u  (y(θ ))
ηθ = . (4.10)
Eμ [u  (y(θ ))]

Being positive and with unit expectation, η is the density of the equivalent probability
measure μθ such that dμθ = ηθ μ. We may write the optimality condition (4.9) as

0 = Eμθ [z i ]. (4.11)

In other words, optimal portfolios are those for which the financial assets have null
expectation under their associated probability μθ . In such a case, we say that μθ is
a neutral risk probability.
Remark 4.7 If short positions are forbidden, meaning that we have the constraint
θ ≥ 0, then the optimality conditions may be expressed as

Eμθ [z i ] ≥ 0; θ ≥ 0; θi Eμθ [z i ] = 0, i = 1, . . . , n. (4.12)

In particular, all assets in the optimal portfolio are risk neutral.


Remark 4.8 Given a nonempty closed convex subset K of Rn , we can apply the
Fenchel duality setting (Theorem 1.113) to the problem

Min U [y(θ )], (4.13)


θ∈K
168 4 Risk Measures

with X = Rn , f = I K , Y = L s (Ω), x = θ , Aθ = θ1 y1 + · · · + θn yn , F = U . The


stability condition is
0 ∈ int (dom(U ) − AK ) . (4.14)

This will hold if s = 1 and u satisfies a linear growth condition, since in this case
dom(U ) = Y . Since f ∗ = σ K , and U ∗ (·) = E(u ∗ (·)) by Proposition 3.80, the expres-
sion of the dual problem is, assuming s ∈ [1, ∞) and 1/s + 1/s  = 1:

Max E(y ∗ y − u ∗ (y ∗ )) − σ K (−A y ∗ ). (4.15)


y ∗ ∈L s 

Since (A y ∗ ) I = y ∗ , yi s (duality product in L s (Ω)) for i = 1 to n, and ∂ I K = N K ,


the optimality condition at a solution θ̄ is, setting ȳ := y(θ̄ ):
⎛ ∗ ⎞
y , y1 s
⎜ .. ⎟ ∗
N K (θ̄ ) + ⎝ . ⎠ 0; y ∈ ∂u( ȳ) a.s. (4.16)
y ∗ , yn s

In particular, if K = Rn this means that y ∗ , yi s = 0, for i = 1 to n, for some y ∗ ∈



L s (Ω), y ∗ ∈ ∂u( ȳ) a.s. If y ∗ is a.s. positive, then ȳ ∗ := y ∗ /Eμ y ∗ is well-defined and
is an equivalent probability measure, under which the assets have null expectation.

4.3 Monetary Measures of Risk

4.3.1 General Properties

We now give an axiomatic approach to risk measures associated with estimates of


incomes, and explicit expressions of some of these risk measures, in the form of a
maximum of mean values.
Let Ω be the set of events. An (uncertain) outcome (opposite of income) is a
function x : Ω → R; x(ω) is the actual outcome obtained if event ω occurs. Outcome
functions are assumed to belong to a Banach space X , containing constant functions.
The space X is endowed with the order relation for functions of ω: if x, y belong to
X , then x ≤ y if x(ω) ≤ y(ω) (either everywhere or a.e.).

Definition 4.9 A mapping ρ : X → R is called a monetary measure of risk (MMR)


if, for all x and y in X , the following holds:
Monotonicity: if x ≥ y, then ρ(x) ≥ ρ(y), i.e., ρ is nondecreasing.
Translation invariance: if a ∈ R, then ρ(x + a) = ρ(x) + a.

Lemma 4.10 (i) The set of monetary measures of risk is convex, and invariant under
addition of a constant et translation.1 (ii) Let f i , i ∈ I , be a family of MMRs. If the

1 Change of ρ(x) into ρ(x + a), with a ∈ R.


4.3 Monetary Measures of Risk 169

supremum (resp. infimum) is everywhere finite, then it is an MMR. (iii) A monetary


measure of risk is non-expansive (i.e., Lipschitz continuous with constant at most 1)
with respect to the supremum norm:

|ρ(x) − ρ(y)| ≤ sup |x(ω) − y(ω)|. (4.17)


ω

Proof The proof of (i)–(ii) being immediate, it suffices to prove (iii). Let x and
y be in X , and M := supω |x(ω) − y(ω)|. Then x ≥ y − M. By monotonicity and
translation invariance, ρ(x) ≥ ρ(y − M) = ρ(y) − M. Exchanging x and y, we
obtain the converse inequality. 
We recall (see Sect. 1.3.4, Chap. 1) that the infimal convolution of a family f i of
extended real-valued functions over X , i ∈ I finite, is defined as

(i∈I f i ) (x) := inf f i (xi ); xi = x . (4.18)
i∈I i∈I

Lemma 4.11 The infimal convolution of a finite family of monetary measures of risk
is, whenever it is finite-valued, a monetary measure of risk.
Proof Let the f i be extended real-valued functions over X , for i = 1 to n. Then

i∈I f i (x) = inf f i (xi ) + f n x − xi . (4.19)
x1 ,...,xn−1
1≤i≤n−1 1≤i≤n−1

By Lemma 4.10(i), each term in the “inf” is an MMR. We conclude by Lemma


4.10(ii). 

4.3.2 Convex Monetary Measures of Risk

We denote by S the following set:

S := {Q ∈ X ∗ ; Q ≥ 0, Q, 1 = 1}, for all x and y in X. (4.20)

In some cases S will have the interpretation of probability measures. We have


established in (1.291) that the Fenchel conjugate of the infimal convolution is the
sum of conjugates.
Lemma 4.12 (i) If ρ is an MMR (possibly nonconvex), then ρ ∗ (Q) = +∞ if Q ∈ /
S.
(ii) The function ρ : X → R is a convex l.s.c. MMR iff it has finite values and satisfies:

ρ(x) = sup{ Q, x − ρ ∗ (Q); Q ∈ S }. (4.21)


170 4 Risk Measures

Proof (i) If Q  0, there exists a y ≥ 0 such that Q, y < 0. Let x ∈ X . Then

ρ ∗ (Q) ≥ Q, x − y − ρ(x − y) ≥ Q, x − Q, y − ρ(x). (4.22)

Taking the supremum over x, we obtain ρ ∗ (Q) ≥ ρ ∗ (Q) − Q, y . Since ρ ∗ (Q) >
−∞ and Q, y < 0, this implies ρ ∗ (Q) = +∞. If on the other hand Q, 1 = 1,
by translation invariance, we get

ρ ∗ (Q) ≥ supα∈R { Q, α1 − ρ(α1)}


(4.23)
= supα∈R {α( Q, 1 − 1) − ρ(0)} = +∞.

(ii) If ρ : X → R is an l.s.c. convex MMR, it is equal to its biconjugate, and so,


by (i), (4.21) holds. Conversely, (4.21) expresses that ρ is a supremum of MMRs.
Having finite values, it is an MMR. 

4.3.3 Acceptation Sets

A monetary measure of risk ρ is characterized by its associated zero sublevel set,


called in this setting an acceptation set:

Aρ := {x ∈ X ; ρ(x) ≤ 0}. (4.24)

Indeed, by the translation invariance property, we have that

ρ(x) = min{α ∈ R; x − α1 ∈ Aρ }. (4.25)

In other words, ρ(x) is the smallest constant reduction of losses that allows one to
get a nonpositive risk. Acceptation sets satisfy

(i) A − X + ⊂ A,
(4.26)
(ii) For all x ∈ X , ρ(x) := min{α ∈ R; x − α1 ∈ A} is finite.

Conversely, to a set A satisfying (4.26) is associated an MMR ρ A defined by

ρ A (x) = inf{α ∈ R; x − α1 ∈ A}. (4.27)

Lemma 4.13 A monetary measure of risk ρ is convex iff its associated acceptance
set is convex.

Proof If ρ is a convex MMR, then Aρ = ρ −1 (R− ) is obviously convex. Conversely,


assume that Aρ is convex. Let x1 and x2 belong to X . Then xi − ρ(xi ) ∈ Aρ , i = 1, 2.
Since Aρ is convex, for any γ ∈ [0, 1], we have that
4.3 Monetary Measures of Risk 171

γ (x1 − ρ(x1 )) + (1 − γ )(x2 − ρ(x2 )) ∈ Aρ . (4.28)

Set x := γ x1 + (1 − γ )x2 . Then x − γρ(x1 ) − (1 − γ )ρ(x2 ) ∈ Aρ which, in view


of the definition of Aρ , implies, ρ(x) ≤ γρ(x1 ) + (1 − γ )ρ(x2 ) as was to be
proved. 
In order to obtain lower estimates of the value of optimization problems associated
with MMRs, it is useful to characterize the greatest convex minorant of an MMR.
The notion of convex closure for sets and functions was introduced in Definition
1.45.
Lemma 4.14 Let ρ be an MMR, with acceptation set A. Assume that conv(A) is not
the entire space. Then conv(ρ) is a monetary measure of risk whose acceptation set
is conv(A).

Proof By the Hahn–Banach theorem, since conv(A) = X , there exist (Q, α) ∈ X ∗ ×


R, with Q = 0, such that Q, y ≤ α for all y ∈ A. For any y ∈ A and z ∈ X + ,
y − z ∈ A, and hence, Q, y − z ≤ α, proving that Q ≥ 0. Next, given x ∈ X , we
have that y := x − ρ(x)1 ∈ A, and hence,

Q, 1 ρ(x) ≥ Q, x − α. (4.29)

If Q, 1 = 0, then Q, x ≤ α for all x ∈ X , which cannot hold since Q ≥ 0 and


Q = 0. So Q, 1 > 0, and we may assume that Q, 1 = 1. It then follows from
(4.29) that Q, x − α is a minorant of ρ.
Since ρ has an affine minorant, by the Fenchel–Moreau–Rockafellar theorem
1.46, conv(ρ) is equal to ρ ∗∗ . Since ρ ∗∗ ≤ ρ, we have that ρ ∗∗ is everywhere finite.
By Lemma 4.12(i), ρ ∗∗ is a supremum of the form (4.21). We conclude by Lemma
4.12(ii) that ρ ∗∗ = conv(ρ) is an l.s.c. convex MMR such that conv(ρ) ≤ ρ, and so
A ⊂ Aconv(ρ) . Since Aconv(ρ) is closed and convex, this implies

conv(A) ⊂ Aconv(ρ) . (4.30)

We now show the converse inclusion by checking that conv(A) satisfies the axioms
of an acceptation set. Condition (4.26)(i) is a consequence of the one satisfied by A,
and for (4.26)(ii), set

r (x) := inf{γ ∈ R; x − γ 1 ∈ conv(A)}. (4.31)

Since A ⊂ conv(A), we have that r (x) ≤ ρ(x). The affine minorant (x) := Q, x
− α of ρ(x) is such that (x) ≤ 0 for all x ∈ A, and hence, for all x ∈ conv(A).
So, x − γ 1 ∈ conv(A) implies α ≥ Q, x − γ 1 = Q, x − γ , i.e., and so γ ≥
Q, x − α. Therefore, the infimum in (4.31) is finite and, as conv(A) is closed, is
attained. The associated MMR r is an l.s.c. convex minorant of ρ, and so r ≤ conv(ρ).
But, since the mapping ρ → Aρ is nonincreasing, the converse inclusion in (4.31)
holds. The conclusion follows. 
172 4 Risk Measures

4.3.4 Risk Trading

This model will illustrate the above concepts. It involves two agents, the issuer A and
the buyer B. An asset F is to be sold to the buyer at a price π to be determined. Initially,
A and B have outcome functions X and Y , in the Banach space X , and assess risk
with risk measures ρ A and ρ B . The buyer will find the transaction advantageous if

ρ B (Y + F + π ) ≤ ρ B (Y ). (4.32)

In view of the translation invariance property, the best (highest) price is π(F) :=
ρ B (Y ) − ρ B (Y + F). The financial product F minimizing the risk of the issuer in a
class F is then the solution of

Min ρ A (X − F − π(F)); π(F) := ρ B (Y ) − ρ B (Y + F). (4.33)


F∈F

Using again the translation invariance property, we obtain the equivalent problem

Min ρ A (X − F) + ρ B (Y + F) − ρ B (Y ). (4.34)
F∈F

If F = X , using the identity

inf {ρ A (X − F) + ρ B (Y + F)} = inf {ρ A (X + Y − G) + ρ B (G)}, (4.35)


F∈X G∈X

we recognize an inf convolution: the above infimum is ρ A ρ B (X + Y ) − ρ B (Y ),


while the gain of the issuer is ρ A (X ) + ρ B (Y ) − ρ A ρ B (X + Y ).

4.3.5 Deviation and Semideviation

Let (Ω, F , μ) be some probability space.


Let p ∈ [1, ∞), X = L p (Ω). The deviation in L p (Ω) is
 1/ p
Ψ p (x) := |x(ω) − E(x)| dμ(ω)
p
. (4.36)
Ω

This is a composition of the “centering” continuous mapping Ax = x − E(x)1 with


the L p norm, and is a positively homogeneous continuous convex function. Since the
subdifferential of a norm at 0 is the closed dual unit ball, ∂Ψ p (0) = A Bq , with Bq
the unit ball of L q (Ω), 1/ p + 1/q = 1. For y ∈ L q (Ω), we have A y = y − E(y)1,
and so, by Lemma 1.66:
4.3 Monetary Measures of Risk 173
 
∂Ψ p (x) = z = y − E(y)1; y L q (Ω) ≤ 1; z(ω) · x(ω)dμ(ω) = Ψ p (x) .
Ω
(4.37)
When p = 1, q = +∞ and for all y ∈ B∞ , we have that Ey ≤ 1, so if z ∈ ∂Ψ1 (0),
z ≥ −2 a.s. The function

ρ1 (x) := E(x) + c |x(ω) − E(x)|dμ(ω) (4.38)
Ω

is, for all c ≥ 0, convex and continuous, translation invariant, and satisfies

∂ρ1 (0) = 1 + c{y − E(y); y∞ ≤ 1}. (4.39)

If c ∈ [0, 1/2], any z ∈ ∂ρ1 (0) is nonnegative and has unit expectation. We deduce
that:

Lemma 4.15 For c ∈ [0, 1/2], the function ρ1 (x) defined in (4.38) is a convex MMR.

4.3.5.1 Semi-deviation

Consider now the function


 1/ p
p
Φ p (x) := |x(ω) − E(x)|+ dμ(ω) . (4.40)
Ω

This is a composition of the same “centering” mapping Ax = x − E(x)1 with the


function x+  p . Let us show that the latter is convex: it is positively homogeneous,
and since (x + y)+ ≤ x+ + y+ , we have

(x + y)+  p ≤ x+ + y+  p ≤ x+  p + y+  p , (4.41)

i.e., x+  p is sublinear.2 Now a sublinear, positively homogeneous function is con-


vex.3

Lemma 4.16 The subdifferential at 0 of x+  p is (Bq )+ , the set of nonnegative


elements of Bq .

Proof Since x → x+  p is nondecreasing and non-expansive, the elements of its


subdifferential are nonnegative and contained in the closed dual unit ball, i.e.,
∂x+  p (0) ⊂ (Bq )+ . Conversely, if y ∈ (Bq )+ , then for all x in X , we have x+  p ≥
q, x+ ≥ 0, and so, q ∈ ∂x+  p (0). 

2A function f is sublinear if f (+y) ≤ f (x) + f (y), for all x and y.


3 Since,if f is sublinear and positively homogeneous, for α ∈]0, 1[, we have f (αx + (1 − α)y) ≤
f (αx) + f ((1 − α)y) = α f (x) + (1 − α) f (y).
174 4 Risk Measures

By the above discussion,

∂Φ p (0) = {y − E(y); y ∈ (Bq )+ }. (4.42)

As in the case of the deviation function, since E(y) ≤ 1 when y ∈ (Bq )+ the subd-
ifferential of Φ p is a.s. greater than or equal to −1. We deduce the following result:

Lemma 4.17 For p ∈ [1, ∞) and c ∈ [0, 1], the following function is a convex
MMR:
ρ̂ p (x) := E(x) + cΦ p (x). (4.43)

Remark 4.18 The function ρ p+ is of practical interest since it penalizes losses, and
not gains, w.r.t. the average revenue.

4.3.6 Value at Risk and CVaR

4.3.6.1 Value at Risk

Risk models often involve a constraint on the probability that losses are no more than
a given level. Denote by
H X (a) := P[X ≤ a] (4.44)

the cumulative distribution function (CDF) of the real random variable X . This is a
nondecreasing function with limits 0 at −∞, and 1 at +∞, which is right continuous.
Setting H X (a − ) := limb↑a H X (b), we have that

H X (a − ) = P[X < a]; P(X = a) = H X (a) − H X (a − ). (4.45)

Given α ∈]0, 1[, we call any number a ∈ R such that

P[X < a] ≤ α ≤ P[X ≤ a] (4.46)

an α quantile. Having in view the minimization of losses, we define the value at risk
of level α ∈]0, 1[ as

VaRα (x) := min{a; H X (a) ≥ 1 − α} = min{a ∈ R; P[X > a] ≤ α}. (4.47)

A constraint of the type


VaRα (x) ≤ a (4.48)

means that the probability of a loss greater than a is no more than α.


Obviously, VaRα (x) is an MMR. Its acceptation set is
4.3 Monetary Measures of Risk 175

A V a R,α := {X ; P[X > 0] ≤ α}. (4.49)

Since the acceptation set is nonconvex, the value at risk is also nonconvex.

4.3.6.2 Conditional Value at Risk

Consider an optimization problem of the form:

Min F(X ); VaRα (X ) ≤ 0, (4.50)


X ∈X

where X is a Banach space. Let us see how to compute a convex function G(X )
such that G(X ) > 0 if VaRα (X ) > 0; the related problem

Min F(X ); G(X ) ≤ 0 (4.51)


X ∈X

might be easier to solve, and its value will provide an upper bound of the one of
(4.50).
Observe that, for any γ > 0:

P(X > 0) = E1{X >0} ≤ E[1 + γ X ]+ = γ E[γ −1 + X ]+ . (4.52)

Dividing by γ > 0, we deduce that

VaRα (X ) ≤ 0 ⇒ inf {E[γ −1 + X ]+ − α/γ } ≤ 0. (4.53)


γ >0

Setting δ = −1/γ and dividing by α, we obtain the equivalent relation

VaRα (X ) ≤ 0 ⇒ inf {δ + α −1 E[X − δ]+ } ≤ 0. (4.54)


δ<0

We can show more, defining

CVaRα (X ) := inf {δ + α −1 E[X − δ]+ }. (4.55)


δ∈R

Lemma 4.19 Assume that E|X | is a continuous function over X . Then CVaRα is
a continuous, convex risk measure.

Proof Clearly, CVaR is nondecreasing and translation invariant, and so is a risk mea-
sure. Since (δ, X ) → δ + α −1 E[X − δ]+ is convex, CVaRα , which is the infimum
w.r.t. δ, is convex. Taking δ = 0, we get CVaRα (X ) ≤ α −1 E|X |, proving that CVaR
is locally upper bounded, and hence, by Proposition 1.65, is continuous. 

Lemma 4.20 The infimum in the r.h.s. of (4.55) is attained for δ = VaRα (X ), and
hence,
176 4 Risk Measures

CVaRα (X ) = VaRα (X ) + α −1 E[X − VaRα (X )]+ . (4.56)

Proof The function ϕ(δ) := δ + α −1 E[X − δ]+ is convex. If H X is continuous at δ,


its derivative is 1 + α −1 (H X (δ) − 1). Otherwise, denoting by H X (δ ± ) the right and
left limits of H X , we have that:

∂ϕ(δ) = 1 − α −1 + α −1 [H X (δ − ), H X (δ + )]. (4.57)

The minimum is attained iff 0 ∈ ∂ϕ(δ), and so, if 1 − α ∈ [H X (δ − ), H X (δ + )]. In


particular, the minimum is attained at δ = VaRα (X ). The result follows. 

As a consequence, for the function G(X ) we may choose the CVaR function.

Lemma 4.21 If H X is continuous at a = VaRα (X ), then


 ∞
−1
CVaRα (X ) = α xdH X (x) = E[X |X ≥ VaRα (X )]. (4.58)
VaRα (X )

Proof By the previous lemma, for δ = VaR(X ), we have:


 ∞
−1 −1
CVaRα (X ) = δ + α E[X − δ]+ = δ + α (x − δ)dH X (x). (4.59)
δ
∞
Since H X is continuous at δ, we have δ dH X (x) = α, whence the first equality,
from which the second immediately follows. 

4.4 Notes

Risk measures were introduced by Artzner et al. [9] with an axiomatic approach.
The most commonly used are the Var and CVaR. See Shapiro et al. [114, Chap. 6].
A reference book on this subject, with applications in finance, is Föllmer and
Schied [49]. An important extension is the concept of dynamic risk measure, see
Ruszczyński and Shapiro [107]. For the link with utility functions, see Dentcheva
and Ruszczyński [43].
Chapter 5
Sampling and Optimizing

Summary This chapter discusses what happens when, instead of minimizing an


expectation, one minimizes the sample approximation obtained by getting a sample
of independent events. The analysis relies on the theory of asymptotic laws (delta
theorems) and its applications in stochastic programming. We extend the results to
the case of constraints in expectation.

5.1 Examples and Motivation

5.1.1 Maximum Likelihood

Consider the problem of estimating a parameter θ ∈ Rm of a density probability law


of the form ϕ(θ, ω)dμ(ω), where (Ω, F , μ) is a measure space. We assume that the
true value θ̄ is such that the associated density function ϕ(θ̄ , ω) is μ a.e. positive.
Given a sample ω1 , . . . , ω N , which are independent and with the true law for ω, the
maximum likelihood estimator
N is a value of θ that maximizes the joint density of
the N observations, i.e. i=1 ϕ(θ, ωi ). It is equivalent to maximize the logarithm of
this amount, called the log-likelihood, or, after normalisation by 1/N :

1 
N
log ϕ(θ, ωi ). (5.1)
N i=1

This can be interpreted as a sampling approach for maximizing the following expec-
tation: 
Φ(θ ) := Eθ̄ log[ϕ(θ, ·)] = log[ϕ(θ, ω)]ϕ(θ̄ , ω)dμ(ω). (5.2)
Ω

Lemma 5.1 We have that Φ(θ ) ≤ Φ(θ̄), for all θ ∈ , with equality iff ϕ(θ, ω) =
ϕ(θ̄ , ω) a.s.
© Springer Nature Switzerland AG 2019 177
J. F. Bonnans, Convex and Stochastic Optimization, Universitext,
https://doi.org/10.1007/978-3-030-14977-2_5
178 5 Sampling and Optimizing

Proof Set ϕ̄(θ, ω) = ϕ(θ, ω)/ϕ(θ̄ , ω). Since log(s) ≤ s − 1, with equality iff s = 1,
we deduce that log[ϕ̄(θ, ω)] ≤ ϕ̄(θ, ω) − 1 and so,

Φ(θ ) − Φ(θ̄) = Ω log[ϕ̄(θ, ω)]ϕ(θ̄ , ω)dμ(ω)
≤ Ω (ϕ(θ, ω) − ϕ(θ̄ , ω))dμ(ω) = 0,

the last equality being due to the fact that ϕ(θ, ω) and ϕ(θ̄ , ω) are density functions
of probabilities, and so, have unit integral. The result follows.
The maximum likelihood approach to the parameter estimation problem can there-
fore be interpreted as an expectation maximization based on a sample. 

Remark 5.2 The log-likelihood approach is related to the following notion. Given
a strictly convex function ϕ over R, such that ϕ(1) = 0 and ∂ϕ(1) = ∅, and given
p, q, densities of the probability laws P, Q over (Ω, F , μ), the ϕ divergence, or
Csiszar divergence [34], is the function

Iϕ (Q, P) := ϕ(q(ω)/ p(ω)) p(ω)dμ(ω), (5.3)
Ω

assuming that p(ω) > 0 a.s. Clearly Iϕ (P, P) = 0, and for a ∈ ∂ϕ(1):

Iϕ (Q, P) = ϕ(1 + (q(ω) − p(ω))/ p(ω)) p(ω)dμ(ω)
Ω
 (5.4)
q(ω) − p(ω)
≥a p(ω)dμ(ω) = 0,
Ω p(ω)

since p and q are densities. In addition, since ϕ is strictly convex, equality holds iff
q(ω) = p(ω) a.s. Taking ϕ = − log we recover, up to a constant, the (opposite of
the) above function Φ.

5.2 Convergence in Law and Related Asymptotics

In this section we will discuss random variables with image in a metric space. So,
(Y, ρ) will be a metric space and the associated distance. An example that will be
considered in applications is that of the space of continuous functions over a compact
set.

5.2.1 Probabilities over Metric Spaces

As a σ -field over Y we take the Borelian field (generated by open sets; its elements
are called the Borelian subsets).
5.2 Convergence in Law and Related Asymptotics 179

Definition 5.3 We say that the probability measure P over Y is regular if any
Borelian subset A of Y is such that,

For any ε > 0, there exist F, G resp. closed and open subsets of Y
(5.5)
such that F ⊂ A ⊂ G and P(G \ F) < ε.

Lemma 5.4 Any probability measure over a metric space is regular.

Proof We follow [20, Ch. 1]. If A is closed, take F = A and for some δ > 0,
G := G δ , where G δ := {y ∈ Y ; ρ(y, A) < δ}. Then P(G δ \ A) = E1{0<ρ(y,A)<δ} .
By the dominated convergence theorem, E1{0<ρ(y,A)<δ} → 0 when δ → 0 and so,
the regularity property holds for closed sets.
Since the closed sets generate the Borelian σ -field, it suffices to check that the
set of regular Borelian subsets of Y is closed under (i) complementation and (ii)
countable unions. Indeed, let (5.5) hold for a given Borelian set A. Denoting by Ac
the complement of A, etc., we have that G c ⊂ Ac ⊂ F c , G c is closed, F c is open,
and F c \ G c = G \ F has probability less than ε. Point (i) follows. Now let An ,
n ∈ N , be a sequence of regular Borelian sets and ε > 0. Let Fn , G n be respectively
open and closed subset such that Fn ⊂ An ⊂ G n , and P(G n \ Fn ) < 2−(n+2) ε. Then
(5.5) holds with G := ∪n G n and F := ∪n≤k Fn , for large enough k. The conclusion
follows. 

5.2.2 Convergence in Law

Let (Ω, F , μ) be a probability space. We know that a random variable (r.v.) y over
Ω with image in Y induces over Y the image probability of μ by y, called the law
or distribution of y, denoted by y∗ μ, and defined by

(y∗ μ)(B) := μ(y −1 (B)), for all Borelian subsets B of Y . (5.6)

Lemma 5.5 If f is measurable Y → R, such that f ◦ y is integrable, the following


change of variable formula holds:
 
E y∗ μ f = f (x)d(y∗ μ)(x) = f (y(ω))dμ(ω) = Eμ ( f ◦ y). (5.7)
Y Ω
n
Proof If f is a simple function, i.e., f = i=1 ai 1 Ai , where the ai are nonzero and
the Ai are Borelian subsets of Y , then


n 
n
E y∗ μ f = ai (y∗ μ)(Ai ) = ai μ(y −1 (Ai )), (5.8)
i=1 i=1
180 5 Sampling and Optimizing

so that (5.7) holds. In the general case, we can build a sequence f k of simple functions
converging a.s. to f , and dominated by | f |, so that f k ◦ y → f ◦ y in L 1 (Ω). Then,
(5.8) and the dominated convergence theorem imply

E y∗ μ f = lim E y∗ μ f k = lim Eμ ( f k ◦ y) = Eμ ( f ◦ y). (5.9)


k k


Given F ⊂ Y and y ∈ Y , we denote the distance to F by

ρ(y, F) := inf{ρ(y, y ); y ∈ F}. (5.10)

Definition 5.6 Let x and x be two r.v.s (with possibly different associated proba-
bility spaces) with values in the same metric space Y , and laws denoted by P and P .
L
We say that x ∼ x if x and x have the same law.

If f is a bounded, continuous function over Y , it is measurable (since the inverse


image of an open set is open). Using the approximation
 in Lemma 3.13 we easily
check if P is a probability law over Y , then Y f (z)dP(z) is well-defined and finite.
We denote by Cb (Y ) the set of continuous and bounded functions over Y .
L
Lemma 5.7 We have that x ∼ x iff
 
f (z)dP(z) = f (z)dP (z), for all f ∈ Cb (Y ). (5.11)
Y Y

Proof Clearly, if x and x have the same law, then (5.11) holds. Conversely, let
(5.11) hold. Let F be a closed subset of Y . For ε > 0, define f : Y → R by f ε (y) :=
(1 − ρ(y, F)/ε)+ . By monotone convergence,
 
P(F) = lim f ε (y)dP = lim f ε (y)dP = P (F). (5.12)
ε↓0 Y ε↓0 Y

So the two probabilities are equal over closed sets, and so also over open sets. Since
by Lemma 5.4 any probability measure over a metric space is regular, the result
follows. 

Definition 5.8 We say that a sequence Pk of measures over the metric space Y
narrowly converges to a measure P over Y , if
 
f (x)dPk (x) → f (x)dP(x), for all f ∈ Cb (Y ). (5.13)
Y Y

Definition 5.9 Let X , X k (for k ∈ N) be r.v.s over the probability spaces (Ω, F , P),
and (Ωk , Fk , Pk ) resp., both with image in Y . We say that the sequence of r.v.s X k
L
over Ωk converges in law to the r.v. X , and write X k → X , if the laws of X k narrowly
5.2 Convergence in Law and Related Asymptotics 181

converge to the law of X . In other words, by Lemma 5.5, denoting by Ek (resp. Ek )


the expectations with the law of X k (resp. X ), we have that:

X k → X in law iff the following holds :
(5.14)
Ek f (X k ) → E f (X ), for all f : Rm → R continuous and bounded.

Definition 5.10 One says that the sequence of r.v.s X k is bounded in probability if,
for any1 y0 ∈ Y , we have, setting |X |∼ := ρ(X, y0 ):

for all ε > 0, there exists a cε > 0 such that Pk (|X k |∼ > cε ) ≤ ε. (5.15)

If X is an r.v. with value in Y , we have2

for all ε > 0, there exists a κε > 0 such that μ(|X |∼ > κε ) ≤ 21 ε. (5.16)

Lemma 5.11 Let X k be a sequence of r.v.s with image in the metric space Y , con-
verging in law to X . Then X k is bounded in probability.
Proof Let ε > 0, κε be given by (5.16), and f : Y → R be continuous with image
in [0, 1], with value 0 if |y|∼ ≤ κε , and 1 if |y|∼ ≥ κε + 1. Then

Pk (|X k |∼ > κε + 1) ≤ Ek f (X k ) → E f (X ) ≤ μ(|X |∼ > κε ) ≤ 21 ε. (5.17)

We get the conclusion with cε := κε + 1. 


Definition 5.12 A function f : Y → R is said to be uniformly continuous if for all
ε > 0, there exists an α > 0 such that | f (y1 ) − f (y2 )| ≤ ε when ρ(y1 , y2 ) ≤ α. The
function is said to be Lipschitz with constant L f if | f (y1 ) − f (y2 )| ≤ L f ρ(y1 , y2 ),
for all y1 , y2 in Y .
Definition 5.13 Let f be bounded Y → R. Given λ > 0, its Lipschitz regularisation
is defined by

1
f λ (y) := inf f (z) + ρ(y, z) , for all y ∈ Y. (5.18)
z∈Y λ

We recognize the natural extension to a metric space of an infimal convolution. We


have in particular inf f ≤ f λ (y) ≤ f (y), for all y ∈ Y .
Lemma 5.14 Let f be bounded and continuous Y → R, with Lipschitz regularisa-
tion f λ (y). Then (i) f λ (y) is Lipschitz with constant 1/λ, (ii) we have f λ (y) ↑ f (y)
when λ ↓ 0, for all y ∈ Y .

1 The definition is independent of y0 . In the applications Y will be a Banach space and we will take
y0 = 0 so that | · |∼ will be equal to the norm of Y .
2 Indeed, the family A := {ω ∈ Ω; |X | > n} being nonincreasing with empty intersection,
n ∼
μ(An ) ↓ 0 by (3.14).
182 5 Sampling and Optimizing

Proof Using the majoration of differences of infima by the supremum of differences,


and the triangle inequality, we get

1 1
f λ (y ) − f λ (y) ≤ sup ρ(y , z) − ρ(y, z) ≤ ρ(y , y). (5.19)
λ z∈Y λ

By symmetry we deduce that f λ is Lipschitz with constant 1/λ.


(ii) By the definition, f λ (y) ≤ f (y) and f λ (y) increases when λ ↓ 0. Fix y ∈ Y . Since
f is continuous in y, for all ε > 0, there exists an α > 0 such that | f (z) − f (y)| ≤ ε
when ρ(z, y) ≤ α. So,

f λ (y) = min inf ρ(z,y)≤α f (z) + λ1 ρ(z, y) , inf ρ(z,y)>α f (z) + λ1 ρ(z, y)
≥ min( f (y) − ε, inf f + α/λ),
(5.20)
and hence, lim inf λ↓0 f λ (y) ≥ f (y) − ε. The conclusion follows. 

Lemma 5.15 The convergence in law of X k to X holds iff

Ek f (X k ) → E f (X ), for all f : Y → R Lipschitz and bounded. (5.21)

Proof The condition is obviously necessary; let us show that it is sufficient. So, let
(5.21) be satisfied, and let f : Y → R be continuous and bounded. By symmetry, it
suffices to show that lim inf k Ek f (X k ) ≥ E f (X ). The Lipschitz regularization f λ of
f being Lipschitz and bounded, it satisfies

Ek f λ (X k ) → E f λ (X ). (5.22)

By monotone convergence and in view of Lemma 5.14(ii), we have that for all ε > 0,
there exists a λε such that

E f λ (X ) ≥ E f (X ) − ε if λ < λε . (5.23)

Using f λ (y) ≤ f (y), we get when λ < λε :

lim inf Ek f (X k ) ≥ lim inf Ek f λ (X k ) = E f λ (X ) ≥ E f (X ) − ε, (5.24)


k k

as was to be shown. 

Corollary 5.16 The convergence of a random variable over a given probability


space, either a.s., or in probability, implies the convergence in law.

Proof By Theorem 3.28(ii), the convergence a.s. of an r.v. on a probability space


implies the convergence in probability. It suffices therefore to consider the case of
a sequence of r.v.s X k over (Ω, F , P) converging to X in probability. Let f be
Lipschitz and bounded, with Lipschitz constant L. Then
5.2 Convergence in Law and Related Asymptotics 183

|E( f (X k ) − f (X ))| = E1{|X k −X |>ε} | f (X k ) − f (X )|


+E1{|X k −X |≤ε} | f (X k ) − f (X )| (5.25)
≤ 2 f ∞ meas({|X k − X | > ε}) + εL ,

converges to 0. We conclude by the previous lemma. 

Definition 5.17 We say that f : Y → R has bounded support if f (y) = 0 when


|y|∼ is large enough (i.e., when f is zero outside a set of finite diameter).

Actually we can use as test functions Lipschitz functions with bounded support:
L
Lemma 5.18 Let X k be bounded in probability. Then X k → X iff

Ek f (X k ) → E f (X ), for all f : Y → R
(5.26)
Lipschitz with bounded support.

Proof It suffices to check that (5.26) implies the convergence in law. Let f be Lip-
schitz and bounded. For M > 0, let ϕ M be Lipschitz R → [0, 1], with value 1 over
[0, M] and 0 over [M + 1, ∞[. By dominated convergence,

lim Eϕ M (|X |∼ ) = E1 = 1. (5.27)


M↑∞

Fix ε > 0. Let Mε be such that Eϕ Mε (|X |∼ ) ≥ 1 − 21 ε. For k large enough, by


(5.26), we have that Ek ϕ Mε (|X k |∼ ) ≥ 1 − ε. Define ψε (t) := 1 − ϕ Mε (t). Then
Eψε (|X |∼ ) ≤ 21 ε and, for large enough k, Ek ψε (|X k |∼ ) ≤ ε. We have then

Ek f (X k ) = Ek f (X k )ϕ Mε (X k ) + Ek f (X k )ψε (X k ). (5.28)

Using Ek f (X k )ϕ Mε (|X k |∼ ) → E f (X )ϕ Mε (|X |∼ ) and

|Ek f (X k )ψε (|X k |∼ )| ≤  f ∞ Eψε (|X k |∼ ) ≤ ε f ∞ , (5.29)

we get with (5.28)

lim inf Ek f (X k ) ≥ E f (X )ϕ Mε (|X |∼ ) − ε f ∞ . (5.30)


k

By the monotone convergence theorem, E f (X )ϕ Mε (|X |∼ ) → E f (X ) when ε ↓ 0,


and so, lim inf k Ek f (X k ) ≥ E f (X ). Changing f into − f we obtain the converse
inequality. The conclusion follows. 

Definition 5.19 Let yk be a sequence of r.v.s with image in Y . One says that yk
converges in probability to ȳ ∈ Y (deterministic) if it converges in probability to the
constant function with value ȳ over Y , i.e., if

Pk {ω ∈ Ω; ρ(yk (ω), ȳ) > ε} → 0, for all ε > 0. (5.31)


184 5 Sampling and Optimizing

In particular, if yk is another sequence of r.v.s over the same probability spaces


(Ωk , Fk , Pk ) as yk , with image in the separable3 metric space Y , one says that
(the sequence of r.v.s Ωk × Ωk → R) ρ(yk , yk ) converges in probability to 0 if
Pk {ρ(yk , yk ) > ε} → 0, for all ε > 0.
Lemma 5.20 The convergence in probability of yk to ȳ ∈ Y is equivalent to the
convergence in law of yk to the Dirac measure at ȳ.
Proof (a) If yk converges in probability to ȳ ∈ Y , for all f Lipschitz and bounded,
and ε > 0, we have

Ek f (yk ) = Ek f (yk )1{ρ(yk , ȳ)≤ε} + Ek f (yk )1{ρ(yk , ȳ)>ε}


(5.32)
≥ Ek ( f ( ȳ) − εL f ) + o( f ∞ ),

so that lim inf k Ek f (yk ) ≥ f ( ȳ) − εL f . By symmetry we deduce that Ek f (yk ) →


f ( ȳ) and so, yk converges in law to the Dirac measure at ȳ.
(b) Conversely, if yk converges in law to the Dirac measure at ȳ, taking f (y) :=
min(1, ρ(y, ȳ)), we get for all ε ∈ (0, 1):

0 = lim Ek f (yk ) − f ( ȳ) = lim Ek f (yk ) ≥ ε lim sup Pk {ρ(yk , ȳ) ≥ ε}. (5.33)
k k k

The conclusion follows. 


Proposition 5.21 Let yk and yk be two sequences of r.v with image in the separable
metric space Y , such that yk and yk have the same probability space (Ωk , Fk , Pk ),
and ρ(yk , yk ) → 0 in probability. Then
(i) We have that Ek [ f (yk ) − f (yk )] → 0, for all f Lipschitz and bounded.
L L
(ii) If yk → ȳ, where ȳ is an r.v., then yk → ȳ.
Proof (i) Let f be Lipschitz and bounded. Then

Ek [ f (yk ) − f (yk )] = Ek ( f (yk ) − f (yk ))1{ρ(yk ,yk )>ε}


+Ek ( f (yk ) − f (yk ))1{ρ(yk ,yk )≤ε} (5.34)
≥ −2 f ∞ Pk [ρ(yk , yk ) > ε] − εL f ,

and so lim inf k Ek [ f (yk ) − f (yk )] ≥ −εL f , which by symmetry implies Ek [ f (yk ) −
f (yk )] → 0 as was to be shown.

(ii) A consequence of (i) and of Lemma 5.15. 


Remark 5.22 We will apply the proposition in the case when yk is a constant
sequence equal to some r.v. ȳ. We have proved that, if ρ( ȳ, yk ) converges in proba-
bility to 0, then yk converges in law to ȳ.

3 The separability of Y ensures that ρ(yk (ω), yk (ω)) is measurable, see Billingsley [20, Appendix
II].
5.2 Convergence in Law and Related Asymptotics 185

We recall the Skorokhod–Dudley representation theorem [118]; see [45, Thm.


11.7.2] for a proof.
Theorem 5.23 Let y k be a sequence of r.v.s over (Ωk , Fk , Pk ), with values in a
separable Banach space Y , converging in law to a probability P. Then there exists a
probability space (Ω, F , μ) and a sequence ŷ k of r.v.s over (Ω, F , μ) with values
L
in Y , such that ŷ k ∼ y k (and therefore ŷ k converges in law to P), and ŷ k converges
a.s. (and therefore also, by Theorem 3.28, in probability).

5.2.3 Central Limit Theorems

We first recall the classical result, see e.g. [20].


Theorem 5.24 (Central limit) Let X be an r.v. with values in Rm and finite
second moment, expectation X̄ , and covariance matrix V of size m × m. Set
X N := N −1 (X 1 + · · · + X N ), where the X i are independent with the law of X . Then
N 1/2 (X N − X̄ ) converges in law to the Gaussian of expectation 0 and variance V .
In what follows we will consider samples of functions to be minimized. So we
need an infinite-dimensional version of the previous results.
Definition 5.25 Let y and z be two r.v.s over the probability space (Ω, F , μ) with
image in a Banach space Y . We assume that y and z have finite second moment,
and denote by ȳ, z̄ their expectations. For any pair (g, h) in Y ∗ × Y ∗ , we define the
covariance of (y, z) along (g, h) by

cov[y, z](g, h) := E [g, y − ȳh, z − z̄] . (5.35)

Note that the functions

(y, z) → cov[y, z](g, h) and (g, h) → cov[y, z](g, h)

are bilinear and continuous, from L 2 (Ω, Y )2 and Y ∗ × Y ∗ to R resp. Set

var[y](g) := cov[y, y](g, g). (5.36)

Definition 5.26 We say that a measure μ over Y is Gaussian if, for all nonzero
y ∗ ∈ Y ∗ , the following measure over R is Gaussian:

μ[y ∗ ](B) := μ({y ∈ Y ; y ∗ , y ∈ B}), for all Borelian B ⊂ R. (5.37)

Consider a probability space (Ω, F , μ), a compact space X ⊂ Rn , and a Cara-


théodory function f : Ω × Rn → R p , i.e., f (ω, x) is continuous w.r.t. x a.s., and
186 5 Sampling and Optimizing

measurable in ω for all x. We assume that f is Lipschitz (in x) with a square integrable
Lipschitz constant, in the sense that

| f (ω, x ) − f (ω, x)| ≤ a(ω)|x − x|, for all x and x in X, (5.38)

with a(ω) ∈ R+ measurable and of finite second moment:

Ea(ω)2 < ∞. (5.39)

We assume the existence of a finite second moment for a particular point x0 ∈ X :

E f (ω, x0 ))2 < ∞, (5.40)

which combined with the previous hypotheses implies the existence of a finite second
moment of f (·, x) for any x ∈ X .
Then ω → f (ω, ·) is an r.v. with image in the Banach space Y = Cb (X ) p , with
expectation denoted by f¯(x). We denote the sample approximation of f¯ by

1 
N
fˆN (x) := f (ωi , x). (5.41)
N i=1

We next state a Functional Central Limit Theorem (FCLT; functional here means
infinite-dimensional).

Theorem 5.27 If (5.38)–(5.40) holds, then N fˆN (x) − f¯(x) converges in law
to the Gaussian of covariance equal to that of f .

Proof See Araujo and Giné [8, Cor. 7.17] for the proof of this difficult result. 

5.2.4 Delta Theorems

5.2.4.1 The First-Order Delta theorem

We now establish some differential calculus rules for r.v.s converging in law.

Theorem 5.28 (Delta theorem) Let Yk be a sequence of r.v.s with values in a sepa-
rable Banach space Y containing η, τk ↑ ∞, and Z an r.v. with values in Y , such
that Z k := τk (Yk − η) converges in law to Z . Let G : Y → W , where W is a Banach
space, be differentiable at η. Then τk (G(Yk ) − G(η)) converges in law to G (η)Z .

Proof In view of the representation Theorem 5.23, we may suppose that the Yk are
r.v.s over the same probability space (Ω, F , P) and that Z k → Z a.s. Since G is
differentiable at η,
5.2 Convergence in Law and Related Asymptotics 187

τk (G(Yk ) − G(η)) → G (η)Z a.s. (5.42)

We conclude by applying Corollary 5.16 to the above expression.


In the applications we wish to expand the minimum value of an expectation
function. Since the minimum is not a differentiable function, we need to extend the
Delta theorem. 
Definition 5.29 Let X be a Banach space, K ⊂ X , and x̄ ∈ K . We call the set

TK (x̄) := {h ∈ X ; there exists tk ↓ 0, xk ∈ K ; (xk − x̄)/tk → h} (5.43)

the (tangent) cone of Bouligand to K at x̄.


Note that, if K is convex, this set coincides with the tangent cone in the sense of
convex analysis (Definition 1.80).
Definition 5.30 Let X and W be two Banach spaces, K ⊂ X , and G : K → W . One
says that G is Hadamard differentiable at x̄ ∈ K , tangentially to K , in the direction
h ∈ TK (x̄) if, for any sequence (tk , xk ) associated with Definition 5.29, we have
that (G(xk ) − G(x̄))/tk has a limit, independent of the particular sequence (tk , xk ),
denoted by G (x, h). If this holds for all h ∈ TK (x̄), one says that G is Hadamard
differentiable at x̄ tangentially to K . When K = Y , one says that G is Hadamard
differentiable at x̄.
Lemma 5.31 If G is the restriction of a Lipschitz mapping X → Y , with directional
derivatives at x̄, then it is Hadamard differentiable at x̄.
Proof Indeed, let G have Lipschitz constant L. When tk ↓ 0 and (xk − x̄)/tk → h,
we have
G(xk ) − G(x̄) G(x̄ + tk h) − G(x̄) G(xk ) − G(x̄ + tk h)
lim = lim + lim .
k tk k tk k tk
(5.44)
Since
G(xk ) − G(x̄ + tk h) ≤ Lxk − (x̄ + tk h) = o(tk ), (5.45)

the limit of the r.h.s. of (5.44) is G (x, h). The result follows.
We next introduce the “Hadamard” version of the Delta theorem. 
Theorem 5.32 (Hadamard Delta Theorem) Let Y and W be Banach spaces, with
Y separable, K a subset of Y , G : K → W Hadamard differentiable at η ∈ K
tangentially to K , and Yk a sequence of r.v.s with values in K . Let τk ↑ ∞, and Z
an r.v. with values in Y , such that Z k := τk (Yk − η) converges in law to Z . Then
τk (G(Yk ) − G(η)) converges in law to G (η, Z ).
Proof The proof is similar to that of Theorem 5.28, replacing (5.42) with

τk (G(Yk ) − G(η)) → G (η, Z ) a.s. (5.46)


188 5 Sampling and Optimizing

5.2.4.2 The Second-Order Delta Theorem

Definition 5.33 Let X and W be two Banach spaces, K ⊂ X , and G : K → W


be Hadamard differentiable at x̄ ∈ K , tangentially to K , in direction h ∈ TK (x̄),
with directional derivative denoted by G (x̄, h). One says that G is second-order
Hadamard differentiable at x̄ ∈ K , tangentially to K , in the direction h ∈ TK (x̄) if
for any sequence (tk , xk ) associated with Definition 5.29, we have the existence of

G(xk ) − G(x̄) − G (x̄, xk − x̄)


G (x̄, h) := lim 1 2
, (5.47)
k t
2 k

the limit being independent of the sequence (tk , xk ). If this holds for all h ∈ TK (x̄),
one says that G is second-order Hadamard differentiable at x̄ tangentially to K .
When K = Y , one says that G is second-order Hadamard differentiable at x̄.

Observe that, if G is of class C 2 , then

G (x̄, h) = D 2 G(x̄)(h, h). (5.48)

Theorem 5.34 (Second-order Hadamard Delta Theorem) Let Y and W be Banach


spaces, with Y separable, K a subset of Y , G : K → W second-order Hadamard
differentiable at η ∈ K tangentially to K , and Yk a sequence of r.v.s with values in K .
Let τk ↑ ∞, and Z an r.v. with values in Y , such that Z k := τk (Yk − η) converges
in law to Z . Then we have the convergence in law

L
2τk2 (G(Yk ) − G(η) − G (η, Yk − η)) → G (η, Z ). (5.49)

Proof The arguments are similar to those of the first-order delta theorem. 

5.2.5 Solving Equations

5.2.5.1 Taylor Expansion of the Solution of an Equation

Let Z be an open subset of Rn , with closure denoted by Z̄ , and Lipschitz boundary


∂ Z (meaning that locally, up to a diffeomorphism, Z coincides with the set {z ∈
Rn ; z n ≤ f (z 1 , . . . , z n−1 )} for some Lipschitz function f ). If ϕ is a C p function
over Z with image in Rn and uniformly continuous derivatives up to order p, we
extend these derivatives over ∂ Z by continuity. We denote by Φ p the space of such
C p function over Z̄ . We can identify Φ 1 with a closed subspace of Cb ( Z̄ )n+1 , and
similarly identify Φ p with a closed subspace of Cb ( Z̄ )n( p) , for some n( p).
For ϕ ∈ Φ p , p ≥ 1, and z ∈ Z , consider the equation

F(ϕ, z) := ϕ(z) = 0. (5.50)


5.2 Convergence in Law and Related Asymptotics 189

Clearly F is for given z a linear continuous function of ϕ ∈ Φ p , with derivative

D F(ϕ, z)(ψ, ζ ) = ψ(z) + ϕ (z)ζ. (5.51)

This derivative being a continuous function of (ϕ, z), F is of class C 1 .


Assume next that ϕ̄ has a root z̄ in the interior of Z , and that ϕ̄ (z̄) is invertible. By
the implicit function theorem we have that, locally, ϕ(z) = 0 iff z = G(ϕ) for some
C 1 function G : Φ p → Z . So we have that ϕ(G(ϕ)) = 0. Computing the derivative
of ϕ(G(ϕ)) = 0 in direction ψ ∈ Φ p , at ϕ̄, we obtain

ψ(z̄) + ϕ̄ (z̄)G (ϕ̄)ψ = 0. (5.52)

Since ϕ̄ (z̄) is invertible, we obtain the expression of the derivative of G at ϕ̄ as

G (ϕ̄)ψ = −ϕ̄ (z̄)−1 ψ(z̄). (5.53)

We can write this for a neighbouring function ϕ as

ϕ (G(ϕ))G (ϕ)ψ + ψ(G(ϕ)) = 0. (5.54)

Differentiating the above expression wrt ϕ in the direction of ψ we obtain

ψ (z)G (ϕ)ψ + ϕ (z)(G (ϕ)ψ)2 + ϕ (z)G (ϕ)(ψ)2 + ψ (z)G (ϕ)ψ = 0. (5.55)

Since ϕ (z) is invertible this provides an expression for G (ϕ)(ψ)2 . The first and last
term are identical and we can eliminate

G (ϕ)ψ = −ϕ (z)−1 ψ(z). (5.56)

So,
 
G (ϕ̄)(ψ)2 = ϕ̄ (z̄)−1 2ψ (z̄)ϕ (z̄)−1 ψ(z̄) − ϕ (z̄)(ϕ (z̄)−1 ψ(z̄))2 . (5.57)

5.2.5.2 Stochastic Equations

Let f (ω, x) be a Carathéodory function Ω × Rn → Rn , a.s. of class C 2 w.r.t. x.


Denote by D f (ω, x) and D 2 f (ω, x) the corresponding derivatives. Assume that f ,
D f (ω, x) and D 2 f (ω, x) are square integrable and Lipschitz in x with a square
integrable Lipschitz constant (hypotheses (5.38)–(5.40)). Setting f¯(x) := E f (·, x),
consider the equation
f¯(x) = 0. (5.58)

We assume that it has a regular root x̄, i.e., f¯(x̄) = 0 and D f¯(x̄) is invertible. There
exists an ε > 0 such that f¯ has no other root in B̄(x̄, ε), and D f¯ is uniformly
190 5 Sampling and Optimizing

invertible over B̄(x̄, ε). Let Y denote the separable Banach space of C 1 functions
over B̄(x̄, ε) with image in Rn , endowed with the natural norm

yY := max |y(x)| + max |Dy(x)|. (5.59)


x x

If g ∈ Y is close enough to the restriction of f¯ to Y , it has a unique solution


denoted by χ (g). Otherwise we set χ (g) equal to a given x0 ∈ Rn . The sampling
approximation is
fˆN (x) = 0. (5.60)

We set x̂ N := χ ( fˆN ).

Theorem 5.35 Let f , x̄ be as above and Z (x̄) denote the covariance of f (·, x̄).
Then
N 1/2 (x̂ N − x̄) → − f¯ (x̄)−1 Z (x̄),
L
(5.61)

and 2N (x̂ N − x̄ + f¯ (x̄)−1 Z (x̄)) converges in law to


 
f¯ (x̄)−1 2Z (x̄) f¯ (x̄)−1 Z (x̄) + f¯ (x̄)( f¯ (x̄)−1 Z (x̄))2 . (5.62)

5.3 Error Estimates

In this section, given a probability space (Ω, F , P), we assume that

X is a compact subset of Rn , and (5.63)

We also assume that f has finite second moments and denote its expectation by f¯(x).

5.3.1 The Empirical Distribution

When optimizing an expectation, it frequently occurs that the law μ is not known,
but nevertheless it is possible to get a sample of realizations that follows the law of
μ. Given an integer N > 0, we obtain an empirical distribution μ̂n that associates
the probability 1/N with each element ω1 , . . . , ω N of the sample (or rather gives
probability p/N in the case of p identical realizations) and zero to the others. This
empirical distribution is an r.v. We recall that we denote by

1 
N
ˆ
f N (x) = f (ωi , x) (5.64)
N i=1
5.3 Error Estimates 191

the mean value of the empirical distribution, also called the standard estimator of
the mean value. We recall that this estimator is unbiased, since

1 
N
E fˆN (x) = E f (ωi , x) = f¯(x). (5.65)
N i=1

The estimation error is fˆN (x) − f¯(x), with variance

1 
N
2 1
E fˆN (x) − f¯(x) E f (ωi , x) − f¯(x)
2
= = V ( f, x). (5.66)
N 2 i=1 N

So, the standard deviation (square root of the variance) of fˆN (x) is N −1/2 V ( f, x)1/2 .
We recall the classical estimator of the variance.

Lemma 5.36 A convergent, unbiased estimator of V ( f, x) is

1 
N
2
V̂ ( f, x) := f (ωi , x) − fˆN (x) . (5.67)
N − 1 i=1

Proof Omitting the dependence on x and assuming w.l.o.g. that f¯ = 0, we get


N 
N 
N
(N − 1)V̂ ( f ) := f (ωi )2 − 2 fˆN f (ωi ) + N fˆN2 = f (ωi )2 − N fˆN2 ,
i=1 i=1 i=1
(5.68)
and so (N − 1)EV̂ ( f ) = N V ( f ) − V ( f ); the result follows. 

Remark 5.37 It follows that the naive estimator below has a negative bias of order
1/N :
1 
N
2
Ṽ ( f, x) := f (ωi , x) − fˆN (x) . (5.69)
N i=1

5.3.2 Minimizing over a Sample

The problem of minimizing the expectation of f :

Min f¯(x) (P)


x∈X

can be approximated by the problem of minimizing the standard estimate of the mean
value:
192 5 Sampling and Optimizing

Min fˆN (x) ( P̂N ).


x∈X

Lemma 5.38 The function N → E val( P̂N ) is nondecreasing, and satisfies

lim E val( P̂N ) ≤ val(P). (5.70)


N

Proof (a) We first show that E val( P̂N ) ≤ val(P). Since f¯(x) = E fˆN (x), this is
equivalent to  
 
ˆ ˆ
inf E f N (x) ≥ E inf f N (x) , (5.71)
x∈X x∈X

which is a consequence of Jensen’s inequality, the infimum being a concave function.


(b) Let us check that v N := E val( P̂N ) is nondecreasing. Indeed, by Jensen’s
inequality again:

N +1

1 1 
v N +1 = E inf f (ω j , x)
N +1 x∈X
i=1
N j=i
N +1

1 1 
≥ E inf f (ω j , x) (5.72)
N +1 i=1
x∈X N
j=i
 N +1
1 1 
= E inf f (ω j , x) = v N ,
N + 1 i=1 x∈X N j=i

as was to be shown. 

By the above lemma, val( P̂N ) is an estimate of val(P) with a nonpositive bias.

Remark 5.39 As an illustration of Lemma 5.38, consider the unbiased estimate


V̂ ( f, x) of the variance, defined in (5.67). Alternatively we could solve the problem

1 
n
Min ( f (ωi ) − e)2 ,
e∈R N i=1

whose solution is e = fˆN . The value of this problem is the estimator Ṽ ( f, x) =


(N − 1)N −1 V̂ ( f, x), which as we have seen has a negative bias.
5.3 Error Estimates 193

5.3.3 Uniform Convergence of Values

Set g(ω) := max | f (ω, x)|. Since X is compact, it contains a dense sequence x k ,
x
and since f (ω, x) ∈ Cb (X ) a.s., we have that g(ω) = max | f (ω, x k )| a.s., proving
k
that g is measurable.

Theorem 5.40 Let (5.63) hold. If g is integrable, then fˆN (x) → f¯(x) uniformly,
with probability (w.p.) 1.

Proof Since g is integrable, by the dominated convergence theorem, for any x ∈ X ,


we have that (i) f (·, x) is integrable, and so, f¯(x) is real-valued, and (ii) if x j → x
in X , then f¯(x j ) → f¯(x̂), i.e., f¯ is continuous.
Define, for x and x in X and ω ∈ Ω:

h(ω, x ) := | f (ω, x) − f (ω, x )|. (5.73)

This function is continuous w.r.t. (x, x ), and is dominated by 2g(ω). By arguments


similar to the previous ones, its expectation h̄(x, x ) is a continuous function, with
zero value when x = x . Since a continuous function over a compact set is uniformly
continuous, for all ε > 0, there exists an αε > 0 such that h̄(x, x ) < ε when |x −
x| ≤ αε . In addition,

  
N
ˆ  1
lim sup  f N (x) − fˆN (x ) ≤ lim sup h(ω, xi ) ≤ ε w.p. 1,
N x ∈B(x,αε ) N N x ∈B(x,αε ) i=1
(5.74)
where the first inequality uses the triangle inequality, and the second one uses the
separability of B(x, αε ) to ensure the measurability of supx ∈B(x,αε ) h(ω, x ), and the
law of large numbers.
Covering the compact set X by finitely many open balls with radius αε and center
x k , k = 1 to K ε , and using fˆN (x k ) → f¯(x k ) w.p. 1, we get, for x ∈ B(x k , αε ):
     
   
lim sup  fˆN (x) − f¯(x) ≤ lim sup  fˆN (x) − fˆN (x k ) + lim sup  f¯(x k ) − f¯(x) .
N N N
(5.75)
The first limit in the r.h.s. is w.p. 1 no more than ε by (5.74), and the second one can
be made arbitrarily small by taking αε small enough. The conclusion follows. 

Corollary 5.41 Under the hypotheses of Theorem 5.40, val( P̂n ) → val(P) with
probability 1.

Proof Indeed, the theorem ensures w.p. 1 the uniform convergence of the cost func-
tion, and the function f → min x∈X f (x) is continuous over Cb (X ). 
194 5 Sampling and Optimizing

5.3.4 The Asymptotic Law

Let f : X → R. We set

S( f ) := {x̄ ∈ X ; f (x̄) = inf f (x)}. (5.76)


x∈X

The next proposition is due to Danskin [38].


Proposition 5.42 The map min : Cb (X ) → R, that to f ∈ Cb (ω) associates the
value min x∈X f (x), is Hadamard differentiable, and its derivative at f in direction
g ∈ Cb (X ) is
min ( f, g) = min x∈S( f ) g(x). (5.77)

Proof Since |min( f ) − min( f )| ≤ supx | f (x) − f (x)|, we have that min(·) is Lip-
schitz. By Lemma 5.31, it suffices to check that it has directional derivatives satisfying
(5.77). Let f and g belong to Cb (X ). We have, for ε > 0:

min( f + εg) ≤ min ( f (x) + εg(x)) = min( f ) + ε min g(x). (5.78)


x∈S( f ) x∈S( f )

On the other hand, let εk ↓ 0, and x k ∈ S( f + εk g). Extracting a subsequence if


necessary, we may assume that x k → x̄. Passing to the limit in the relation

f (x k ) + εk g(x k ) ≤ f (x) + εk g(x), for all x ∈ X, (5.79)

we deduce that x̄ ∈ S( f ). By continuity of g, we have

min( f + εk g) = f (x k ) + εk g(x k ) = f (x k ) + εk g(x̄) + o(εk )


≥ min( f ) + εk min g(x) + o(εk ), (5.80)
x∈S( f )

which combined with (5.78) implies the conclusion. 


Theorem 5.43 Let f (ω, x) satisfy (5.38)–(5.40), and denote by Z (x) the Gaussian
with variance equal to that of f (ω, x). Then N 1/2 (val( P̂N ) − val(P)) converges in
law to min x∈S( f¯)) Z (x).

Proof By Theorem 5.27, N 1/2 ( fˆN (x) − f¯(x)) converges in law to Z . We conclude
by combining Proposition 5.42 and the Hadamard Delta Theorem 5.32. 
Remark 5.44 The asymptotic law of N 1/2 (val( P̂N ) − val(P)), when the minimum
of f¯ over X is not unique, is therefore in general not Gaussian.
Example 5.45 Let ω √ be a standard Gaussian variable, X = {1, 2},√ f (ω, 1) = ω,
f (ω, 2) = 0. Then N fˆN (1) is a standard Gaussian variable, and N fˆN (2) = 0.
√ law of ( P̂N ) is min(0, Z 1 ), where Z 1 is a standard Gaussian variable. We have
So the
that N fˆN (x) converges in law to Z := (Z 1 , 0) so that, as follows from the above
5.3 Error Estimates 195

theorem, min x fˆN (x) converges in law (since in fact here the law is constant over the
sequence) to min(0, Z 1 ).

5.3.5 Expectation Constraints

Problems with expectation type constraints need a more involved analysis. We restrict
ourselves to the convex setting, which is the only one that is well understood.

5.3.5.1 Marginal Analysis of Convex Problems

Let X be a compact subset of Rn and ( f, G) be continuous functions from X to R


and R p resp. The associated optimization problem is

Min f (x); G(x) ≤ 0, x ∈ X. (P f,G )


x

Denote by val( f, G) its value, and by L[ f, G](x, λ) := f (x) + λ · G(x) its


Lagrangian. The dual problem is

Maxp inf L[ f, G](x, λ), (D f,G )


λ∈R+ x∈X

with solution set denoted by Λ( f, G). We know that, if the duality gap is zero, then
(x̄, λ̄) is a primal-dual solution4 iff

x̄ ∈ argmin L[ f, G](x, λ̄); λ̄ ≥ 0; λ̄ · G(x̄) = 0. (5.81)


x∈X

One easily checks that the stability condition (1.170) of the duality theory holds iff

There exists β f,G > 0, and x f,G ∈ X, such that
(5.82)
G i (x f,G ) < −β f,G , i = 1, . . . , p.

The following result is consequence of Proposition 1.98:


Lemma 5.46 Assume that X ⊂ Rn is convex and compact, f and G i , i = 1 to p, are
continuous and convex functions over Rn , and the stability condition (5.82) holds.
Then (P f,G ) and (D f,G ) have the same value, the sets S( f, G) and Λ( f, G) are
compact and nonempty, and the primal-dual solutions are characterized by (5.81).
Denote by Cconv (X ) the set of restrictions to X of continuous convex functions
over Rn . The reference functional space is Cb (X ). The following result takes its
origin in Gol’shtein [54]:

4 This means that x̄ is a solution of (P f,G ), and λ̄ is a solution of (D f,G ).


196 5 Sampling and Optimizing

Theorem 5.47 Assume that X ⊂ Rn is convex and compact, f and G i , i = 1 to


p, are restrictions of continuous convex functions over Rn , and the stability con-
dition (5.82) holds. Then val(·, ·) is Hadamard differentiable at ( f, G) tangentially
to Cconv (X ), and the expression of its derivative in the direction (φ, Ψ ) tangent to
Cconv (X ) at ( f, G) is

val ( f, G)(φ, Ψ ) = min max L[φ, Ψ ](x, λ). (5.83)


x∈S( f,G) λ∈Λ( f,G)

In addition, let a sequence in Cconv (X ) be of the form ( f + εk φ k , G + εk Ψ k ), with


(φ k , Ψ k ) → (φ, Ψ ) uniformly and εk ↓ 0. Then any limit point of S( f + εk φ k , G +
εk Ψ k ), belong to the set

argmin max L[φ, Ψ ](x, λ). (5.84)


x∈S( f,G) λ∈Λ( f,G)

Proof (a) Let (φ k , Ψik ), i = 1 to p, belong to Cb (Rn ), and be such that ( f +


εk φik , G + εk Ψik ), i = 1 to p, are continuous convex functions over Rn , and (φ k , Ψ k )
converge uniformly to (φ, Ψ ) over X . Set

vk := val( f + εk φ k , G + εk Ψ k ). (5.85)

Let x̄ ∈ S( f, G), γ ∈ (0, 1) and set x γ := γ x f,G + (1 − γ )x̄ with x f,G defined in
(5.82). Then x γ ∈ X and

G i (x γ ) ≤ γ G i (x f,G ) + (1 − γ )G i (x̄) < −γβ f,G . (5.86)

For k large enough, we have that x γ ∈ F( f + εk φ k , G + εk Ψ k ). Let x k ∈ S( f +


εk φ k , G + εk Ψ k ). Then

lim sup f (xk ) = lim sup vk ≤ inf lim( f + εk φ k )(x γ ) = val( f, G). (5.87)
k k γ ∈(0,1) k

(b) Again, let x k ∈ S( f + εk φ k , G + εk Ψ k ). For all x ∈ S( f, G) and λ ∈ Λ( f, G),


we have that:

vk = ( f + εk φ k )(x k ) ≥ ( f + εk φ k )(x k ) + λ · (G + εk Ψ k )(x k )


= L[ f, G](x k , λ) + εk L[φ k , Ψ k ](x k , λ) (5.88)
≥ val( f, G) + εk L[φ k , Ψ k ](x k , λ).

We used here the complementarity conditions, the minimality of the Lagrangian


L[ f, G](·, λ) at S( f, G), and the fact that L[ f, G](x, λ) = val( f, G) when x ∈
S( f, G). By (5.87), (5.88), vk → val( f, G), and since vk = f (x k ) + o(1), and
G(x k )+ = O(ε), any limit point x̂ of x k belongs to S( f, G), and we get by (5.88),
extracting a subsequence if necessary:
5.3 Error Estimates 197

vk ≥ val( f, G) + εk L[φ, Ψ ](x̂, λ) + εk o(1 + |λ|). (5.89)

Maximizing w.r.t. λ in the compact set Λ( f, G), we get

vk ≥ val( f, G) + εk max L[φ, Ψ ](x̂, λ) + o(εk ). (5.90)


λ∈Λ( f,G)

Minimizing then w.r.t. x̂ ∈ S( f, G), we obtain

vk ≥ val( f, G) + εk min max L[φ, Ψ ](x, λ) + o(εk ). (5.91)


x∈S( f,G) λ∈Λ( f,G)

(c) Again, let x k ∈ S( f + εk φ k , G + εk Ψ k ). Fix x̄ ∈ S( f, G). The stability condition


(5.82) implies that Λk := Λ(φ k , Ψ k ) is uniformly bounded for k large enough. Let
λk ∈ Λk . Extracting a subsequence if necessary, we may assume that λk → λ̄, and
one shows easily that λ̄ ∈ Λ( f, G). We get

vk = f (x k ) + εk φ k (x k )
= min L[ f + εk φ k , G + εk Ψ k ](x, λk )
x∈X
≤ L[ f + εk φ k , G + εk Ψ k ](x̄, λk ) (5.92)
= L[ f, G](x̄, λk ) + εk L[φ k , Ψ k ](x̄, λk )
≤ val( f, G) + εk L[φ, Ψ ](x̄, λ̄) + o(εk ).

The second inequality uses the relation L[ f, G](x̄, λk ) ≤ L[ f, G](x̄, λ̄) = val( f, G),
a consequence of the fact that λ̄ is a dual solution. Since Λ( f, G) is bounded, we get
vk ≤ val( f, G) + εk max L[φ, Ψ ](x̄, λ) + o(εk ), (5.93)
λ∈Λ( f,G)

and minimizing w.r.t. x̄ ∈ S( f, G) we obtain the converse inequality of (5.91), imply-


ing (5.83). Finally the property about limit points of primal solutions follows from
(5.83) and (5.90). 

5.3.5.2 Application to Expectation Constraints

Let f (ω, x) : Ω × Rn → R and G(ω, x) : Ω × Rn → R p . Assume that f and G i ,


i = 1 to p, are convex w.r.t. x a.s., and measurable in ω for all x (Carathéodory
conditions for convex functions), and satisfy (5.38)–(5.40). Their expectations
f¯(x) = E f (x, ·) and Ḡ i (x) = EG i (x, ·) are therefore convex and continuous. Let
us consider the convex problem

Min f¯(x); Ḡ(x) ≤ 0, x ∈ X, (P f¯,Ḡ )


x
198 5 Sampling and Optimizing

with X a compact and convex subset of Rn . The sample approximation of this problem
is
Min fˆN (x); Ĝ N (x) ≤ 0, x ∈ X, (P fˆN ,Ĝ N )
x

where fˆN is the empirical estimate (5.41), and the same convention for Ĝ N (with the
same sample). We need the qualification condition

There exists β > 0, and x 0 ∈ X such that Ḡ i (x 0 ) < −β, i = 1 to p. (5.94)

The set S( f¯, Ḡ) of solutions of (P f¯,Ḡ ) is a convex and compact subset of X . We
recall that we denote by L[ f¯, Ḡ](x, λ) := f¯(x) + λ · Ḡ(x) the Lagrangian and by
Λ( f¯, Ḡ) the set of Lagrange multipliers, solutions of the dual problem

Maxp inf L[ f¯, Ḡ](x, λ). (D f¯,Ḡ )


λ∈R+ x∈X

Theorem 5.48 Let f (ω, x) and G(ω, x) satisfy (5.38)–(5.40), and let the qualifi-
cation condition (5.94) hold. Let (Z f¯ , Z Ḡ ) denote the components of the Gaussian
variable with image in Cb (X ) p+1 , of covariance equal to that of ( f¯, Ḡ). Let Z i
denote the component associated with Ḡ i . Then we have the convergence in law of
N 1/2 val(P fˆN ,Ĝ N ) − val(P f¯,Ḡ ) towards
 

p
min max Z f¯ (x) + λi Z i (x) . (5.95)
x∈S( f¯,Ḡ)) λ∈Λ( f¯,Ḡ)
i=1

Proof By (5.38)–(5.40) and Theorem 5.27, ( fˆN , Ĝ N ) converges in law towards


(Z f¯ , Z Ḡ ). We conclude by combining Theorem 5.47 and the Hadamard Delta The-
orem 5.32. 

5.4 Large Deviations

In this section we briefly recall the starting point of the theory of large deviations,
and show how to apply this theory to stochastic optimization problems.

5.4.1 The Principle of Large Deviations

Let X 1 , . . . , X N be independent r.v.s with law equal to that of X . Set Z N :=


N −1 (X 1 + · · · + X N ). For all a ∈ R and t > 0, we have that
5.4 Large Deviations 199

P(Z N ≥ a) = E[1 Z N ≥a ] ≤ e−ta E[et Z N 1 Z N ≥a ] ≤ e−ta E[et Z N ]. (5.96)

Denote by M(t) := E[et X ] the moment-generating function. We know that

E[et Z N ] = Πi=1
N
E[et X i /N ] = M(t/N ) N . (5.97)

Denote by L M(t) := log M(t) the logarithmic moment-generating function. Then

1 t
log P(Z N ≥ a) ≤ − a + L M(t/N ). (5.98)
N N
Minimizing over t > 0, we obtain

1
log P(Z N ≥ a) ≤ −I + (a), (5.99)
N
where
I + (a) := sup {aτ − L M(τ )} . (5.100)
τ >0

This definition is close to that of the Fenchel transform of the logarithmic moment-
generating function, also called the rate function:

I (a) := sup {aτ − L M(τ )} . (5.101)


τ

We have of course I + (a) ≤ I (a). The interesting case is when a > E(X ); then the
probability of Z N ≥ a tends to zero as N ↑ ∞. We will see then that I + (a) = I (a)
under weak hypotheses, and this gives the following large deviations estimate:

Theorem 5.49 (Cramér’s theorem) Let a > E(X ). If M(τ ) has a finite value in
[−τ, τ ] for some τ > 0, then I + (a) = I (a), and so with (5.99):

P(Z N ≥ a) ≤ e−N I (a) . (5.102)

Proof We have that M(0) = 1. Set τ := 21 τ . Let t ∈ (−τ , τ ). By the mean


value theorem, et X (ω) − 1 = t X (ω)eθ X (ω) for some θ = θ (ω) ∈ (−τ , τ ), and since
τ |X (ω)| ≤ eτ |X (ω)| :

|et X (ω) − 1| 1 1
≤ τ |X (ω)|eτ |X (ω)| ≤ eτ |X (ω)| . (5.103)
t τ τ

Since M(τ ) has a finite value in [−τ, τ ], the r.h.s. has an expectation majorized by
(M(τ ) + M(−τ ))/τ . By the dominated convergence theorem, (M(t) − M(0))/t
has when t ↓ 0 a limit equal to M (0+ ) = E(X ). Consequently,

L M (0) = M (0)/M(0) = E(X ). (5.104)


200 5 Sampling and Optimizing

Since L M is convex,5 and so aτ − L M(τ ) is concave, and has a derivative a −


E(X ) > 0 at τ = 0, the supremum in (5.101) is attained for τ > 0. The result
follows. 

5.4.2 Error Estimates in Stochastic Programming

Let us come back to the stochastic optimization problem (P) and its sampled version
( P̂N ) of Sect. 5.3.2. Assume for the sake of simplicity that (P) has at least one solution
x̄, and that the moment-generating function M(t) is finitely-valued for t > 0 small
enough. By the large deviations principle, for all a > f¯(x̄), we have, denoting by Ix
the rate function associated with f (ω, x):

P(val( P̂N ) ≥ a) ≤ P( fˆN (x̄) ≥ a) ≤ e−N Ix̄ (a) . (5.105)

So, the value of the sampled problem has an “exponentially weak” probability of
being more than f¯(x̄) plus a given positive amount.

Remark 5.50 When minimizing over a finite set, it follows by similar arguments that
the value of the sampled problem has an “exponentially weak” probability of being
less than f¯(x̄) minus a given positive amount.

5.5 Notes

The state of the art on the subject of the chapter is given in Ruszczynski and Shapiro
[106], and Shapiro et al. [114]. The Hadamard Delta Theorems 5.32 and 5.43 are due
to Shapiro [111]. Theorem 5.47 is also due to Shapiro [112].
Linderoth et al. [75] made extensive numerical tests to obtain statistical estimates
of the value function for simple recourse problems.

5 It suffices to check this in the case of a finite sum. Let L M(t) = log( n

i=1 pi e ),
t xi with the p pos-
−1
n −1
n i
itive of sum one. Then L M (t) = M(t) pi x i e t x i and L M (t) = M(t) 2 t xi −
i=1 pi x i e
 i=1
M(t)−2 ( i=1 n
pi xi et xi )2 . We conclude by the Cauchy–Schwarz inequality.
Chapter 6
Dynamic Stochastic Optimization

Summary Dynamic stochastic optimization problems have the following informa-


tion constraint: each decision must be a function of the available information at
the corresponding time. This can be expressed as a linear constraint involving con-
ditional expectations. This chapter develops the corresponding theory for convex
problems with full observation of the state. The resulting optimality system involves
a backward costate equation, the control variable being a point of minimum of some
Hamiltonian function.

6.1 Conditional Expectation

6.1.1 Functional Dependency

Quite often a decision needs to be a function of certain signals, or outputs of the


system. Mathematically this means that, given two functions X (the signal) and Y
(the decision) over the set Ω of events, we need to take Y = g(X ) for some function g.
As Lemma 3.15 shows, in the framework of finite-dimensional measurable functions,
this (nonlinear) constraint can be expressed as the (linear) constraint that Y belongs to
the σ -algebra generated by X . So in the sequel we will study optimization problems
with the measurability constraint to belong to a certain sub σ -algebra.

6.1.2 Construction of the Conditional Expectation

Let (Ω, F , P) be a probability space, and let G be a sub σ -algebra of F . For


s ∈ [1, ∞], we write

L s (F ) := L s (Ω, F ); HF := L 2 (F )m , (6.1)

© Springer Nature Switzerland AG 2019 201


J. F. Bonnans, Convex and Stochastic Optimization, Universitext,
https://doi.org/10.1007/978-3-030-14977-2_6
202 6 Dynamic Stochastic Optimization

with a similar convention for G . The scalar product in HF is denoted by

(X, X  )F := E(X · X  ), for all X, X  in HF . (6.2)

Both HF and HG are Hilbert spaces, and the norm on HG is induced by the norm
of HF . It follows that HG is a closed subspace of HF . The (orthogonal) projection
operator from HF onto HG is called the conditional expectation (over G ) and usually
denoted by E[·|G ]; but this notation is often too heavy and so it is convenient to write
PG instead. So, if X ∈ HF , its projection Y onto HG is such that

Y = PG X = E[X |G ]. (6.3)

The mapping PG is obviously linear. Consequently, if α1 and α2 belong to Rm , then

PG (α1 · X 1 + α2 · X 2 ) = α1 · PG X 1 + α2 · PG X 2 . (6.4)

Also, PG is non-expansive: PG X  ≤ X , and therefore continuous, and it operates


componentwise, i.e., Yi = PG X i , for i = 1 to m.
Clearly PG X  = X  iff X  ∈ HG . For any a ∈ Rm , we have that, identifying a
constant with the constant function of HF having the same value:

PG (a + X ) = a + PG X. (6.5)

We give some additional properties of the conditional expectation in the L 2 setting.


We define the componentwise product of random variables Z , Z  with values in Rm
by (Z Z  )i (ω) := z i (ω)z i (ω), for i = 1 to m and ω ∈ Ω.

Lemma 6.1 Let X ∈ HF and Y = PG X . Then (i) Y is characterized by the follow-


ing relations

Y ∈ HG and E(Y · Z ) = E(X · Z ), for all Z ∈ HG . (6.6)

(ii) We have that

PG (Z X ) = Z PG X = Z Y, for all Z ∈ L ∞ (G )m . (6.7)

(iii) For any X and X  in HF , with m = 1, we have that

X ≤ X ⇒ PG X ≤ PG X  . (6.8)

Proof (i) The expectations in (6.6) are the scalar products in HF of Z with Y and
Z . So we can rewrite this equation as (X − Y, Z )F = 0, for all Z ∈ HG , which is
the characterization of the projection onto a subspace.
(ii) Set Y Z := PG (Z X ). By point (i), Y Z is characterized by
6.1 Conditional Expectation 203

E(Y Z · Z  ) = E(X Z · Z  ), for all Z  ∈ HG . (6.9)

Now Z Z  ∈ HG , and so by point (i):

E(X Z · Z  ) = (X, Z Z  )F = (Y, Z Z  )F = (Y Z , Z  )F . (6.10)

Since Y Z ∈ HG , the result follows with (i).


(iii) By linearity it suffices to check that if X ≥ 0, then Y ≥ 0. Indeed, Y+ (the positive
part of Y taken a.s.) clearly satisfies Y+ ∈ HG and X − Y+ 2 ≤ X − Y 2 . Since
Y is the orthogonal projection of X onto HG , this implies Y = Y+ and the result
follows. 

Taking Z in (6.6) constant, we get

EPG X = EX, for all X ∈ HF . (6.11)

We now present the conditional Jensen’s inequality and some of its consequences.

Lemma 6.2 Let X ∈ HF , Y = PG X , and ϕ be a proper l.s.c. convex function over


Rm . Then
(i) The following conditional Jensen inequality holds:

ϕ(Y ) ≤ PG (ϕ(X )) a.s. on Ω. (6.12)

(ii) Let K be a nonempty, closed convex subset of Rm . Then

X (ω) ∈ K a.s. ⇒ Y (ω) ∈ K a.s. (6.13)

(iii) We have the integral Jensen inequality (the expectations having values in R ∪
{+∞}):
Eϕ(Y ) ≤ Eϕ(X ). (6.14)

(iv) For any s ∈ [1, ∞], we have that

PG X s ≤ X s , a.s. on Ω. (6.15)

Proof (i) Since ϕ is proper l.s.c. continuous and convex, we have that ϕ is a supremum
of its affine minorants, i.e., there exists an A ⊂ Rm × R such that, for all x ∈ Rm :

ϕ(x) = sup{a · x + b; (a, b) ∈ A}. (6.16)

In view of (6.4), (6.5) and (6.8), we have that for any (a, b) ∈ A:

a · Y + b = a · PG X + b = PG [a · X + b] ≤ PG (ϕ(X )) . (6.17)
204 6 Dynamic Stochastic Optimization

Maximizing the l.h.s. over (a, b) ∈ A, we get the desired result.


(ii) Take ϕ = I K , the indicatrix function of K , in (6.12). The r.h.s. is equal to 0, and
hence the l.h.s. is nonpositive. The conclusion follows.
(iii) Take expectations on both sides of (6.12), noting that since ϕ has an affine
minorant, the expectations are well-defined with value in R ∪ {+∞}), and use (6.11).
(iv) For s ∈ [1, ∞), apply point (iii) with ϕ(x) = |x|s . For s = ∞, apply point (ii)
with K = B̄(0, X ∞ ). 

We next show how to extend the conditional expectation from HF to the larger
space L 1 (F )m . By (6.15) we already know that, for all X ∈ HF :

PG X 1 ≤ X 1 , (6.18)

and consequently PG X has a unique continuous extension to L 1 (F )m , denoted in


the same way.

Remark 6.3 (i) If G is the σ -algebra generated by a random variable, say g : Ω →


Rq , then we write the conditional expectation of the random variable X in the form
E[X |g]. As we have seen, then, E[X |g](ω) = h(g(ω)) a.e. for some Borelian func-
tion h.
(ii) We define the conditional expectation of X when g(ω) = a as

E[X |g = a] := h(a) if g −1 (a) has a positive probability, 0 otherwise. (6.19)

Lemma 6.4 (i) Relation (6.18) is also satisfied by the extension of the conditional
probability to L 1 (F )m .
(ii) The latter satisfies (6.4), (6.5), (6.7), (6.8), and (6.12)–(6.15). If X ∈ L 1 (F )m ,
then Y = PG X is characterized by the relation

Y ∈ HG and E(Y · Z ) = E(X · Z ), for all Z ∈ L ∞ (G )m . (6.20)

Proof (i) That (6.18) holds for all X ∈ HF follows from Lemma 6.2(iv) with s = 1.
Given X ∈ L 1 (G )m , and k ∈ N, k = 0, consider the truncation

X k (ω) := 0 if |X (ω)| > k, and X (ω) otherwise. (6.21)

Then X k belongs to HG , and is a Cauchy sequence converging to X in L 1 (G )m .


Thanks to (6.18) (for X ∈ HF ), Y k := PG X k is a Cauchy sequence in L 1 (G )m , and
so has in this space a limit Y , which by the definition of an extended operator satisfies
Y = PG X .
(ii) We leave the proofs (based again on the sequences X k and Y k ) as an exercise.


For s ∈ [1, ∞], we denote by s  its conjugate number, such that 1/s + 1/s  = 1,
and set Us := L s (F )m . We denote by Ps the restriction of the conditional expectation
(over U1 ) to Us , and view it as an element of L(Us ).
6.1 Conditional Expectation 205

Lemma 6.5 (i) Let s ∈ [1, ∞] and u ∈ Us . Then Ps u is characterized by


 

Z (ω) · (Ps u)(ω)dω = Z (ω) · u(ω)dω, for all Z ∈ L s (G )m . (6.22)
Ω Ω

(ii) For any s ∈ [1, ∞), we have that Ps = Ps  .


(iii) Let u ∈ U 1 . Then P∞ u = P1 u.

Proof (i) By Lemma 6.4(ii), Ps u is characterized by the equality in (6.22), for all
Z ∈ U∞ . So we only have to prove that (6.22) holds for s ∈ (1, ∞]. Let v ∈ Us  . The
componentwise truncated sequence:

vik (ω) := max(−k, min(k, vi (ω))), i = 1 to m, (6.23)



belongs to L s (G )m and converges to v in Us  . By (6.20) we deduce that (the duality
product being for the Us space):

v, Ps u = limvk , Ps u = limvk , u = v, u. (6.24)


k k

Point (i) follows.



(ii) Let s ∈ [1, ∞) and (u, v) ∈ Us × Us  . Since Ps  v ∈ L s (G )m and Ps u ∈ L s (G )m ,
we have by point (i) that

Ps  v, u = Ps  v, Ps u = v, Ps u, (6.25)

proving that Ps = Ps  .

(iii) By the same arguments, when v ∈ U∞ and u ∈ U1 (which is a subset of U∞ ),
we have that P∞ v = P1 v. 

An obvious consequence of Lemma 6.5(i) is the following corollary:



Corollary 6.6 Let s ∈ [1, ∞] and u ∈ Us . Let E be a subset of L s (G ), whose

spanned vector space is dense in L s (G ). Then Ps u is characterized by
 
Z (ω) · (Ps u)(ω)dω = Z (ω) · u(ω)dω, for all Z ∈ E. (6.26)
Ω Ω

This holds in particular when taking for E the set of characteristic functions of
G -measurable subspaces.

We recall that, by Lemma 3.83, an element v∗ ∈ L ∞ (F )∗ can be decomposed in


a unique way as v∗ = v1 + vs , where v1 ∈ L 1 (F ) and vs is a singular multiplier.

Definition 6.7 The conditional expectation of v∗ ∈ L ∞ (F )∗ is defined by

E[v∗ |G ] := P∞ v∗ . (6.27)
206 6 Dynamic Stochastic Optimization

Remark 6.8 (i) In view of Lemma 6.5(iii), when v∗ ∈ L 1 (F ), we recover the usual
conditional expectation.
(ii) By the same lemma, for all s ∈ [1, ∞], Ps is a conditional expectation (but of
course P∞ = P1 ).

6.1.3 The Conditional Expectation of Non-integrable


Functions

Let (Ω, F , P) be, as before, a probability space, and let G be a sub σ -algebra of
F . Denote by L 0 (Ω, F ) the set of measurable functions w.r.t. the σ -algebra F ,
and by L 0+ (Ω, F ) the set of such measurable functions that are nonnegative a.s. To
f ∈ L 0+ (Ω, F ) we associate the sequence of truncated functions f k , k ∈ N, such
that f k (ω) := min( f (ω), k) and their conditional expectation gk := E[ f k , G ]. The
latter are well-defined since f k ∈ L ∞ (Ω, F ). The conditional expectation being
a nondecreasing mapping, the sequence gk is nondecreasing and converges a.s. to
some g ∈ L 0+ (Ω, F ). We say that g is the conditional expectation of f and write
g = E[ f, G ].
More generally, if f ∈ L 0 (Ω, F ) is such that f ≥ h a.s. for some h in L 1 (Ω, F ),
we can define E[ f, G ] as the limit a.s. of the nondecreasing sequence E[ f k , G ].

Lemma 6.9 Let f ∈ L 0+ (Ω, F ). Then g = E[ f, G ] satisfies

E[ f · z] = E[g · z], for all z ∈ L ∞


+ (Ω, G ). (6.28)

Proof Let z ∈ L ∞ + (Ω, G ). Using the monotone convergence Theorem 3.34 twice,
and the fact that gk = E[ f k |G ], we get

E[ f · z] = lim E[ f k · z] = lim E[gk · z] = E[g · z]. (6.29)


k k

If follows that g = E[ f |G ] satisfies (6.28). The conclusion follows. 

6.1.4 Computation in Some Simple Cases

Let s ∈ [1, ∞]. The two extreme cases are: when G is the trivial σ -algebra, the con-
ditional expectation coincides with the expectation; when G = F , the conditional
expectation is the identity operator in L s (F ).

Example 6.10 Let G ⊂ Ω be an atom of G . Then g = PG f has over G the value


E( f 1G )/P(G) (the mean value of f over G). Indeed, it suffices to get the result
when f is scalar, and to take Z = 1G in (6.22).
6.1 Conditional Expectation 207

Example 6.11 Let (Ω1 , F1 ) and (Ω2 , F2 ) be measurable spaces, and let F be the
product σ -algebra (the one generated by F1 × F2 ). Set Ω := Ω1 × Ω2 , and let
P be a probability measure on (Ω, F ). Set G := F1 × {Ω2 , ∅}. The associated
random functions are those that do not depend on ω2 . Then, roughly speaking, Y :=
E[X |G ] is obtained by averaging for each ω1 ∈ Ω1 the value of X (ω1 , ·). More
precise statements follow.
Example 6.12 In the framework of Example 6.11, assume that Ω1 and Ω2 are finite
sets, say equal to {1, . . . , p} and {1, . . . , q} resp., with elements denoted by i and j;
let pi j be the probability of (i, j). Taking for Z the characteristic function 1{i0 } (i, j),
for any i 0 ∈ {1, . . . , p}, in (6.22), we deduce that

j∈Ω2 pi j X (i, j)
Y (i) =  , for all i ∈ Ω1 . (6.30)
j∈Ω2 pi j

Example 6.13 (Independent noises) In the framework of Example 6.11, let P be the
product of the probability P1 over (Ω1 , F1 ) and P2 over (Ω2 , F2 ), so that ω1 and
ω2 are independent. Then Y := E[X |G ] is given by, a.s.:

Y (ω1 ) = X (ω1 , ω2 )dP2 (ω2 ). (6.31)
Ω2

Remark 6.14 More general expressions can be obtained using the disintegration
theorem [40, Chap. III]. In most applications we have (reformulating the model if
necessary) independent noises.

6.1.5 Convergence Theorems

The main convergence theorems of integration theory have their counterparts for
conditional expectations.
Theorem 6.15 (Monotone convergence) Let f k be a nondecreasing sequence of
L 1 (F ), with limit a.s. f ∈ L 1 (F ). Set gk := E[ f k |G ] and g := E[ f |G ]. Then gk is
nondecreasing, and converges to g both a.s. and in L 1 (G ).
Proof Since f k is nondecreasing, by (6.8) (which is valid in L 1 (F )) so is gk , and
hence, gk → ĝ a.s. for some measurable function ĝ, such that gk ≤ ĝ ≤ g. By dom-
inated convergence, ĝ is integrable. Let A ∈ G with characteristic function z = 1 A .
Using the monotone convergence Theorem 3.34 twice, we get:

E(z ĝ) = lim E(zgk ) = lim E(z f k ) = E(z f ) = E(zg). (6.32)


k k

We deduce by Corollary 6.6 that ĝ = g, and therefore gk → g in L 1 (G ) by monotone


convergence. 
208 6 Dynamic Stochastic Optimization

Theorem 6.16 (Lebesgue dominated convergence) Let the sequence f k of L 1 (F )


converge a.e. to f , and be dominated by h ∈ L 1 (F ), in the sense that | f k (ω)| ≤ h(ω)
a.s. Set gk := E[ f k |G ] and g := E[ f |G ]. Then g ∈ L 1 (G ), and gk → g in L 1 (G ).
Proof By the Lebesgue dominated convergence Theorem 3.38, f k → f in L 1 (F ),
and by Lemma 6.4(i) the conditional expectation is a continuous operator in L 1 (F ).
The conclusion follows. 
Lemma 6.17 (Fatou’s lemma) Let f k be a sequence in L 1 (F ), with f k ≥ h, where h
is an integrable function. Let f := lim inf k f k be integrable, and set gk := E[ f k |G ],
g := E[ f |G ]. Then
g ≤ lim inf gk a.s. (6.33)
k

Proof Set fˆk := inf{ f  ;  ≥ k}, and ĝk := E[ fˆk |G ]. Then fˆk is nondecreasing and
converges a.s. to f . Since h ≤ fˆk ≤ f k , fˆk is integrable. By the monotone conver-
gence Theorem 6.15, ĝk ↑ g a.s. Since fˆk ≤ f k , we have that ĝk ≤ gk . The conclusion
follows. 

6.1.6 Conditional Variance

Definition 6.18 Let G be a sub σ -algebra of some σ -algebra F , X ∈ L 2 (F ), and


Y = EG X . We call the G measurable function

var G X := EG (X − Y )(X − Y ) (6.34)

the conditional variance of X .


Lemma 6.19 Let F , G , X and Y be as in the previous definition, with X ∈ L 2 (F ).
Then we have the law of total variance

var X = Evar G X + varY. (6.35)

Proof We may assume that EX = EY = 0 and it is enough to prove the result when
X is a scalar. Then

var X = EX 2 = E(X − Y + Y )2 = E(X − Y )2 + 2E[(X − Y )Y ] + varY. (6.36)

Now
E(X − Y )2 = EEG (X − Y )2 = Evar G X (6.37)

and
E[X Y ] = EEG [X Y ] = E(Y EG [X ]) = EY 2 (6.38)

so that E[(X − Y )Y ] = 0. The result follows. 


6.1 Conditional Expectation 209

Remark 6.20 The law of total variance (6.35) can be interpreted as the decomposition
of the variance as the sum of the term varY explained, or predicted by G , and of the
unexplained, or unpredicted term Evar G X .

6.1.7 Compatibility with a Subspace

In this subsection, instead of a measurability constraint, we consider the more general


case of a Banach space U with a closed subspace V . This abstract setting simplifies the
discussion and allows us to apply the results to more general frameworks (dynamic
case). We assume the existence of a projector P from U onto V , i.e., P ∈ L(U ) and

Pu ∈ V, for all u ∈ U,
(6.39)
Pu = u, for all u ∈ V.

Note that P  := I − P is itself a projector on the closed subspace

V  := Im(P  ) = Ker P. (6.40)

Any u ∈ U can be decomposed in a unique way as u = u  + u  , with u  ∈ V 


and u  ∈ V . Also, the transpose operator P can be interpreted as the restriction
of linear forms over the subspace V . In the applications, u ∈ V might represent a
measurability constraint, and P would then be the corresponding conditional expec-
tation. Remember that then, P is also a conditional expectation. Given K ⊂ U ,
nonempty, closed and convex, we set KV := K ∩ V .

Definition 6.21 We say that K is compatible with P if PK ⊂ K , i.e., if any


u ∈ K is such that Pu ∈ K .

Remark 6.22 By Remark 1.17, there exists an E ⊂ U ∗ × R such that

K = {u ∈ U ; u ∗ , u ≤ b, for all (u ∗ , b) ∈ E}. (6.41)

Therefore, K is compatible whenever for all (u ∗ , b) ∈ E, we have that, u ∗ , u =


u ∗ , Pu for all u ∈ U , or equivalently, if

u ∗ = P u ∗ , for all (u ∗ , b) ∈ E. (6.42)

We emphasize the fact that we consider here KV as a subset of U (and not of V ).


Therefore the normal cone to KV at u ∈ KV (of which the lemma below gives an
expression) is considered as a subset of U ∗ (and not of V ∗ ). Of course V ⊥ denotes
the orthogonal of V in U ∗ .

Lemma 6.23 (i) We have that Ker P = V ⊥ .


(ii) Let K be compatible with P. Then, for all u ∈ KV :
210 6 Dynamic Stochastic Optimization

NK V (u) = NK (u) + V ⊥ . (6.43)

Proof (i) Let u ∗ ∈ U ∗ and u ∈ U . Then P u ∗ , u = u ∗ , Pu. Since the range of


P is V , the result follows.
(ii) We have the trivial inclusion for normal cones of an intersection: NK (u) and
V ⊥ being elements of NK V , and the latter being a cone, the inclusion NK V (u) ⊃
NK (u) + V ⊥ follows.
We next show the converse inclusion. Let u ∈ KV and u ∗ ∈ NK V (u). Given v ∈ K ,
define v1 := Pv. By the definition of compatibility, v1 ∈ KV , and so

0 ≥ u ∗ , v1 − u = u ∗ , P(v − u) = P u ∗ , v − u, (6.44)

proving that P u ∗ ∈ NK (u). So it suffices to prove that u ∗ − P u ∗ ∈ V ⊥ . Indeed,


if v ∈ V then we have that u ∗ − P u ∗ , v = u ∗ , v − Pv = 0. The result
follows. 

Remark 6.24 We proved in Lemma 1.124 the following geometric calculus rule: the
normal cone of an intersection of closed convex sets is the sum of normal cones
to these sets, provided that the qualification condition 0 ∈ int(K 1 − K 2 ) holds. In
the above lemma we obtained the geometric calculus rule without the qualification
condition.

An easy application of the above result, that we state for future reference, is as
follows. Consider the problem

Min F(u); y[u] := A u ∈ K Y , (6.45)


u∈K V

where F is a continuous convex function over U , Y is another Banach space,


A ∈ L(U, Y ), and K Y is a closed convex subset of Y . Assume that the following
qualification condition holds, where B stands for the unit open ball:

ε B ⊂ A KV − K Y , for some ε > 0. (6.46)

Proposition 6.25 Let ū ∈ F(6.45) satisfy (6.46), and set ȳ = A ū. Then ū is a solu-
tion of (6.45) iff there exists y ∗ ∈ N K Y ( ȳ), u ∗ ∈ ∂ F(ū) and q ∈ NK (ū) such that
 
P A y ∗ + u ∗ + q = 0. (6.47)

Proof In view of the qualification condition (6.46), by the subdifferential calculus


rules (Lemma 1.120), we have that ū ∈ S(6.45) iff there exists y ∗ ∈ N K Y ( ȳ), u ∗ ∈
∂ F(ū) and q1 ∈ NK V (ū) such that

A y ∗ + u ∗ + q1  0. (6.48)
6.1 Conditional Expectation 211

By Lemma 6.23(ii) this is equivalent to A y ∗ + u ∗ + q ∈ V ⊥ , for some q ∈ NK (ū),


and we conclude by Lemma 6.23(i). 

Remark 6.26 (i) Let Ki , i ∈ I , be nonempty closed convex subsets of U , compatible


with the subspace V . Then K := ∩i∈I Ki is a closed convex subset of U , which (if
nonempty) is obviously compatible with V .
(ii) If in addition I is finite and the following condition for normal cones holds: for
all u ∈ K ∩ V , we have that:

NK (u) = NK i (u), (6.49)
i∈I

then by Proposition 6.25, if ū ∈ F(6.45) satisfies (6.46) and ȳ = A ū, then ū ∈


S(6.45) iff there exists y ∗ ∈ N K Y ( ȳ), u ∗ ∈ ∂ F(ū) and qi ∈ NK i (ū), for all i ∈ I ,
such that 

P A y∗ + u∗ + qi = 0. (6.50)
i∈I

Example 6.27 (Product structure) In the applications to stochastic programming,


we have a discrete set of times T = 0, . . . , T , and (note that the index of control
variables runs from 0 to T − 1, and that of state variables from 1 to T ):

T −1 T −1 T T
U= Ut ; K = Kt ; Y = Yt ; K Y
= Kt Y , (6.51)
t=0 t=0 t=1 t=1

where Kt is a nonempty, closed convex subset of a Banach space Ut , Kt Y is a


nonempty, closed convex subset of a Banach space Yt , and Pt ∈ L(Ut ) is a projection
onto a closed subspace Vt of Ut . We may write

T −1

yτ [u] = Aτ t u t , τ = 1, . . . , T, where Aτ t ∈ L(Ut , Yτ ). (6.52)
t=0

If the qualification condition (6.46) holds, a solution ū will be characterized by the


existence of

u ∗ ∈ ∂ F(ū); yt∗ ∈ NK t Y ( ȳt ); qt ∈ NK t (ū t ), t = 1, . . . , T, (6.53)

such that 

Pt Aτ t yτ∗ + u ∗t + qt = 0, t = 0, . . . , T − 1. (6.54)
τ ∈T

Remark 6.28 By Lemma 6.23(ii), (6.54) is equivalent to


212 6 Dynamic Stochastic Optimization

Aτ t yτ∗ + u ∗t + NK t  0, t = 0, . . . , T − 1. (6.55)
τ ∈T

So (by the sudifferential calculus rule for a sum) it is also equivalent to the fact that
ū is solution of the problem

T −1 
 T
min F(u) + yτ∗ , Aτ t u t ; u t ∈ Kt ∩ Vt , t = 0, . . . , T − 1. (6.56)
u
t=0 τ =1

6.1.8 Compatibility with Measurability Constraints

We apply the results of the previous section in the case of measurability constraints,
i.e., (Ω, F , μ) is a probability space, and G is a σ -algebra included in F . For
some s ∈ [1, ∞], we assume that U = L s (F )m and V = L s (G )m . We recall that Ps
denotes the conditional expectation operator in L s (F )m .

Definition 6.29 Let K be a closed convex subset of L s (F )m , for some s ∈ [1, ∞].
We say that K is compatible with G if PG K ⊂ K , i.e., if any x ∈ K is such that
PG x ∈ K .

Proposition 6.30 Let ū ∈ F(6.45) satisfy the qualification condition (6.46); set ȳ =
A ū. Then
 ū ∈ ∗S (6.45) iff there exists y ∗ ∈ N K Y ( ȳ), u ∗ ∈ ∂ F(ū) q ∈ NK (ū) such

that Ps A y + u + q = 0, or equivalently,

E[A y ∗ + u ∗ + q|, G ] = 0. (6.57)

Proof Immediate consequence of Proposition 6.25, Lemma 6.5(ii) and


Definition 6.7. 

We next present some examples of compatible constraints.


Definition 6.31 Let K be a closed convex subset of L s (F )m , for some s ∈ [1, ∞].
(i) We say that K defines a Jensen type constraint if, for some proper l.s.c. convex
function ϕ over Rm , we have that

K = {x ∈ L s (F )m ; ϕ(x(ω)) ≤ 0 a.s.}. (6.58)

(ii) We say that K defines an integral Jensen type constraint if, for some proper l.s.c.
convex function ϕ over Rm , we have that

K = {x ∈ L s (F )m ; Eϕ(x(·)) ≤ 0}. (6.59)

(iii) We say that K defines a local constant constraint if there exists a nonempty
closed convex subset K of Rm such that
6.1 Conditional Expectation 213

K = {x ∈ L s (F )m ; x(ω) ∈ K a.e.}. (6.60)

Clearly a local constant constraint is a special case of a Jensen type constraint


with ϕ = I K .
Lemma 6.32 A Jensen (resp. integral Jensen) type constraint is compatible with G
measurability.
Proof Immediate consequence of the Jensen and integral Jensen inequalities (6.12)
and (6.14). 
We next give some generalizations of the previous examples.
Definition 6.33 Consider a measurable function ϕ : Ω × Rm → R ∪ {+∞} of the
form
ϕ(ω, u) := sup{ai (ω) · u + bi (ω), i ∈ I }, (6.61)

where I is a countable set and the (ai , bi )i∈I are F -measurable and essentially
bounded. We say that ϕ is an F -adapted function, and that it is G -adapted if in
addition any (ai , bi ), for i ∈ I , is G -measurable.
Definition 6.34 Let K be a nonempty, closed convex subset of L s (F )m , for some
s ∈ [1, ∞].
(i) We say that K defines a generalized Jensen type constraint if, for some function
ϕ satisfying (6.61), G -adapted, we have that

K = {u ∈ L s (F )m ; ϕ(ω, u(ω)) ≤ 0 a.e.}. (6.62)

(ii) We say that K defines a generalized integral Jensen type constraint if, for some
function ϕ satisfying (6.61), G -adapted, we have that

K = {u ∈ L s (F )m ; Eϕ(·, u(·)) ≤ 0}. (6.63)

Lemma 6.35 All constraints of the previous type are G -compatible.


Proof It is enough to discuss case (i). If u ∈ K , then for all i ∈ I , g(ω) :=
ai (ω)u(ω) + b(ω) ≤ 0. Since the conditional expectation is nondecreasing, and ai ,
bi are G -measurable, we deduce that

ai (ω)E[u|G ](ω) + bi (ω) = E[g|G ](ω) ≤ 0. (6.64)

The conclusion follows by taking the supremum over i ∈ I . 

6.1.9 No Recourse

The problem without recourse is a particular case of the previous theory, when
G = {∅, Ω} is the trivial σ -algebra. Then the conditional expectation in L s (F )
214 6 Dynamic Stochastic Optimization

coincides with the expectation when s ∈ [1, ∞). If u ∗ ∈ L ∞ (F )m , its conditional


expectation, denoted by Eu ∗ , is the element of Rm defined by

(Eu ∗ )i = Eu i∗ = u i∗ , 1. (6.65)

A very simple example illustrates the fact that, in the presence of constraints to be
satisfied a.e., the multipliers in the dual of L ∞ typically have singular parts.

Example 6.36 Let u ∈ R+ represent a number of items to be ordered at price p0 , and


sold at price p1 > p0 . The stochastic demand is ω, with uniform law in Ω = [dm , d M ]
with 0 < dm < d M . However, all bought items must be sold. The optimal decision
is therefore ū = dm . The mathematical formulation of the optimization problem is,
setting p := p1 − p0 :

Min − pu; y[u](ω) := u − ω ≤ 0 a.s. (6.66)


u≥0

Set ȳ(ω) := ū − ω. Taking Y = L ∞ (Ω) as constraint space, and observing that the
constraint is qualified, we obtain the existence of a multiplier λ such that

λ ∈ NY− ( ȳ); − p + λ, 1 = 0. (6.67)

For any ε > 0 and y ∈ Y with zero value on (dm , dm + ε), there exists a ρ > 0 such
that ȳ ± ρy ∈ Y− . Since λ ∈ NY− ( ȳ), and so λ, ȳ = 0, it follows that λ, y = 0.
We have proved that λ is equal to its singular part; note that it is nonzero in view of
(6.67), since p > 0.

6.2 Dynamic Stochastic Programming

6.2.1 Dynamic Uncertainty

Random variables such as prices, temperatures, etc. that depend on time are modelled
as series, say yt ∈ Rn with t ∈ Z. Quite often the yt are not independent variables,
and we can express them as function of past values:

yt = Ψ (yt−1 , . . . , yt−q ) + Φ(yt−1 , . . . , yt−q )et , (6.68)

where the random variables et ∈ Rm , called innovations, are “white noise”, i.e., i.i.d.
with zero mean and unit variance. A simple example is the one of autoregressive (AR)
models
yt = a1 yt−1 + · · · + aq yt−q + Φ̂et , (6.69)
6.2 Dynamic Stochastic Programming 215

where the ai are n × n matrices and Φ̂ is a given matrix; this model of order q is
also called ARq. Then the vector Yt := (yt , yt−1 , . . . , yt−q+1 ) has the first-order
dynamics ⎛ ⎞ ⎛ ⎞
a1 a2 · · · aq−1 aq Φ̂
⎜1 0 ⎟ ⎜0⎟
⎜ ⎟ ⎜ ⎟
Yt+1 = ⎜ .. ⎟ Yt + ⎜ . ⎟ et . (6.70)
⎝ . ⎠ ⎝ .. ⎠
0 ··· 1 0 0

So this type of model is suitable for our framework. For more on AR models and
their nonlinear extensions, we refer to [55].

6.2.2 Abstract Optimality Conditions

We start with the general setting of an abstract problem in product form of Exam-
ple 6.27. We call u the control, and y the state, and assume that the control to state
mapping is defined by the state equation

yt+1 = At yt + Bt u t + dt , t = 0, . . . , T − 1; y0 ∈ Y0 given, (6.71)

with At ∈ L(Yt , Yt+1 ), Bt ∈ L(Ut , Yt+1 ), dt ∈ Yt+1 , and solution denoted by y[u],
and that the cost function has the following form:

T −1

F(u) = J (u, y[u]), with J (u, y) := t (u t , yt ) + ϕ(yT ). (6.72)
t=0

Here t and ϕ are continuous convex functions over Ut × Yt , for t = 0 to T − 1, and


over Y N , resp. The linearized state equation is

z t+1 = At z t + Bt vt , t = 0, . . . , T − 1; z 0 = 0. (6.73)

We first give a means to express the subdifferential of F, using the adjoint state (or
costate) approach.

Definition 6.37 Set P := Y1∗ × · · · × YT∗ as costate space. Let ū ∈ U have associ-
ated state ȳ := y[ū]. The costate p ∈ P (i.e., pt ∈ Yt∗ , t = 1 to T ) associated with
ū, y ∗ ∈ Y ∗ and w∗ ∈ Y ∗ (we distinguish these two dual variables since they will play
different roles) is defined as the solution of the backward equation ( pt is computed
by backward induction)

pt = yt∗ + wt∗ + At pt+1 , t = 1, . . . , T − 1;
(6.74)
pT = yT∗ + wT∗ .
216 6 Dynamic Stochastic Optimization

We note the useful identity, where (v, z) satisfies the linearized state equation
(6.73):

⎪ T T −1


⎪ ∗ ∗

⎪ y + w , z  =  p , z  +  pt − At pt+1 , z t 

⎪ t=1
t t t T T

⎪ t=1

⎪  T T −1




⎪ =  p , z  −  pt+1 , At z t 
⎨ t t
t=1 t=0
T −1 (6.75)

⎪  T 

⎪ =  pt , z t  +  pt+1 , Bt vt − z t+1 






t=1 t=0

⎪ T −1


⎪ = Bt pt+1 , vt .


t=0

Note that (v∗ , y ∗ ) ∈ U ∗ × Y ∗ belongs to ∂ J (ū, ȳ) iff v0∗ ∈ ∂0 (ū 0 , ȳ0 ), (vt∗ , yt∗ ) ∈
∂t (ū t , ȳt ), for t = 1 to T − 1, and yT∗ ∈ ∂ϕ( ȳT ).

Lemma 6.38 We have that u ∗ ∈ ∂ F(ū) iff there exists (v∗ , y ∗ ) ∈ ∂ J (ū, ȳ) such that
the costate p associated with y ∗ and w∗ = 0 satisfies

u ∗t = vt∗ + Bt p̄t+1 , t = 0, . . . , T − 1. (6.76)

Proof We have that the state satisfies y[u] = A u + d for some linear continuous
operator A and some d in an appropriate space. Since F(u) = J (u, y[u]), by the
subdifferential calculus rules in Lemma 1.120, we have that u ∗ ∈ ∂ F(ū) iff u ∗ =
v∗ + A y ∗ for some (v∗ , y ∗ ) ∈ ∂ J (ū, ȳ), or equivalently, if

T −1
 T −1
 
T
u ∗t , vt  = vt∗ , vt  + yt∗ , z t . (6.77)
t=0 t=0 t=1

We conclude by (6.75), where here wt∗ = 0 for all t. 

We are now in a position to state the optimality conditions.

Theorem 6.39 Let ū be feasible, with associated state ȳ. Assume that the quali-
fication condition (6.46) holds, and that the constraints that ū t belongs to Kt are
compatible with the projector Pt , for t = 0 to T − 1. Then ū is a solution of the
abstract optimal control problem (6.45) iff there exists yT∗ ∈ ∂ϕ( ȳT ), and

(vt∗ , yt∗ ) ∈ ∂t (ū t , ȳt ), wt+1



∈ N K t+1
Y ( ȳt+1 ), qt ∈ NK t (ū t ), t = 0, . . . , T − 1,
(6.78)
such that the costate p̄ ∈ P, a solution of (6.74), satisfies
 
Pt vt∗ + Bt pt+1 + qt = 0, t = 0, . . . , T − 1. (6.79)
6.2 Dynamic Stochastic Programming 217

Proof Immediate consequence of Example 6.27 and Lemma 6.38. 


Remark 6.40 Similarly to Remark 6.28 we can observe that (6.79) is equivalent to
the fact that for t = 0 to T − 1, ū t minimizes u → t (u, ȳt ) +  pt+1 , Bt u over Kt .

6.2.3 The Growing Information Framework

We now particularize the previous setting by assuming that the spaces Ut and Yt
do not depend on t, so we may denote them as U0 , Y0 , and that, if y = y[u] with
u t ∈ Vt for all t, then yt belongs to some closed subspace Z t of Y0 , with which is
associated a projector Q t . We assume that the operators Pt ∈ L(U0 ) and Q t ∈ L(Y0 )
(which in our stochastic programming applications correspond to some conditional
expectations) satisfy PT = I , Q T = I as well as the following identities:

Pt = Pt Pτ = Pτ Pt ; Q t = Q t Q τ = Q τ Q t , t = 0, . . . , τ − 1, (6.80)

and
Q t+1 At = At Q t+1 ; t = 0, . . . , T − 1, (6.81)

Pt+1 Bt = Bt Q t+1 , t = 0, . . . , T − 2. (6.82)

Note that (6.80) implies that the sequences of spaces Vt and Z t are nondecreasing.
We introduce the adapted costate

p̄t = Q t pt , t = 1, . . . , T. (6.83)

Remark 6.41 By Remark 6.8, the transpose of conditional expectations are condi-
tional expectations (in a generalized sense for L ∞ norms), so that (at least in the case
of L s spaces for s ∈ [1, ∞)), in the stochastic optimization applications, p̄t will be
adapted. This justifies the terminology of adapted costate.
Lemma 6.42 Under the assumptions of Lemma 6.38, if (6.80)–(6.82) hold, then the
following adapted costate equation holds
  
p̄t = Q t yt∗ + wt∗ + At p̄t+1 , t = 1, . . . , T − 1;
(6.84)
p̄T = yT∗ + wT∗ ,

as well as (6.78) and


 
Pt vt∗ + Bt p̄t+1 + qt = 0, t = 0, . . . , T − 1. (6.85)

Proof Applying (6.80)–(6.81) several times, we have that

Q t At pt+1 = Q t Q t+1 At pt+1 = Q t At p̄t+1 . (6.86)


218 6 Dynamic Stochastic Optimization

Multiplying by Q t on both sides of the costate equation (6.74), we get (6.84).


Now (6.85) holds for t = T − 1 since p̄T = pT . By (6.81)–(6.82),

Pt Bt = Pt Pt+1 Bt = Pt Bt Q t+1 , (6.87)

we get Pt Bt pt+1 = Pt Bt p̄t+1 ; (6.85) then follows from (6.79). 

Remark 6.43 Similarly to Remark 6.40 we observe that (6.79) is equivalent to the
fact that for t = 0 to T − 1, ū t minimizes u → t (u, ȳt ) +  p̄t+1 , Bt u over Kt .

6.2.4 The Standard full Information Framework

We now apply the previous ‘abstract’ framework to stochastic programming prob-


lems. We consider a nondecreasing sequence F0 , . . . , FT of σ -algebras, included
in F , such that FT = F , called a filtration. Roughly speaking, Ft represents the
information available at time t, when taking the decision u t .

Definition 6.44 We say that a measurable mapping (with values in a Banach space)
u = (u 0 , . . . , u T −1 ) is adapted to the filtration if u t is Ft measurable for t = 0 to
T − 1.

We also call the fact that u needs to be adapted a nonanticipativity constraint. In


the sequel we assume that it holds. The function spaces are, for s ∈ [1, ∞]:

Ut := L s (F )m ; Vt := L s (Ft )m ; Yt := L s (F )n ; Z t := L s (Ft )n . (6.88)

We also assume that, for t = 0 to T − 1, At ∈ L(Y0 , Y0 ) and Bt ∈ L(U0 , Y0 ) satisfy

At ∈ L(Z t , Z t+1 ); Bt ∈ L(Vt , Z t+1 ); dt ∈ Z t+1 , t = 0, . . . , T − 1. (6.89)

Later we will see examples of operators At and Bt . The state equation is



yt+1 (ω) = (At yt ) (ω) + (Bt u t ) (ω) + dt (ω), t = 0, . . . , T − 1;
(6.90)
a.s., with y0 ∈ Z 0 given,

we have indeed that yt ∈ Z t , for t = 0 to T . We assume next that the cost function
is an expectation with the property of additivity w.r.t. time, i.e.,

t (u t , yt ) = Eˆt (ω, u t (ω), yt (ω)), t = 0, . . . , T − 1,
(6.91)
ϕ(yT ) = Eϕ̂(ω, yT (ω)),

ˆ
where the functions (ω, ·, ·) and ϕ̂(ω, ·) are a.s. convex functions. Under technical
conditions seen in Sect. 3.2 of Chap. 3, we have that, for t = 0 to T − 1:
6.2 Dynamic Stochastic Programming 219

∂t (u t , yt ) = {(vt∗ , yt∗ ) ∈ Ut∗ × Yt∗ ; (vt∗ (ω), yt∗ (ω)) ∈ ∂ ˆt (ω, u t (ω), yt (ω)) a.s.},
(6.92)
∂ϕ(yT ) = {yT∗ ∈ YT∗ ; yT∗ (ω) ∈ ∂ ϕ̂(ω, yT (ω)) a.s.}. (6.93)

We may denote the conditional expectation over Ft by Et . Noticing that the operators
Pt and Q t as well as their adjoints are conditional expectations over Ft , we may
write the adapted costate equation (6.84) in the following form:
  
p̄t = Et yt∗ + wt∗ + At p̄t+1 , t = 1, . . . , T − 1;
(6.94)
p̄T = yT∗ + wT∗ ,

and the optimality condition (6.85) in the form


 
Et vt∗ + Bt p̄t+1 + qt = 0, t = 0, . . . , T − 1. (6.95)

6.2.5 Independent Noises

We assume here, as is often the case in applications, that we can write ω =


(ω0 , . . . , ωT ) with ωt independent variables, each over some probability space
(Ω̂t , Fˆt , Pt ), and the decision u t is a function of (the past information) (ω0 , . . . , ωt ).
Then the filtration is such that Ft is the set of measurable functions of (ω0 , . . . , ωt ).
We have seen in Example 6.13 how to compute conditional expectations in the case
of independent noises. So we can write for t = 0 to T − 1:

u t = u t (ω0 , . . . , ωt ), yt+1 = yt+1 (ω0 , . . . , ωt+1 ), pt+1 = pt+1 (ω0 , . . . , ωt+1 ),


(6.96)
etc. and the conditional expectation from Φ, Ft+1 -measurable, to Ft , is a.s.

Et Φ(ω0 , . . . , ωt ) = Φ(ω0 , . . . , ωt+1 )dPt+1 (ωt+1 ). (6.97)
Ωt+1

Remark 6.45 In practice it is not easy to deal with functions of several variables.
Storing them, or computing conditional expectations becomes very expensive when
the dimension increases. The optimality conditions are nevertheless of interest for
studying theoretical properties (such as sensitivity analysis).

6.2.6 Elementary Examples

We may define operators At and Bt in the following way. If Ât is an n × n matrix,


set
(At yt ) (ω) := Ât yt (ω). (6.98)
220 6 Dynamic Stochastic Optimization

More generally, if Ât is an n × n matrix that is Ft+1 -measurable and essentially


bounded, set
(At yt ) (ω) := Ât (ω)yt (ω). (6.99)

This case of a local operator is quite common in practice. Assuming that Bt has the
same structure and identifying the operators At and Ât , Bt and B̂t , we can express
the optimality conditions in the following form:

yt+1 (ω) = Ât (ω)yt (ω) + B̂t (ω)u t (ω) + dt (ω) a.s., t = 0, . . . , T − 1;
y0 ∈ Z 0 given,
  (6.100)
∗ ∗ ∗ ∗
p̄t = Et yt + wt + Ât p̄t+1 , t = 1, . . . , T ; p̄T = yT + wT . (6.101)
 
Et vt∗ + B̂t p̄t+1 + qt = 0, t = 0, . . . , T − 1. (6.102)

6.2.7 Application to the Turbining Problem

6.2.7.1 Framework

Let yt ∈ [ym , y M ] denote the amount of water at a dam at the beginning of day t.
We can turbine an amount u t ∈ [u m , u M ], and spill an amount st ≥ 0. The natural
increment of water is bt ≥ 0. So the dynamics is

yt+1 = yt + bt − u t − st , t = 0, . . . , T − 1. (6.103)

Each day we have to fix u t and vt . So we have the constraints

yt+1 ∈ [ym , y M ]; u t ∈ [u m , u M ]; st ≥ 0, t = 0, . . . , T − 1. (6.104)

The price of the electricity market is ct ≥ 0, t = 0 to T − 1. The total revenue, to be


maximized, is
T −1

ct u t + C T yT , (6.105)
t=0

where C T ≥ 0 is an estimation of the water price at final time.

6.2.7.2 A Deterministic Model

In a deterministic version of this problem, where bt and ct are known for all time t,
the problem of maximizing the revenue can be written as
6.2 Dynamic Stochastic Programming 221

T −1

Min − ct u t − C T yT s.t. (6.103)-(6.104). (6.106)
u,s
t=0

Denoting by pt ∈ R the costate we obtain the costate equation

pt = wt∗ + pt+1 , t = 1, . . . , T − 1; pT = wT∗ − C T , (6.107)

where ⎧ ∗
⎨ wt ≤ 0 if yt = ym ,
w∗ = 0 if yt ∈ (ym , y M ), (6.108)
⎩ t∗
wt ≥ 0 if yt = y M .

Eliminating w we can also write


⎧ ⎧
⎨ pt ≤ pt+1 if yt = ym , ⎨ pT ≤ −C T if yT = ym ,
pt = pt+1 if yt ∈ (ym , y M ), pT = −C T if yT ∈ (ym , y M ), (6.109)
⎩ ⎩
pt ≥ pt+1 if yt = y M , pT ≥ −C T if yT = y M .

Similarly to Remark 6.43 we can observe that for t = 0 to T − 1, ū t minimizes


v → −(ct + pt )v over [u m , u M ], and therefore setting p̂t := − pt :

u t = u m if p̂t < ct ,
(6.110)
u t = u M if p̂t > ct .

We can interpret p̂t as the marginal value of storing, called in this context the water
price. If the market price ct is strictly smaller (resp. strictly greater) than the water
value, then one should store (resp. turbine) as much as possible. Observe that the
water value decreases (resp. increases) when the storage attains the minimum (resp.
maximum) value.
For the spilling variable s the policy is to take st = 0 as long as the water value is
positive, and st ≥ 0 otherwise (with a value compatible with the constraint yt+1 ≤
y M ).
This is in agreement with the following observation. If during some time interval
the inflows are important, it may be worth turbining even if the market price is low.
So the water price should be small, and possibly become greater after.

Exercise 6.46 If ym = −∞ and y M = +∞, show that the optimal strategies are to
take u t = u m if ct < C T , and u t = u M if ct > C T , and u t ∈ [u m , u M ] otherwise.

6.2.7.3 Stochastic Model

We may assume that randomness occurs only in the variables bt and ct . Here we will
assume that
bt is deterministic; ym = 0; y M = +∞, (6.111)
222 6 Dynamic Stochastic Optimization

so that no spilling occurs. We also assume that ct ∈ L 1 (Ft ) for t = 0 to T − 1, and


C T ∈ L 1 (F ). We choose the function spaces

Vt = Z t = L ∞ (Ft ). (6.112)

The cost function is T −1



−E ct , u t  + C T , yT  . (6.113)
t=0

Also, we have that

Kt := {u ∈ Vt ; u m ≤ u(ω) ≤ u M a.s.}, Kt Y = (Z t )+ := {y ∈ Z t ; y(ω) ≥ 0 a.s.},


(6.114)
and so, by Exercise 1.82, for any y0 ∈ (Z t )+ :

NK t Y = {w∗ ∈ (Z t∗ )− ; w∗ , y0  = 0}. (6.115)

The adapted costate equation is


 
p̄t = Et wt∗ + p̄t+1 , t = 1, . . . , T − 1; p̄T = wT∗ − C T , (6.116)

where wt∗ ∈ NK t Y , for t = 1 to T , and

Et (−ct − p̄t+1 + qt ) = 0; qt ∈ NK t (ū t ); t = 0, . . . , T − 1. (6.117)

Since the conditional expectation is a nondecreasing operator, by (6.116), the adapted


costate is itself nondecreasing. Set

c̄t := −Et (ct + p̄t+1 ) . (6.118)

The relation (6.117) implies

c̄t , v − ū t  ≥ 0, for all v ∈ Kt . (6.119)

6.3 Notes

The discussion of conditional expectation is classical, see e.g. Malliavin [77], and
Dellacherie and Meyer [40]. For more on first-order optimality conditions, see
Rockafellar and Wets [103, 104], Wets [124] and Dallagi [37] for the numerical
aspects.
Chapter 7
Markov Decision Processes

Summary This chapter considers the problem of minimizing the expectation of a


reward for a controlled Markov chain process, either with a finite horizon, or an
infinite one for which the reward has discounted values, including the cases of exit
times and stopping decisions. The value and policy (Howard) iterations are compared.
Extensions of these results are provided for problems with expectations constraints,
partial observation, and for the ergodic case, limit in some sense of large horizon
problems with undiscounted cost.

7.1 Controlled Markov Chains

7.1.1 Markov Chains

7.1.1.1 The Probability Setting

We consider a state space S , equal to either {1, . . . , m}, with m ∈ N, or to N∗ =


{1, 2, . . .}, and a time index k ∈ {0, . . . , N } where N ∈ N∗ is called the horizon. For
k ∈ {0, . . . , N }, we denote by x  a process (i.e. a random function of time) with
values in S , for  = k (the starting time of the process) to N .
A Markov chain is a process whose transition from state i at time  to state j at
time  + 1 (for  = k to N − 1) happens with a given probability Mij , independently
of the values taken by the process for times less than . Obviously Mij ≥ 0 and
 
j∈S Mi j = 1. The Markov chain framework can be put in the setting of probability
spaces in the following way. Let X k be the class of processes starting at time k, and

X ik := {x ∈ X k ; x k = i}, for all i ∈ S . (7.1)

Any element of X ik has the representation x = (i, x k+1 , . . . , x N ). Let the set of events
(denoted by Ω in probability theory) be X ik , with σ -field P(X ik ). We denote by Pik
© Springer Nature Switzerland AG 2019 223
J. F. Bonnans, Convex and Stochastic Optimization, Universitext,
https://doi.org/10.1007/978-3-030-14977-2_7
224 7 Markov Decision Processes

the probability defined as follows. Since X ik is a countable set, the probability of


A ⊂ X ik is the sum of probabilities of elements of A, the latter being defined by

Pik (x) := Mikx k+1 . . . MxNN−1 N −1 


−1 x N = Π=k M x  x +1 , for all x in X ik . (7.2)

In the sequel we will often use the more intuitive notation

P((x k , . . . , x N ) | x k = i) := Pik (x), (7.3)

which remains meaningful for a process starting at a time possibly less than k.
In the next lemma, we check that, given the knowledge of the state at some time
 < N , the additional knowledge of past states (for times up to k − 1) is useless for
the estimation of x +1 (and so, by induction for x j , j >  + 1).

Lemma 7.1 Given times 0 ≤ k <  < N , A ⊂ S −k , and q ∈ S , set

Aq := {x ∈ X k ; (x k , . . . , x −1 ) ∈ A; x  = q}. (7.4)

Assume that Aq has a positive probability. Then

Pik (x +1 = j | x ∈ Aq } = Pik (x +1 = j | x  = q) = Mqk j . (7.5)

Proof We have that



P(x +1 = j and x ∈ Aq ) = Mq j Mxkk x k+1 . . . Mx−1 
−1 q = Mq j P(x ∈ A q ).
k

(x k ,...,x  )∈A
(7.6)
Therefore by the Bayes rule

P(x +1 = j and x ∈ Aq )


P(x +1 = j | x ∈ Aqk ) = = Mqk j , (7.7)
P(x ∈ Aq )

as was to be shown.

7.1.1.2 Transition Operators

We can view M k = {Mikj }i, j∈S ×S as a possibly ‘infinite matrix’ with a (nonnegative)
element Mikj in row i and column j, the sum over each row being equal to 1. We
call such a ‘matrix’ having these two properties a transition operator. If S is finite,
a transition operator M reduces to a stochastic matrix (a matrix with nonnegative
elements whose sum over each row is 1).
We have the following calculus rules that extend the usual matrix calculus: prod-
ucts between transition operators, and the product of a transition operator with a
7.1 Controlled Markov Chains 225

horizontal vector on the left, or a vertical vector on the right, under appropriate
conditions on these vectors.
More precisely, let 1 and ∞ , respectively, denote the space of summable and
bounded sequences, whose elements are represented as horizontal (for 1 ) and vertical
(for ∞ ) vectors. These spaces are resp. endowed with the norms

π 1 := |πi |; v∞ := sup |vi |. (7.8)
i∈S i∈S

We recall that ∞ is the topological dual (the set of continuous linear forms) of 1 .
We denote the duality pairing by

π v := πi vi , for all π ∈ 1 and v ∈ ∞ . (7.9)
i∈S

This is in accordance with the rules for products of vectors in the case of a finite state
space. Let π ∈ 1 , v ∈ ∞ , and M be a transition operator. We define the products
π M ∈ 1 and Mv ∈ ∞ by
 
(π M) j := πi Mi j ; (Mv)i := Mi j v j , for all i, j in S . (7.10)
i∈S j∈S

We easily check that π → π M and v → Mv are non-expansive, i.e.,

π M1 ≤ π 1 ; Mv∞ ≤ v∞ . (7.11)

In addition, for all v ∈ ∞ :

inf vi ≤ inf (Mv)i ≤ sup(Mv)i ≤ sup vi . (7.12)


i i i i

If M 1 and M 2 are two transition operators, their product M 1 M 2 is defined as



(M 1 M 2 )i j := 1
Miq Mq2j , for all i, j in S . (7.13)
q∈S

It is easy to check that the product of two transition operators is a transition operator.
We interpret  

P := π ∈  ; πi ≥ 0, i ∈ S ;
1
πi = 1 (7.14)
i∈S

as a set of probability laws over S , and ∞ as a values space. The (left) product of a
probability law π with a transition operator is a probability law, and we can interpret
the pairing (7.9) as the expectation of v under the probability law π . One can interpret
226 7 Markov Decision Processes

the ith row of M k as the probability law of x k+1 , knowing that the process x satisfies
xk = i ∈ S .
Let x ∈ X k , the class of processes starting at time k. It may happen that the initial
state x k is unknown, but has a known probability law π k ; we then write x k ∼ π k .
Then we may define the event set as X k and the probability of x ∈ X k as

Pkπ k (x) := πxkk Mx k x k+1 . . . Mx N −1 x N . (7.15)

We note that
Pkπ k (x) := πxkk Pkx k (x), (7.16)

and that for  > k, the probability law of x k+1 , i.e. π  := P(x  | x k ∼ π k ), satisfies
the forward Kolmogorov equation
 
π +1
j = πi P[x +1 = j, x  = i] = πi Mi, j , for  = k to N − 1, (7.17)
i i

or equivalently

π +1 = π  M  = π k Πq=k

M q , for  = k to N − 1 . (7.18)

7.1.1.3 Cost Processes

We define a Markov cost process by associating with a Markov chain process {x k }


the cost function {cik }, i ∈ S , k ∈ N. We assume that ck := {cik }i∈S belongs to ∞ ,
which means that the costs are uniformly bounded in space. We represent ck as a
vertical vector. Recalling the notion of conditional expectation for a given value of
a random variable (Remark 6.3), define the value associated with c and the Markov
chain starting at time k with state i and horizon N ≥ k as
 

N
Vik := E cx  |x =i .
k
(7.19)
=k

The above conditional expectation is well-defined, since c is bounded. The proba-


bilities π  being defined by (7.18), we have that


N
Vik = cik + π  c . (7.20)
=k+1

Denote by e j the probability concentrated at state i, i.e., the element of 1 with all
components equal to 0, except for the jth one equal to 1.
7.1 Controlled Markov Chains 227

Proposition 7.2 For all k = 0, . . . , N , the value function V k belongs to ∞ , and is


the solution of the backwards Kolmogorov equation

V k = ck + M k V k+1 , k = 0, . . . , N − 1,
(7.21)
V N = cN .

Proof That V N = c N is obvious. Now let k ∈ {0, . . . , N − 1}. Then

 
N
Vik = cik + P[x k+1
= j | x = i]
k
E[cx  | x k+1 = j]. (7.22)
j∈S =k+1

N
Now P[x k+1 = j | x k = i] = Mikj and =k+1 E[cx  | x k+1 = j] = V jk+1 . The con-
clusion follows.

7.1.1.4 Discounted Problems with Infinite Horizon

In the case of an infinite horizon, the probability space can be defined by Kol-
mogorov’s extension of finite horizon probabilities, see Theorem 3.24. We first con-
sider a problem with discount rate β ∈ (0, 1) and non-autonomous data, i.e., ck and
M k depend on the time k. We assume that

c∞ := sup ck ∞ < ∞. (7.23)


k∈N

The associated value function, starting at state i and time k, is defined by



Vik := (1 − β)E β −k cx  |x k = i . (7.24)
=k

It is well-defined and belongs to ∞ , since




|Vik | ≤ (1 − β) β −k c ∞ ≤ c∞ . (7.25)
=k

Lemma 7.3 We have that

V k = (1 − β)ck + β M k V k+1 , k ∈ N. (7.26)

Proof In view of (7.18), and since (ei M k ) j = Mikj it follows that


228 7 Markov Decision Processes

Vik 
= cik + ei ∞=k+1 β
−k
M k . . . M −1 c
1−β (7.27)
 ∞ −k −1 
= cik + j∈S Mikj βck+1 j + =k+2 je β M k+1
. . . M c ,

so that
 ∞

Vik
= cik + Mikj E β −k cx  |x k+1 = j , (7.28)
1−β =k+1
j∈S

and the above expectation is nothing else than βV jk+1 /(1 − β). The conclusion fol-
lows.

Remark 7.4 Lemma 7.3 allows us to compute V k given V k+1 . In practice, we can
compute an approximation of V k given a horizon N > 0, setting

ck,N = ck if k < N , and ck,N = 0 otherwise. (7.29)

The corresponding expectation

N −1

Vik,N := (1 − β)E β −k cx,N
 |x
k
=i (7.30)
=k

is the value function of a problem with finite horizon N and therefore can be computed
by induction, starting from V N ,N = 0. We have the error estimate

V k,N − V k ∞ ≤ (1 − β) β −k c ∞ ≤ β N −k c∞ . (7.31)
≥N

Remark 7.5 In the autonomous case, i.e., when (ck , M k ) does not depend on time,
and is then denoted as (c, M), it is easily checked that V k actually does not depend
on k, and is therefore denoted by V . Then Lemma 7.3 tells us that V satisfies

V = (1 − β)c + β M V. (7.32)

Since M is non-expansive in ∞ , V → (1 − β)c + β M V is a contraction with coef-


ficient β. By the Banach–Picard theorem, (7.32) has a unique solution.
As observed in Remark 7.4, applying the above contraction mapping N times,
starting from the zero value function, is equivalent to compute the value function V N
of the corresponding problem with horizon N and zero terminal cost, and we have
as in (7.31):
V N − V ∞ ≤ β N c∞ . (7.33)

Remark 7.6 We often have periodic data (think of seasonal effects in economic
modelling), i.e., (ck , M k ) = (ck+K , M k+K ), where the positive integer K is called
7.1 Controlled Markov Chains 229

the period. It is easily checked that V k is periodic with period K , so it suffices to


compute (V 1 , . . . , V K ). It then follows from (7.26) that, with obvious notations:
⎛ ⎞ ⎛ 1⎞ ⎛ 1 ⎞⎛ 2 ⎞
V1 c M 0 V
1⎜ . ⎟ ⎜ . ⎟ ⎜ . ⎟ ⎜ . ⎟
⎝ .. ⎠ = (1 − β) ⎝ .. ⎠ + β ⎝ .. ⎠ ⎝ .. ⎠ , (7.34)
β
VK cK 0 MK V K +1

with V K +1 := V 1 . We see that (V 1 , . . . , V K ) is a solution of a contracting fixed-


point equation, of the same nature as the one obtained in the autonomous case (but
K times larger).

7.1.2 The Dynamic Programming Principle

Consider now a Markov chain whose transition probabilities Mikj (u) depend on a
control variable u ∈ Uik , where Uik is an arbitrary set depending on the time k and
state i ∈ S . We have costs depending on the control and state: cik (u) : Uik → R, and
final values ϕ ∈ ∞ , such that

c∞ := sup |cik (u)| < ∞. (7.35)


k,i,u

Let Φ k denote the set of feedback mappings (at time k), that to each i ∈ S associates
u i ∈ Uik . Given a horizon N > k, we choose a feedback policy, i.e., an element u of
the set
Φ (0,N −1) := Φ 0 × · · · × Φ N −1 , (7.36)

that to each i ∈ S and k ∈ {0, . . . , N − 1} associates an element u ik of Uik . We denote


by M k (u k ) the transition operator with generic term Mikj (u ik ), and by Pu and Eu the
associated probability and expectation. From the discussion of the uncontrolled case
it follows that with the feedback policy u are associated the values

N −1

Vik (u) := Eu cx  (u x  ) + ϕx N |x k = i , k ∈ N, i ∈ S . (7.37)
=k

By our previous results, these values are characterized by the relations



V k (u) = ck (u) + M k (u)V k+1 (u), k = 0, . . . , N − 1;
(7.38)
V N (u) = ϕ.

Here, by the short notation ck (u), we mean the function of i ∈ S with value cik (u i ).
Also, the following holds:
230 7 Markov Decision Processes

N −1

V k (u)∞ ≤ ck ∞ + ϕ∞ , k = 0, . . . , N . (7.39)
=k

The (minimal) value is defined by

Vik := inf Vik (u), i ∈ S ; k = 0, . . . , N . (7.40)


u∈Φ (0,N −1)

In view of (7.39), we have that

N −1

V k ∞ ≤ ck ∞ + ϕ∞ , k = 0, . . . , N . (7.41)
=k

Given ε ≥ 0 and k ∈ {0, . . . , N − 1}, we define the set Φ k,ε of ε-optimal feedback
policies at time k, as
⎧ ⎧ ⎫ ⎫
⎨ ⎨  ⎬ ⎬
Φ k,ε = û ∈ Φ k ; û i ∈ ε-argmin cik (u i ) + Mikj (u i )V jk+1 , for all i ∈ S .
⎩ u∈U k ⎩ ⎭ ⎭
i j
(7.42)
By ε-argminu∈Uik , we mean the set of points where the infimum is attained up to ε,
that is, in the present setting, the set of û i ∈ Uik such that
⎧ ⎫
 ⎨  ⎬
cik (û i ) + Mikj (û i )V jk+1 ≤ ε + inf cik (u) + Mikj (u)V jk+1 . (7.43)
u∈Uik ⎩ ⎭
j j

Note that this set may be empty if ε = 0. Consider the dynamic programming equa-
tion: find (v = v0 , . . . , v N ) ∈ (∞ ) N +1 such that
⎧ ⎧ ⎫

⎪ ⎨  ⎬
⎨ k
vi = inf cik (u) + Mikj (u)vk+1 , i ∈ S , k = 0, . . . , N − 1,
u∈Uik ⎩ ⎭
j (7.44)

⎪ j
⎩ N
v = ϕ.

Proposition 7.7 The (minimal) value function V k is the unique solution of the
dynamic programming equation. If the policy ū is such that for some εk ≥ 0,
ū k ∈ Ū k,εk for all k, then

N −1

Vik ≤ V k (ū) ≤ Vik + ε̄k , ε̄k := ε , k = 0, . . . , N − 1. (7.45)
=k
7.1 Controlled Markov Chains 231

In particular, if the above relation holds with εk = 0 for all k, then the policy u is
optimal in the sense that Vik = Vik (u), for all k = 0, . . . , N − 1, and i ∈ S .

Proof By (backward) induction, we easily obtain that the dynamic programming


principle (7.44) has a unique solution v in (∞ ) N +1 that satisfies the estimate (7.41).
Given a policy ū, we claim that vk ≤ V k (ū). This holds (with equality) for k = N ,
and if it holds at time k + 1, then the claim follows by induction, since

vik = inf u∈Uik {cik (u) + j Mikj (u)vk+1 }

≤ cik (ū ik ) + j Mikj (ū ik )vk+1 (7.46)

≤ cik (ū ik ) + j Mikj (ū ik )V jk+1 (ū) = Vik (ū).

Minimizing over ū we obtain that vk ≤ V k . We next prove the second inequality in


(7.45) with v in lieu of V . It obviously holds when k = N , and if it does at time
k + 1, then

Vik (ū) = cik (ū ik ) + Mikj (ū ik )V jk+1 (ū)
j

≤ εk + inf u∈Uik {cik (u) + j Mikj (u)V jk+1 (ū)} (7.47)

≤ ε̄k + inf u∈Uik {cik (u) + j Mikj (u)vk+1 j } = ε̄k + vi .
k

So, we have proved that vk ≤ V k ≤ V k (ū) ≤ ε̄k + vik . Since ε̄k can be taken arbitrar-
ily small, vk = V k for all k, and the conclusion follows.

7.1.3 Infinite Horizon Problems

7.1.3.1 Main Result

In this section, we assume that the data are autonomous: the cost function, transition
operator and control sets do not depend on time, and we have a discount coefficient
β ∈ (0, 1). The following theorem characterizes the optimal policies, and shows in
particular that we can limit ourself to autonomous (not depending on time) feedback
policies Φ that with each i ∈ S associate to an element u i of Ui . Sometimes we will
use the following hypothesis:

For all i and j in S , Ui is metric compact
(7.48)
and the functions ci (u) and Mi j (u) are continuous.

Set, for all i ∈ S :


 ∞


Vi (u) := (1 − β)E u
β cx k (u xk )|x = i .
k 0
(7.49)
k=0
232 7 Markov Decision Processes

Given the discount factor β ∈]0, 1[, the (minimal) value function is defined by

Vi := inf Vi (u), i ∈ S . (7.50)


u∈Φ

Theorem 7.8 (i) The value function is the unique solution of the dynamic program-
ming equation: find v ∈ ∞ such that
⎧ ⎫
⎨  ⎬
vi = inf (1 − β)ci (u) + β Mi j (u)v j , i ∈ S. (7.51)
u∈Ui ⎩ ⎭
j

(ii) Given ε ≥ 0, let u ∈ Φ be an autonomous policy and V (u) ∈ ∞ be the associated


value, the unique solution of

V (u) = (1 − β)c(u) + β M(u)V (u). (7.52)

Assume that, for all i ∈ S ,


⎛ ⎞

Vi (u) ≤ inf ⎝(1 − β)ci (ũ) + β Mi j (ũ)V j (u)⎠ + ε. (7.53)
ũ∈Ui
j

Set ε := (1 − β)−1 ε. Then the policy u is ε suboptimal, in the sense that the asso-
ciated value V (u) satisfies

Vi (u) ≤ Vi + ε , for all i ∈ S . (7.54)

(iii) Let (7.48) hold. Then there exists (at least) an optimal policy.

We recall that
 
 
 inf a(u) − inf b(u) ≤ sup |a(u) − b(u)|, (7.55)
u∈U u∈U 
u∈U

and define the Bellman operator T : ∞ → ∞ as


⎧ ⎫
⎨  ⎬
(T w)i := inf (1 − β)ci (u) + β Mi j (u)w j . (7.56)
u∈Ui ⎩ ⎭
j

Proof (a) Let us show first that (7.51) has a unique solution. This equation is of the
form v = T v. Since T w∞ ≤ (1 − β)c∞ + βw∞ , the operator T indeed
maps ∞ into itself. Given w and w in ∞ , using (7.55) and the fact that M(u) is a
transition operator, we get
7.1 Controlled Markov Chains 233

1   m
  m
(T w )i − (T w)i  ≤ sup  Mi j (u)(w − w) j  ≤ sup Mi j (u)w − w∞
β u∈Ui j=1 u∈Ui j=1

and the r.h.s. is equal to w − w∞ . So, T is a contraction with coefficient β and,
by the Banach–Picard theorem, has a unique solution denoted by v. We next prove
that v is equal to the minimal value V .
(b) Let u ∈ Φ be a policy, with associated value V (u). Since

v ≤ (1 − β)c(u) + β M(u)v, (7.57)

we deduce using (7.52) that v − V (u) ≤ β M(u)(v − V (u)). Lemma 7.9 below
ensures that v ≤ V (u). Since this holds for all policies, we also have v ≤ V .
(c) If (7.53) is satisfied, using (7.55) we get

Vi (u) − vi ≤ ε + sup β Mi j (ũ)(V j (u) − v j ) ≤ ε + β sup(V (u) − v). (7.58)
ũ∈Ui j∈S

Taking the supremum in i, we deduce that sup(V (u) − v) ≤ ε . Since v ≤ V (u) for
any u ∈ Φ, we deduce (7.54), whence (ii).
(d) It follows from (ii) that a policy satisfying the dynamic programming equation
(7.51) is optimal. Such a policy exists whenever (7.48) holds. Points (i) and (iii)
follow. 
Lemma 7.9 Let M be a transition operator, β ∈]0, 1[, ε ≥ 0 and w ∈ ∞ satisfy
w ≤ ε1 + β Mw. Then w ≤ (1 − β)−1 ε1.
Proof We have Mw ≤ (sup w)1 since M is a transition operator, and so w ≤ (ε +
β sup w)1. Therefore, sup w ≤ ε + β sup w, whence the conclusion. 
Definition 7.10 We say that the sequence {u q } of autonomous feedback policies
q
simply converges to ū ∈ Φ if u i → ū i , for all i ∈ S . We define in the same way the
simple convergence in  and ∞ .
1

Lemma 7.11 Let {u q } simply converge to ū in Φ. Then the associated value sequence
V (u q ) simply converges to V (ū).
Proof Since V (u q ) is bounded in ∞ , by a diagonalizing argument, there exists a
subsequence of V (u q ) that simply converges to some V̄ ∈ ∞ . We will show that
V̄ = V (ū). It easily follows then that the sequence V (u q ) simply converges to V (ū).
So, extracting a subsequence if necessary, we may assume that V (u q ) simply
converges to V̄ ∈ ∞ . Fix ε ∈ (0, 1) and i ∈ S . There exists a partition (I, J ) of S
such that 
I has a finite cardinality and Mi j (ū) ≥ 1 − 21 ε. (7.59)
j∈I

Since I is finite and u q simply converges to ū, for q large enough, we have that

j∈I Mi j (u ) ≥ 1 − ε, and so
q
234 7 Markov Decision Processes
 
Mi j (ū) ≤ ε; Mi j (u q ) ≤ ε. (7.60)
j∈J j∈J

Set, for i ∈ S , Δi := lim supq |Vi (u q ) − Vi (ū)|. Since I is finite, we have that
 
 

 
Δi = lim sup (1 − β)(ci (u i ) − ci (ū i )) + β (Mi j (u i )V (u q ) j − Mi j (ū i )V j (ū))
q q
q  j 
 
 
 
≤ β lim sup  (Mi j (u q )V (u q ) j − Mi j (ū)V j (ū)) + ε(V (u q )∞ + V (ū)∞ )
q  j∈I 
≤ ε(V q ∞ + V (ū)∞ ).

Since we may take ε arbitrarily small, the result follows. 


Remark 7.12 By similar arguments it can be shown that, in a finite horizon setting,
if a sequence {u q } of feedback policies simply converges to the feedback policy ū,
then the associated values V (u q ) simply converge to V (ū).

7.1.3.2 Characterization of Optimal Policies

We now want to characterize optimal policies when starting from a given point, say
i ∈ S . That is, a policy u ∈ Φ such that the associated value satisfies Vi (u) = Vi .
Definition 7.13 Consider an autonomous Markov chain with transition operator M.
Let i ∈ S . We say that j ∈ S is q-steps accessible from i (with q ≥ 1) if a Markov
chain starting at state i and time 0 has a nonzero probability of having its state equal
to j at time q. We say that j is accessible from i if it is n-steps accessible for some
n ≥ 1. The union of such j is called the accessible set from state i.
Let here M q denote the q times product of M. It is easily checked by induction
q
that Mi j > 0 iff the Markov chain starting at i at time 0 has a positive probability of
being equal to j at time q. Therefore the accessible set is
∞ q
Si = ∪q=1 { j ∈ S ; Mi j > 0}. (7.61)

In the case of a controlled Markov chain, we denote by Si (u) the accessible set when
starting from i, with the policy u ∈ Φ. Set Sˆi (u) := {i} ∪ Si (u).
Theorem 7.14 A policy u ∈ Φ is optimal, when starting from i 0 ∈ S , iff it satisfies
the dynamic programming equation over Sˆi0 (u), i.e.,
⎧ ⎫
⎨  ⎬
u i ∈ argmin (1 − β)ci (v) + β Mi j (v)V j , for all i ∈ Sˆi0 (u). (7.62)
v∈Ui ⎩ j

7.1 Controlled Markov Chains 235

Proof Let i ∈ S be such that Vi (u) = Vi . Then


⎛ ⎞ ⎛ ⎞
 
(1 − β) ⎝ci (u i ) + β Mi j (u i )⎠ V j (u) = Vi = inf (1 − β) ⎝ci (v) + β Mi j (v)V j ⎠ .
v∈Ui
j j
(7.63)
Since V j ≤ V j (u) this holds iff V j (u) = V j whenever Mi j (u) = 0. The result then
follows by induction, starting with i = i 0 . 

7.1.4 Numerical Algorithms

7.1.4.1 Value Iteration

In the case of autonomous infinite horizon problems, the simplest method for solving
the dynamic programming principle (7.51) is the value iteration algorithm: compute
the sequence vq in ∞ , for q ∈ N, the solution of
⎧ ⎫
⎨  ⎬
q+1 q
vi = inf (1 − β)ci (u) + β Mi j (u)v j , i ∈ S , q ∈ N. (7.64)
u∈Ui ⎩ ⎭
j

We initialize the sequence with an arbitrary element v0 of ∞ . The sequence vq is not


to be confused with the values vk used in the case of finite horizon. Observe that (7.64)
coincides with the formula for computing the value of finite horizon problems (up to
the fact that here we increase the index q instead of decreasing it). It easily follows
that vq is the value function of the following finite horizon, discounted problem


q−1
β  cx  (u  ) + β q vx0q |x 0 = i , k ∈ N, i ∈ S ,
q
Vi (u) := (1 − β) min Eu
u∈Φ (0,q−1)
=0
(7.65)
where the set Φ (0,N −1) of feedback policies was defined in (7.36).

Proposition 7.15 The value iteration algorithm converges to the unique solution V
of (7.51), and we have

vq − V ∞ ≤ β q v0 − V ∞ , for all q ∈ N. (7.66)

Proof We showed in the proof of Theorem 7.8 that the Bellman operator T , defined
in (7.56), is a contraction with ratio β in the uniform norm. We conclude by the
Banach–Picard theorem. 

Remark 7.16 When taking v0 = 0 we obtain the explicit estimate of distance to the
solution:
236 7 Markov Decision Processes

vq − V ∞ ≤ β q V ∞ ≤ β q c∞ , for all q ∈ N. (7.67)

Remark 7.17 Observe that vq+1 is a nondecreasing function of vq . So if v1 ≤ v0 , we


obtain by an induction argument that vq is a nonincreasing sequence. This is the case
in particular if v0j ≥ supi ci , for all j ∈ S . Similarly, if v0j ≤ inf i ci , for all j ∈ S ,
then vq is nondecreasing.

7.1.4.2 Policy Iteration

When β is close to 1, the value iteration algorithm can be very slow. A possible
alternative is the policy iterations, or Howard algorithm. Roughly speaking, the idea
is, for a given policy, to compute the associated value, and then to update the policy by
computing the argument of the minimum in the dynamic programming operator. We
assume that the compactness hypothesis (7.48) holds. Each iteration of the algorithm
has two steps:
Algorithm 7.18 (Howard algorithm)
1. Initialization: choose a policy u 0 ∈ Φ; set q := 0.
2. Compute the value function vq associated with the policy u q ∈ Φ, i.e., the solu-
tion of the linear equation

vq = (1 − β)c(u q ) + β M(u q )vq . (7.68)

3. Compute a policy u q+1 ∈ Φ, a solution of


⎧ ⎫
⎨  ⎬
q+1 q
ui ∈ arg min (1 − β)ci (u) + β Mi j (u)v j , for all i ∈ S . (7.69)
u∈Ui ⎩ ⎭
j

4. q := q + 1; go to step 2.
Denote by V the value function, the unique solution of the dynamic programming
principle (7.51).
Proposition 7.19 Let (7.48) hold. Then the Howard algorithm is well-defined. The
sequence vq is nonincreasing and satisfies

vq+1 − V ∞ ≤ βvq − V ∞ , for all q ∈ N. (7.70)

In addition, denote by v̄q+1 the value obtained by applying the value iteration to vq .
Then vq+1 ≤ v̄q+1 .
Proof The linear system (7.68) has a unique solution in ∞ , since it is a fixed point
equation of a contraction. In view of (7.48), the minimum in the second step is
attained. The sequence vq is bounded in ∞ since we have
7.1 Controlled Markov Chains 237

vq ∞ ≤ (1 − β)c(u q )∞ + βM(u q )vq ∞ ≤ (1 − β)c(u q )∞ + βvq ∞ ,


(7.71)
and therefore vq ∞ ≤ c∞ . Relations (7.68) and (7.69) imply

(1 − β)c(u q+1 ) + β M(u q+1 )vq ≤ (1 − β)c(u q ) + β M(u q )vq , (7.72)

whence

vq+1 − vq = (1 − β)(c(u q+1 ) − c(u q )) + β(M(u q+1 )vq+1 − M(u q )vq )


≤ β M(u q+1 )(vq+1 − vq ),

and so vq+1 − vq ≤ 0 by Lemma 7.9.


By Proposition 7.15, v̄q+1 − V ∞ ≤ βvq − V ∞ . Since V ≤ vq+1 , it is enough
to establish that vq+1 ≤ v̄q+1 . Indeed, we get after cancellation that

vq+1 − v̄q+1 = β M(u q+1 )(vq+1 − vq ) ≤ 0,

since M(u q+1 ) has nonnegative elements and vq+1 ≤ vq . The conclusion
follows. 

Remark 7.20 The previous proof shows that the policy iterations converge at least as
rapidly as the value iteration. However, each iteration needs to solve a linear system.
This can be expensive, especially if the transition operators are not sparse.

Remark 7.21 The contraction constant β is optimal for the Howard algorithm, as
Example 7.22 shows. In addition, in this example the sequence computed by the
value and Howard algorithms coincide. So, in general, the Howard algorithm does
not converges more rapidly than the value iteration.

Example 7.22 Here is a variant of an example due to Tsitsiklis, see Santos and Rust
[109], showing that the Howard algorithm does not necessarily converge faster than
the value iteration algorithm. Let S = N, and for all i ∈ N, i = 0, Ui = {0, 1}. The
decision 0 (resp. 1) represents a (deterministic) move from state i to state i − 1 (resp.
to itself). The only possible decision at state 0 is to remain there. The cost is 1 at any
state i = 0, and 0 at state 0. So the optimal policy is to choose u = 0 when i = 0.
The optimal value is V0 = 0 and for i > 0,

Vi = (1 − β)(1 + β + · · · + β i−1 ) = 1 − β i . (7.73)

We choose to initialize Howard’s algorithm with the policy u = 1 for any i > 0. So

vi0 = 0 if i = 0, vi0 = 1 otherwise. (7.74)

We then have
v0 − V ∞ = v10 − V1 = 1 − (1 − β) = β. (7.75)
238 7 Markov Decision Processes

At each iteration q of the algorithm the decision at state q changes from 1 to 0, and
this is the only change, so that
q q
vi = Vi , 0 ≤ i ≤ q; vi = 1, i > q. (7.76)

It follows that
q
vq − V ∞ = vq+1 − Vq+1 = 1 − (1 − β q+1 ) = β q+1 = β q v0 − V ∞ . (7.77)

7.1.4.3 Modified Policy Iteration Algorithms

The idea is to replace, in Howard’s algorithm, the linear system resolution with
finitely many value iteration-like steps, where the decision is freezed.

Algorithm 7.23 (Modified policy iteration algorithm)


1. Initialization: choose a policy u 0 ∈ Φ, an initial value estimate v−1 ∈ ∞ , and
m ∈ N∗ . Set q := 0.
2. Set vq,0 := vq−1 . Compute vq,k , k = 1 to m, as the solution of ‘freezed value
iteration steps’ as follows:

vq,k := (1 − β)c(u q ) + β M(u q )vq,k−1 . (7.78)

3. Set vq := vq,m and compute the policy u q+1 ∈ Φ, a solution of


⎧ ⎫
⎨  ⎬
q+1 q
ui ∈ arg min (1 − β)ci (u) + β Mi j (u)v j , for all i ∈ S . (7.79)
u∈Ui ⎩ ⎭
j

4. q := q + 1; go to step 2.

Remark 7.24 (i) If m = 1 we recover the value iteration algorithm.


(ii) Denote by v̂q the value associated with the policy u q . The convergence analysis
of the freezed value iteration steps is similar to that of the ‘classical’ value iterations.
We deduce that
vq,m − v̂q ∞ ≤ β m vq−1 − v̂q ∞ . (7.80)

So, informally speaking, when m is large, the sequence vq should not be too dif-
ferent from the one computed by Howard’s algorithm. The gain is that the freezed
value iteration steps are generally much faster than the corresponding classical value
iterations.
7.1 Controlled Markov Chains 239

7.1.5 Exit Time Problems

Let Sˆ be a subset of S , and consider an autonomous controlled Markov chain


model, the control sets Ui being metric and compact, and the transition operator M
and cost ci (u) being continuous. Let τ be the first exit time of Sˆ of the Markov
chain starting at i ∈ Sˆ , at time zero:

/ Sˆ }.
τ := min{k ∈ N; x k ∈ (7.81)

Note that τ = ∞ if no exit occurs. We consider the value function, for i ∈ S :

τ −1

Vi := (1 − β) inf Eu β k cx k (u x k ) + β τ ϕx τ |x 0 = i . (7.82)
u∈Φ
k=0

Remark 7.25 If the ci are set to zero and ϕi = 1 (resp. ϕi = −1) for all i ∈ S \ Sˆ,
we see that the problem consists, roughly speaking, in maximizing (resp. minimizing)
a discounted value of the exit time.
It appears that exit problems reduce to the standard one by adding a final state
say i f to the state space, which becomes S := S ∪ {i f }. The decision sets are for
i ∈S : 
Ui if i ∈ Sˆ ,
Ui = (7.83)
{0} otherwise.

The associated transition operators are defined by



Mi j (u) for i ∈ Sˆ , u ∈ Ui ,
Mi j (u) = (7.84)
δ ji f / Sˆ .
if i ∈

In other words, for any i ∈ S \ Sˆ , the only possible transition is to the final state
i f , and when in i f the process remains there. The costs are

⎨ ci j (u) if i ∈ Sˆ ,
ci (u) = ϕi if i ∈ S \ Sˆ , (7.85)

0 if i = i f .

Proposition 7.26 Let supu∈U |ci (u)| be finite and ϕ bounded. Then the value function
of the exit time problem is the unique solution of the dynamic programming equation
⎧ ⎧ ⎫

⎪ ⎨  ⎬
⎨ v = inf (1 − β)c (u) + β Mi j (u)v j , i ∈ Sˆ,
i i
u∈Ui ⎩ ⎭ (7.86)

⎪ j

vi = (1 − β)ϕi , i ∈ S \ Sˆ .
240 7 Markov Decision Processes

Proof This is a consequence of our previous results. We have just shown that exit
time problems can be rewritten as standard controlled Markov chain problems, and
the value at the state i f is zero. Writing the corresponding dynamic programming
equation, for i ∈ Sˆ we get the first row in (7.86), and otherwise we get

vi = (1 − β)ϕi + βvi f , i ∈ S \ Sˆ; vi f = βvi f . (7.87)

Therefore vi f = 0 and then (7.86) follows. 

Remark 7.27 The value iteration algorithm (rewriting the exit problem as a standard
one), when starting with initial values such that

vi0 = (1 − β)ϕi , i ∈ S \ Sˆ ; vi0f = 0, (7.88)

satisfies
vi = (1 − β)ϕi , i ∈ S \ Sˆ ; vi f = 0,
q q
for all q ∈ N. (7.89)

So we can express it in the form, for q ∈ N:


⎧ ⎧ ⎫

⎪ ⎨  ⎬
⎨ q+1
vi = inf (1 − β)ci (u) + β
q
Mi j (u)v j , i ∈ Sˆ ,
u∈Ui ⎩ ⎭ (7.90)

⎪ j

i ∈ S \ Sˆ .
q+1
vi = (1 − β)ϕi ,

Exercise 7.28 Extend the policy iteration algorithm to the present setting, the
sequence of values satisfying (7.89).

7.1.6 Problems with Stopping Decisions

7.1.6.1 Setting

We now study an extension of the previous framework, with the additional possibility
of a stopping decision at any state i ∈ S with cost ψi in R ∪ {+∞} (in fact, the
possibly infinite value restricts the possibility of stopping to the states with a finite
value of ψ). We assume that Ψ has a finite infimum. Let M(u) be the transition
operator of the controlled Markov chain. We assume that (7.48) holds. Given Sˆ ⊂
S , we denote by τ the first exit time of Sˆ , and consider the additional decision θ ,
called the stopping time (a function of i ∈ S ). Set

1 if θ < τ,
χθ<τ = (7.91)
0 otherwise,
7.1 Controlled Markov Chains 241

and adopt a similar convention for χθ≥τ . We consider the controlled stopping time
problem
(θ∧τ )−1 

Vi := (1 − β) inf Eu β k c(u)x k + β θ χθ<τ ψx θ + β τ χθ≥τ ϕx τ |x 0 = i .
u∈Φ
k=0
(7.92)
Remark 7.29 (i) When Ui is a singleton for all i ∈ S , the only decision is when to
stop. We speak then of a pure stopping problem. (ii) The optimal policy may be to
never stop.
In the sequel we assume that

(i) the compactness hypothesis (7.48) holds,
(7.93)
(ii) supu∈U |ci (u)| < ∞, (iii) ϕ ∈ ∞ , (iv) inf ψ is finite.

Theorem 7.30 The value function V of the stopping problem belongs to ∞ , and is
the unique solution of the dynamic programming equation
⎧ ⎛ ⎧ ⎫ ⎞
⎪ ⎨  ⎬

⎨ (i) vi = min ⎝ inf (1 − β)ci (u) + β Mi j (u)v j , (1 − β)ψi ⎠ , i ∈ Sˆ ,
u∈Ui ⎩ ⎭


j

(ii) vi = (1 − β)ϕi , i∈/ Sˆ .
(7.94)
Proof Choosing a policy without stopping, we get that Vi ≤ c∞ . Changing ci into
(inf c)1 and ψi into (inf ψ)1, for each i ∈ S , we get Vi ≥ min(−c∞ , (1−β) inf ψ)
(remember that Ψ has a finite infimum). So, V ∈ ∞ .
We can rewrite the stopping problem as a standard one. As in the case of exit
problems, we add to S a final state i f with only transition to itself, and transitions
from any i ∈ S \ Sˆ to i f , with associated cost ϕi . The difference is that we add
the possible decision from any i ∈ S to i f with associated cost ψi . The associated
dynamic programming equation then reads
⎧ ⎛ ⎧ ⎫ ⎞

⎪ ⎨  ⎬
⎪ ⎝
⎪ vi = min inf (1 − β)ci (u) + β Mi j (u)v j , (1 − β)ψi + βvi f ⎠ , i ∈ Sˆ ,

⎨ u∈Ui ⎩ ⎭
j



⎪ vi = (1 − β)ϕi + βvi f , i ∈ S \ Sˆ ,


vi f = βvi f .
(7.95)
Clearly this holds iff vi f = 0 and the second row of (7.94) is satisfied. So, (7.94) is
equivalent to the dynamic programming equation of the reformulation as a standard
problem and therefore characterizes the minimum value function. 
As in the case of exit problems, we easily check that the value iteration algorithm
(applied to the reformulation as a standard problem), initialized with v0 such that
242 7 Markov Decision Processes

vi0f = 0; vi0 = (1 − β)ϕi , for all i in S \ Sˆ , (7.96)

satisfies

vi f = 0; vi = (1 − β)ϕi , for all i in S \ Sˆ and for all q ∈ N.


q q
(7.97)

So, we can define the value iterations algorithm for exit problems as computing the
sequence satisfying (7.97) as well as
⎛ ⎧ ⎫ ⎞
⎨  ⎬
= min ⎝ inf , (1 − β)ψi ⎠ , i ∈ Sˆ .
q+1 q
vi (1 − β)ci (u) + β Mi j (u)v j
u∈Ui ⎩ ⎭
j
(7.98)

7.1.6.2 Policy Iterations Algorithm

The policy iterations algorithm (applied to the reformulation as a standard problem)


can be expressed as follows. We again have that the initialization (7.96) implies
(7.97). We know that the sequence vq computed by the policy iterations algorithm
is nonincreasing. Therefore, if i ∈ Sˆ is such that vi < ψi for some q ∈ N, then
q

q
vi < ψi for all q ∈ N. That is, the set I q of states with ‘non-stopping decision
at iteration q’ (defined precisely below) is nondecreasing. We next formulate the
Howard algorithm. Given a policy u q ∈ Φ, one has to compute the solution vq of the
linear equation
⎧ ⎛ ⎞

⎪ 

⎨ vi = ⎝(1 − β)ci (u i ) + β
⎪ Mi j (u i )v j ⎠ , i ∈ I q ,
q q q q

j (7.99)

⎪ i ∈ Sˆ \ I q ,
q

⎪ v = (1 − β)ψi ,
⎩ iq
vi = (1 − β)ϕi , i∈/ Sˆ.

Let us now state the Howard algorithm:

Algorithm 7.31 (Policy iteration for stopping problems)

1. Choose u 0 ∈ Φ; set q = 0, I 0 := ∅, and a solution v0 of (7.99) with q = 0.


2. q := q + 1. Compute u i , for all i ∈ Sˆ , such that
q

⎧ ⎫
⎨  ⎬
, i ∈ Sˆ.
q q−1
u i ∈ arg min (1 − β)ci (u) + β Mi j (u)v j (7.100)
u∈Ui ⎩ ⎭
j

3. Set
7.1 Controlled Markov Chains 243
⎧ ⎛ ⎞ ⎫
⎨  ⎬
I q := I q−1 ∪ i ∈ Sˆ ; ⎝(1 − β)ci (u i ) + β Mi j (u i )v j ⎠ < (1 − β)ψi
q q q−1
.
⎩ ⎭
j
(7.101)
4. Compute the solution vq of the linear equation (7.99); go to 2.
From the study of the policy iteration in the standard framework, see Proposition
7.19, we deduce that:
Proposition 7.32 The Howard algorithm computes a nonincreasing sequence vq
that satisfies
vq+1 − V ∞ ≤ βvq − V ∞ . (7.102)

7.1.7 Undiscounted Problems

It may happen that exit time or stopping problems have finite values in the absence of
discounting. Indeed, consider the controlled stopping time problem similar to (7.92),
but without discounting:
θ∧τ −1 

Vi := inf E u
c(u)x k + χθ<τ ψx θ + χθ≥τ ϕx τ |x = i .
0
(7.103)
u∈Φ
k=0

Example 7.33 Assume that c(u) and ϕ have nonnegative values, and that ψ ∈ ∞
(in particular, stopping in any state is possible), with inf Ψ < 0. Minorizing Vi by
changing c(u) and ϕ to zero, we obtain that for all i ∈ Sˆ , inf ψ ≤ Vi ≤ ψi , so that
V ∈ ∞ .
Using the arguments of the previous sections, one easily checks that the value
functions satisfy a dynamic programming principle similar to those already stated,
but with β = 1. We leave the details as an exercise.

7.2 Advanced Material on Controlled Markov Chains

This section presents some more advanced aspects of the theory of controlled Markov
chains, among them problems with expectation constraints, with partial information,
including open loop control.

7.2.1 Expectation Constraints

We will see, in the presence of constraints over the expectations of functions of the
state, a nice relation with the duality theory presented in the first chapter.
244 7 Markov Decision Processes

7.2.1.1 Setting

As in Sect. 7.1.3, we consider a problem with autonomous data, infinite horizon,


and discount rate β ∈]0, 1[. We consider only autonomous feedback policies, i.e.
elements of the set Φ of mappings that to each i ∈ S associate some u i ∈ Ui . We
fix the starting point i 0 ∈ S of the Markov chain. The value function associated with
a policy u ∈ Φ is, in the spirit of (7.50), given by
 ∞


Vi0 (u) := (1 − β)E u
β cx k (u x k )|x = i 0 .
k 0
(7.104)
k=0

We have in addition expectation constraints of the form

Wi0 (u) ∈ K , (7.105)

where again u ∈ Φ, K is a nonempty, closed convex subset of Rr , and Wi (u) is the


value associated with uniformly bounded functions Ψi : Ui → Rr , for all i ∈ S :
 ∞


Wi (u) := (1 − β)E u
β Ψx k (u x k )|x = i .
k 0
(7.106)
k=0

The problem is therefore

Min Vi0 (u); Wi0 (u) ∈ K . (7.107)


u∈Φ

7.2.1.2 Weak Duality

We apply the duality theory of Chap. 1 to this (nonconvex) problem. For λ ∈ Rr and
v ∈ Ui , set
ciλ (v) := ci (v) + λ · Ψi (v). (7.108)

Denote by V λ (u) the associated value function, defined by


 ∞


Viλ (u) := (1 − β)Eu β k cxλk (u x k ) | x 0 = i , i ∈ S , (7.109)
k=0

and set V̄iλ := inf u∈Φ Viλ (u). The (standard) Lagrangian, duality Lagrangian, and
dual cost associated with problem (7.107) are, resp.:

⎨ L(u, λ) := Vi0 (u) + λ · Wi0 (u) = Viλ0 (u),
L (u, λ) := L(u, λ) − σ K (λ), (7.110)

δ(λ) := inf u∈Φ L (u, λ) = V̄iλ0 − σ K (λ).
7.2 Advanced Material on Controlled Markov Chains 245

The dual problem is


Max δ(λ). (7.111)
λ

We know that its value (the dual value) is a lower bound of the value of (7.107). This
lower bound is often useful, since the primal problem is not easy to solve. We next
analyze some cases when there is no duality gap, i.e., the primal and dual values are
equal.

7.2.1.3 Strong Duality; Relaxation

In this subsection we assume the following hypotheses in order to obtain strong


duality results: the state set is finite

|S | = m < ∞, (7.112)

the following qualification condition holds:


 
ε B ⊂ conv Im(Wi0 ) − K , for some ε > 0, (7.113)

where B is the unit ball of Rr , and




⎪ The Ui are convex, compact subsets of Rn u ,

u → M(u) is affine,
(7.114)
⎪ → ci (u) is Lipschitz and convex for any state i,
⎪ u

u → Ψi (u) is affine for any state i.

Note that the above hypotheses do not imply that problem (7.107) is convex (for
instance, the criterion is not a convex function of the policy). We can rewrite the dual
problem as the one of minimizing the l.s.c. function

d(λ) := −δ(λ) = σ K (λ) + sup(−Viλ0 (u)). (7.115)


u∈Φ

Theorem 7.34 Let hypotheses (7.112)–(7.114) hold and the primal problem be fea-
sible. Then
(i) The set of solutions of the dual problem (7.111) is nonempty and compact, and
λ is a dual solution iff thereexists a Borelian probability measure μ over Φ such
that, denoting by Eμ g(u) = Φ g(u)dμ(u) the associated expectation, the following
holds:

suppμ ⊂ argmin L(u, λ); Eμ Wi0 (u) ∈ K ; λ ∈ N K (Eμ Wi0 (u)). (7.116)
u∈Φ
246 7 Markov Decision Processes

(ii) Problems (7.107) and (7.111) have equal value, and there exists a primal-dual
solution (ū, λ). Any such primal-dual solution (ū, λ) is characterized by the relations

ū ∈ argmin L(u, λ); Wi0 (ū) ∈ K ; λ ∈ N K (Wi0 (ū)). (7.117)


u∈Φ

Proof (i) It is easily checked that u → (Vi0 (u), Wi0 (u)) is continuous. Indeed, let u
and u be two policies. Then

V (u) = (1 − β)c(u) + β M(u)V (u)); V (u ) = (1 − β)c(u ) + β M(u )V (u )),


(7.118)
so that W := V (u ) − V (u) satisfies

W = (1 − β)(c(u ) − c(u)) + β M(u )W + β(M(u ) − M(u))V (u). (7.119)

Since M(u ) is a stochastic matrix it is easily deduced that, since V (u)∞ ≤


c(u)∞ :

(1 − β)W ∞ ≤ (I − β M(u ))W


≤ (1 − β)c(u ) − c(u)∞ + βM(u ) − M(u))V (u)∞
≤ (1 − β)c(u ) − c(u)∞ + βM(u ) − M(u))∞ c∞ .
(7.120)
Since c and M are uniformly continuous, the continuity of V (u) follows. We easily
deduce that the set of solutions is not empty.
(ii) Given a sequence εn ↓ 0 of positive numbers, consider the associated perturbed
cost function
cin (u) := ci (u) + εn |u|2 . (7.121)

Denote the corresponding value associated with u ∈ Φ by Vin0 (u); the associated
perturbed problem is
Min Vin0 (u); Wi0 (u) ∈ K . (Pn )
u∈Φ

Let ū be solution of the original problem. Then

Vi0 (ū) = V̄i0 ≤ lim V̄in0 ≤ lim Vin0 (ū) = Vi0 (ū). (7.122)
n n

The first inequality follows from ci (u) ≤ ciε (u), and the two other relations are obvi-
ous. So, V̄in0 → V̄i0 .
Let λn be a dual solution of the perturbed problem (Pn ) (it exists by the same
arguments as for the nominal problem). In view of the qualification condition (7.113),
{λ}n is bounded (adapt the arguments in the proof of Proposition 1.160). Extracting
a subsequence if necessary, we may assume that λn → λ̄.
For u ∈ Φ and j ∈ S , set V jλ,n (u) := V jn (u) + λ · W j (u), as well as
7.2 Advanced Material on Controlled Markov Chains 247

V jλ,n := min V jλ,n (u); L n (u, λ) := Viλ,n


0
(u). (7.123)
u∈Φ

Let Φ n be the set of u ∈ Φ that attain the minimum in L n (·, λn ). Let u n ∈ Φ n , having
 Theorem 7.14, for all i ∈ Su ,
accessible set Su n when starting from state i 0 . By n

u i attains the minimum over Ui of u → cin (u) + j∈S Mi j (u)V jλ . The latter being
strictly convex, all optimal policies have the same value at i 0 , and therefore by
induction the same accessible set, and coincide over this accessible set. Of course,
for states outside the accessible set, the control can take arbitrary values. So all
elements of Φ n have the same value of the constraint Wi0 (u). By Proposition 1.164,
(Pn ) and its dual have the same value, u n is a solution of (Pn ), and we have that

Wi0 (u n ) ∈ K ; λn ∈ N K (Wi0 (u n ));
(7.124)
Vin0 (u n ) + λn · Wi0 (u n ) ≤ Vin0 (u) + λn · Wi0 (u), for all u ∈ Φ.

We have proved that val(Pn ) → val(P). Passing to the limit in the above optimality
conditions in (u n , λn ) we obtain that the limit point (ū, λ̄) satisfies the optimality
conditions for the original problem, i.e.

Wi0 (ū) ∈ K ; λ̄ ∈ N K (Wi0 (ū));
(7.125)
Vi0 (ū) + λ̄ · Wi0 (ū) ≤ Vi0 (u) + λ̄ · Wi0 (u), for all u ∈ Φ.

By Proposition 1.164, ū is a primal solution and the primal and dual problems have
the same value. That the primal-dual solutions are characterized by (7.117) is a
standard result of duality theory. 

7.2.1.4 Probabilistic Constraints

Let Sˆ ⊂ S . We consider the standard problem of minimization of a controlled


Markov chain over a finite horizon N , with the additional probability constraint on
the final state: P[x N ∈ Sˆ ] ≤ α. The constraint can be rewritten as an expectation
constraint:
Eu 1Sˆ (x N ) ≤ α. (7.126)

Since we have a scalar inequality constraint, we may take K := (−, ∞, α] and


the qualification condition (7.113) is equivalent to the existence of û ∈ Φ such that

Eû 1Sˆ (x N (u)) < α. (7.127)



We assume that for u ∈ iS Uik , i ∈ S and k = 0 to N − 1:

⎨ Each Uik is a nonempty, convex, compact subset of Rm ,
u → M k (u) is affine, (7.128)

u → c0,i
k
(u) is continuous and convex.
248 7 Markov Decision Processes

Theorem 7.35 Let (7.127) and (7.128) hold. Then the primal and dual problems
have the same value, and a nonempty set of solutions.

Proof Adapt the techniques in the proofs of the previous statements to the case of a
finite horizon. 

Remark 7.36 Obviously the technique can easily be adapted to the case of several
probabilistic constraints.

7.2.2 Partial Information

7.2.2.1 Open Loop Control

We next come back to the finite horizon framework. Assume that the Ui are equal to
a set denoted by U , and consider the problem of control of the Markov chain without
observation of the state, and knowing only a probability law of the initial state.
We consider a problem starting at time k in {0, . . . , N − 1}, with initial probability
law π k . An open-loop policy is now an element u of U N −k , whose component
u  represents the decision taken at time  = k, . . . , N − 1. The transition matrices
M k (u) are known, and therefore also the probability laws for x  :

π +1 (u) = π  (u)M  (u  ),  = k, . . . , N − 1. (7.129)

Equivalently, for  = k + 1, . . . , N :

−1

π  (u) = π k M k (u), where M k (u) := M q (u q ). (7.130)
q=k

So, the criterion associated with an open loop policy u ∈ U and an initial probability
law π k is
N −1
 N −1

V k (u, π k ) = Eu cx  (u  ) + ϕx N = π  (u)c (u  ) + π N (u)ϕ. (7.131)
=k =k

It is a linear function of π k :
N −1

V (u, π ) = π V̂ (u);
k k k k
where V̂ (u) :=
k
M k (u)c (u  ) + M k N (u)ϕ.
=k
(7.132)
Note that M kk (u) is the identity mapping. For any open-loop policy u, the linear
mapping π k → V k (u, π k ) is Lipschitz from 1 into ∞ , with constant at most
7.2 Advanced Material on Controlled Markov Chains 249

L := N c∞ + ϕ∞ . (7.133)

Set
U := set of mappings S → U. (7.134)

Since an infimum of uniformly Lipschitz functions is Lipschitz with the same con-
stant, the Bellman values
V̄ k (π ) = inf π V̂ k (u) (7.135)
u∈U N −k

are also Lipschitz with constant given by (7.133).


Theorem 7.37 The value functions V̄ k (π ) satisfy the dynamic programming prin-
ciple
 
V̄ k (π ) = inf π ck (u) + V̄ k+1 (π M k (u)) , k = 0, . . . , N − 1; V̄ N (π ) = π ϕ.
u∈U
(7.136)
Proof Elementary, left to the reader. 
Remark 7.38 The state space is now continuous. In order to get an effective algorithm
we need to discretize it. One possibility is a triangulation of the domain, see [13,
Appendix A by M. Falcone]. In most cases the dimension of the problem will make
the numerical resolution very difficult.

7.2.2.2 Costate and Hamiltonian: A General Setting

We can link the previous results to the first-order optimality conditions of some
discrete-time optimal control problem. For the sake of clarity, let us first consider an
abstract discrete-time optimal control problem with state equation

y k = Fk (u k , y k−1 ), k = 1, . . . , N ; ŷ 0 − y 0 = 0. (7.137)

The state variables y k belong to Rn , and the control variables u k belong to Rm .


The initial state ŷ 0 ∈ Rn and dynamics Fk : Rm × Rn → Rn , for k = 1 to N , are
given. The state and control space are resp. Y := (Rn ) N +1 and U := (Rm ) N . Given
a control u ∈ U , the state equation has a unique solution in Y , denoted by y[u].
The cost function is, for given k : Rn × Rm → R, k = 1 to N , and Ψ : Rn → R:


N
J (u, y) := k (u k , y k−1 ) + Ψ (y N ). (7.138)
k=1

The reduced cost is f (u) := J (u, y[u]). The optimal control problem is

Min f (u); u k ∈ Uk , k = 0, . . . , N − 1, (7.139)


u∈U
250 7 Markov Decision Processes

where the Uk are subsets of Rm . The Lagrangian of the problem is


N
 
L (u, y, p) := J (u, y) + p k · Fk (u k , y k−1 ) − y k + p 0 · ( ŷ 0 − y 0 ). (7.140)
k=1

We next assume that the functions Fk , k and Ψ are continuously differentiable. The
costate equation is obtained by setting

D y L (u, y, p) = 0. (7.141)

By a first-order Taylor expansion, one easily checks that this is equivalent to

p k = ∇ y k+1 (u k+1 , y k ) + D y Fk+1 (u k+1 , y k ) p k+1 , k = 0, . . . , N − 1, (7.142)

with final conditions


p N = ∇ψ(y N ). (7.143)

Given (u, y) with y = y[u], the (backwards) costate equation has a unique solu-
tion, denoted by p[u] and called the costate associated with u. Since f (u) =
L (u, y[u], p[u]), D y L (u, y[u], p[u]) = 0, and (u, y[u]) satisfies the state equa-
tion, we have, by the chain rule:

∇ f (u) = ∇u L (u, y[u], p[u]). (7.144)

Introduce the Hamiltonian function, for (u, y, p) ∈ Rn × Rm × Rn and k = 1 to N :

Hk (u, y, p) := k (u, y) + p · Fk (u, y). (7.145)

Setting p = p[u], we obtain that, for k = 1 to N :

∇ f (u) = ∇u (u k , y k−1 ) + Du Fk (u k , y k−1 ) p k = ∇u Hk (u k , y k−1 , p k ). (7.146)

We obtain the following:

Lemma 7.39 Let u be a local solution of the optimal control problem. For k = 1 to
N , if the sets Uk are convex, then

∇u Hk (u k , y k−1 , p k ) · (v − u k ) ≥ 0, for all v ∈ Uk . (7.147)

Proof Let v ∈ U and k ∈ {1, . . . , N }. For t ∈ (0, 1), set wtk := (1 − t)u k + tv, and
wt = u  for  ∈ {1, . . . , N },  = k. Since u is a local solution, we have that

f (wt ) − f (u)
0 ≤ lim = ∇u f (u) · (v − u k ), (7.148)
t↓0 t
7.2 Advanced Material on Controlled Markov Chains 251

and we conclude using (7.146). 


Remark 7.40 (i) Note that (7.147) is the first-order necessary condition for the opti-
mization problem
Min Hk (v, y k−1 , p k ). (7.149)
v∈Uk

(ii) If in addition, for k = 1 to N , k is a convex function of its first argument, and Fk


is an affine function of its first argument, then Hk (·, y k−1 , p k ) is a convex function,
and (7.147) holds iff u k is a solution of (7.149).
By analogy with continuous time optimal control problems, we will say that u
satisfies Pontryagin’s principle [87] if u k is solution of (7.149), for k = 1 to N .

7.2.2.3 Costate and Hamiltonian in a Markov Chain Setting

We apply the previous results to the Markov chain open loop setting (7.135). The
state equation is the law of the Markov chain process in the absence of observation
(but here writing the state as a vertical vector in order to adapt the optimal control
setting):

ν k = M k−1 (u k ) ν k , k = 0, . . . , N − 1; ν̂ 0 − ν 0 = 0, (7.150)

where ν̂0 is a given probability law on S . The control variables are the u k , and the
state variables are the laws ν k represented as vertical vectors. The cost function is

N −1

J (u, ν) := ν k · ck + ν N · ϕ. (7.151)
k=0

Therefore, the problem is


N −k
Min J (u, ν) s.t. (7.150) and u ∈ U . (7.152)
u,ν

The Lagrangian function, with costate denoted by W , is

N −1
  
L (u, ν, W ) := J (u, ν) + W k+1 · M k (u k ) ν k − ν k+1 + W 0 · (ν̂ 0 − ν 0 ).
k=0
(7.153)
So, the costate equation gives

W N = ϕ; W k = ck + M k (u k )W k+1 , k = 0, . . . , N − 1. (7.154)

Therefore, W is equal to the value function V . In addition, since the expression of


the Hamiltonian is
252 7 Markov Decision Processes

Hk (u, ν, W ) = ν · ck + ν  M k (u)W, (7.155)

we see that the dynamic programming principle is equivalent to Pontryagin’s prin-


ciple (7.149). We have proved that:

Lemma 7.41 The costate associated with the problem (7.152) coincides with the
value function V , and Pontryagin’s principle (7.149) holds for this problem.

7.2.2.4 Nonlinear Filtering for a Markov Chain

As before x  is the state of a Markov chain, and at each time step  we observe a
signal y taking values in a finite set Y . The process starts at, say, time k and finishes
at time N . Let k ≤ n ≤ N . The probability of ((x k , y k ), . . . , (x n , y n )), given the
k
initial probability law π k,y ∈ 1 , (which therefore depends on the initial signal y k )
k
for x , is
k k,y k ,y +1
P((x k , y k ), . . . , (x n , y n )) | π k,y ) = πx k Π=k
n−1
Mx  x +1 . (7.156)

Here Mi,r
j represents the probability, being in state i at time , of having both the
transition to state j, and the observation r at time  + 1. So, we have that

Mi,r
j ≥ 0; Mi,r
j = 1, for all i ∈ S and  ∈ {k, . . . , N − 1}. (7.157)
r ∈Y j∈S

k
The marginal law of (y k , . . . , y n , x n ) given π k,y is
k
 k
P(y k , . . . , y n , x n | π k,y )) = P(((x k , y k ), . . . , (x n , y n )) | π k,y ). (7.158)
x k ,...,x n−1

Therefore,
k n−1 ,y k +1
P(y k , . . . , y n , x n | π k,y ) = π k,y Π=k M ex n , (7.159)

where here ei denotes the element of ∞ with zero components except for the ith
one, equal to 1. So, the probability law for the observations is
k n−1 ,yk +1
P(y k , . . . , y n | π k,y ) = π k,y Π=k M 1. (7.160)

The conditional law of x n , knowing the ‘initial’ law at time k and the signal up to
time n, is therefore
k n−1 ,y k +1
P(x n | (y k , . . . , y n , π k,y )) π k,y Π=k M
q =
n
= . (7.161)
P(y k , . . . , y n , π k,y )k
k,y k n−1 ,y +1
π Π=k M 1
7.2 Advanced Material on Controlled Markov Chains 253

One usually computes the marginal law (and therefore the conditional law) by induc-
tion, in the following way, for n > k:
n
p k = π k,yk , p n := p n−1 M n−1,y , q n := p n / p n 1. (7.162)
k
Next, knowing (y k , . . . , y n , π k,y ), the probability that y n+1 = z is
k
P(y k , . . . , y n , y n+1 = z, π k,y )
= q n M n,z 1. (7.163)
P(y k , . . . , y n , π k,y k )
k
As expressed by (7.162), the conditional law at step n + 1, knowing π k,y and
(y k , . . . , y n+1 ) with y n+1 = z, will be

q n M n,z
q n+1 := , with probability q n M z 1, for any z ∈ Y. (7.164)
q n M n,z 1

This is the equation of a dynamical system with state q n , whose transitions are
governed by probability laws depending only on the state. We see that this structure
is very similar to that of Markov chains. Consider the value function

N −1

V k (q) := π  c + π N ϕ. (7.165)
=k

Here π  is the law of the process at time , with initial value q at time k, and the
functions c and ϕ belong to ∞ . So, the value function will satisfy the following
equation: !
⎧ 
⎪ q M n,z
⎪ V (q) = qc +

n n
(q M 1)V
n,z n+1
,
⎨ q M n,z 1
z∈Y
(7.166)

⎪ n = k, . . . , N − 1;

⎩ N
V (q) = qϕ.

7.2.2.5 Control with Partial Information

We consider a similar setting, the decision u n ∈ U (control set independent of the


k
state) being taken at time n knowing the initial law π k,y and the observations
(y k , . . . , y n ), and the cost function cn and transition matrices M n,z being functions
of u n , for all k ≤ n < N . We assume that the cost functions cn (·) are uniformly
bounded and that ϕ belongs to ∞ .
By similar arguments, we obtain that the conditional law at step n + 1, knowing
(u k , . . . , u n ) and (y k , . . . , y n+1 ) with y n+1 = z, will be
q n M n,z (u n )
q n+1 := , with probability q n M z (u n )1, for any z ∈ Y. (7.167)
q n M n,z (u n )1
254 7 Markov Decision Processes

The dynamic programming principle reads


⎧ !!

⎪  q M n,z (u)

⎨ V (q) = min
⎪ qcn (u) + (q M n,z (u)1)V n+1 ,
n
u∈U
z∈Y
q M n,z (u)1
(7.168)

⎪ n = k, . . . , N − 1;



V N (q) = qϕ.

7.2.3 Linear Programming Formulation

We come back to the setting of Sect. 7.1.3: infinite horizon, discount factor β ∈
(0, 1), with value function denoted by V . We say that v ∈ ∞ is a subsolution of the
‘discounted’ dynamic programming equation if

vi ≤ (1 − β)ci (u) + β Mi j (u)v j , for all i ∈ S and u ∈ Ui . (7.169)
j∈S

Setting ⎛ ⎞

δi := inf ⎝(1 − β)ci (u) + β Mi j (u)v j ⎠ − vi , (7.170)
u
j∈S

we see that δ ≥ 0 and v is a solution of the discounted dynamic programming equa-


tion with cost ci (u) − δi . Then v ≤ V (since it is easily checked that the value is
a nondecreasing function of the cost). It follows that V is the greatest subsolution
of the discounted dynamic programming equation. Let π be an arbitrary probability
on S , with positive components. A way to compute V is to solve the optimization
problem 
Min

− πi vi s.t. (7.169) (7.171)
v∈
i∈S

Assume next that both S and the sets Ui , for all i ∈ S , are finite. Then (7.171) is a
linear programming problem, which gives a way to numerically solve the problem.
The associated Lagrangian function is
⎛ ⎛ ⎞⎞
 
L(v, λ) := −π v + λi (u) ⎝vi − ⎝(1 − β)ci (u) + β Mi j (u)v j ⎠⎠ .
i∈S u∈Ui j∈S
(7.172)
So, the expression of the dual problem is:
  
Max −(1 − β) ci (u)λi (u); λi (u) = πi + β M ji (û)λ j (û).
λ≥0
i∈S u∈Ui u∈Ui j∈S û∈U j
(7.173)
7.3 Ergodic Markov Chains 255

7.3 Ergodic Markov Chains

We now consider what happens for undiscounted finite horizon processes when the
horizon goes to infinity. Under appropriate hypotheses, we are able to compute the
limit of the average reward per unit time, and to extend these results to the controlled
setting.
In this section we assume the state space to be finite.

7.3.1 Orientation

Consider an autonomous (uncontrolled) Markov chain (c, M) with finite state space
S and cost function c ∈ ∞ . Consider the sequence formed by the value iteration
operator:
V n+1 = c + M V n (7.174)

initialized with V 0 = 0 so that V 1 = c, V 2 = c + Mc, etc. Then V q represents the


value function at time zero for a problem with horizon q, running cost c, and zero
final cost:
V n = c + Mc + · · · + M n−1 c. (7.175)

Setting
S n := (Id + M + · · · + M n )/(n + 1), (7.176)

we may write V n = nS n−1 c. In general, V n grows at a linear rate, so that we study


possible limits of the average cost over the horizon n, i.e.

1 n
V̄ n := V = S n−1 c. (7.177)
n
Observe that
(M − I )S n−1 = S n−1 (M − I ) = (M n − I )/n, (7.178)

and therefore
1
(M − I )V̄ n = (M − I )S n−1 c = (M n − I )c. (7.179)
n
Since M n is a bounded sequence, the r.h.s. converges to 0. Therefore, any limit-point
of V̄ n is an eigenvector of M with eigenvalue 1. Since 1 is an eigenvector of M with
eigenvalue 1, we may ask when V̄ n converges to a multiple of 1.
A related question is, given a probability law π 0 for the starting point of the
Markov chain, to see how the related probabilities π n at step n and the average
probability π̄ n over the first n steps behave. We know that π n = π 0 M n , so that

1 0
π̄ n := (π + · · · + π n−1 ) = π 0 S n−1 , for n ≥ 0. (7.180)
n
256 7 Markov Decision Processes

By (7.178) it follows that

π̄ n (M − I ) = π 0 S n−1 (M − I ) = π 0 (M n − I )/n, for n ≥ 0. (7.181)

Any limit point π̄ of π̄ n is a probability that, by the above display, is an invariant


probability in the sense that
π̄ = π̄ M. (7.182)

The expected value, when x 0 has law π 0 , is

V̄ n (π 0 ) := π̄ n c = π 0 V̄ n . (7.183)

Definition 7.42 Given a sequence wn in Rm , we say that w̄ is the Cesaro limit of


C 
wn , and write w̄ = C-lim wn or wn → w̄, if w̄ = limn n1 n−1
k=0 w .
k

It may happen that the sequence wn has a Cesaro limit but does not converge, take
for example wn = (−1)n . On the other hand, if wn has a limit, then wn converges in
the Cesaro sense to the same limit. If wn → w̄ at a linear rate, in the sense that

|wn − w̄| ≤ Cηn , for some C > 0 and η ∈ (0, 1), (7.184)

then  n−1   n−1 


1   1   C 1 − ηn
   
 w − w̄ =  (w − w̄) ≤
k k
. (7.185)
n  n  n 1−η
k=0 k=0

Taking the example of a constant sequence (except for the first term) we see that the
convergence in the Cesaro sense is typically at best at speed 1/n.
Coming back to Markov chains, in view of (7.180) and (7.183), we obviously
have
C C
If S n → S̄, then π n → π̄ = π 0 S̄, and V̄ n → S̄c. (7.186)

Example 7.43 Consider an uncontrolled ! Markov chain with S = {1, 2} and M equal
01
to the permutation matrix M := . Then M n is equal to M if n is odd, and equal
10
to the identity otherwise. So, M n and π n have no limit. However, we have the Cesaro
limits !
n C 1 1 1 C  
M →2 ; π n → 21 21 , V̄ N → 21 (c1 + c2 )1. (7.187)
11

7.3.2 Transient and Recurrent States

With a transition matrix M we associate the graph G M in which the set of nodes (or
vertices) is the state set S , and there is an edge (a directed arc) between vertices i
and j iff Mi j > 0.
7.3 Ergodic Markov Chains 257

Definition 7.44 An (n-step) walk in G M is an ordered string of nodes (i 0 , . . . , i n ),


n ≥ 1, such that there exists an arc from i k to i k+1 , for k = 0 to n − 1. A path is a
walk in which no node is repeated. A cycle is a walk with the same initial and final
node, and no other repeated node.

It is easily checked that there exists an n-step walk from i to j iff Minj > 0, i.e., if j
is n-step accessible from i (Definition 7.13). We say that two states i, j communicate
if each of them is accessible from the other. This is an equivalence relation whose
classes are called communication classes or just classes. A recurrent class is a class
that contains all states that are accessible from any of its elements. Once the state
enters this class, it stays in it forever. A transient class is a class that is not recurrent.
A state is transient (resp. recurrent) if it belongs to a transient (resp. recurrent) class.

Definition 7.45 The class graph is the graph whose nodes are the communication
classes, with a directed arc between two classes C , C iff C = C , and Mi j > 0, for
some i ∈ C and j ∈ C .

Observe that the class graph is acyclic (it contains no cycle) so that each maximal
path ends in a recurrent class. In particular, there exists at least one recurrent class.

7.3.2.1 Invariant Probabilities

Recall that a probability π is invariant iff π = π M, i.e., if it is a left eigenvector of


M with eigenvalue 1. If M is the identity operator, any probability law is invariant.
Therefore, invariant probabilities are in general nonunique. We call the support of
 law the set of states over which it is nonzero, and if B ⊂ S , we set
a probability
π(B) := i∈B πi .

Lemma 7.46 Let π be an invariant probability law. Then for all i ∈ S , πi = 0


whenever i is transient.

Proof Let π be an invariant probability law. Let T (resp. R) denote the set of transient
(resp. recurrent) states. Then Minj = 0 if i ∈ R and j ∈ T , for any n ≥ 1, and so,
   
π(T ) = πi Minj = πi Minj = πi Minj . (7.188)
j∈T i∈S j∈T i∈T i∈T j∈T


This implies that if πi = 0, then j∈T Minj = 1 for all n ≥ 1, meaning that all acces-
sible states from i are transient, contradicting the fact that (as is easily established)
some recurrent states must be accessible from any transient state. 

Lemma 7.47 Let C denote the square submatrix of M corresponding to row and
columns associated with transient states. Then C n → 0 at a linear rate.

Proof For large enough n, the probability that the Markov chain starting at any
transient state i ∈ T is in a recurrent state is positive, say greater than ε > 0 when
258 7 Markov Decision Processes


n > n 0 . Let n > n 0 . By the previous discussion, j∈T Cinj < 1 − ε. But then C n is
a strict contraction in ∞ since, for any v ∈ ∞ :
⎛ ⎞
 
C n v∞ ≤ max Cinj |v j | ≤ max ⎝ Cinj ⎠ v∞ ≤ (1 − ε)v∞ (7.189)
i∈T i∈T
j∈T j∈T

and since C is non-expansive, for m = qn + r , 0 ≤ r < n, we have that

C m v∞ ≤ (1 − ε)q v∞ ≤ (1 − ε)m/n−1 v∞ . (7.190)

The conclusion follows. 


Remark 7.48 Let π be an invariant probability and R be a recurrent class. Then
either R is included in the support of π , or π vanishes over R. Indeed, if i, j belong
to R and πi > 0, for some n, Minj > 0, and since π = π M n , π j ≥ πi Minj > 0.

7.3.2.2 Regular Transition Matrices

We start by discussing the contraction property of operators.


Lemma 7.49 Let M be a transition operator and y ∈ ∞ . Set z := M y. Then (i) the
following holds:
min(y) ≤ min(z); max(z) ≤ max(y). (7.191)

The first (resp. second) equality occurs iff for some i ∈ S , Mi j = 0 for any j ∈ S
such that y j > min(y) (resp. y j < max(y)), and
(ii) if M is a transition matrix such that ε := mini, j Mi j is positive, then for any
y ∈ ∞ , M n y converges to a constant vector and

max(M n y) − min(M n y) ≤ (1 − 2ε)n (max(y) − min(y)). (7.192)

Proof (i) Immediate.


(ii) Given y ∈ Rm , set z := M y, a := min(y), b := max(y), attained at indexes i 1 and
i 2 resp., and ε :=
 mini, j Mi j . Let y ∈ R be such that yi1 = a and yi = b otherwise.
m

Since M ≥ 0, j Mi j = 1 and y ≤ y , we have that for all i ∈ S :

z i = (M y)i ≤ (M y )i = Mii1 a + (1 − Mii1 )b ≤ εa + (1 − ε)b. (7.193)

Similarly, let y ∈ Rm be such that yi2 = b and yi = a otherwise. Then

z i = (M y)i ≥ (M y )i = Mii2 b + (1 − Mii2 )a ≥ εb + (1 − ε)a. (7.194)

Setting N (y) := max(y) − min(y), we obtain that N (M y) ≤ (1 − 2ε)N (y), and


conclude by an induction argument. 
7.3 Ergodic Markov Chains 259

We say that M is a regular transition matrix if there exists an n 0 such that M n 0


has no zero components. It is easily checked then that M n has no zero components
for all n > n 0 . If M has a unique invariant probability law π̄ , we denote by M̄ the
matrix whose rows are all equal to π̄ .
Lemma 7.50 A regular transition matrix M has a unique invariant probability law
π̄ , which is the unique left eigenvector of M, having an eigenvalue of modulus greater
than or equal to 1. Also, M n → M̄ at a linear rate, in the sense that, for some C > 0
and η ∈ (0, 1):
M n − M̄ ≤ Cηn . (7.195)

Proof (a) Let y ∈ Rm . By Lemma 7.49(i) y n = M n y is such that min(y n ) is


nondecreasing and max(y n ) is nonincreasing. By Lemma 7.49(ii), max(y n 0 j ) −
min(y n 0 j ) → 0 at a linear rate. Combining the two results we see that y n converges
towards a constant vector. Taking y equal to column j of M, so that y n equals column
j of M n , we deduce that M n converges to some transition matrix M̄ whose column i
is of the form π̄i 1, with 0 ≤ min(π̄ ) ≤ max(π̄ ) ≤ 1. Since M̄ is a transition matrix,
π̄ has sum 1 and is therefore a probability law. For any horizontal vector z, we have
that 
lim z M n = z M̄ = (π̄1 , . . . , π̄m ) zi . (7.196)
n
i

Let z be a left eigenvector of M with eigenvalue λ ∈ C. If |λ| > 1 then z M n = λn z


diverges, which contradicts (7.196). If λ = eiθ then z M n = eniθ z. By (7.196) this
implies that θ = 0 (modulo 2π ) and that z is colinear to π̄ . So, π̄ is the unique left
eigenvector of M̄, with an eigenvalue of modulus at least one. In particular, π̄ is the
unique invariant probability law of M. 
Example 7.51 A transition matrix
! may have nonzero eigenvalues of modulus less
10
than 1. For instance, M = 1 1 has eigenvalues 1 and 21 .
2 2
Remark 7.52 Let M be a regular transition matrix. The orthogonal space to the
invariant probability π̄ is

π̄ ⊥ = {y ∈ ∞ ; π̄ y = 0}. (7.197)

For y ∈ ∞ , we have the unique decomposition y = α1 + z, z ∈ π̄ ⊥ , with α = π̄ y.


Since M̄ y = α1, it follows from (7.195) that, for some positive η < 1, |M n z| =
O(ηn ), and therefore:
M n y = α1 + O(ηn ). (7.198)

We recover the fact that 1 is a simple eigenvalue of M (the associated eigenspace has
dimension 1) and that the other eigenvalues have modulus less than one. If follows
that if π 0 is any probability law for x 0 , then for some C > 0:

|π 0 M n − π̄ | ≤ C ηn . (7.199)
260 7 Markov Decision Processes

7.3.2.3 Single Class Transition Matrices

If the transition matrix M has a single class (which is therefore recurrent), we cannot
hope M n to converge in general (see Example 7.43), but we still have the following
result.

Lemma 7.53 A single class transition matrix has a unique invariant probability.

Proof For ε ∈ (0, 1), set Mε := ε I + (1 − ε)M, where M is a single class transition
matrix. Then by the binomial formula for commutative matrices


n !
ε n n 1− p
(M ) = ε (1 − ε) p M p (7.200)
p
p=0

has, for n large enough, only positive elements. By Lemma 7.50, M ε has a unique
invariant probability. Since it is easily checked that M ε and M have the same invariant
probabilities, the conclusion follows. 

Lemma 7.54 A single class transition matrix M (with therefore a unique invariant
probability π̄ ), is such that for any probability π 0 , π n := π 0 Sn converges to π̄ , and
Sn converges to ⎛ ⎞
.. ..
⎜ .1 . ⎟
S̄ = ⎜
⎝ π̄ · · · π̄ m⎟
⎠. (7.201)
.. ..
. .

Proof Since Sn is a transition matrix, π n is a sequence of probabilities. Since


Sn (I − M) = (I − M n+1 )/(n + 1) → 0, we have that π n (I − M) → 0. So, any
limit-point of π n is an invariant probability, and is by Lemma 7.53 equal to π̄.
This implies that π n → π̄ . So, any limit point S̄ of (the bounded sequence) S n is
such that π 0 S̄ = π̄ , for any initial probability π 0 , from which (7.201) follows. 

7.3.2.4 General Case

More generally after some permutation of the state indexes we may write the transi-
tion matrix in the form !
A 0
M= , (7.202)
BC

where A is a p × p matrix, p ≤ m, the first p state being recurrent, the other being
transient. The matrix A is block diagonal, each block corresponding to a recurrent
class. Then !
An 0
Mn = , (7.203)
Bn C n
7.3 Ergodic Markov Chains 261

with B1 := B and Bn+1 = Bn A + C n B, so that by induction


n
Bn+1 = B An + C B An−1 + · · · + C n B = C i B An−i . (7.204)
i=0

We set (we will check in the lema below that (I − C) is invertible):


!
Ā 0
M̄ := , where B̄ := (I − C)−1 B Ā. (7.205)
B̄ 0

Lemma 7.55 (i) The matrix (I − C) is invertible, and (I − C)−1 = ∞ k
k=0 C . (ii)
If all recurrent classes are regular, then M → M̄. (iii) Otherwise, M converges to
n n

M̄ in the Cesaro sense.



Proof (i) By Lemma 7.47, C n → 0 geometrically, so that C := ∞ k
k=0 C is well-
defined, and


q
(I − C)C = lim(I − C) C k = lim(I − C q+1 ) = I. (7.206)
q q
k=0

Point (i) follows.


(ii) Assume that all recurrent classes are regular. By Lemma 7.50, An → Ā, a block-
diagonal matrix each block of which has identical rows equal to the unique invariant
probability of the corresponding block of A, as in (7.201). It remains to prove that
Bn → B̄. Since C n → 0 at a linear rate, An is bounded and converges to Ā, this fol-
lows from the dominated convergence theorem in the space of summable sequences.
Point (ii) follows.
(iii) General case. Setting
⎧ 1

⎨ Ān = (I + A + · · · + An ),
n+1 (7.207)

⎩ C̄ n = 1 (I + C + · · · + C n ),
n+1

and using (7.204) we obtain that for some Bn , the expression of the average sum
S n is !
Ān 0
Sn = . (7.208)
Bn C̄ n

By Lemma 7.54, Ān → Ā (as before, the block diagonal matrix whose rows for
a given recurrent class are equal to the corresponding invariant probability) and
C̄ n → 0 since C n → 0 at a linear rate. By (7.204),
262 7 Markov Decision Processes

1   i
n
1
Bn = (B1 + · · · + Bn ) = C B Aj, (7.209)
n+1 n + 1 q=0 i+ j=q

and therefore

1  i  j  i n − i + 1 n−i
n n−i n
Bn = C B A = C B Ā . (7.210)
n + 1 i=0 j=0 i=0
n+1

Again by a dominated convergence argument, Bn → (I − C)−1 B Ā. The conclusion


follows. 

7.3.2.5 A Linear System for the Average Return

We assume that the Markov chain has a unique invariant probability law π̄ (i.e., there
is a unique recurrent class). Set W := M − I , and consider the linear equation

c + W V = η1, (7.211)

where c ∈ ∞ is given, so that the solution space is (V, η) ∈ ∞ × R. Observe that,


if (V, η) is a solution, then for all α ∈ R, setting V := V + α1, we have that (V , η)
is another solution, so that we redefine the solution space as (∞ /R) × R. Being an
invariant probability law, π̄ belongs to the left kernel of W . Multiplying (7.211) on
the left by π̄ , we deduce that
η = π̄ c (7.212)

is the average cost. Solving the linear system (7.211) therefore gives a way to compute
the average cost without computing the invariant probability law.

Lemma 7.56 Equation (7.211) has a unique solution in (∞ /R) × R.

Proof Since the setting is finite-dimensional, it suffices to check that the only solu-
tions for c = 0 are when η = 0 and V is constant. That η = 0 follows from (7.212).
Now let V attain its maximum at i 0 . Then
 
Vi0 = (M V )i0 = Mi 0 j V j ≤ Mi0 j Vi0 = Vi0 , (7.213)
j j

where we used the fact that M is a stochastic matrix. The equality means that we
have that V j = Vi0 whenever Mi0 j = 0, i.e., when j is 1-step accessible from i 0 . By
induction, we deduce that this holds for any state accessible from i 0 , and in particular
for any element of the recurrent class. We have a similar result by considering a state
where V attains its minimum. Therefore the minimum of V is equal to its maximum.
The result follows. 
7.3 Ergodic Markov Chains 263

7.3.2.6 More on Average Return

The linear equation (7.211), taking into account the decomposition (7.202) of the
transition matrix M, is of the form

η1 = c + (A − I )V ,
(7.214)
η1 = c + BV + (C − I )V ,

where c refers to the subvector of c with corresponding components for the recurrent
class, etc. Since A is a transition matrix with a unique recurrent class, the first row
has a unique solution that determines (η, V ). Since, by Lemma 7.55(i), (C − I ) is
invertible, the second row determines the value of V , given (η, V ). Observe that,
in order to have the correct value of η, it is enough to solve the first block, i.e., we
can ignore the transient states.

7.3.2.7 A Link with Finite Horizon Problems

Let (η, V ) be a solution of the average return Eq. (7.211). Then

1
V̄ n = S n−1 c = S n−1 η1 − S n−1 (M − I )V = η1 − (M n − I )V. (7.215)
n
Remark 7.57 If M is a regular transition matrix, by Remark 7.52, 1 is a simple
eigenvalue. Take for V (defined in ∞ /R) the representative V̂ that is a combination
of vectors of other eigenspaces, whose eigenvalues have modulus less that 1, so that
M n V̂  ≤ cγ n for some c > 0, γ < 1. So, (7.215) gives the following expansion
of V n = n V̄ n :
V n − nη1 − V̂  = M n V̂  ≤ cγ n . (7.216)

7.3.3 Ergodic Dynamic Programming

We next consider the problem of minimizing the average cost.

7.3.3.1 Controlled Markov Chains

We still assume that S is finite, and that



For each u ∈ Φ, the Markov chain M(u) has a unique recurrent
(7.217)
class S (u) and therefore a unique invariant probability law π(u).
264 7 Markov Decision Processes

The minimum ergodic cost problem can then be defined as

Min π c(u); π M(u) = π. (7.218)


u∈Φ,π∈P

In view of (7.217), each feasible pair (u, π ) is such that π = π(u). By the previous
section, an equivalent problem is

Min η; η1 + V = c(u) + M(u)V. (7.219)


u∈Φ,V,η

We connect this to the following ergodic dynamic programming principle:

η + Vi = min (c(u) + M(u)V )i , for all i ∈ S . (7.220)


u∈Ui

For u ∈ Φ, we set W (u) := M(u) − I .

Theorem 7.58 Let ū satisfy the ergodic dynamic programming principle (7.220).
Then
(i) ū is a solution of the ergodic problem (7.219).
(ii) If û is another solution of (7.220), writing c̄ = c(ū), ĉ = c(û), etc., then V̄ − V̂
is maximal and constant over S (û). If in addition S (ū) ∩ S (û) = ∅, then V̄ − V̂
is constant.

Proof (i) Let (ū, V̄ , η̄) satisfy (7.220) and let û ∈ Φ. Then

η̄1 = c̄ + W̄ V̄ ≤ ĉ + Ŵ V̄ = Ŵ (V̄ − V̂ ) + ĉ + Ŵ V̂ = Ŵ (V̄ − V̂ ) + η̂1.


(7.221)
Multiplying on the left by the invariant probability law π̂ for the policy û, since
π̂ Ŵ = 0, we obtain that η̄ ≤ η̂. Point (i) follows.
(ii) Let (û, V̂ , η̂) be another solution of (7.220). Set δV := V̄ − V̂ . Since η̄ = η̂ by
point (i), (7.221) implies that Ŵ δV ≥ 0, i.e.

δV ≤ M̂δV. (7.222)

We deduce that if δV attains its maximum at state i, then the maximum is also attained
at each 1-step accessible state from i, and also by induction at any accessible state
from i (for the policy û). This implies that δV is maximal and constant over S (û).
Exchanging the roles of û and ū, we obtain that δV is minimal and constant over
S (ū). So, if S (ū) ∩ S (û) = ∅, then δV is constant. 

We assume next that the Ui are compact, c and M are continuous, and (7.217)
holds; then, by Lemma 7.56, the following Howard policy iteration algorithm is
well-defined (compare to the Howard Algorithm 7.18 for the discounted case):
7.3 Ergodic Markov Chains 265

Algorithm 7.59 (Ergodic Howard algorithm)


1. Initialization: choose a policy u 0 ∈ Φ; q := 0.
2. Compute a solution (V q , ηq ) in ∞ × R of the linear equation

ηq 1 + V q = c(u q ) + M(u q )V q . (7.223)

3. Compute a policy u q+1 , a solution of


⎧ ⎫
⎨  ⎬
q+1 q
ui ∈ arg min ci (u) + Mi j (u)V j , for all i ∈ S . (7.224)
u∈Ui ⎩ ⎭
j

4. Go to step 2.

Next, consider the following single class hypothesis for any strategy, stronger than
(7.217):
For each u ∈ Φ, S = S (u). (7.225)

Theorem 7.60 (i) The sequence computed by the Howard algorithm is such that ηq
is nonincreasing.
(ii) If (7.225) holds, then any limit point of (u q , V q , ηq ) satisfies the ergodic dynamic
programming principle and is therefore an optimal policy in view of Theorem 7.58.

Proof (i) Setting cq := c(u q ), etc., we have that

W q+1 V q+1 + cq+1 − ηq+1 1 = 0 = W q V q + cq − ηq 1, (7.226)

and therefore

(W q+1 − W q )V q + cq+1 − cq = W q+1 (V q − V q+1 ) + (ηq+1 − ηq )1. (7.227)

By the definition of the Howard algorithm, the l.h.s., denoted by ξ q , has nonpositive
values. Multiplying on the left by π q+1 ≥ 0, since π q+1 W q+1 = 0, we obtain that

0 ≥ π q+1 ξ q = ηq+1 − ηq . (7.228)

So, ηq is nonincreasing.
(ii) Being bounded, ηq converges to some η̄ ∈ R. Take a subsequence for which
(u q , u q+1 ) → (ū, û), with similar conventions for costs, probabilities, etc. Passing
to the limit in the relation

cq+1 + W q+1 V q ≤ c(u) + W (u)V q , for all u ∈ Φ, (7.229)

we obtain that
ĉ + Ŵ V̄ ≤ c(u) + W (u)V̄ , for all u ∈ Φ. (7.230)
266 7 Markov Decision Processes

Taking u = ū and adding the relation

c̄ + W̄ V̄ = η̄1 = η̂1 = ĉ + Ŵ V̂ , (7.231)

we deduce that ζ := Ŵ (V̄ − V̂ ) ≤ 0. Since π̂ Ŵ = 0 we have that π̂ ζ = 0. Since


π̂ ≥ 0 and ζ ≤ 0 it follows that ζ vanishes over the support of π̂ , which is the
recurrent class, say Sˆ, associated with the policy û. So, for any i ∈ Sˆ we get, using
ζi = 0 and (7.230):

(c̄ + W̄ V̄ )i = η̄ = η̂ = (ĉ + Ŵ V̂ )i = (ĉ + Ŵ V̄ )i ≤ (c(u) + W (u)V̄ )i . (7.232)

With (7.225) we conclude that (η̄, V̄ ) satisfies the ergodic dynamic programming
principle (7.220). The conclusion follows. 

Remark 7.61 Any solution of the ergodic dynamic programming principle (7.220)
provides a stationary sequence for Howard’s algorithm. So, if the single class hypoth-
esis (7.225) holds, by the above two theorems, a policy is optimal iff it satisfies the
ergodic dynamic programming principle.

7.4 Notes

For partially observed processes, see Monahan [81]. For more on ergodic Markov
chains, see Arapostathis et al. [7] and Hsu et al. [61]. On the superlinear convergence
of Howard’s type algorithms, see Bokanowski, Maroso and Zidani [22], and Santos
and Rust [109].
For further reading we refer to the books by Bersekas [19], Altman [5] (especially
about expectation constraints), Puterman [91], and for continuous state spaces to
Hernández-Lerma, and Lasserre [56, 57]. The link with discretization of continuous
time processes is discussed in Kushner and Dupuis [67]. On the modified policy
iteration algorithms for discounted Markov decision problems, see Puterman and
Shin [92].
Chapter 8
Algorithms

Summary In the case of convex, dynamical stochastic optimization problems, the


Bellman functions, being convex, can be approximated as finite suprema of affine
functions. Starting with static and deterministic problems, it is shown how this leads
to the effective stochastic dual dynamic programming algorithm.
The second part of the chapter is devoted to the promising approach of linear
decision rules, which allows one to obtain upper and lower bounds of the value
functions of stochastic optimization problems.

8.1 Stochastic Dual Dynamic Programming (SDDP)

In this section we will study the case of convex dynamic problems, whose convex
Bellman values can be approximated by a collection of affine minorants. We start
with the static case.

8.1.1 Static Case: Kelley’s Algorithm

Consider the problem


Min f (x), (8.1)
x∈X

where X is a convex, compact subset of Rn and f : Rn → R is convex. Given


sequences x k in X and y k ∈ ∂ f (x k ), k ∈ N, we define the sequence of functions
ϕk : Rn → R by  
ϕk (x) := max f (x i ) + y i · (x − x i ) . (8.2)
0≤i≤k

In view of the definition of a subgradient, we have that

ϕk (x) ≤ f (x), for all x ∈ Rn . (8.3)


© Springer Nature Switzerland AG 2019 267
J. F. Bonnans, Convex and Stochastic Optimization, Universitext,
https://doi.org/10.1007/978-3-030-14977-2_8
268 8 Algorithms

Setting ak := f (x k ) − y k · x k we note that


 
ϕk (x) := max ai + y i · x . (8.4)
0≤i≤k

So, computing ϕk (x) can be done by storing only (k + 1) vectors of Rn+1 , instead
of 2(k + 1) vectors of Rn , as would suggest the definition of ϕk .

Lemma 8.1 We have that f (x k ) − ϕk (x k ) → 0, and if x̄ is a limit-point of x k , then


ϕk (x̄) → f (x̄).

Proof (a) Let L be a Lipschitz constant of f over a bounded neighbourhood of X .


Then ∂ f (x) ⊂ B̄(0, L), for all x ∈ X . Being a maximum of Lipschitz functions with
constant L, ϕk is itself Lipschitz with constant L. Let x̄ be a limit-point of x k . For
any ε > 0 there exists a kε such that |x kε − x̄| < ε. Since ϕk and f are Lipschitz with
constant L, we get

ϕk (x̄) ≥ f (x kε ) + y kε · (x̄ − x kε ) ≥ f (x kε ) − L|x kε − x̄| ≥ f (x̄) − 2Lε, for all k > kε .


(8.5)
Since ϕk (x̄) ≤ f (x̄), it follows that ϕk (x̄) → f (x̄).
(b) If for a subsequence f (x ki ) − ϕki (x ki ) → 0, since ϕk is Lipschitz with constant
L not depending on k, it converges uniformly and for some limit-point x̄ of x ki :

0 < lim( f (x ki ) − ϕki (x ki )) = lim( f (x̄) − ϕki (x̄)), (8.6)


i i

which gives a contradiction with (a). The conclusion follows. 

The cutting plane (or Kelley) algorithm is as follows:


Algorithm 8.2 (Cutting plane)
1. Data: x 0 ∈ X , ε ≥ 0. Set k := 0.
2. Compute x k+1 ∈ X such that

ϕk (x k+1 ) ≤ ϕk (x), for all x ∈ X. (8.7)

3. If f (x k+1 ) − ϕk (x k+1 ) ≤ ε, return x̂ := x k+1 .


Otherwise, set k := k + 1 and go to step 2.
Note that computing x k+1 , when X is a polyhedron, means solving a linear pro-
gram. By (8.3), we have that

ϕk (x k+1 ) = min ϕk (x) ≤ min f (x) ≤ f (x k+1 ). (8.8)


x∈X x∈X

Set εk+1 := f (x k+1 ) − ϕk (x k+1 ). It follows that for k ≥ 1, x k is an εk -solution of


(8.1), in the sense that
8.1 Stochastic Dual Dynamic Programming (SDDP) 269

f (x k+1 ) − min f (x) ≤ εk . (8.9)


x∈X

In particular, when the algorithm stops the return point x̂ is an ε solution.

Lemma 8.3 If ε > 0, the algorithm stops after finitely many iterations. If ε = 0,
either it stops after finitely many iterations, or ϕk (x k+1 ) → min X f , and any limit-
point of x k is a solution of (8.1).

Proof It suffices to study the case when ε = 0 and the algorithm does not stop. Let x̄
be a limit-point of x k . For the associated subsequence x ki , since ϕk is Lipschitz and
nondecreasing as a function of k, the value of its minimum converges. We conclude
using Lemma 8.1, (8.8) and

min f (x) ≥ lim ϕk (x k+1 ) = lim ϕki −1 (x ki ) = lim ϕki −1 (x̄) = f (x̄). (8.10)
x∈X k i i

8.1.2 Deterministic Dual Dynamic Programming

8.1.2.1 Principle

Consider now the problem (P) of minimizing

N −1

J (u, y) := t (u t , yt ) +  N (y N ), (8.11)
t=0

subject to the state equation and control constraints

yt+1 = At yt + Bt u t , t = 0, . . . , N − 1; y0 = y 0 , (8.12)

u t ∈ Ut , t = 0, . . . , N − 1. (8.13)

Here At and Bt are matrices of size n × n and n × m resp., (yt , u t ) ∈ Rn × Rm ,


the initial condition y 0 is given, the Ut are convex and compact subsets of Rm , and
the functions t for 0 ≤ t ≤ N − 1, and  N , are convex and Lipschitz. We say that
(u, y) is a feasible trajectory if it satisfies (8.12)–(8.13). We denote by y[u] the state
associated with control u and denote the reduced cost by

F(u) := J (u, y[u]). (8.14)

The Bellman values are such that, for τ = 0 to N − 1 and x ∈ Rn :


270 8 Algorithms
 N −1 

v N =  N ; vτ (x) := min t (u t , yt ) +  N (y N ); yτ = x , (8.15)
u τ ,...,u N −1
t=τ

where the minimization is over the control variables satisfying the control constraints
(8.13). Then the following dynamic programming principle holds:

vτ (x) = min (τ (u, x) + vτ +1 (Aτ x + Bτ u)) . (8.16)


u∈Uτ

Since the data are Lipschitz, so are the Bellman values vτ , with constant, say, L. The
algorithm is as follows. At iteration k, we have a convex minorant ϕtk of vt , which
is therefore necessarily Lipschitz with constant at most L, and a nondecreasing
function of k. The initialization with k = 0 is usually done by taking ϕt0 equal to
a large negative number. We first perform the forward step: this means computing
a feasible trajectory (u k , y k ) such that u kt is a solution of the approximate dynamic
programming strategy, where vt+1 is replaced by ϕt+1 k
:
 
u kt ∈ argmin t (u, ytk ) + ϕt+1
k
(At ytk + Bt u) , t = 0, . . . , N − 1. (8.17)
u∈Ut

This step is forward in the sense that we first compute u k0 , then u k1 , etc. We then
see how to perform the backward step, which consists in computing an improved
minorant of vt , i.e., ϕtk+1 such that

ϕtk ≤ ϕtk+1 ≤ vt . (8.18)

Let us note that


ϕ0k (y 0 ) ≤ v0 (y 0 ) ≤ F(u k ). (8.19)

So we have that u k is an εk -solution with

εk := F(u k ) − ϕ0k (y 0 ). (8.20)

8.1.2.2 Backward Step and Convergence

We will improve the minorant of the Bellman function by applying the subdifferential
calculus rule in Lemma 1.120 to the forward step (8.17), combined with Theorem
1.117. Since t and ϕt+1 are continuous, the latter in view of (8.17), for t = 0 to
N − 1, there exists

rtk = (rut
k
, r yt
k
) ∈ ∂t (u kt , ytk ); h kt ∈ NUt (u kt ); qt+1
k
∈ ∂ϕt+1
k
(yt+1
k
), (8.21)

such that
k
rut + Bt qt+1
k
+ h kt = 0, t = 0, . . . , N − 1. (8.22)
8.1 Stochastic Dual Dynamic Programming (SDDP) 271

Relation (8.21) means that for all u ∈ Ut , y and y in Rn :



⎨ t (u, y) ≥ t (u t , yt ) + rut · (u − u t ) + r yt · (y − yt ),
⎪ k k k k k k

ϕt+1
k
(y ) ≥ ϕt+1
k
(yt+1
k
) + qt+1
k
· (y − yt+1
k
), (8.23)


0 ≥ h t · (u − u t ).
k k

Summing these relation when y = At y + Bt u, so that


k
qt+1 · (y − yt+1
k
) = (A k k  k
t qt+1 ) · (y − yt ) + (Bt qt+1 ) · (u − u t ),
k
(8.24)

and using (8.22), we obtain that

k (A y + B u) ≥  (u k , y k ) + ϕ k (y k ) + r k + A q k
t (u, y) + ϕt+1 k
t t t t t t+1 t+1 yt t t+1 · (y − yt ).
(8.25)
Minimizing the l.h.s. over u ∈ Ut we obtain an affine minorant of the value function
vt . Therefore, the above r.h.s. is itself an affine minorant of the value function vt . So,
we can update ϕtk as follows:
  k  
ϕtk+1 (y) := max ϕtk (y), t (u kt , ytk ) + ϕt+1
k
(yt+1
k
) + r yt + A qt+1
k
· (y − ytk ) .
(8.26)
We also update ϕ Nk as follows:
 
ϕ Nk+1 (y) := max ϕ Nk (y),  N (y Nk ) + r Nk · (y − y Nk ) , where r Nk ∈ ∂ N (y Nk ).
(8.27)
The updates of the ϕtk can be performed in parallel or in any order, and is anyway
very fast. We see that the costly step of the algorithm is the forward one. Since ϕtk is
nondecreasing and upper bounded by vt , it has a limit denoted by ϕ̄t .

Lemma 8.4 We have that ϕ0k (y 0 ) → v0 (y 0 ). More generally,

vt (ytk+1 ) − ϕtk (ytk+1 ) → 0, for t = 0 to N , (8.28)

and any limit-point of u k is a solution of (P).

Proof (a) We claim, using a backward induction argument, that (8.28) holds. For
t = N this follows from Lemma 8.1. Let it hold for t + 1, with 0 ≤ t ≤ N − 1. It
suffices to check the result for a subsequence ki such that y ki +1 is convergent.
Since the data are Lipschitz and the minorants ϕk are Lipschitz with constant L,
ctk := r yt
k
+ A k
t qt+1 is bounded.
Given ε > 0, for large enough i, by the induction hypothesis, since ϕtk is nonde-
creasing w.r.t. k, for j > i, we have using (8.26) that
272 8 Algorithms

ϕt j (yt j ) ≥ ϕtki +1 (yt j )


k k k

k
≥ t (u kt i , ytki ) + ϕt+1
ki
(Ay ki + Bu ki ) + ctk · (yt j − ytki ))
k
≥ t (u kt i , ytki ) + vt+1 (Ay ki + Bu ki ) − ε − |ctk | |yt j − ytki | (8.29)
k
≥ vt (ytki ) − ε − |ctk | |yt j − ytki |
k k
≥ vt (yt j ) − ε − (L + |ctk | ) |yt j − ytki |.

k k
Since |ctk | is bounded, this implies lim inf j ϕt j (yt j ) − vt (ytki ) ≥ 0. Since ϕtk is a
minorant of vt , the claim follows.
(b) We must prove that any limit-point of u k is a solution of (P). Indeed we have
that for all u t ∈ Ut , in view of step (a):

t (u kt , ytk ) + ϕt+1
k
(At ytk + Bt u kt ) ≤ t (u t , ytk ) + ϕt+1
k
(At ytk + Bt u t )
≤ t (u t , yt ) + vt+1 (At ytk + Bt u t ) + o(1).
k

(8.30)
Making k ↑ ∞ we get that

t (ū t , ȳt ) + ϕ̄t+1 ( ȳt+1 ) ≤ t (u t , ȳt ) + vt+1 (At ȳt + Bt u t ). (8.31)

By point (a), ϕ̄t+1 ( ȳt+1 ) = vt+1 ( ȳt+1 ). Minimizing the r.h.s. over u t ∈ Ut we get that
in view of the dynamic programming principle

t (ū t , ȳt ) + vt+1 ( ȳt+1 ) ≤ vt ( ȳt ). (8.32)

So, ū satisfies the DPP and is therefore optimal. 

8.1.3 Stochastic Case

8.1.3.1 Principle

For the sake of simplicity we assume that Ω = Ω0N +1 , and that ω = (ω0 , . . . , ω N )
with all components independent, of the same law. Additionally Ω0 = {1, . . . , M}
and the event i has probability pi . We say that a random variable is Ft -measurable
if it depends on (ω0 , . . . , ωt−1 ). We consider adapted policies: u t (and therefore also
yt ) is Ft -measurable, for t = 0 to N − 1. We denote by y[u] the state associated
with control u, the adapted solution of

yt+1 = At yt + Bt u t + et (ωt ), t = 0, . . . , N − 1. (8.33)

The cost function is, given (u, y) adapted and a.s. bounded
8.1 Stochastic Dual Dynamic Programming (SDDP) 273

N −1


J (u, y) := E t (u t , yt , ωt ) +  N (y N , ω N ) . (8.34)
t=0

We assume that the functions entering into the cost are Lipschitz and convex w.r.t.
(u, y). Denote the reduced cost by

F(u) := J (u, y[u]). (8.35)

The problem is to minimize the reduced cost satisfying the control constraints:

Min F(u); u adapted, u t ∈ Ut a.s., 0 ≤ t ≤ N − 1. (8.36)


u

The Bellman values are, for τ = 0, . . . , N − 1 and x ∈ Rn , solutions of:

N −1


v N = E N ; vτ (x) := min E t (u t , yt , ωt ) +  N (y N , ω N ) | yτ = x ,
u τ ,...,u N −1
t=τ
(8.37)
where the minimization is over the feasible adapted policies (feasible in the sense
that they satisfy the above control constraints). The dynamic principle reads

vt (y) = min Et (t (u, yt , ωt ) + vt+1 (At y + Bt u + et (ωt ))) , (8.38)


u∈Ut

or equivalently writing i = ωt , ei := e(i):


M
vt (y) = min pi (t (u t , yt , i) + vt+1 (At y + Bt u + ei )) . (8.39)
u∈Ut
i=1

The SDDP algorithm will compute a nondecreasing sequence of minorants ϕt of vt .


We can then compute a trajectory based on the approximate dynamic principle, i.e.,
(u k , y k ) such that for a given realization of ω:


M
 
u kt ∈ argmin pi t (u, ytk ) + ϕt+1
k
(At ytk + Bt u + ei ) , t = 0, . . . , N − 1,
u∈Ut i=1
(8.40)
k
and then compute yt+1 according to (8.33). Assuming that in the case of multiple
minima we choose one of them following a rule such as choosing the solution of
minimum norm, this determines an adapted policy. Computing trajectories when
choosing i with probability pi , this procedure then appears as a Monte Carlo type
computation for estimating the reduced cost F(u k ) associated with the adapted policy
u k . We have that
ϕ0k (y 0 ) ≤ v0 (y 0 ) ≤ F(u k ). (8.41)
274 8 Algorithms

So, provided we have a statistical procedure implying that for some ε > 0 and ak ∈ R

F(u k ) ≤ ak with probability 1 − ε, (8.42)

we deduce the estimate

F(u k ) − v0 (y 0 ) ≤ a k − ϕ0k (y 0 ) with probability 1 − ε. (8.43)

8.1.3.2 Backward Step and Convergence

We next provide an extension of the backward step of the deterministic case. We


apply the subdifferential calculus rule in Lemma 1.120 to the forward step (8.40).
Since t and ϕt+1 are continuous functions of (u, y) and y resp., setting
k
yi,t+1 := At y k + Bt u k + ei , (8.44)

there exists for t = 0 to N − 1:


 k
h t ∈ NUt (u kt );
ritk = (riut
k
, rikyt ) ∈ ∂t (u kt , ytk , i); qi,t+1
k
∈ ∂ϕt+1
k
(yi,t+1
k
), i = 1, . . . , M,
(8.45)
such that
 M
 k 
h kt + pi riut + Bt qi,t+1
k
= 0, t = 0, . . . , N − 1. (8.46)
i=1

Relations (8.45) means that for all u ∈ Ut , y and y in Rn :



⎨ t (u, y, i) ≥ t (u t , yt , i) + riut · (u − u t ) + ri yt · (y − yt ),
k k
⎪ k k k k

ϕt+1
k
(y ) ≥ ϕt+1
k
(yi,t+1
k
) + qi,t+1
k
· (y − yi,t+1
k
), (8.47)


0 ≥ h t · (u − u t ).
k k

Summing these relation (with weights pi for the two first) when y = At y + Bt u + ei
and using (8.46) we obtain that


M
 
pi t (u, y, i) + ϕt+1
k
(At y + Bt u + ei ) ≥ atk + btk · (y − ytk ), (8.48)
i=1

where ⎧ M  
⎨ atk := i=1 pi t (u kt , ytk , i) + ϕt+1
k
(yi,t+1
k
) ,
M (8.49)
⎩ btk := i=1 pi rikyt + A qi,t+1 k
.
8.1 Stochastic Dual Dynamic Programming (SDDP) 275

Minimizing the l.h.s. of (8.48) over u ∈ Ut , we see that the above r.h.s. gives an
affine minorant of the value function vt , so that we can update ϕtk as follows:
 
ϕtk+1 (y) := max ϕtk (y), atk + btk · (y − ytk ) . (8.50)

We also update ϕ Nk as follows:




M
 k 
ϕ Nk+1 (y) := max ϕ Nk (y),  N (y Nk ) + pi ri N · (y − y Nk ) , rikN ∈ ∂ N (y Nk , i).
i=1
(8.51)
Note that we can perform the update of the ϕtk in parallel or in any order.
Lemma 8.5 We have that ϕ0k (y 0 ) → v0 (y 0 ). More generally, vt (ytk+1 ) − ϕtk (ytk+1 )
converges to 0, for t = 0 to N .
Proof We show by backward induction that vt (ytk+1 ) − ϕtk (ytk+1 ) → 0, for t = 0 to
N . For t = N this follows from Lemma 8.1. Let it hold for t + 1, with 0 ≤ t ≤ N − 1.
Let k j be a subsequence such that u k j +1 → ū (in the space of adapted strategies).
Let k := k j , k := k j+1 , u := u k , etc. Then given ε > 0, for large enough j, by the
induction hypothesis
M  
ϕtk+1 (ytk ) = i=1 pi t (u t , yt , i) + ϕt+1 (Ay + Bu + ei )
k k k k k
M  
≥ i=1 pi t (u t , yt , i) + vt+1 (Ay + Bu + ei )
k k k k
−ε (8.52)
≥ vt (ytk ) − ε.

The conclusion follows. 


For a discussion of the SDDP approach we refer to the notes at the end of this
chapter.

8.2 Introduction to Linear Decision Rules

8.2.1 About the Frobenius Norm

We recall that the Frobenius scalar product between two matrices A, B of same size
is 
A, B F = Ai, j Bi, j = trace(AB  ). (8.53)
i, j

Note that, if A, B, C are matrices such that AB and C have the same dimension,
then we have the “transposition rule”

AB, C F = trace(ABC  ) = trace(A(C B  ) ) = A, C B  F. (8.54)


276 8 Algorithms

8.2.2 Setting

Let (Ω, F , P) be a probability space. Consider the problem

Min E c(ω) · x(ω); Ax(ω) ≤ b(ω) a.s. (8.55)


x∈L 2 (Ω)n

Here A, a p × n matrix, b(ω) ∈ L 2 (Ω) p and c(ω) ∈ L 2 (Ω)n are given. We assume
that the probability has support over the closed set Ω ⊂ Rn ω and that for some
matrices B and C of appropriate dimension:

c(ω) = Cω; b(ω) = Bω; E|ω|2 < ∞. (8.56)

In addition we decide to take a linear decision rule, i.e. for some X ∈ Rn×n ω :

x(ω) = X ω a.s. on Ω. (8.57)

Remark 8.6 We may assume that

ω1 = 1 a.s. on Ω, (8.58)

so that these linear decision rules are in fact affine decision rules on ω2 , . . . , ωn ω .
Denoting by (AX − B)i the ith row of AX − B, the resulting problem reads:

Min E(Cω) · (X ω); (AX − B)i ω ≤ 0 a.s., i = 1, . . . , p. (8.59)


X

Denoting the second moment of ω by M := Eωω , we get by (8.54) that

E(Cω) · (X ω) = Eω C  X ω = E ωω , C  X F


(8.60)
= Eωω , X  C F = trace (MC  X ),

so that (8.59) can be reformulated as

Min trace (MC  X ); (AX − B)i ∈ Ω − , i = 1, . . . , p. (8.61)


X

This is a linear problem in X , which might be tractable if Ω − has a nice structure.

8.2.3 Linear Programming Reformulation

Let (z, h) ∈ Rn z × Rn h . Assume that, for some matrices W , Z of appropriate size:

Ω = {ω ∈ Rn ω ; W ω + Z z ≥ h}. (8.62)
8.2 Introduction to Linear Decision Rules 277

Let y ∈ Rn ω . That y ∈ Ω − means that v(y) ≥ 0, where

v(y) := inf {−y · ω; W ω + Z z ≥ h}. (8.63)


ω,z

So, v(y) is the value of a feasible linear program (we assume of course that Ω is
nonempty) whose Lagrangian function is

− y · ω + λ · (h − W ω − Z z) = −(y + W  λ) · ω − (Z  λ) · z + λ · h. (8.64)

Therefore, the dual problem has the same value as the primal one, i.e.,

v(y) = sup{λ · h; y + W  λ = 0; Z  λ = 0}. (8.65)


λ≥0

In addition, both the primal and dual problem have solutions if v(y) is finite. So,
v(y) ≥ 0 iff λ · h ≥ 0, for some λ satisfying the constraints in (8.65), which may be
expressed in the form

λ W + y  = 0; λ Z = 0; λ ≥ 0. (8.66)

Taking for y  the rows of AX − B, and denoting by Λ the matrix whose rows are
the transpose of the corresponding λ, we obtain an equivalent linear programming
reformulation of problem (8.62):

Min trace (MC  X ); AX + ΛW = B; ΛZ = 0; Λh ≥ 0; Λ ≥ 0. (8.67)


X,Λ

So, we have proved that

Lemma 8.7 Let Ω be of the form (8.62). Then the value of the linear programming
problem (8.67) is an upper bound of the value of the original problem (8.55).

8.2.4 Linear Conic Reformulation

We next generalize the previous analysis by considering the setting of linear conical
optimization, see Chap. 1, Sect. 1.3.2. Assume that for some z ∈ Rn z , h ∈ Rn h , W and
Z matrices of appropriate dimensions, and some (finite-dimensional) closed convex
cone K :
Ω = {ω ∈ Rn ω ; W ω + Z z − h ∈ K }. (8.68)

That y ∈ Ω − means that v(y) ≥ 0, where

v(y) := inf {−y · ω; W ω + Z z − h ∈ K }. (8.69)


ω,z
278 8 Algorithms

Remember that the infimum is not necessarily attained, even if the value is finite.
Assume that the above problem is qualified, i.e., for some ε > 0:

ε BY ⊂ K + h + Im(W ) + Im(Z ). (8.70)

Expressing the dual using K + rather than K − , by Corollary 1.144, we have that
either v(y) = −∞, or

v(y) = max+ {λ · h; y + W  λ = 0; Z  λ = 0}. (8.71)


λ∈K

So, v(y) ≥ 0 iff λ · h ≥ 0, for some dual feasible λ. The dual constraints may be
expressed in the form

λ W + y  = 0; λ Z = 0; λ ∈ K + . (8.72)

Taking for y  the rows of AX − B, and denoting by Λ the matrix whose rows are
the transpose of the corresponding λ, we obtain an equivalent conic reformulation
of (8.62), where Λi denotes the ith row of the matrix Λ:

Min trace (MC  X ); AX + ΛW = B; ΛZ = 0;


X,Λ (8.73)
Λh ≥ 0; Λi ∈ K + , i = 1, . . . , n ω .

We have proved that

Lemma 8.8 Let Ω be of the form (8.68), and satisfy the qualification condition
(8.70). Then the value of (8.73) is an upper bound of the value of the original
problem (8.55).

8.2.5 Dual Bounds in a Conic Setting

8.2.5.1 Derivation of the Dual Bound

We are now looking for lower bounds of the value of the original stochastic opti-
mization problem (8.55), when Ω is of the form (8.68). Denoting by v P the value of
(8.68), which we may express as

vP = inf p
sup E (c(ω) · x(ω) + y(ω) · (Ax(ω) + s(ω) − b(ω))) ,
x∈L 2 (Ω)n ,s∈L 2 (Ω)+ y∈L 2 (Ω) p
(8.74)
we get a lower bound by restricting, in the above expression, y to some subspace say
Y of L 2 (Ω) p : so v P ≥ vY , where
8.2 Introduction to Linear Decision Rules 279

vY := inf p
sup E (c(ω) · x(ω) + y(ω) · (Ax(ω) + s(ω) − b(ω))) .
x∈L 2 (Ω)n ,s∈L 2 (Ω)+ y∈Y
(8.75)
Note that

vY := inf p
Ec(ω) · x(ω); Ax(ω) + s(ω) − b(ω) ∈ Y ⊥ . (8.76)
x∈L 2 (Ω)n ,s∈L 2 (Ω)+

Consider the particular case of a linear multiplier rule:

Y = {y ∈ L 2 (Ω) p ; y(ω) = Y ω for some matrix Y }. (8.77)

Then
 
vY = inf p
sup E c(ω) · x(ω) + ω Y  (Ax(ω) + s(ω) − b(ω)) .
x∈L 2 (Ω)n ,s∈L 2 (Ω)+ Y
(8.78)
Set e(ω) := Ax(ω) + s(ω) − b(ω). Then by the transposition rule (8.54):

Eω Y  e(ω) = Ee(ω) Y ω = E Y, e(ω)ω F = Y, Ee(ω)ω F, (8.79)

and therefore e(·) ∈ Y ⊥ iff Ee(ω)ω = 0. It follows that

vY = inf p
Ec(ω) · x(ω); E(Ax(ω) + s(ω) − b(ω))ω = 0. (8.80)
x∈L 2 (Ω)n ,s∈L 2 (Ω)+

We next discuss the second-order moment of ω.

Lemma 8.9 We have that M = Eωω is full rank iff Ω spans Rn ω .

Proof Since M is symmetric and semidefinite, it is not of full rank iff there exists
some nonzero g ∈ Rn ω so that

0 = g  Mg = Eg  ωω g = E(ω g)2 . (8.81)

So, M is not of full rank iff ω lies in the orthogonal of some nonzero vector g ∈ Rn ω .
The conclusion follows. 

In the sequel we assume that

M = Eωω is full rank. (8.82)

So, the matrices X , S, B are uniquely defined by the relations below:

X M = Ex(ω)ω ; S M = Es(ω)ω ; B M = Eb(ω)ω . (8.83)


280 8 Algorithms

On the other hand, given any n × n ω matrix X , we have that x(ω) = X ω is such that
the above first relation holds. Assume in the sequel that c(ω) = Cω. Using (8.54)
and the symmetry of M, we get that

Ec(ω) · x(ω) = Ex(ω) Cω = C, Ex(ω)ω = C, X M


= M X  , C  = M, C  X = M, X  C = trace(MC  X ).
(8.84)
So, we can express vY in terms of X rather than x. It follows that

vY = inf p
trace(MC  X ); S M = Es(ω)ω ; (AX + S − B)M = 0.
X,S,s∈L 2 (Ω)+
(8.85)
Since M is invertible, we deduce that

vY = inf p
trace(MC  X ); AX + S = B; S M = Es(ω)ω . (8.86)
X,S,s∈L 2 (Ω)+

The above problem is still not tractable, but we will see that it has the following
tractable relaxation:

vY1 = inf trace(MC  X ); AX + S = B; (W − he1 )M S  + Z Γ ∈ K p .


X,S,Γ
(8.87)
The last inclusion relation means that each column of (W − he1 )M S  + Z Γ
belongs to K . We need to assume that

There exists a measurable mapping Ω → Rn z , ω → z(ω) such that
(8.88)
W ω + Z z(ω) ≥ h and for some c > 0 : |z(ω)| ≤ c(1 + |ω|) a.s.

Lemma 8.10 Let (8.58) and (8.88) hold. Then vY1 ≤ vY ≤ v P .


Proof The second inequality follows from the previous arguments. We next prove
the first one. Since vY1 and vY have the same cost function, it suffices to check that
if (X, S, s) satisfies the constraints in (8.86), then (X, S, Γ ) satisfies the constraints
in (8.87), for some Γ . Indeed, let s ∈ L 2 (Ω)+ and S be such that S M = Es(ω)ω .
p

Since, by (8.58), any element of Ω has a first component equal to 1:

(W − he1 )M S  = (W − he1 )Eωs(ω) = E(W ω − h)s(ω) . (8.89)

Set Γ := Ez(ω)s(ω) ; note that this expectation is finite since E|ω|2 < ∞, in view
of (8.88). By the above display,

(W − he1 )M S  + Z Γ = E(W ω − h + Z z(ω))s(ω) . (8.90)

The jth column of the r.h.s. matrix is E(W ω − h + Z z(ω))s j (ω). Since W ω − h +
Z z(ω) ∈ K a.e., and K is a closed convex cone, it belongs to K . The conclusion
follows. 
8.2 Introduction to Linear Decision Rules 281

Remark 8.11 (i) The derivation of this dual bound did not assume any qualification
condition.
(ii) For a refined analysis of the lower bound, in the case when K is the set of
nonnegative vectors, see [66].

8.3 Notes

Kelley’s [63] algorithm 8.1 for minimizing a convex function over a set X essentially
requires us to solve a linear programming problem at each step, if X is a polyhedron.
Various improvements, involving the quadratic penalization of the displacement and
therefore the resolution of convex quadratic programs, are described in Bonnans et
al. [24].
The SDDP algorithm, due to Pereira and Pinto [86], can be seen as an extension
of the Benders decomposition [18]. Shapiro [113] analyzed the convergence of such
an algorithm for problems with potentially infinitely many scenarios, and considered
the case of a risk averse formulation, based on the conditional value at risk. See also
Girardeau et al. [52]. In the case of a random noise process with memory, a possibility
is to approximate it by a Markov chain, obtained by a quantization method, and to
apply the SDDP approach to the resulting dynamic programming formulation. This
applies more generally when the value functions are convex w.r.t. some variables
only, see Bonnans et al. [23]. The SDDP approach can also provide useful bounds
in the case of problems with integer constraints, see Zou, Ahmed and Sun [128].
In the presentation of linear decision rules we follow Georghiou et al. [51, 66].
The primal upper bound (8.73) can be computed by efficient algorithms when K
is the product of polyhedral cones, second-order cones, and cones of semidefinite
symmetric matrices. See e.g. Nesterov and Nemirovski [85]. For other aspects of
linear decision rules, in connection with robust optimization (for which a reference
book is [16]), see Ben-Tal et al. [14].
Chapter 9
Generalized Convexity and
Transportation Theory

Summary This chapter first presents the generalization of convexity theory when
replacing duality products with general coupling functions on arbitrary sets. The
notions of Fenchel conjugates, cyclical monotonicity and duality of optimization
problems, have a natural extension to this setting, in which the augmented Lagrangian
approach has a natural interpretation.
Convex functions over measure spaces, constructed as Fenchel conjugates of
integral functions of continuous functions, are shown to be sometimes equal to some
integral of a function of their density. This is used in the presentation of optimal
transportation theory over compact sets, and the associated penalized problems. The
chapter ends with a discussion of the multi-transport setting.

9.1 Generalized Convexity

9.1.1 Generalized Fenchel Conjugates

Let X and Y be arbitrary sets and κ : X × Y → R, called a coupling between X and


Y (and then X , Y are called in this context coupled spaces). The κ-Fenchel conjugate
of ϕ : X → R is ϕ κ : Y → R, defined by

ϕ κ (y) := sup (κ(x, y) − ϕ(x)) . (9.1)


x∈X

We have the κ-Fenchel–Young inequality

ϕ κ (y) ≥ κ(x, y) − ϕ(x), for all x ∈ X and y ∈ Y. (9.2)

If ϕ has a finite value at x ∈ X , we define the κ-subdifferential of ϕ at x ∈ X as

∂κ ϕ(x) := {y ∈ Y ; ϕ(x) + ϕ κ (y) = κ(x, y)}. (9.3)


© Springer Nature Switzerland AG 2019 283
J. F. Bonnans, Convex and Stochastic Optimization, Universitext,
https://doi.org/10.1007/978-3-030-14977-2_9
284 9 Generalized Convexity and Transportation Theory

So y ∈ ∂κ ϕ(x) iff equality holds in the κ-Fenchel–Young inequality. Recalling the


definition of ϕ κ , we see that y ∈ ∂κ ϕ(x) iff ϕ(x) is finite and the following κ-
subdifferential inequality holds:

ϕ(x  ) ≥ ϕ(x) + κ(x  , y) − κ(x, y), for all x  ∈ X. (9.4)

We call a κ-minorant of ϕ any function over X of the form x → κ(x, y) − β, for


some (y, β) ∈ Y × R, that is a minorant of ϕ, i.e., such that

β ≥ κ(x, y) − ϕ(x), for all x ∈ X. (9.5)

Clearly, this holds iff β ≥ ϕ κ (y). In other words, for any given y ∈ Y , if ϕ κ (y) is
finite, then x → κ(x, y) − ϕ κ (y) is the ‘best’ κ-minorant of the form κ(x, y) − β,
for some β ∈ R. If ϕ κ (y) = ∞, there is no such minorant. Finally, ϕ κ (y) = −∞
means that ϕ(x) = ∞, for any x ∈ X .
Since X and Y play symmetric roles we have similar notions for ψ : Y → R. For
instance, the κ-Fenchel conjugate of ψ is the function ψ κ : X → R defined by

ψ κ (x) := sup (κ(x, y) − ψ(y)) . (9.6)


y∈Y

We can define the κ-biconjugate of ϕ : X → R as the conjugate of its conjugate, i.e.


the function X → R defined by

ϕ κκ (x) := sup (κ(x, y) − ϕ κ (y)) . (9.7)


y∈Y

In view of (9.5), the κ-biconjugate is the supremum of κ-minorants, and therefore is


itself a minorant of ϕ, that is,

ϕ κκ (x) ≤ ϕ(x), for all x ∈ X. (9.8)

We say that ϕ : X → R is κ-convex if it is the κ-conjugate of some function ψ :


Y → R. The following holds:

Lemma 9.1 (i) The biconjugate of ϕ is the greatest κ-convex function dominated by
ϕ. That is, if f : X → R is κ-convex and f (x) ≤ ϕ(x) for all x ∈ X , then f (x) ≤
ϕ κκ (x) for all x ∈ X .
(ii) A function is κ-convex iff it is equal to its biconjugate.
(iii) A supremum of κ-convex functions is κ-convex.

Proof (i) Let f be a κ-convex minorant of ϕ. Then f = ψ κ for some ψ : Y → R,


and then
ϕ(x) + ψ(y) ≥ f (x) + ψ(y) ≥ κ(x, y), (9.9)
9.1 Generalized Convexity 285

so that
ψ(y) ≥ sup(κ(x, y) − ϕ(x)) = ϕ κ (y) (9.10)
x∈X

and therefore (since the κ-Fenchel conjugate is obviously decreasing) ϕ κκ (x) ≥


f (x).
(ii) Direct consequence of (i).
(iii) Let the ϕi : X → R̄ be κ-convex for i ∈ I , and set ϕ(x) := supi∈I ϕi (x). Since
each ϕi is equal to its biconjugate, we have that

ϕ(x) = sup sup(κ(x, y) − ϕiκ (y)) = sup(κ(x, y) − inf ϕiκ (y)), (9.11)
i∈I y∈Y y∈Y i∈I

which shows that ϕ is a κ-conjugate, and is therefore κ-convex. 


Remark 9.2 If the subdifferential of ϕ at x̄ ∈ X contains ȳ, then

ϕ κκ (x̄) ≥ κ(x̄, ȳ) − ϕ κ ( ȳ) = ϕ(x̄). (9.12)

In other words,
∂κ ϕ(x̄) = ∅ ⇒ ϕ κκ (x̄) = ϕ(x̄). (9.13)

Lemma 9.3 Let ϕ : X → R. Then


(i) ϕ and its biconjugate have the same conjugate.
(ii) Let ϕ be equal to ϕ κκ at x̄ ∈ X . Then ∂κ ϕ(x̄) = ∂κ ϕ κκ (x̄).
Proof (i) Since the biconjugate is the supremum of κ minorants, a function and its
biconjugate have the same κ-minorants, and hence, the same κ-conjugate.
(ii) The κ-subdifferential of the biconjugate of ϕ at any x ∈ X satisfies, in view of
(i):
∂κ ϕ κκ (x) := {y ∈ Y ; ϕ κκ (x) + ϕ κ (y) = κ(x, y)}. (9.14)

So when ϕ and its biconjugate have the same value at some point they also have the
same κ-subdifferential. 
Remark 9.4 When X is a Banach space, Y is its dual, and κ(x, y) = y, x is the
usual duality product, we will speak of usual convexity, and then we recover the
usual Fenchel transform. Note, however, the difference in the definition of convex
functions.

9.1.2 Cyclical Monotonicity

We say that the set Γ ⊂ X × Y is κ-cyclically monotone if for any positive N ∈ N


and finite sequence (x1 , y1 ), . . . , (x N , y N ) in Γ , setting x N +1 := x1 , the following
holds:
286 9 Generalized Convexity and Transportation Theory


N 
N
κ(xi , yi ) ≥ κ(xi+1 , yi ). (9.15)
i=1 i=1

Lemma 9.5 We have that Γ is κ cyclically monotone iff there exists a κ-convex
function ϕ over X , such that

y ∈ ∂κ ϕ(x), for all (x, y) ∈ Γ. (9.16)

Proof (i) If some κ-convex function ϕ over X satisfies (9.16) then, by the κ-Fenchel–
Young inequality: 
κ(xi , yi ) = ϕ(xi ) + ϕ κ (yi ),
(9.17)
−κ(xi+1 , yi ) ≥ −ϕ(xi+1 ) − ϕ κ (yi ).

Summing these inequalities for i = 1 to N , we get (9.15).


(ii) Conversely, let (9.15) hold. Fix (x1 , y1 ) ∈ Γ and, for x ∈ X , set


N
ϕ(x) := sup (κ(xi+1 , yi ) − κ(xi , yi )). (9.18)
i=1

The supremum is w.r.t. to all nonzero N ∈ N, and to all (xi , yi ) in Γ , i = 2 to N ,


with x N +1 equal to x. Then ϕ is κ-convex since we may express it as
  N −1


ϕ(x) := sup κ(x, y N ) + sup −κ(x N , y N ) + (κ(xi+1 , yi ) − κ(xi , yi )) ,
y N ∈Y i=1
(9.19)
the second supremum being w.r.t. to all nonzero N ∈ N, and to all (xi , yi ) in Γ ,
i = 2 to N , with y N = y given. Observe that ϕ(x) > −∞ for all x ∈ X . By cycli-
cal monotonicity, ϕ(x1 ) ≤ 0. Taking N = 2 and (x2 , y2 ) = (x1 , y1 ), we obtain the
converse inequality; it follows that ϕ(x1 ) = 0.
Next, let (x̄, ȳ) ∈ Γ . We must prove that ϕ(x̄) is finite, and that ȳ ∈ ∂κ ϕ(x̄).
Setting (x N +1 , y N +1 ) := (x̄, ȳ) and x N +2 := x, we get that, by the definition of ϕ:


N +1 
N
ϕ(x) ≥ (κ(xi+1 , yi ) − κ(xi , yi )) = κ(x, ȳ) − κ(x̄, ȳ) + (κ(xi+1 , yi ) − κ(xi , yi )).
i=1 i=1
(9.20)
Maximizing over the last sum it follows that

ϕ(x) ≥ κ(x, ȳ) − κ(x̄, ȳ) + ϕ(x̄). (9.21)

Taking x = x1 we deduce that ϕ(x̄) < ∞. It follows that ϕ(x̄) is finite, and by the
above display, ȳ ∈ ∂κ ϕ(x̄). The conclusion follows. 
9.1 Generalized Convexity 287

9.1.3 Duality

Consider a family of optimization problems of the form

Min ϕ(x, y) − κ X (x, x  ). (Py )


x∈X

Here we have arbitrary sets X , X  , Y , Y  , y ∈ Y , x  ∈ X  , and coupling functions κ X ,


κY between (X, X  ) and (Y, Y  ) resp. The product spaces (X, Y ) and (X  , Y  ) are
endowed with the product coupling

κ(x, y; x  , y  ) := κ X (x, x  ) + κY (y, y  ). (9.22)

We denote the value function (for fixed x  ) of problem (Py ) by


 
v(y) := inf ϕ(x, y) − κ X (x, x  ) . (9.23)
x∈X

Its κY conjugate, denoted by vκ , is


 
vκ (y  ) := sup κ X (x, x  ) + κY (y, y  ) − ϕ(x, y) = ϕ κ (x  , y  ). (9.24)
(x,y)∈X ×Y

So, its biconjugate is


 
vκκ (y) := sup κY (y, y  ) − ϕ κ (x  , y  ) . (9.25)
y  ∈Y 

This leads us to define the dual problem as


  κ  

Max
 
κ Y (y, y ) − ϕ (x , y ) . (D y )
y ∈Y

Our previous results on generalized convexity (in particular Lemma 9.3) lead to the
following weak duality result:

Theorem 9.6 We have that

val(D y ) = vκκ (y) ≤ v(y) = val(Py ), (9.26)


S(D y ) = ∂κ vκκ (y), (9.27)
∂κ v(y) = ∅ ⇒ ∂κ v(y) = S(D y ). (9.28)

9.1.4 Augmented Lagrangian

We continue in the previous setting, in the case when X is again an arbitrary set, Y
is a Banach space and the family of optimization problems is
288 9 Generalized Convexity and Transportation Theory

Min f (x) − κ X (x, x  ); g(x) + y ∈ K , (Py )


x∈X

with g : X → Y and K a closed convex subset of Y . We introduce a penalty function


P : Y → R̄ and a penalty parameter r > 0. The penalized problem is

Min f (x) − κ X (x, x  ) + r P(y); g(x) + y ∈ K . (Pr,y )


x∈X

Its value, denoted by vr , satisfies



vr (y) = inf f (x) − κ X (x, x  ) + r P(y); g(x) + y ∈ K = v(y) + r P(y).
x
(9.29)
In the sequel we assume that Y  is the (topological) dual of Y , and we consider two
types of dualization:
(a) Dualization of the previous penalized problem using the standard coupling whose
expression is κY (y, y ∗ ) := y ∗ , y . We then have, writing y = z − g(x), with z ∈ K :

vrκ (y ∗ ) = sup{κ X (x, x  ) − f (x) + y ∗ , y − r P(y); g(x) + y ∈ K }


x,y

= sup{κ X (x, x  ) − f (x) − y ∗ , g(x) } + sup{ y ∗ , z − r P(z − g(x))}.


x z∈K
(9.30)
Define the augmented Lagrangian

L r (x, y ∗ ) := f (x) + inf {r P(z − g(x)) + y ∗ , g(x) − z }. (9.31)


z∈K

We have shown that

vrκ (y ∗ ) = sup{κ X (x, x  ) − L r (x, y ∗ )}. (9.32)


x

So, the dual problem is nothing but

Max y ∗ , y + inf {L r (x, y ∗ ) − κ X (x, x  )}. (Dr,y )


y ∗ ∈Y ∗ x

(b) Dualization of the original problem (Py ), with value v(y), using the coupling
between Y and Y  defined by

κ̂Y (y, y ∗ ) := y ∗ , y − r P(y). (9.33)

We denote the κ̂-conjugate of v(y) by v̂κ . Then

v̂κ (y ∗ ) = sup{κ X (x, x  ) − f (x) + y ∗ , y − r P(y)); g(x) + y ∈ K }. (9.34)


x,y
9.1 Generalized Convexity 289

Therefore we get the same value function:

v̂κ (y ∗ ) = sup{κ X (x, x  ) − L r (x, y ∗ )} = vrκ (y ∗ ). (9.35)


x

Definition 9.7 We say that y ∗ ∈ Y ∗ is an augmented Lagrange multiplier of the


unperturbed problem (y = 0) if it belongs to ∂vr (0); that is, if v(0) = val(Dr,0 ) and
y ∗ ∈ S(Dr,0 ).

Note that y ∗ ∈ Y ∗ is an augmented Lagrange multiplier iff

v(0) = inf {L r (x, y ∗ ) − κ X (x, x  )}. (9.36)


x

Remark 9.8 Observe that, when P(0) = 0, in cases (a) and (b), the duality gap is the
same for the unperturbed problem y = 0. So, the augmented Lagrangian approach
can be seen as a generalized convexity approach on the original problem with the
nonstandard coupling y ∗ , y − r P(y).

Example 9.9 The classical example is when Y is a Hilbert space identified with its
dual, and P(y) = 21 y2 . Then the penalty term in the augmented Lagrangian is

1 1 ∗ 2 1
inf 2
r z − g(x)2 + y ∗ , g(x) − z = inf 1
2
r z − g(x) − y  − 2 y ∗ 2 ,
z∈K z∈K r r
(9.37)
and therefore the augmented Lagrangian is
2
1 ∗ 1 ∗ 2
L r (x, y ∗ ) := f (x) + r dist K g(x) + y − y  . (9.38)
r r2

The case of finitely many inequality constraints corresponds to the case when Y = Rm
is endowed with the Euclidean norm and K = Rm − . The expression of the augmented
Lagrangian is then


m
1 ∗ 2
1 ∗ 2
L r (x, y ∗ ) := f (x) + r g(x) + y − y  . (9.39)
i=1
r + r2

9.2 Convex Functions of Measures

In various applications we need to minimize some nonlinear functions of measures,


involving for instance some entropic regularization terms as we will see later in
the context of optimal transportation problems. We will see how to construct some
convex functions of measures, as Fenchel conjugates of integrals of convex functions
of continuous functions.
290 9 Generalized Convexity and Transportation Theory

9.2.1 A First Result

Let Ω be a compact subset of Rn , and C(Ω) denote the set of continuous functions
over Ω. Let p ∈ N be nonzero and set X := C(Ω) p , whose elements are viewed as
continuous functions over Ω with value in R p , and norm

ϕ X := max |ϕ(ω)|. (9.40)


ω∈Ω

Given f : R p → R̄, l.s.c. convex and proper, let F : X → R̄ be defined by

F(ϕ) := f (ϕ(ω))dω. (9.41)


Ω

Lemma 9.10 The functional F is convex, l.s.c. proper over X .

Proof By Theorem 1.44, f has an affine minorant, and therefore F is well-defined,


with value in (−∞, +∞]. The convexity of F is obvious. Taking ϕ to be constant,
equal to an element of the domain of f , we obtain that F is proper. Finally, we prove
that F is l.s.c. It is enough to consider a sequence ϕk → ϕ in X such that there exists
limk F(ϕk ) < ∞. For any measurable function a ∈ L 1 (Ω) p , we have that

limk F(ϕk ) ≥ lim inf (a(ω) · ϕk (ω) − f ∗ (a(ω)))dω


k Ω (9.42)
= (a(ω) · ϕ(ω) − f ∗ (a(ω)))dω.
Ω

Indeed a(ω) · x − f ∗ (a(ω)) ≤ f (x), so that the above inequality holds (since f ∗ has
an affine minorant, the integral has value in [−∞; ∞)), and the equality is obvious
since ϕk → ϕ in X . By Proposition 3.74, the supremum over a(·) of the r.h.s. is
precisely F(ϕ). The result follows. 

Remark 9.11 If dom( f ) = R p then dom(F) = X , and F is bounded over bounded


sets, so that it is continuous.

Recall the Definition 5.3 of regular measures. The dual of C(Ω) is M(Ω), the
set of finite Borel regular measures over Ω; see [77, Chap. II, Sect. 5]. The Fenchel
conjugate of F is F ∗ : M(Ω) p → R̄ defined by

F ∗ (μ) := sup μ, ϕ X − f (ϕ(ω))dω. (9.43)


ϕ∈X Ω

Here, denoting by μi , 1 ≤ i ≤ p, the components of the vector measure μ:


p
μ, ϕ X := ϕi (ω)dμi (ω). (9.44)
i=1 Ω
9.2 Convex Functions of Measures 291

p
Let L 1μ := Πi=1 L 1μi (Ω) denote the set of integrable functions for the measure μ.

Definition 9.12 We say that h = R p → R̄ has superlinear growth if, for all k > 0,
h(y)/|y| > k when |y| > rk , for some rk > 0.

Lemma 9.13 We have that f ∗ has superlinear growth iff dom( f ) = R p .

Proof Let ck := sup{ f (x); |x| ≤ k}. Then

f ∗ (y) = sup{x · y − f (x); |x| ≤ k} ≥ sup{x · y − ck ; |x| ≤ k} = sup(k|y| − ck ).


k,x k,x k
(9.45)
If dom( f ) = R p , by Corollary 1.58, f is continuous, so that ck is finite, for all k,
and then by the above display, f ∗ has superlinear growth. Conversely, let f ∗ have
superlinear growth. Set

gk (x) = sup {x · y − f ∗ (y)}; h k (x) = sup {x · y − f ∗ (y)}. (9.46)


|y|≤rk |y|>rk

Then f (x) = max(gk (x), h k (x)), and when |x| < k:

h k (x) ≤ sup {x · y − k|y|} ≤ sup (|x| − k)|y| = 0. (9.47)


|y|>rk |y|>rk

On the other hand, f ∗ (y) has an affine minorant, say a · y − b, so that

gk (x) ≤ sup {x · y − a · y + b} ≤ rk |x − a| + b. (9.48)


|y|≤rk

Therefore f (x) < ∞. 

By the Lebesgue decomposition theorem, any μ ∈ X ∗ can be decomposed in a


unique way as μ = μs + μa , where μs is the singular part and μa is the absolutely
continuous part, see [105, Chap. 11]. We identify μa ∈ L 1 (Ω) p with its density w.r.t.
the Lebesgue measure.

Lemma 9.14 Let f ∗ have superlinear growth. Then



⎨ ∞ if μs = 0,
F ∗ (μ) = (9.49)
⎩ f ∗ (μa (ω))dω otherwise.
Ω

Proof (i) If μs = 0, there exists a measurable subset E of Ω, of null measure, such


that μ(E) = 0 (note that μ(E) ∈ R p ), say μ1 (E) > 0, where μ1 is the first compo-
nent of μ. Since μ1 is regular, there exists a compact K ⊂ E such that μ1 (K ) > 0.
Given ε ∈ (0, 1), set ϕε (ω) := c(1 − d K (ω)/ε)+ for some c > 0. By the dominated
convergence theorem, ϕε converges in L 1μ1 to c1 K , so that
292 9 Generalized Convexity and Transportation Theory

μ1 , ϕε → c μ1 , 1 K = cμ1 (K ). (9.50)

We next identify ϕε with the element of C(Ω) p with first component ϕε and the
other components equal to zero. By Lemma 9.13, f is Lipschitz on bounded sets,
and so, by the dominated convergence theorem, F(ϕε ) → 0. Therefore, F ∗ (μ) ≥
limε ( μ1 , ϕε − F(ϕε )) = cμ1 (K ). Letting c ↑ ∞ we deduce that F ∗ (μ) = +∞.
(ii) Let μs = 0. Then

F ∗ (μ) = sup (μa (ω) · ϕ(ω) − f (ϕ(ω)))dω ≤ f ∗ (μa (ω))dω, (9.51)


ϕ∈X Ω Ω

where in the last inequality we use the Fenchel–Young inequality. We next prove the
converse inequality. Set b(ω, v) := μa (ω)v − f (v). Let ak be a dense sequence in
dom f and let ϕk ∈ L ∞ (Ω) be inductively defined by ϕ0 (ω) = a0 and

ak if b(ω, ak ) > b(ω, ϕk−1 (ω)),
ϕk (ω) = (9.52)
ϕk−1 (ω) otherwise.

Then
 b(ω, ϕk (ω)) → f ∗ (μa (ω)) a.e. and, by the monotone convergence theorem,

Ω b(ω, ϕk (ω))dω → Ω f (μa (ω))dω. We cannot conclude the result from this
since the ϕk are not continuous. So, given ε > 0, fix k such that


f ∗ (μa (ω))dω − ε if f ∗ (μa (ω)) < ∞,
b(ω, ϕk (ω))dω > Ω Ω (9.53)
Ω ⎩ 1/ε otherwise.

Given M > 0, denote the truncation of μa by

μaM (ω) := max(−M, min(M, μa (ω))). (9.54)

Fix M > 0 such that μaM − μa  L 1 (Ω) < ε. Extend ϕk over R p by 0 and let η : R p →
R+ be of class C ∞ with integral 1 and support in the unit ball. Set for α > 0, ηα (x) :=
α −n η(x/α), and ϕ̂α := ϕk ∗ ηα (convolution product). By Jensen’s inequality,

f (ϕ̂α (ω))dω ≤ ( f (ϕk ) ∗ ηα )(ω)dω = f (ϕk (ω))dω. (9.55)


Rn Rn Ω

By a dominated convergence argument we obtain that Rn \Ω f (ϕ̂α (ω))dω → 0. So,
for α > 0 small enough, by the above inequality:

f (ϕ̂α (ω))dω ≤ f (ϕk (ω))dω + ε. (9.56)


Ω Ω

Also, since μaM − μa  L 1 (Ω) < ε, for small enough α:


9.2 Convex Functions of Measures 293

| μ, ϕ̂α − ϕk | ≤ | μ − μ M , ϕ̂α − ϕk | + | μ M , ϕ̂α − ϕk |


≤ εϕ̂α − ϕk ∞ + Mϕ̂α − ϕk  L 1 (Ω) (9.57)
≤ ε(2ϕk ∞ + 1).

In the last inequality we use ϕ̂α ∞ ≤ ϕk ∞ and ϕ̂α → ϕk in L 1 (Ω). We conclude
by combining the previous inequality with (9.53) and (9.56). 

9.2.2 A Second Result

Let g(ω, x) : Ω × R p → R be a continuous function, convex w.r.t. x. Define G :


X → R by
G(ϕ) := g(ω, ϕ(ω))dω. (9.58)
Ω

Clearly G is convex and bounded over bounded sets. So, it is continuous, with
conjugate
G ∗ (μ) := sup μ, ϕ X − g(ω, ϕ(ω))dω. (9.59)
ϕ∈X Ω

We denote by g ∗ the Fenchel conjugate of g w.r.t. its second variable.

Lemma 9.15 We have that



⎨ ∞ if μs = 0,
G ∗ (μ) = (9.60)
⎩ g ∗ (ω, μa (ω))dω otherwise.
Ω

Proof This is an easy variant of the proof of Lemma 9.14. Let us just mention that,
while Jensen’s inequality in (9.55) cannot be easily extended, we get directly the
analogous to (9.56), namely

g(ω, ϕ̂α (ω))dω ≤ g(ω, ϕ(ω))dω + ε (9.61)


Ω Ω

by the dominated convergence theorem, since ϕ̂α is bounded in L ∞ (Ω) p and con-
verges to ϕ in L 1 (Ω) p . 

9.3 Transportation Theory

We next analyze in a simple way the Kantorovich duality that extends the classical
Monge problem.
294 9 Generalized Convexity and Transportation Theory

9.3.1 The Compact Framework

Let x be a compact subset of Rn , and c(x) denote the space of continuous functions
over x, endowed with the uniform norm

ϕx := max{|ϕ(x)|; x ∈ x}. (9.62)

This is a Banach space, with dual denoted by m(x). We say that η ∈ m(x) is non-
negative, and we write η ≥ 0, if η, ϕ c(x) ≥ 0, for any nonnegative ϕ. We denote
by m + (x) the positive cone (set of nonnegative elements) of m(x). It is known that
m(x) is the space of finite Borel measures over x, see [77, Chap. 2].
Given a compact subset y of R p , set z := x × y (this is a compact subset of Rn+ p ).
To ϕ ∈ c(x) we associate bx ϕ ∈ c(z) defined by

(bx ϕ) (x, y) = ϕ(x), for all (x, y) ∈ z. (9.63)


 
We define in the same way b y ψ, where ψ ∈ c(y), by b y ψ (x, y) = ψ(y), for all
(x, y) ∈ z. One easily checks that bx (as well as b y ) is isometric: bx ϕc(z) = ϕc(x) .
So, we call bx (resp. b y ) the canonical injection from c(x) (resp. c(y)) into c(z).
Let μ ∈ m(z). We call the element μ| X of M(X ) defined by

μ| X , ϕ C(X ) = μ, B X ϕ C(Z ) , for all ϕ ∈ C(X ) (9.64)

the marginal of μ over X . The marginal mapping μ → μ| X is nothing but the trans-
pose of the canonical injection from C(X ) into C(Z ), and is non-expansive in the
sense that
μ| X  M(X ) ≤ μ M(Z ) . (9.65)

Let 1 X have value 1 over X . The marginals are related by the compatibility relation

μ| X , 1 X C(X ) = μ|Y , 1Y C(Y ) = μ, 1 Z C(Z ) . (9.66)

By P(X ) we denote the set of Borel probabilities over X , i.e.,

P(X ) := {η ∈ M+ (X ); η, 1 X = 1}. (9.67)

We fix (η, ν) ∈ P(X ) × P(Y ), and c(x, y) ∈ C(Z ). Consider the Kantorovich
problem

Min − η, ϕ X − ν, ψ C(Y ) ; ϕ(x) + ψ(y) − c(x, y) ≤ 0, for all (x, y) ∈ Z .


ϕ∈C(X )
ψ∈C(Y )
(9.68)
This is a convex problem, whose Lagrangian L : C(X ) × C(Y ) × M(Z ) → R is
9.3 Transportation Theory 295

L (ϕ, ψ, μ) := − η, ϕ C(X ) − ν, ψ C(Y ) + μ, B X ϕ(x) + BY ψ(y) − c C(Z )


= μ| X − η, ϕ C(X ) + μ|Y − ν, ψ C(Y ) − μ, c C(Z ) .
(9.69)
Observe that

−∞ if μ| X = η or μ|Y = ν,
inf L (ϕ, ψ, μ) = (9.70)
ϕ∈C(X ) − μ, c C(Z ) otherwise.
ψ∈C(Y )

So, the dual problem is

Max − μ, c C(Z ) ; μ| X = η; μ|Y = ν. (9.71)


μ∈M+ (Z )

Proposition 9.16 Problems (9.71) and (9.68) have the same finite value, and both
have a nonempty set of solutions.

Proof (a) The dual problem (9.71) is feasible (take for μ the product of η and ν) and
the primal problem (9.68) is qualified: there exists a pair (ϕ0 , ψ0 ) in C(X ) × C(Y )
such that c(x, y) − ϕ0 (x) − ψ0 (y) is uniformly positive. By general results of convex
duality theory, problems (9.68) and (9.71) have the same finite value, and (9.71) has
a nonempty and bounded set of solutions.
(b) It remains to prove that (9.68) has a nonempty set of solutions. Let (ϕk , ψk ) be a
minimizing sequence. Set

ψk (y) := min{c(x, y) − ϕk (x); x ∈ X },
(9.72)
ϕk (x) := min{c(x, y) − ψk (y); y ∈ Y }.

It is easily checked that these two functions are continuous, and satisfy the primal
constraint as well as the inequality (ϕk , ψk ) ≥ (ϕk , ψk ), implying that the associated
cost is smaller than the one for (ϕk , ψk ); so, (ϕk , ψk ) is another minimizing sequence.
In addition, (ϕk , ψk ) has a continuity modulus not greater than the one of c (in short,
it has a c-continuity modulus), since a finitely-valued infimum of functions with c-
continuity modulus has c-continuity modulus. Since we can always add a constant to
ϕk and subtract it from ψk we get the existence of a minimizing sequence (ϕk , ψk )
with c-continuity modulus, and such that ϕk (x0 ) = 0. It easily follows that (ϕk , ψk )
is bounded in C(Y ) and C(X ) resp. By the Ascoli–Arzela theorem, there exists a
subsequence in C(X ) × C(Y ) converging to some (ϕ, ψ). Passing to the limit in
the cost function and constraints of (9.68) we obtain that (ϕ, ψ) is a solution to this
problem. 

Remark 9.17 The primal solution (ϕ, ψ) constructed in the above proof satisfies

ψ(y) = min{c(x, y) − ϕ(x); x ∈ X },
(9.73)
ϕ(x) := min{c(x, y) − ψ(y); y ∈ Y }.
296 9 Generalized Convexity and Transportation Theory

Setting κ(x, y) := −c(x, y), the above relations can be interpreted as κ-conjugates
in the sense of (9.1):
− ψ = (−ϕ)κ ; −ϕ = (−ψ)κ . (9.74)

9.3.2 Optimal Transportation Maps

Let (ϕ, ψ) and μ be primal and dual feasible, resp. The difference of associated costs
is, since η and ν are the marginals of μ:

μ, c C(Z ) − η, ϕ C(X ) − ν, ψ C(Y ) = μ, c(x, y) − ϕ(x) − ψ(y) C(Z ) . (9.75)

As expected it is nonnegative, and (since the primal and dual problem have the same
value), (ϕ, ψ) and μ are primal and dual solutions, resp., iff the above r.h.s. is equal
to zero, meaning that c(x, y) = ϕ(x) + ψ(y) over the support of μ, which we denote
by Γ . Let (x̄, ȳ) ∈ Γ . Then

c(x̄, ȳ) − ϕ(x̄) = ψ( ȳ) ≤ c(x, ȳ) − ϕ(x), for all x ∈ X. (9.76)

In the sequel we assume that X and Y are the closure of their interior, and that c(·, ·)
is of class C 1 . By the above remark, we may assume that ϕ and ψ satisfy (9.74) and
therefore are Lipschitz.
By Rademacher’s theorem, see [6, Thm. 2.14], ϕ is a.e. differentiable over int(X ).
If x̄ ∈ int(X ) and ϕ(x) is differentiable at x̄, (9.76) implies that

∇ϕ(x̄) = ∇x c(x̄, ȳ) a.e. (9.77)

Example 9.18 Take c(x, y) = 21 |x − y|2 . We obtain that ∇ϕ(x̄) = x̄ − ȳ. Therefore

ȳ = T (x̄), where T (x) := x − ∇ϕ(x), a.e., (9.78)

so that the support of μ is contained in the graph of the transportation map T (x). If
η has a density, we can identify μ with this transportation map. In addition, since
(9.74) holds for (ϕ, ψ) we have that ϕ̂ := −ϕ satisfies

ϕ̂(x) = max{−c(x, y) + ψ(y)} = − 21 |x|2 + max x · y − 21 |y|2 + ψ(y) .
y∈Y y∈Y
(9.79)
The last maximum of affine functions of x, say F(x), is a convex function of x. We
deduce that ϕ(x) = 21 |x|2 − F(x), with F convex. We have proved that

If c(x, y) = 21 |x − y|2 , then the transportation plan is a.e.
(9.80)
of the form T (x) = ∇ F(x) a.e., where F is a convex function.
9.3 Transportation Theory 297

Example 9.19 More generally assume that c(x, y) = f (x − y) with f convex and
Lipschitz. Then (9.76) implies that

∇ϕ(x̄) ∈ ∂ f (x̄ − ȳ), a.e., (9.81)

where by ∂ f we denote the subdifferential. If ∂ f is injective, then we have that

∂ f −1 (∇ϕ(x̄))  x̄ − ȳ a.e., (9.82)

meaning that

ȳ = T f (x̄), with now T f (x) := x − ∂ f −1 (∇ϕ(x)) a.e. (9.83)

So, if η has a density, we can identify μ with the transportation map T f .

9.3.3 Penalty Approximations

9.3.3.1 Duality

The dual problem was set in (9.68). We assume that

η and ν have densities. (9.84)

Consider a penalty function e : R → R̄ for the nonnegativity of the measure, of the


following type:

e is proper l.s.c. convex with superlinear growth,
(9.85)
(0, ∞) ⊂ dom(e) ⊂ [0, ∞).

A typical example is the entropy penalty

ê(s) := s(log s − 1), (9.86)

with ê(0) := 0 and domain [0, ∞). The penalty term is defined as

⎨ ∞ if μs = 0,
P(μ) = (9.87)
⎩ e(μa (ω))dω otherwise.
Ω

A penalized version of the dual problem, with ε > 0, is (remember that Z := X × Y ):

Max − μ, c C(Z ) − ε P(μ); μ|C(X ) = η; μ|Y = ν. (9.88)


μ∈M+ (Z )
298 9 Generalized Convexity and Transportation Theory

Set

f (μ)=ε P(μ(x, y))d(x, y); F(ν) = I{0} (ν), Aμ = −(μ| X , μ|Y ); y = (η, ν).
Z
(9.89)
Here A is from M(Z ) into M(X ) × M(Y ). The penalized dual problem can be written
in the form
Min f (μ) + μ, c C(Z ) + F(Aμ + y). (9.90)
μ∈M(Z )

We can compute the dual to this problem in dual spaces, as explained in Chap. 1,
Sect. 1.2.1.2. While the problem is in a dual space setting, the computations are similar
to those in the standard Fenchel duality framework, so that the ‘bidual’ expressed as
a minimization has expression

Min f ∗ (−c − A (ϕ, ψ)) + F ∗ (ϕ, ψ) − η, ϕ − ν, ψ . (9.91)


ϕ,ψ

Now F ∗ is the null function, and (ε f )∗ = ε f ∗ (·/ε).

Lemma 9.20 We have that f ∗ has finite values and, for every c ∈ C(Z ):

P ∗ (c) = e∗ (c(x, y))d(x, y). (9.92)


z

Proof Let fˆ(c) denote the above r.h.s. Since e is l.s.c. proper convex, it is the Fenchel
conjugate of e∗ . So, by Lemma 9.13, since e has superlinear growth, e∗ is finite-
valued and bounded over bounded sets. So, fˆ(c) is a continuous convex function
and, by Lemma 9.14, its conjugate is P(μ). Since fˆ(c) is equal to its biconjugate,
the conclusion follows. 

Consider the problem

ϕ(x) + ψ(y) − c(x, y)


Min − η, ϕ C(X ) − ν, ψ C(Y ) +ε P∗ d(x, y).
ϕ,ψ Z ε
(9.93)

Proposition 9.21 The penalized problem (9.88) is the dual of problem (9.93).

Proof Apply the Fenchel duality theory, taking into account Lemma 9.20. 

Remark 9.22 Since the primal penalized problem is qualified (in the case of the
usual penalties given above) its dual has a nonempty and bounded set of solutions.

The semiprimal problem consists in minimizing the primal cost w.r.t. ϕ only. The
primal cost can be expressed as
9.3 Transportation Theory 299
 
ε (P ∗ ((ϕ(x) + ψ(y) − c(x, y))/ε)dy − η(x)ϕ(x)) dx − ν, ψ C(Y ) .
X Y
(9.94)
So the first-order condition for minimizing w.r.t. ϕ is that

(D P ∗ ((ϕ(x) + ψ(y) − c(x, y))/ε)dy = η(x). (9.95)


Y

Example 9.23 Entropy penalty: then P ∗ (s) = D P ∗ (s) = es and (9.95) reduces to

exp(ϕ(x)/ε) (exp((ψ(y) − c(x, y))/ε)dy = η(x). (9.96)


Y

Since the l.h.s. of (9.96) is a positive and continuous function of x, (9.96) has a
solution iff η is absolutely continuous, with positive and continuous density (since
Ω is compact, this implies that the density has a positive minimum), and the solution
is ϕ such that

ϕ(x)
+ log (exp((ψ(y) − c(x, y))/ε)dy = log η(x). (9.97)
ε Y

Substituting into (9.94), we obtain the expression of the semiprimal cost:

ε log (exp((ψ(y) − c(x, y))/ε)dy η(x)dx − ν, ψ C(Y )


X Y
(9.98)
+ε − ε η(x) log(η(x))dx.
X

9.3.4 Barycenters

9.3.4.1 The Multi-transport Setting

Let X and Yk , for k = 1 to K , be compact subsets of Rn , Z k := X × Yk , ck ∈ C(Z k ),


ν k ∈ P(Yk ). Consider the problem in dual spaces
K
Maxμ,η − μk , ck C(Z k ) ;μ
k
∈ M+ (Z k ), k = 1, . . . , K ; η ∈ M(X );
k=1
μk| X = η; μk|Y = ν k .
(9.99)
In some cases these problems can be interpreted as the computation of barycenters,
see the references at the end of the chapter. Note that, by the above constraints,
η ∈ P(X ). We easily check that problem (9.99) is the dual of
300 9 Generalized Convexity and Transportation Theory


K
Min − νk , ψ k C(Yk ) ;ϕ
k
(x) + ψ k (y) − ck (x, y) ≤ 0,
ϕ,ψ
k=1
for all x ∈ X and y ∈ Yk , (9.100)
K
ϕ k (x) = 0, for all x ∈ X,
k=1
ϕ k ∈ C(X ); ψ k ∈ C(Yk ), k = 1, . . . , K .

Proposition 9.24 Problems (9.99) and (9.100) have the same finite value, and both
have a nonempty set of solutions.

Proof (a) The dual problem (9.99) is feasible (take for μk the product of η and ν k ) and
the primal problem (9.100) is qualified: for k = 1 to K , there exists pairs (ϕ0k , ψ0k ) in
C(X ) × C(Yk ) such that ck (x, y) − ϕ0k (x) − ψ0k (y) is uniformly positive. By general
results of convex duality theory, problems (9.99) and (9.100) have the same finite
value, and (9.99) has a nonempty and bounded set of solutions.
(b) It remains to show that the primal problem (9.100) has solutions. We adapt the
ideas in the proof of Proposition 9.16. Let (ϕ j , ψ j ) be a minimizing sequence. Set
⎧ k

⎪ ψ̂ j (y) := min{ck (x, y) − ϕ kj (x); x ∈ X }, k = 1, . . . , K ,



⎨ ϕ̂ k (x) := min{ck (x, y) − ψ̂ k (y); y ∈ Yk }, k = 1, . . . , K − 1,
j j
(9.101)

⎪ 
K −1


⎩ ϕ̂ j (x) := − ϕ̂ kj (x).
K

k=1

Then ϕ̂ kj ≥ ϕ kj , for k = 1 to K − 1, so that ϕ̂ Kj ≤ ϕ Kj . It follows that (ϕ̂ j , ψ̂ j ) is


feasible. The associated cost is not greater than the one for (ϕ j , ψ j ), since ψ̂ kj (y) ≥
ψ kj for all k. Therefore (ϕ̂ j , ψ̂ j ) is a minimizing sequence which, in addition, has a
uniform continuity modulus. Changing ϕ̂ k (x) into ϕ̂ k (x) − ϕ̂ k (x0 ) if necessary, we
get that ϕ k (x0 ) = 0 for all k (the sum of the ϕ̂ k is still equal to 0, and this operation
leaves the cost invariant). We have constructed a bounded minimizing sequence with
uniform continuity modulus, and conclude by the Ascoli–Arzela theorem. 

9.3.4.2 Penalization

As in the case of a standard transport problem we start from a penalty approximation


of the dual formulation, that is, we approximate (9.100) by


K
Max − μk , ck Zk +ε P(μk (x, y))d(x, y) ; μk| X = η; μk|Y = ν k ;
μ,η Zk
k=1

μk ∈ M+ (Z k ), k = 1, . . . , K ; η ∈ M(X ).
(9.102)
9.3 Transportation Theory 301

Computing the ‘bidual’ problem we again recognize the Fenchel duality framework
with


⎨ K
f (μ, η) = f 1 (μ) + f 2 (η); f 1 (μ) = ε P(μk (x, y))d(x, y)
(9.103)

⎩ k=1 Z k

f 2 (η) = 0; F = I{0} ; A(μ, η) = (η − μ| X , −μ|Y ); y = (0, ν).

We find that f 1∗ can be computed as in Sect. 9.3.3, and f 2∗ is the indicatrix of 0, so


that the primal (or bidual) problem is


K
ϕ k (x) + ψ k (y) − ck (x, y)
Min − ηk , ϕ k C(X ) − νk , ψ k C(Yk ) +ε P∗ d(x, y) ;
ϕ,ψ Zk ε
k=1
K
ϕ k = 0; ϕ k ∈ C(X ); ψ k ∈ C(Yk ), k = 1, . . . , K .
k=1
(9.104)

9.4 Notes

Brøndsted [30] and Dolecki and Kurcyusz [44] are early references for generalized
convexity. The augmented Lagrangian approach was introduced by Powell [88] and
Hestenes [58], and linked to the dual proximal algorithm in Rockafellar [101]. For its
application to infinite-dimensional problems, see Fortin and Glowinski [50]. Convex
functions of measures are discussed in Demengel and Temam [41, 42].
On transportation theory, see the monographs by Villani [122] and Santambrogio
[108]. The link (9.80) between a transportation map and the derivative of a convex
function is known as Brenier’s theorem [27]. Augmented Lagrangians are a useful
numerical tool for solving optimal transport problems, see Benamou and Carlier
[17]. Cuturi [35] introduced the entropic penalty, and showed that the resulting prob-
lem can be efficiently solved thanks to Sinkhorn’s algorithm [116] (for computing
matrices with prescribed row and column sums). Barycenters in the optimal transport
framework were introduced in Carlier and Ekeland [31]. See also Agueh and Carlier
[1]. It gives a powerful tool for clustering, see Cuturi and Doucet [36].
References

1. Agueh, M., Carlier, G.: Barycenters in the Wasserstein space. SIAM J. Math. Anal. 43(2),
904–924 (2011)
2. Akhiezer, N.I.: The Classical Moment Problem. Hafner Publishing Co., New York (1965)
3. Aliprantis, C.D., Border, K.C.: Infinite dimensional analysis, 3rd edn. Springer, Berlin (2006)
4. Alizadeh, F., Goldfarb, D.: Second-order cone programming. Math. Program. 95(1, Ser. B),
3–51 (2003)
5. Altman, E.: Constrained Markov Decision Processes. Stochastic Modeling. Chapman &
Hall/CRC, Boca Raton (1999)
6. Ambrosio, L., Fusco, N., Pallara, D.: Functions of Bounded Variation and Free Discontinuity
Problems. Oxford Mathematical Monographs. The Clarendon Press, Oxford University Press,
New York (2000)
7. Arapostathis, A., Borkar, V.S., Fernández-Gaucherand, E., Ghosh, M.K., Marcus, S.I.:
Discrete-time controlled Markov processes with average cost criterion: a survey. SIAM J.
Control Optim. 31(2), 282–344 (1993)
8. Araujo, A., Giné, E.: The Central Limit Theorem for Real and Banach Valued Random
Variables. Wiley, New York (1980)
9. Artzner, P., Delbaen, F., Eber, J.M., Heath, D.: Coherent measures of risk. Math. Financ. 9(3),
203–228 (1999)
10. Attouch, H., Brézis, H.: Duality for the sum of convex functions in general Banach spaces.
In: Barroso, J.A. (ed) Aspects of Mathematics and its Applications, pp. 125–133 (1986)
11. Aubin, J.-P., Ekeland, I.: Estimates of the duality gap in nonconvex optimization. Math. Oper.
Res. 1(3), 225–245 (1976)
12. Aubin, J.P., Frankowska, H.: Set-Valued Analysis. Birkhäuser, Boston (1990)
13. Bardi, M., Capuzzo-Dolcetta, I.: Optimal Control and Viscosity Solutions of Hamilton-Jacobi-
Bellman Equations. Birkhäuser, Boston (1997)
14. Ben-Tal, A., Golany, B., Nemirovski, A.: Vial, J-Ph: Supplier-retailer flexible commitments
contracts: a robust optimization approach. Manuf. Serv. Oper. Manag. 7(3), 248–273 (2005)
15. Ben-Tal, A., Nemirovski, A.: Lectures on modern convex optimization. Society for Industrial
and Applied Mathematics (SIAM), Philadelphia, MPS/SIAM Series on Optimization (2001)
16. Ben-Tal, Aharon: Ghaoui, El: Laurent, Nemirovski, Arkadi: Robust Optimization. Princeton
University Press, Princeton (2009)
17. Benamou, J.-D., Carlier, G.: Augmented Lagrangian methods for transport optimization, mean
field games and degenerate elliptic equations. J. Optim. Theory Appl. 167(1), 1–26 (2015)
18. Benders, J.F.: Partitioning procedures for solving mixed-variables programming problems.
Numer. Math. 4, 238–252 (1962)
19. Bertsekas, D.P.: Dynamic Programming and Optimal Control, 2nd edn., vol I & II. Athena
Scientific, Belmont (2000, 2001)

© Springer Nature Switzerland AG 2019 303


J. F. Bonnans, Convex and Stochastic Optimization, Universitext,
https://doi.org/10.1007/978-3-030-14977-2
304 References

20. Billingsley, P.: Convergence of Probability Measures, 2nd edn. Wiley Inc., New York (1999)
21. Birge, J.R., Louveaux, F.: Introduction to Stochastic Programming. Springer, New York (1997)
22. Bokanowski, O., Maroso, S., Zidani, H.: Some convergence results for Howard’s algorithm.
SIAM J. Numer. Anal. 47(4), 3001–3026 (2009)
23. Bonnans, J.F., Cen, Z.: Christel, Th: Energy contracts management by stochastic programming
techniques. Ann. Oper. Res. 200, 199–222 (2012)
24. Bonnans, J.F., Gilbert, J.C., Lemaréchal, C., Sagastizábal, C.: Numerical Optimization: The-
oretical and Numerical Aspects, 2nd edn. Universitext. Springer, Berlin (2006)
25. Bonnans, J.F., Ramírez, H.: Perturbation analysis of second-order cone programming prob-
lems. Math. Program. 104(2–3, Ser. B), 205–227 (2005)
26. Bonnans, J.F., Shapiro, A.: Perturbation Analysis Of Optimization Problems. Springer, New
York (2000)
27. Brenier, Y.: Polar factorization and monotone rearrangement of vector-valued functions. Com-
mun. Pure Appl. Math. 44(4), 375–417 (1991)
28. Brézis, H.: Functional Analysis. Sobolev Spaces and Partial Differential Equations. Springer,
New York (2011)
29. Brézis, H., Lieb, E.: A relation between pointwise convergence of functions and convergence
of functionals. Proc. Am. Math. Soc. 88(3), 486–490 (1983)
30. Brøndsted, A.: Convexification of conjugate functions. Math. Scand. 36, 131–136 (1975)
31. Carlier, G., Ekeland, I.: Matching for teams. Econ. Theory 42(2), 397–418 (2010)
32. Carpentier, P., Chancelier, J-Ph, Cohen, G., De Lara, M.: Stochastic Multi-stage Optimization.
Springer, Berlin (2015)
33. Castaing, C., Valadier, M.: Convex Analysis and Measurable Multifunctions. Lecture Notes
in Mathematics, vol. 580. Springer, Berlin (1977)
34. Csiszár, I.: Information-type measures of difference of probability distributions and indirect
observations. Stud. Sci. Math. Hungar. 2, 299–318 (1967)
35. Cuturi, M.: Sinkhorn distances: lightspeed computation of optimal transportation. In: Neural
Information Processing Conference Proceedings, pp. 2292–2300 (2013)
36. Cuturi, M., Doucet, A.: Fast computation of Wasserstein barycenters. In: Xing, E.P., Jebara,
T. (eds.) Proceedings of the 31st International Conference on Machine Learning. Proceedings
of Machine Learning Research, vol. 32, pp. 685–693, Bejing, China (2014)
37. Dallagi, A.: Méthodes particulaires en commande optimale stochastique. Ph.D. thesis, Uni-
versité Paris I (2007)
38. Danskin, J.M.: The Theory of Max-Min and Its Applications to Weapons Allocation Problems.
Springer, New York (1967)
39. Decarreau, A., Hilhorst, D., Lemaréchal, C., Navaza, J.: Dual methods in entropy maximiza-
tion. Application to some problems in crystallography. SIAM J. Optim. 2(2), 173–197 (1992)
40. Dellacherie, C., Meyer, P.-A.: Probabilities and potential. North-Holland Mathematics Stud-
ies, vol. 29. North-Holland Publishing Co., Amsterdam (1978)
41. Demengel, F., Temam, R.: Convex functions of a measure and applications. Indiana Univ.
Math. J. 33(5), 673–709 (1984)
42. Demengel, F., Temam, R.: Convex function of a measure: the unbounded case. FERMAT days
85: mathematics for optimization (Toulouse, 1985). North-Holland Mathematics Studies, vol.
129, pp. 103–134. North-Holland, Amsterdam (1986)
43. Dentcheva, D., Ruszczyński, A.: Common mathematical foundations of expected utility and
dual utility theories. SIAM J. Optim. 23(1), 381–405 (2013)
44. Dolecki, S., Kurcyusz, S.: On Φ-convexity in extremal problems. SIAM J. Control Optim.
16(2), 277–300 (1978)
45. Dudley, R.M.: Real Analysis and Probability. Cambridge University Press, Cambridge (2002).
Revised reprint of the 1989 original
46. Ekeland, I., Temam, R., Convex Analysis and Variational Problems. Studies in Mathematics
and its Applications, vol. 1. North-Holland, Amsterdam (1976). French edition: Analyse
convexe et problèmes variationnels. Dunod, Paris (1974)
47. Fenchel, W.: On conjugate convex functions. Can. J. Math. 1, 73–77 (1949)
References 305

48. Fenchel, W.: Convex Cones and Functions. Lecture Notes. Princeton University, Princeton
(1953)
49. Föllmer, H., Schied, A.: Stochastic Finance: An Introduction in Discrete Time. de Gruyter
Studies in Mathematics, vol. 27. Walter de Gruyter & Co., Berlin (2002)
50. Fortin, M., Glowinski, R.: Augmented Lagrangian Methods. North-Holland, Amsterdam
(1983)
51. Georghiou, A., Wiesemann, W., Kuhn, D.: Generalized decision rule approximations for
stochastic programming via liftings. Math. Program. 152(1-2, Ser. A), 301–338 (2015)
52. Girardeau, P., Leclere, V., Philpott, A.B.: On the convergence of decomposition methods for
multistage stochastic convex programs. Math. Oper. Res. 40(1), 130–145 (2015)
53. Goberna, M.A., Lopez, M.A.: Linear Semi-infinite Optimization. Wiley Series in Mathemat-
ical Methods in Practice, vol. 2. Wiley, Chichester (1998)
54. Gol’shtein, E.G.: Theory of Convex Programming. Translations of Mathematical Mono-
graphs, vol. 36. American Mathematical Society, Providence (1972)
55. Gouriéroux, C.: ARCH Models and Financial Applications. Springer, New York (1997)
56. Hernández-Lerma, O., Lasserre, J.B.: Discrete-Time Markov Control Processes. Springer,
New York (1996)
57. Hernández-Lerma, O., Lasserre, J.B.: Further Topics on Discrete-Time Markov Control Pro-
cesses. Springer, New York (1999)
58. Hestenes, M.R.: Multiplier and gradient methods. J. Optim. Theory Appl. 4, 303–320 (1969)
59. Hoffman, A.: On approximate solutions of systems of linear inequalities. J. Res. Natl. Bureau
Stand., Sect. B, Math. Sci. 49, 263–265 (1952)
60. Horn, R.A., Johnson, C.R.: Matrix Analysis, 2nd edn. Cambridge University Press, Cambridge
(2013)
61. Hsu, S.-P., Chuang, D.-M., Arapostathis, A.: On the existence of stationary optimal policies
for partially observed MDPs under the long-run average cost criterion. Syst. Control Lett.
55(2), 165–173 (2006)
62. Kall, P., Wallace, S.W.: Stochastic Programming. Wiley, Chichester (1994)
63. Kelley, J.E.: The cutting plane method for solving convex programs. J. Soc. Indust. Appl.
Math. 8, 703–712 (1960)
64. Komiya, H.: Elementary proof for Sion’s minimax theorem. Kodai Math. J. 11(1), 5–7 (1988)
65. Krein, M., Milman, D.: On extreme points of regular convex sets. Studia Math. 9, 133–138
(1940)
66. Kuhn, D., Wiesemann, W., Georghiou, A.: Primal and dual linear decision rules in stochastic
and robust optimization. Math. Program. 130(1, Ser. A), 177–209 (2011)
67. Kushner, H.J., Dupuis, P.G.: Numerical Methods for Stochastic Control Problems in Contin-
uous Time. Applications of Mathematics, vol. 24, 2nd edn. Springer, New York (2001)
68. Lang, S.: Real and Functional Analysis, 3rd edn. Springer, New York (1993)
69. Lasserre, J.B.: Semidefinite programming versus LP relaxations for polynomial programming.
Math. Oper. Res. 27, 347–360 (2002)
70. Lemaréchal, C., Oustry, F.: Semidefinite relaxations and Lagrangian duality with application
to combinatorial optimization. Rapport de Recherche INRIA 3710, (1999)
71. Lewis, A.: The mathematics of eigenvalue optimization. Math. Programm. 97, 155–176 (2003)
72. Lewis, A.S.: The convex analysis of unitarily invariant matrix functions. J. Convex Anal.
2(1–2), 173–183 (1995)
73. Lewis, A.S., Overton, M.L.: Eigenvalue optimization. In: Acta numerica, 1996, pp. 149–190.
Cambridge University Press, Cambridge (1996)
74. Liapounoff, A.: Sur les fonctions-vecteurs complètement additives. Bull. Acad. Sci. URSS.
Sér. Math. [Izvestia Akad. Nauk SSSR] 4, 465–478 (1940)
75. Linderoth, J.T., Shapiro, A., Wright, S.: The empirical behavior of sampling methods for
stochastic programming. Technical Report 02-01, Computer Science Department, University
of Wisconsin-Madison (2002)
76. Lobo, M.S., Vandenberghe, L., Boyd, S., Lebret, H.: Applications of second-order cone pro-
gramming. Linear Algebra Appl. 284, 193–228 (1998)
306 References

77. Malliavin, P.: Integration and Probability. Springer, New York (1995). French edition: Masson,
Paris (1982)
78. Mandelbrojt, S.: Sur les fonctions convexes. C. R. Acad. Sci., Paris 209, 977–978 (1939)
79. Maréchal, P.: On the convexity of the multiplicative potential and penalty functions and related
topics. Math. Program. 89(3, Ser. A), 505–516 (2001)
80. Modica, L.: The gradient theory of phase transitions and the minimal interface criterion. Arch.
Ration. Mech. Anal. 98(2), 123–142 (1987)
81. Monahan, G.E.: A survey of partially observable Markov decision processes: theory, models,
and algorithms. Manag. Sci. 28(1), 1–16 (1982)
82. Moreau, J.-J.: Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. France 93,
273–299 (1965)
83. Moreau, J.-J.: Fonctionnelles convexes. In: Leray, J. (ed.) Séminaire sur les équations aux
dérivées partielles, vol. 2, pp. 1–108. Collège de France (1966/1967). www.numdam.org
84. Moreau, J.-J.: Inf-convolution, sous-additivité, convexité des fonctions numériques. J. Math.
Pures Appl. 9(49), 109–154 (1970)
85. Nesterov, Y., Nemirovskii, A.: Interior-Point Polynomial Algorithms in Convex Programming.
Society for Industrial and Applied Mathematics (SIAM), Philadelphia (1994)
86. Pereira, M.V.F., Pinto, L.M.V.G.: Multi-stage stochastic optimization applied to energy plan-
ning. Math. Program. 52(2, Ser. B), 359–375 (1991)
87. Pontryagin, L.S., Boltyanskiı̆, V.G., Gamkrelidze, R.V., Mishchenko, E.F.: The Mathematical
Theory of Optimal Processes. Gordon & Breach Science Publishers, New York (1986). Reprint
of the 1962 English translation
88. Powell, M.J.D.: A method for nonlinear constraints in minimization problems. In: Fletcher,
R. (ed.) Optimization, pp. 283–298. Academic, New York (1969)
89. Powell, M.J.D.: Approximation Theory and Methods. Cambridge University Press, Cam-
bridge (1981)
90. Pulleyblank, W.R.: Polyhedral combinatorics. In: Nemhauser, G.L., et al. (eds.) Optimization.
Elsevier, Amsterdam (1989)
91. Puterman, M.L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming.
WileyInc, New York (1994)
92. Puterman, M.L., Shin, M.C.: Modified policy iteration algorithms for discounted Markov
decision problems. Manag. Sci. 24(11), 1127–1137 (1978)
93. Rockafellar, R.T.: Duality theorems for convex functions. Bull. Am. Math. Soc. 70, 189–192
(1964)
94. Rockafellar, R.T.: Extension of Fenchel’s duality theorem for convex functions. Duke Math.
J. 33, 81–90 (1966)
95. Rockafellar, R.T.: Extension of Fenchel’s duality theorem for convex functions. Duke Math.
J. 33, 81–89 (1966)
96. Rockafellar, R.T.: Integrals which are convex functionals. Pacif. J. Math. 24, 525–539 (1968)
97. Rockafellar, R.T.: Convex Analysis. Princeton University Press, Princeton (1970)
98. Rockafellar, R.T.: Convex integral functionals and duality. In: Contributions to Nonlinear
Functional Analysis (Proc. Sympos., Math. Res. Center, University of Wisconsin, Madison,
Wisconsin, 1971), pp. 215–236. Academic, New York (1971)
99. Rockafellar, R.T.: Integrals which are convex functionals. II. Pacif. J. Math. 39, 439–469
(1971)
100. Rockafellar, R.T.:. Conjugate Duality and Optimization. Regional Conference Series in
Applied Mathematics, vol. 16. SIAM, Philadelphia (1974)
101. Rockafellar, R.T.: Augmented Lagrangians and applications of the proximal point algorithm
in convex programming. Math. Oper. Res. 1, 97–116 (1976)
102. Rockafellar, R.T.: Integral functionals, normal integrands and measurable selections. In: Non-
linear Operators and the Calculus of Variations (Summer School, Univ. Libre Bruxelles, Brus-
sels, 1975). Lecture Notes in Mathematics, vol. 543, pp. 157–207. Springer, Berlin (1976)
103. Rockafellar, R.T., Wets, R.J.-B.: Stochastic convex programming: basic duality. Pacif. J. Math.
62(1), 173–195 (1976)
References 307

104. Rockafellar, R.T., Wets, R.J.-B.: Stochastic convex programming: singular multipliers and
extended duality singular multipliers and duality. Pacif. J. Math. 62(2), 507–522 (1976)
105. Royden, H.L.: Real Analysis, 3rd edn. Macmillan Publishing Company, New York (1988)
106. Ruszczynski, A., Shapiro, A. (eds.): Stochastic Programming. Handbook in Operations
Research and Management, vol. 10. Elsevier, Amsterdam (2003)
107. Ruszczyński, A., Shapiro, A.: Conditional risk mappings. Math. Oper. Res. 31(3), 544–561
(2006)
108. Santambrogio, F.: Optimal Transport for Applied Mathematicians. Birkhäuser (2015)
109. Santos, M.S., Rust, J.: Convergence properties of policy iteration. SIAM J. Control Optim.
42(6), 2094–2115 (electronic) (2004)
110. Schrijver, A.: Theory of Linear and Integer Programming. Wiley, New Jersey (1986)
111. Shapiro, A., Asymptotic analysis of stochastic programs. Ann. Oper. Res., 30(1–4):169–186
(1991). Stochastic programming, Part I (Ann Arbor, MI, 1989)
112. Shapiro, A.: Asymptotics of minimax stochastic programs. Stat. Probab. Lett. 78(2), 150–157
(2008)
113. Shapiro, A.: Analysis of stochastic dual dynamic programming method. Eur. J. Oper. Res.
209(1), 63–72 (2011)
114. Shapiro, A., Dentcheva, D., Ruszczynski, A.: Lectures on Stochastic Programming: Modelling
and Theory, 2nd edn. SIAM (2014)
115. Shiryaev, A.N.: Probability. Graduate Texts in Mathematics, vol. 95, 2nd edn. Springer, New
York (1996). Translated from the first (1980) Russian edition by R.P. Boas
116. Sinkhorn, R.: Diagonal equivalence to matrices with prescribed row and column sums. Am.
Math. Mon. 74(4), 402–405 (1967)
117. Sion, M.: On general minimax theorems. Pacif. J. Math. 8, 171–176 (1958)
118. Skorohod, A.V.: Limit theorems for stochastic processes. Teor. Veroyatnost. i Primenen. 1,
289–319 (1956)
119. Tardella, F.: A new proof of the Lyapunov convexity theorem. SIAM J. Control Optim. 28(2),
478–481 (1990)
120. Tibshirani, R.: Regression shrinkage and selection via the lasso: a retrospective. J. R. Stat.
Soc. Ser. B 73, Part 3, 273–282 (2011)
121. Villani, C.: Intégration et analyse de Fourier. ENS Lyon (2007). Revised in 2010
122. Villani, C.: Optimal Transport. Old and New. Springer, Berlin (2009)
123. Wallace, S.W., Ziemba, W.T. (eds.): Aplications of Stochastic Programming. MPS/SIAM
Series Optimization, vol. 5. SIAM, Philadelphia (2005)
124. Wets, R.J.-B.: Stochastic programs with fixed recourse: the equivalent deterministic program.
SIAM Rev. 16, 309–339 (1974)
125. Wolkowicz, H., Saigal, R., Vandenberghe, L. (eds.): Handbook of Semidefinite Programming.
Kluwer Academic Publishers, Boston (2000)
126. Yosida, K., Hewitt, E.: Finitely additive measures. Trans. Am. Math. Soc. 72, 46–66 (1952)
127. Zhou, L.: A simple proof of the Shapley-Folkman theorem. Econom. Theory 3(2), 371–372
(1993)
128. Zou, J., Ahmed, S., Sun, X.A.: Stochastic dual dynamic integer programming. Math. Program.
(2018)
Index

A Compatibility, 209
Acceptation set, 170 Compatible
Accessible set, 234 multimapping, 144
Adjoint state, 215 Conditional
Algebra, 117 expectation, 202
Algorithm variance, 208
cutting plane, 268 Cone
Kelley, 268 normal, 31
Application recession, 41
measurable, 119 tangent, 31
Approximation Cone of nonincreasing vectors, 80
Moreau–Yosida, 38 Conjugate, 17, 21
Approximation in the sense of Chebyshev, Contact set, 96
101 Convergence
in law, 180
in measure, 127
B in probability, 127
Backward equation, 215 narrow, 180
Biconjugate, 19 simple, 120, 233
Bidual, 89 Convex
Bochner, 140 function, 3
Borel σ -algebra, 118 set, 3
Borel–Cantelli, 125 Convex closure, 20
Borelian function, 121 Core, 26
Bounded Costate, 215
in probability, 181 adapted, 217
Countable additivity, 122
Covariance, 185
C Cycle, 257
Calmness, 51 Cylinder, 125
Carathéodory, 142
Carathéodory theorem, 123
Castaing representation, 143 D
Cauchy sequence, 5 Distribution
Class empirical, 190
recurrent, 257 Disutility, 165
transient, 257 Domain, 1
© Springer Nature Switzerland AG 2019 309
J. F. Bonnans, Convex and Stochastic Optimization, Universitext,
https://doi.org/10.1007/978-3-030-14977-2
310 Index

Dominated convergence, 132, 137 Inequality


generalized, 133 Tchebycheff, 129
Doob–Dynkin, 121 Infimal convolution, 60
Dynamic programming, 229, 230, 232, 239, Innovations, 214
270 Integrable selection, 156
ergodic, 264 Interpolation
Lagrange, 104
Iteration
E Howard, 236
Egoroff, 127 policy, 236
Entropy, 36 values, 235
Epigraph, 4
Estimate of lack of convexity, 69
Exchange property, 142 K
Exhaustion sequence, 122 Kolmogorov equation, 226
Expansive (non), 54
Extremal loading, 114
L
Lagrangian, 65
F duality, 33
Fatou, 133 standard, 38
Fenchel Legendre transform, 23
duality, 42 Lemma
subdifferential formula, 46 Schur, 78
Fenchel conjugate, 17 Schur (generalized), 88
Fenchel–Young inequality, 17 Limit
Filtration, 218 Cesaro, 256
Floor approximation, 121 Linear rate, 256
Function Local solution, 72
moment-generating, 199 Log-likelihood, 177
perspective, 63
rate, 199
recession, 62 M
simple, 120 Markov chain, 223
spectral, 80 Maximum likelihood, 177
step, 120 Measure, 122
symmetric, 80 completed, 125
Gaussian, 185
non-atomic, 156
G positive, 97
Gâteaux differentiable, 29 probability, 122
Gauge function, 7 with finite support, 97
Graph class, 257 Metric, 2
Minimizing sequence, 2
Minkowski sum, 7
H Moments of distributions, 111
Hadamard Monotone convergence, 131
differentiability, 187, 188 Monotonicity, 168
Half-space, 6 Multiplier
Hamiltonian, 250 Lagrange with finite support, 97
Hyperplane, 6

N
I Neutral risk probability, 167
Indicatrix function, 2 Nonanticipativity constraint, 218
Index 311

Norm Set
differentiable, 29 feasible, 1
Frobenius, 75 negligible, 124
Normal integrand, 146 solution, 2
σ -algebra, 117
complete, 124
O Lebesgue, 125
Oblique hyperplane, 19 product, 118
Optimality condition, 35 trivial, 117
Optimization Skorokhod–Dudley representation theorem,
semi-infinite, 95 185
Space
P measurable, 117
Path, 257 measure, 122
Point of interpolation, 104 normed, 5
Policy probability, 122
feedback, 229 separable, 122
open-loop, 248 Stability, 36
Polyhedron, 57 State
Polynomial accessible, 257
nonnegative, 106 communicating, 257
of Chebyshev, 103 State equation, 215
Pontryagin’s principle, 251 linearized, 215
Positively homogeneous, 4 Subadditive, 4
Preference, 166 Subdifferential, 22
Probability partial, 44
invariant, 256 Support, 97
Programmation function, 22
semi-infinite, 95 of a measure, 56
Programming
positive semidefinite, 77 T
positive semidefinite linear, 77 Time
Projector, 209 exit, 239
stopping, 240
R Trajectory, 273
Recourse, 153 Transition
Reference (of a polynomial), 102 matrix (regular), 259
Regular operator, 224
probability, 179 Translation invariance, 168
Regularisation
Lipschitz, 181 U
Regularity Uniformly integrability, 134
constraints, 96
Relative interior, 9
Risk-adverse, 166 V
Rotationally invariant, 80 Value, 1
Value at risk, 174
Vertical hyperplane, 19
S Vitali, 134
Sample approximation, 186
SDP relaxation, 87
Second-order cone, 91 W
Separable, 140 Walk, 257
Separation, 6

You might also like